Anatomy of a Self-Healing App on AWS

One of the great things about the cloud is that it enables you to scale workloads so that you pay only for what you use. But another important benefit that doesn’t get quite the same attention is the cloud’s ability to self-heal. To get the most out of your applications running on AWS, you want to bake in the self-healing nature, and that’s done with autoscaling. Let’s take a look at how that works.

To build a self-healing app in AWS, you need the following tools and services:

  •      An Availability Zone is a data center or small cluster of data centers that are interconnected to other availability zones within the same region via private high-speed fiber. (Note: A region is a group of availability zones.) There’s generally only a 2 millisecond latency between different availability zones, but they are far enough apart that a tornado won’t take all of them out.
  •      CloudWatch is a monitoring and metrics system. In addition to reporting on the overall health of your resources, it allows you to orchestrate reactions to your environment when something happens. For example, you may configure CloudWatch to allow your application to grow based on an increase in Twitter activity.
  •      A server orchestration or configuration tool is needed to programmatically provision servers. AWS offers OpsWorks (basically a managed version of Chef) and Elastic Beanstalk, or you can use a standalone tool like Puppet, Chef or Ansible.
  •      Route 53 or other highly available DNS. It doesn’t matter if you have the most resilient application in the world if people can’t reach it. Use a highly scalable, robust DNS provider. Don’t host your own DNS internally or use shared-hosting consumer DNS.

Here’s what all of this looks like.


We’ve defined a single ASG that spans two AZs. The ELB takes incoming requests and round robins them amongst the different servers in the ASG. We’re using RDS as our persistent data store, and it’s in the high availability configuration, which is an active and a passive node. An active node in one AZ is doing synchronous replication to the standby database in the other AZ. All the read/writes are occurring to the master and again being replicated to the standby database. This is what I’d call a pretty simple 2 ½ tier application: We have our load balancer, combination application/web tier and the persistent data tier.

Scale-up Event

Let’s say we get a scale-up event. Suddenly we’re the hot topic on Twitter. We get a rush of data to the ELB, and it kicks off various rules we’ve set up. For example, we know our application is CPU bound so we scale up if the aggregate CPU utilization for the ASG remains above 90% for five minutes. The server orchestration tool provisions additional servers, they plug themselves into the load balancers and connect back to the database, and suddenly we’ve doubled our compute capacity.

Automatic Database Failover

More importantly, let’s say in the heat of all this, there’s a hardware failure. Not a problem. It  automatically detects that we have a failure, does a little bit of internal DNS trickery and sends everybody over to the standby node. This is actually a pretty cool feature of RDS. Last time I had a failover happen, from the time it failed to the time it was operational was 11 seconds. When a replacement comes online, then replication reverts back in the other direction.

Self-healing Network

If we lose an application or a web server, it’s even easier because they are stateless boxes. The ASG automatically recognizes that only three servers are online when there should be four, so it stops routing traffic to the dead server. It creates a replacement which comes back online, stabilizes it, does a few smoke tests to make sure the code is deployed properly, and then begins resuming routing to that server. You can cut down how fast that happens to even a few minutes if you design the process properly.

Scale-down Event

Now let’s say we’re yesterday’s news, and we’ve had a dip in traffic. We no longer need to be paying for these resources, so we begin the scale down process. Everything shrinks down automatically and now we’re paying for what we started with.

All of this is pretty amazing. Everything fails over automatically – there’s no need to panic in the middle of the night or requisition new hardware. If you’re interested in learning more, I explore the topic further in my presentation, Building Microservices on AWS.