At RightBrain Networks we manage environments for applications old and new. We always strive to build cloud environments to best practices; however, sometimes application changes are out of scope. When this is the case, we need to plan for failure when an application server isn’t highly available. There are two main ways that we approach this problem: Auto Scaling Groups (ASG) and self-healing Amazon Elastic Compute Cloud (EC2) instances.
Auto Scaling Group of One Server
Using ASG of one sever works well if the application state isn’t stored on the machine. If there’s a failure of the server, this configuration enables the server to rebuild itself. This methodology is preferred because, in the event of an Availability Zone (AZ) failure within AWS, the application server can migrate itself to a healthy AZ.
One issue that arises with this configuration is the Amazon Elastic Block Store (EBS) volume attached to that machine can’t transfer AZs. To solve that problem, we would first look to Amazon Elastic File System (EFS) for a network attached filesystem that’s redundant and does span AZs. While EFS is efficient these days, when working with lots of tiny files we see it slow down. The dip in performance is due to the small files using up what Input/output operations per second (IOPS) available in relatively short order. You can always pay for more IOPS, but it’s usually cost prohibitive.
When there are performance or cost constraints with the EFS solution, we then use a pattern in which the volume moves with the machine. In cases when the new EC2 instance comes up in the same AZ as it was in before failure, the volume is attached. If the instance comes up in another AZ, the new instance can request that the volume is snapshotted. A new volume based on the snapshot is created in the appropriate AZ, then attached, however; this pattern can take significant time to recover.
Self-Healing Standalone EC2
Another pattern for self-healing application servers that aren’t highly available is to provision a single EC2 instance and turn on EC2 Automated Recovery through Amazon CloudWatch. This pattern is usually our least preferred option because the instance can’t move itself to another AZ in the event of a complete AZ failure. If the workload has used a lift and shift migration to the cloud, this pattern does come in handy. RightBrain doesn’t recommend lift and shift migrations in most use cases.
Automated Recovery of a standalone EC2 instance that isn’t part of an ASG is created by setting up CloudWatch alarms for two types of EC2 status checks, system, and instance. The system check monitors the system on which the EC2 instance runs and the instance check monitors the responsiveness of the Operating System (OS).
If a system check fails, the instance needs to be stopped and then restarted. CloudWatch can automatically perform these steps through a provided action called ec2:recover. The instance is stopped and started again, moving systems within AWS, essentially putting you on new hardware.
In the event, an instance check fails there’s usually a problem communicating with the application layer of the host. A reboot can help remediate this issue often. CloudWatch provides an action called ec2:reboot, which reboots the instance monitored for instance checks. This pattern is far from complying with best practices, but it can suffice when there are constraints on budget, time, and application changes.
These patterns enable the self-healing functionality of the cloud for legacy applications that can’t distribute their state and won’t have the capability in the future or near future. By automating the recovery of these instances, we’re moving away from traditional human managed and recovered machines to nearly fully automated operational processes.