Self-Healing Applications and Managed Cloud Operations

In the Cloud, if you don’t plan for failure, you will eventually feel the pain of failure. Issues will arise at some point, it’s inevitable. Once you accept that simple fact and start planning for contingency, you can begin to lessen the impact of a failure.

Newly designed applications strive to keep their state in highly available data stores. Persistent relational and transactional data are often stored in relational databases like MySQL or PostgreSQL. NoSQL solutions like Amazon DynamoDB or Apache Cassandra usually store non-relational persistent data. Non-persistent volatile data such as session state sits on in-memory databases, like Memcached, or Redis. Lastly, object storage is handled by a distributed and replicated object store like Amazon Simple Storage Service (S3) or Red Hat Ceph Storage.

The storage layers chosen for your state in production need to be highly available and failover automatically if there is a failure. In Amazon Web Services (AWS) RightBrain utilizes the Relational Database Service (RDS), ElastiCache, DynamoDB, and S3 when possible. These services take care of the highly available configuration, replication, and failover. While these services may come at a premium, the cost overhead is much smaller than the engineering time that would be spent managing these systems.

Storing in these designated stateful tiers enables the application tiers to be highly available and distributed. If an application instance fails, it can be replaced without human intervention because there isn’t a state stored on the instance. It’s through auto-scaling and continual health monitoring; we achieve this automation.

These principals enable 24/7 operations mission-critical applications to experience failure without major impact to the end-client. Automated responses also change the way we look at managed cloud operations. Traditional managed services became obsolete when servers turned into “cattle” (Pets vs. Cattle). Managed cloud operation providers like RightBrain no longer spend cycles on patching live server they are focusing on managed DevOps frameworks.

Teams like ours at RightBrain aren’t logging into servers to update or to swap out code. Instead, we’re building automation that responds to failures. Using tools like packer to build machine images with the latest patches installed and swapping out the instances entirely.

We’re managing the code pipeline, ensuring the SDLC doesn’t have any hiccups, guaranteeing that logs flow to the centralized logging solution, so they’re aggregated and searchable. We hardly ever log in to servers, and that’s the way we think it ought to be.