Building Highly Reliable Deployments

By Vitorrio Brooks

One of the advantages of running your application on a mature public cloud, such as Amazon Web Services, is the built-in redundancy and high-availability provided by its services. From the ability to deploy to multiple regions and data centers to the native redundancy of its services and virtual resources, AWS provides a set of highly-available foundational components.

However, the deployment, orchestration, and management of those resources raise other questions. Having a reliable, automatable infrastructure does not mean much if the tools and code that control it are unreliable. To build reliable deployments for your infrastructure, you need to build reliability into the servers build process itself. There are obviously many ways to accomplish this, but here are four suggestions:

1. Language of Choice:

Use a mature scripting language or high level language to orchestrate machine provisioning. Relying on batch scripts to orchestrate the deep ecosystem of a Windows technologies, or relying on exit codes in Bash as a poor-mans exception handling will, given enough time and complexity, lead to problems. Using a rich high level language like Python or an advanced scripting framework like Powershell not only allows you to develop more reliable code, but will allow you to interact more fluidly with various API’s and services that your code must orchestrate.

For instance, consider what is required to parse a JSON object returned from an API request in a batch script.

At the very least you’ll need to curl to obtain the request (have fun if you intend to @start http://url the request) save it to a file, parse through it with a FOR loop, and match against the tokens. Alternatively, you could do much of the above, but use Findstr and try to find the information you need with a regex pattern.

A quicker, simpler, and more reliable solution would be to make the request using PowerShell, as it’s better suited to the task at hand:

$id-doc = Invoke-RestMethod
$region = id-doc.region

In the above we make a simple RESTful call using powershell, which automatically creates an object for us out of the returned JSON, exposing the region as a parameter.

2. Detect and handle failures whenever possible.

Write your provision and deployment scripts to handle failure gracefully. . Using the right tool such as Python over shell scripting, better prepares you to deal with anomalies. Your code should be written so that every step performed during the build process considers the areas where it might fail and attempts to remedy them. If an error cannot be handling during runtime, tear the machine down and let it rebuild. If the issue is caused by an anomaly, there’s a good chance that the next build will be good. Remember to send yourself a notification with as much information as needed about the error, in case it represents a pattern.

3. Test like a programmer.

Operations engineers can often fall into the habit of “building until it works”. As natural as that may feel, this approach does not lead to a reliable outcome. Human error and oversight is the most frequent cause of outages. Test your edge cases and run your code through the same types of tests that a good developer would to uncover weak points in your design, There are endless resources on the internet about how to “properly” test code, but this link provides a nice overview: Tutorial

As Jez Humble pointed out in Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation:

“The natural step when building these automated delivery scripts should be that every step is tested in a modular thorough way. Downloading the file, unpacking, etc, etc, All should be made reliable and validated before actually considering the script you wrote to be reliable.”

While Humble was speaking of deploying code, the same is true for the scripts you write. They are are responsible for installing, configuring, and managing the myriad of infrastructure and services that your application relies on. Since your code is often that glue that holds many disparate components together, it needs to be as reliable as the services themselves.

4. Code Deployment:

Reliably deploying code does not simply involve writing a script to get code from point A to point B and then restarting a service. Instead, it should involve designing a code deployment system that validates code and infrastructure-as-code during any change to the system. To eliminate human error and build confidence in the process, this should all be automated as well.

This process can vary widely, but at its core are the following principles:

-Keep a master package of “production ready” code on highly-available persistent storage, such as S3, so that during deployments any newly-provisioned machine can grab this package and deploy it on-demand.
Use health checks/smoke tests after deploying code to verify a proper code deployment has completed. Running a script to validate that an application server can connect to a database after deploying code can save a lot of headaches if connection strings in a configuration file were accidentally changed.

-Use Blue/Green deployment strategies to facilitate Zero Downtime deployments. That is, spin up and deploy code to a mirror environment and change over to that mirrored environment via a CNAME switch, after all smoke tests have passed. Depending on the tier, this process can be quick and painless in the cloud.

-Use a Continuous Integration / Continuous Delivery server to orchestrate deployment pipelines that automate your code deployment process. That is, any change that is made to your environment should be run through an automated, orchestrated series of tests to validate that this change is functional and reliable and ready to be released to production.

There are many more aspects involved in building highly-reliable infrastructures, including proper use of configuration management, monitoring and notification and well-engineered application architecture. Check back on our blog for updates on these topics.