Developers and testers have traditionally left hardware testing to electrical engineers. However, an interesting thing happens when we deploy code in the cloud — we open ourselves up to a whole new area of testing by replacing hardware with software abstractions. Tools like Puppet, Salt and Chef have turned much of our hardware deployment and provisioning from a physical endeavor to a software task. Is there any reason why we can’t test the software abstraction on which our code runs using the same paradigms we use to test software projects?
When I began exploring the concept of testing the cloud I had to step back and ask myself some fundamental questions. Testing always made sense; I want my project to work, and when it breaks I want to know that, too. However, I had never really sat down and thought about why I spend so much time testing.
Why Do We Test?
In short, we test to make sure our projects work. In general, I’m far more uncomfortable when something I compile works the first time than when I find issues and have to dive in. Tests enable us to say, “I’m done, it works!” But if this were the only reason to test then there wouldn’t be any need to run automated and continuous test suites.
The popularity of continuous integration and continuous delivery (CI/CD) isn’t an accident. Automating and continuously running test suites provides confidence that our projects work, but also that projects continue working even when changes are made. This is extremely convenient for an engineer, but as we move towards larger development teams, distributed source code (git) and rapid deployment models, CI/CD becomes essential.
Metrics… developers hate ‘em, managers love ‘em and everyone beside the U.S., Liberia, and Myanmar uses their system (*bah dum bum*). Testing provides metrics without significant additional effort. When test plans are well written and run frequently they provide an excellent view the health of a code base and how ready it is to deploy.
Proving code quality, expediting development and deployment, and providing a view into the health of our code are just a few reasons why we test our code. I’m sure there are many other reasons to test, but I’ll focus on these in a future blog post when I share my lessons learned.
How Do We Test and Why Is it So Difficult?
When engineers and designers add physical interfaces or features to products, they’re limited by inconveniences like the laws of physics. Software developers have more flexibility. But with great flexibility comes great complexity. The more interfaces we build, the more possible paths our code can take. As much as we’d like to test all possible logical transitions, it’s simply not possible. If we think about communication between our systems as graphs, or code paths as transitions from nodes-to-nodes, it’s easy to understand how things get out of hand. Below is a visualization of all the communication paths in an a 12, 36, and 72 node mesh.
A full mesh with n nodes will have (n*(n-1))/2 lines. Given that many of our applications have hundreds — if not thousands — of possible transitions, we can’t possibly test them all. So we generally settle for testing the inputs and outputs of every node. I bring this up for two reasons. First, to explain why testing is an art in itself and second because cloud-deployed microservices are analogous to a well-structured program. Where cloud microservices pass data via the network requests and responses, a program passes data through function calls and returns. Once we recognize that software testing is a highly skilled art, it becomes less surprising that software teams have historically shied away from the equally highly skilled art of hardware testing.
With the cloud, we’ve resigned ourselves to let the hardware be someone else’s problem. Instead of having to purchase, rack, wire, and power a physical device, we write configuration files and automate launching instances. We often test deployments as a system, but this new paradigm is ripe for something analogous to unit testing. Some things we could test would be:
- Instance availability – This is nothing new, nagios, cloudwatch and many other tools have been doing this.
- Resource provisioning – If we’re counting on high I/O, CPU or other resource provisioning this could be verified even before code is deployed.
- Inter-instance connectivity – Confidence that any instances or even non-instance based utilities could verify interbox communications before deployments are deployed and switched into production.
Monitoring the CD in CI/CD
Once we have tests for our deployments, we can run them continually. This will give us confidence that the things that are out of our control (e.g. underlying hardware, cloud provider networking) continue to operate correctly and, if not, that we fail fast. A continuously running suite of unit tests for a cloud deployment with proper reporting also provides valuable data to help track intermittent issues.