AWS Batch

Objective

The objective of this blog is to share with the public the excitement I found when doing a proof of concept with AWS Batch. I believe this platform can solve some interesting issues for batch style data processing workflows. If you’d like to view the POC now, follow this link to github: AWS Batch Example POC on Github.

Why I think AWS Batch is Cool

Here I’ll explain why I think AWS Batch is cool. I’ll explain the features that it offers and a bit about how it offers them. You’ll gain some insight into what sets AWS apart from a run your own solution such as Apache Mesos, which is still a totally valid option.

First and foremost Batch is simply a platform to run tasks, and is not prescriptive to how you should run or define your own processing. Second, and with most excitement from an operations perspective, it’s fully managed. AWS Batch is able to manage the infrastructure for you. When defining your Compute Environment you’re able to define the minimum and maximum number of CPU’s you’d like to utilize, Batch will scale horizontally for you. When you select the Instance Type, you’re able to select “Optimal”, Batch will automatically select Instance Types based on the amount of work in the queue, allowing Batch to scale vertically as well.

Your Compute Environment is also able to be ran on spot instances, and AWS Batch will bid on spot pricing for you. To complement that your job queues can be worked from multiple Compute Environments, in the event your not able to procure spot instances, a backup Compute Environment with OnDemand pricing can activate and process the work in queue. The job queues set a priority on the Compute Environment that processes their work, meaning you can tell Batch to prefer spot instances over OnDemand. This means that AWS Batch optimize your processing power while also optimizing your work to cost ratio.

The queues themselves also have priority so you can ensure that the most important work gets done first. The work is done via container tasks defined by the Job Definitions. It appears that Job Definitions can only be defined as containers today, however, when defining a Job Definition in CloudFormation, I did see an error that told me the Type could be “container” or “lambda”. This is a bit of foreshadowing that Job Definitions in the future will be able to run on AWS Lambda. The Job Definitions can set all the properties of a container task, such as, environment variables, parameters, mount points, volumes, commands, etc.

When submitting a job you can override most of the Job Definition properties. This enables you to pass information into your container about it’s environment and purpose at run time. Jobs can also have dependencies on other jobs. This enables you to string multiple jobs together in a workflow, keeping the work the job does focused. Another feature of the submission of jobs is the ability to run an array of jobs, making monte carlo simulations a breeze.

AWS batch also monitors the progress of each job as it works through different states. These states include, submitted, pending, runnable, starting, running, failed, and success. I believe this encapsulates the progression of a job pretty well. An interesting point I found is that if a job depends on work to be completed but has already been submitted it stays in the pending status. Once a job transitions to runnable, the Compute Environment is able to pick up that portion of work. Once jobs transition to success or failed statuses, you’re able to quickly retrieve the logs produced by that job in CloudWatch logs. The shipping of container standard output to CloudWatch logs is done automatically in a Managed Compute Environment.

Lastly, Batch is Free. That’s right, Free. You pay only for the compute resources utilized. The queues, the scheduling, the automation, that’s all free.

With all that said, I hope I’ve provided some excitement around the platform. As mentioned, a fully managed platform from a operations perspective is optimal, and I’m sure development teams will appreciate that as well. The platform give you many options on how you can define your own batch processing workflows, and it’s free, except for the compute.

The POC I Built

AWS Batch Example POC on Github.

The proof of concept that I built was to demonstrate running a batch workload triggered off of an S3 object creation or modification. In this section I’ll discuss the main components of the POC, such as, the trigger, the jobs, and the workflow.

AWS Batch POC Diagram
Diagram of POC. Click for larger size.

A file based trigger is a typical trigger in batch workloads. When a file comes in and is ready to be processed a Batch Job is submitted. I’ve configured a S3 notification configuration that notifies AWS Lambda when a object is modified by PUT or POST.  With the notification we’re able to ingest the event passed to Lambda, which will detail the S3 action, including what object was created or modified, and how. With this data we’re able to submit a job to Batch, and have the opportunity to tell the job about the object via environment variable or CMD overrides.   

The Job Definition used for the Job submitted by Lambda, I’ll refer to as the Start Job. The Start Job does nothing more than submit more jobs. The way the jobs are submitted are a part of the POC. I wanted to exercise the ability for job dependency and show what’s known as a fan-in, fan-out workflow. This type of workflow demonstrates both serial and parallel processing. The workflow schedules some process to be done, after that process is complete, a number of other processes can be executed in parallel. Once all of the parallel jobs complete the workflow can proceed in serial processing.

The jobs scheduled by the Start Job only emulate work. These jobs I’ll refer to as Process Jobs. The Process Jobs are simply a busybox container with a “echo ‘hello world’” CMD. The Start Job itself is a container built on top of Centos 7 with python 3.6. There’s a custom built python module and command line interface that uses boto3 to schedule the Process Jobs, and define the workflow mentioned above.

This POC is slim, but also demonstrates the ability that the AWS Batch platform provides for running batch workloads. As a minimum viable product to get my feet wet with the service, I believe the POC was a success. I was extremely excited to get to play with the new service and hope to get the chance to work with it again soon. If you’re interested in learning more, I’ve made this POC available on Github, it includes CloudFormation and a script to help you set up quickly and precisely. If you’re interested in implementing this for your batch workload, get in contact with us! AWS Batch Example POC on Github.