Deep Dive Into Auto Scaling Lifecycle Hooks

In July of 2014, without much fanfare, Amazon Web Services released one of its most valuable (and possibly underutilized) updates for its Auto Scaling service: Auto Scaling Lifecycle Hooks. Auto Scaling Lifecycle Hooks allow EC2 instances that are part of an Auto Scaling group to pause for a specified amount of time during initialization or before terminating.

There are plenty of use cases for lifecycle hooks. These include: pausing termination during a scale-in event, so that you can ship system or application logs to a central store, allow a job queue on the file system to drain, or simply to analyze the instance before it terminates.

When a lifecycle hook is added to an Auto Scaling group, instances wait a specified amount of time before transitioning to their next state. In our example, we will configure a lifecycle hook to force the Auto Scaling Group instances to wait thirty minutes before terminating. Notifications of instance scale-in events will be sent to an SQS queue of our choosing. We’ll also implement a simple worker (consumer) on the instances to pull messages off this queue to determine if they are slated for termination and to execute our pre-termination tasks when true.

Finally, we’ll demonstrate how to dynamically modify the termination timeout period, in cases where we need more time to complete our pre-termination tasks, and then terminate the instance once all work is complete.

The example below uses the AWS Command Line Interface (CLI) to create and orchestrate the Auto Scaling group’s lifecycle hook. The same functionality is available in most of the latest Amazon SDKs. Lifecycle hook creation has not, however, been added to the AWS management console and cannot currently be configured in CloudFormation.

Creating Your First Lifecycle Hook

To create a lifecycle hook for an Auto Scaling group, we use the put-lifecycle-hook command and provide values to the various arguments it accepts. This command then creates a new hook into the Auto Scaling Service that we can use to control what happens when the instance initializes or is ready to terminate.

put-lifecycle-hook --lifecycle-hook-name "do-some-work" --auto-scaling-group-name "exampleAutoScalingGroup" --lifecycle-transition "autoscaling:EC2_INSTANCE_TERMINATING" --role-arn "arn:aws:iam::123456789:role/AutoScaling" --notification-target-arn arn:aws:sqs:us-east-1:123456789101:exampleQueue --heartbeat-timeout 1800 --default-result 'CONTINUE'

In the example above, we’ve named our lifecycle hook do-some-work, and we’ve applied it to the exampleAutoScalingGroup.

This means that any existing or new instances in that Auto Scaling Group that need to terminate will now do so according to the options we’ve set.

Speaking of configuration, we need to let the Auto Scaling service know that our hook should only apply to instances that are terminating — not to instances that are initializing. To do this, we need to set the lifecycle transition that this hook will apply to. In this case, because we want to create a hook that pauses termination, we’ll set the –lifecycle-transition option to the EC2_INSTANCE_TERMINATING transitional state.

--lifecycle-transition "autoscaling:EC2_INSTANCE_TERMINATING"

Now that we’ve configured when our instances will pause, we need to let Auto Scaling know where we want to receive notifications about instances that are waiting to terminate. This is accomplished with the –notification-target-arn option:

--notification-target-arn arn:aws:sqs:us-east-1:636936778347:exampleQueue

The notification target option takes the ARN of an Amazon SQS queue where we’d like termination messages published to, or alternatively, takes the ARN of an SNS topic you’d like the termination message posted to. In the command above we’ve supplied an ARN that points to an Amazon SQS queue we’ve created called exampleQueue.

When a scale-in event occurs and an instance is slated for termination, a message will be published to this queue which, among other properties, will include the instance ID of the EC2 instance that is slated for termination:

{
"AutoScalingGroupName": "exampleAutoScalingGroup",
"Service": "AWS Auto Scaling",
"Time": "2015-01-07T18:37:17.553Z",
"AccountId": "356438515751",
"LifecycleTransition": "autoscaling:EC2_INSTANCE_TERMINATING",
"RequestId": "6648ba02-138b-4f56-a0c7-bc74f22c3b51",
"LifecycleActionToken": "eaac0cdf-df85-4c8f-a9ed-f0a685066099",
"EC2InstanceId": "i-af391367",
"LifecycleHookName": "do-some-work"
}

Our worker that runs on the EC2 instance will poll the messages in this queue to determine if the instance it is running on should run its pre-termination tasks.

It’s worthwhile to note that when you first create your lifecycle hook, the Auto Scaling service will publish a test message to your SQS queue with details about the newly-created hook:

{
"AutoScalingGroupName": "exampleAutoScalingGroup",
"Service": "AWS Auto Scaling",
"Time": "2015-01-07T17:46:26.005Z",
"AccountId": "356438515751",
"Event": "autoscaling:TEST_NOTIFICATION",
"RequestId": "1a10a7c6-9695-11e4-97d0-730e96ff7596",
"AutoScalingGroupARN": "arn:aws:autoscaling:us-west-1:356438515751:autoScalingGroup:3adf0ecc-39e7-4d58-9933-67df2bbee7fa:autoScalingGroupName/exampleAutoScalingGroup"
}

This should be taken into account when you create your worker, especially if you’ll be adding hooks to your auto scaling groups dynamically, as your worker will need to distinguish between this test notification message and an actual termination message.

Now, before the Auto Scaling service can publish termination messages to our queue, it will need permissions to publish to SQS queues. The –role-arn option is used to pass the ARN of an IAM role that gives the Auto Scaling service permissions to publish to an SQS queue.

--role-arn "arn:aws:iam::356438515751:role/lifecycle-role"

An example policy that provides these permissions to the Auto Scaling service is shown below:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"sqs:SendMessage",
"sqs:GetQueueUrl",
"sns:Publish"
],
"Resource": "*"
}
]
}

Next, we’ll need to set the default timeout period, which is the amount of time that an instance will wait before terminating. In our example above, we’ve configured each instance in the Auto Scaling Group to pause for exactly thirty minutes before terminating by including the –heartbeat-timeout option, which we’ve set to 1800 seconds. By setting the default result option to CONTINUE we’ve also ensured that the instance will terminate if our timeout threshold is reached..

--heartbeat-timeout 1800 --default-result 'CONTINUE'

Orchestrating Termination

By issuing the put-lifecycle-hook command as we’ve defined above, we’ve completed setting up our hook. Our machines will now wait thirty minutes before terminating, and the Auto Scaling service will publish a message to our SQS queue identifying each instance that is slated for termination.

Our workers can now pull messages from the queue to determine if they are slated for termination:

Request
aws sqs receive-message --queue-url https://sqs.us-west-1.amazonaws.com/111111111111/exampleQueue --timeout-visibility 60

Response
{
"Messages": [
{
"Body": "{\"AutoScalingGroupName\":\"exampleAutoScalingGroup\",
\"Service\":\"AWS Auto Scaling\",\"Time\":\"2015-01-07T19:13:22.375Z\",\"AccountId\":\"356438515751\",
\"LifecycleTransition\":\"autoscaling:EC2_INSTANCE_TERMINATING\",
\"RequestId\":\"876eac1c-2aaa-407d-98d7-ce9afe597663\",\"LifecycleActionToken\":\"4889fcc7-adc6-43ff-a415-46240e2f57dc\",\"EC2InstanceId\":\"i-883a3a42\",\"LifecycleHookName\":\"do-some-work\"}",
"ReceiptHandle": "AQEBAjam9pe3ZxzD+w3A==",
"MD5OfBody": "d872dc653bcd5d1cc981b2eae64d3827",
"MessageId": "b3308afb-dad3-4eef-abb9-1d99aa9dd50f"
}
]
}

With this information, the worker on the instance can determine if the instance is scheduled for termination and can begin to perform its pre-termination task. Again, this could involve any task that must be completed before the machine terminates, such as shipping logs.

During the execution of the pre-termination task(s), If we calculate that we are near our expiration timeout but need more time, we can use the record-lifecycle-action-heartbeat (note that this is a separate command in itself) to give the application more time to clear its queue, as demonstrated below:

record-lifecycle-action-heartbeat --lifecycle-hook-name "do-some-work" --auto-scaling-group-name "exampleAutoScalingGroup" --lifecycle-action-token "A544346G324F3"

The record-lifecycle-action-heartbeat command extends the wait period by the length of time you defined in the heartbeat timeout parameter when you created the lifecycle hook. For example, if after fifteen minutes we decide that we want to run the record-lifecycle-action-heartbeat to provision more time, another thirty minutes would be added to our total time, giving us a total of forty-five minutes before the heartbeat times out. You must pass in the –lifecycle-action-token value when calling the heartbeat command. This token uniquely identifies a specific lifecycle action associated with an instance and is available, in this particular case, in the message published to our SQS queue (see Response above).

When our queue is finally empty, we can instruct the instance to terminate using the complete-lifecycle-action call, as demonstrated below:

complete-lifecycle-action --lifecycle-hook-name "do-some-work" --auto-scaling-group-name "example-Tomcat-ASG" --lifecycle-action-token "eaac0cdf-df85-4c8f-a9ed-f0a685066099" -lifecycle-action-result "CONTINUE"

In the above example, we use the complete-lifecycle-action call to instruct the Auto Scaling Service to continue to terminate the instance.

Summary

Admittedly, using lifecycle hooks to manage the transitions of an Auto Scaling group’s instance may not be an ideal solution in every case. It is far better to use the myriad of AWS services (S3, SQS, CloudWatch Logs, etc.) to design your application servers to be as stateless as possible (storing nothing but the application code).

Nonetheless, lifecycle hooks can assist in managing the state of instances and controlling the conditions under which an instance may launch or terminate. To learn more, please visit: Auto Scaling Group Lifecycle.

8 Replies to “Deep Dive Into Auto Scaling Lifecycle Hooks”

  1. If the worker, as you’ve laid it out, is running on multiple instances in an ASG, when they’re all pulling messages from the queue, how do you ensure that the right instance gets the message?
    It looks to me like if one of the other instances pulled the SQS message, it would have to ignore it, but somehow ensure that the right instance eventually picked it up.

  2. This is a fantastic article! There is one piece missing that would really make this article stand out from all others: An example of your worker that was successfully implemented. The reason is that the worker itself is just as critical to the automation process of lifecycle hooks as the creation of the hooks themselves.

  3. Hi Derek,

    Good question, and as there has been some mention of this, I may try to post an example worker/solution in a later article. The solutions to “race conditions” and the like can require a good bit of thought, and SQS is not always the right choice. However, when a worker pulls a message from the queue and parses it, it can read the instance-id from a field in the message: “EC2InstanceId”: “i-ab123456”. The worker then knows that it needs to terminate, can run its post-termination task, and then delete that message from the queue.

    It is simpler however, and often more desirable, to set up a more synchronous workflow for managing termination events. That is to say, it’s often better to set up the –notification-target-arn to point to an SNS topic. That SNS topic could have the URL of a Jenkins build job subscribed to it. When the Auto Scaling Service then posts to that topic (as it would during a termination event), the build job could execute and start your post-termination tasks immediately (This avoids the need to filter through messages in the queue looking for a termination message with a certain instance-id), speeding up the process and avoiding the possibility that an instance never sees a message, because it’s always being consumed by another instance’s worker.

  4. Hi Thomas,

    Thanks! And yes, that’s exactly right, the worker is critical. There’s a lot that can go into designing it, and the processes around it, to make sure that it’s reliable, can avoid race conditions, etc. I may do a follow up post, then, that demonstrates how to address some of these critical points, perhaps including a full example.

  5. Hello, I am curious as to whether or not you’ve been able to create an example of what a worker would look like checking on whether the instance is ready to be terminated or not. This seems to be my biggest roadblock. I’d be interested to know if this is a script that is baked into the AMI of the instance you’re using, or if it’s being pulled from source control. This was a great article btw, really helped me understand the LifeCycleHooks a lot more than the AWS white pages.

  6. Thank you very much for this article. It was the only place where I found an actual example of what the SQS message for a lifecycle hook event actually looks like, and it saved me from a lot of trial-and-error.

  7. Regarding Derek Lewis’s comment about knowing if the message is sent to you…technically, no two machines will act upon the same event, so you can just set the visibility to 0. There is still down sides, as shown by tests I have run:

    -even with visibility of 0, messages appear to disappear. Not sure why this is, perhaps the “0” case is really “durned close to 0”, or maybe a visibility of 0 doesn’t work.

    -if you have your myriad of workers sniffing the queue, you can run into a starvation situation. It is important for your sniff demons (sounds like a 70’s disco term) to not get in lock-step. It is unclear to me if a random backoff is important, given that it is yet unclear how “visibility=0” works (or should I say, NOT works)

    -make sure your auto queue deletion period isn’t too long: otherwise, you can run into a situation where a crufty (unhandled) message will block the queue. To help reduce this, if you’re using a visibility of 0 (which as stated, doesn’t always work right), just pull off the maximum 10 messages

    -this queue deletion period is ESPECIALLY important if you have a massive descale event storm

Leave a Reply

Your email address will not be published. Required fields are marked *