In July of 2014, without much fanfare, Amazon Web Services released one of its most valuable (and possibly underutilized) updates for its Auto Scaling service: Auto Scaling Lifecycle Hooks. Auto Scaling Lifecycle Hooks allow EC2 instances that are part of an Auto Scaling group to pause for a specified amount of time during initialization or before terminating.
There are plenty of use cases for lifecycle hooks. These include: pausing termination during a scale-in event, so that you can ship system or application logs to a central store, allow a job queue on the file system to drain, or simply to analyze the instance before it terminates.
When a lifecycle hook is added to an Auto Scaling group, instances wait a specified amount of time before transitioning to their next state. In our example, we will configure a lifecycle hook to force the Auto Scaling Group instances to wait thirty minutes before terminating. Notifications of instance scale-in events will be sent to an SQS queue of our choosing. We’ll also implement a simple worker (consumer) on the instances to pull messages off this queue to determine if they are slated for termination and to execute our pre-termination tasks when true.
Finally, we’ll demonstrate how to dynamically modify the termination timeout period, in cases where we need more time to complete our pre-termination tasks, and then terminate the instance once all work is complete.
The example below uses the AWS Command Line Interface (CLI) to create and orchestrate the Auto Scaling group’s lifecycle hook. The same functionality is available in most of the latest Amazon SDKs. Lifecycle hook creation has not, however, been added to the AWS management console and cannot currently be configured in CloudFormation.
Creating Your First Lifecycle Hook
To create a lifecycle hook for an Auto Scaling group, we use the put-lifecycle-hook command and provide values to the various arguments it accepts. This command then creates a new hook into the Auto Scaling Service that we can use to control what happens when the instance initializes or is ready to terminate.
put-lifecycle-hook --lifecycle-hook-name "do-some-work" --auto-scaling-group-name "exampleAutoScalingGroup" --lifecycle-transition "autoscaling:EC2_INSTANCE_TERMINATING" --role-arn "arn:aws:iam::123456789:role/AutoScaling" --notification-target-arn arn:aws:sqs:us-east-1:123456789101:exampleQueue --heartbeat-timeout 1800 --default-result 'CONTINUE'
In the example above, we’ve named our lifecycle hook do-some-work, and we’ve applied it to the exampleAutoScalingGroup.
This means that any existing or new instances in that Auto Scaling Group that need to terminate will now do so according to the options we’ve set.
Speaking of configuration, we need to let the Auto Scaling service know that our hook should only apply to instances that are terminating — not to instances that are initializing. To do this, we need to set the lifecycle transition that this hook will apply to. In this case, because we want to create a hook that pauses termination, we’ll set the –lifecycle-transition option to the EC2_INSTANCE_TERMINATING transitional state.
--lifecycle-transition "autoscaling:EC2_INSTANCE_TERMINATING"
Now that we’ve configured when our instances will pause, we need to let Auto Scaling know where we want to receive notifications about instances that are waiting to terminate. This is accomplished with the –notification-target-arn option:
--notification-target-arn arn:aws:sqs:us-east-1:636936778347:exampleQueue
The notification target option takes the ARN of an Amazon SQS queue where we’d like termination messages published to, or alternatively, takes the ARN of an SNS topic you’d like the termination message posted to. In the command above we’ve supplied an ARN that points to an Amazon SQS queue we’ve created called exampleQueue.
When a scale-in event occurs and an instance is slated for termination, a message will be published to this queue which, among other properties, will include the instance ID of the EC2 instance that is slated for termination:
{
"AutoScalingGroupName": "exampleAutoScalingGroup",
"Service": "AWS Auto Scaling",
"Time": "2015-01-07T18:37:17.553Z",
"AccountId": "356438515751",
"LifecycleTransition": "autoscaling:EC2_INSTANCE_TERMINATING",
"RequestId": "6648ba02-138b-4f56-a0c7-bc74f22c3b51",
"LifecycleActionToken": "eaac0cdf-df85-4c8f-a9ed-f0a685066099",
"EC2InstanceId": "i-af391367",
"LifecycleHookName": "do-some-work"
}
Our worker that runs on the EC2 instance will poll the messages in this queue to determine if the instance it is running on should run its pre-termination tasks.
It’s worthwhile to note that when you first create your lifecycle hook, the Auto Scaling service will publish a test message to your SQS queue with details about the newly-created hook:
{
"AutoScalingGroupName": "exampleAutoScalingGroup",
"Service": "AWS Auto Scaling",
"Time": "2015-01-07T17:46:26.005Z",
"AccountId": "356438515751",
"Event": "autoscaling:TEST_NOTIFICATION",
"RequestId": "1a10a7c6-9695-11e4-97d0-730e96ff7596",
"AutoScalingGroupARN": "arn:aws:autoscaling:us-west-1:356438515751:autoScalingGroup:3adf0ecc-39e7-4d58-9933-67df2bbee7fa:autoScalingGroupName/exampleAutoScalingGroup"
}
This should be taken into account when you create your worker, especially if you’ll be adding hooks to your auto scaling groups dynamically, as your worker will need to distinguish between this test notification message and an actual termination message.
Now, before the Auto Scaling service can publish termination messages to our queue, it will need permissions to publish to SQS queues. The –role-arn option is used to pass the ARN of an IAM role that gives the Auto Scaling service permissions to publish to an SQS queue.
--role-arn "arn:aws:iam::356438515751:role/lifecycle-role"
An example policy that provides these permissions to the Auto Scaling service is shown below:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"sqs:SendMessage",
"sqs:GetQueueUrl",
"sns:Publish"
],
"Resource": "*"
}
]
}
Next, we’ll need to set the default timeout period, which is the amount of time that an instance will wait before terminating. In our example above, we’ve configured each instance in the Auto Scaling Group to pause for exactly thirty minutes before terminating by including the –heartbeat-timeout option, which we’ve set to 1800 seconds. By setting the default result option to CONTINUE we’ve also ensured that the instance will terminate if our timeout threshold is reached..
--heartbeat-timeout 1800 --default-result 'CONTINUE'
Orchestrating Termination
By issuing the put-lifecycle-hook command as we’ve defined above, we’ve completed setting up our hook. Our machines will now wait thirty minutes before terminating, and the Auto Scaling service will publish a message to our SQS queue identifying each instance that is slated for termination.
Our workers can now pull messages from the queue to determine if they are slated for termination:
Request
aws sqs receive-message --queue-url https://sqs.us-west-1.amazonaws.com/111111111111/exampleQueue --timeout-visibility 60
Response
{
"Messages": [
{
"Body": "{\"AutoScalingGroupName\":\"exampleAutoScalingGroup\",
\"Service\":\"AWS Auto Scaling\",\"Time\":\"2015-01-07T19:13:22.375Z\",\"AccountId\":\"356438515751\",
\"LifecycleTransition\":\"autoscaling:EC2_INSTANCE_TERMINATING\",
\"RequestId\":\"876eac1c-2aaa-407d-98d7-ce9afe597663\",\"LifecycleActionToken\":\"4889fcc7-adc6-43ff-a415-46240e2f57dc\",\"EC2InstanceId\":\"i-883a3a42\",\"LifecycleHookName\":\"do-some-work\"}",
"ReceiptHandle": "AQEBAjam9pe3ZxzD+w3A==",
"MD5OfBody": "d872dc653bcd5d1cc981b2eae64d3827",
"MessageId": "b3308afb-dad3-4eef-abb9-1d99aa9dd50f"
}
]
}
With this information, the worker on the instance can determine if the instance is scheduled for termination and can begin to perform its pre-termination task. Again, this could involve any task that must be completed before the machine terminates, such as shipping logs.
During the execution of the pre-termination task(s), If we calculate that we are near our expiration timeout but need more time, we can use the record-lifecycle-action-heartbeat (note that this is a separate command in itself) to give the application more time to clear its queue, as demonstrated below:
record-lifecycle-action-heartbeat --lifecycle-hook-name "do-some-work" --auto-scaling-group-name "exampleAutoScalingGroup" --lifecycle-action-token "A544346G324F3"
The record-lifecycle-action-heartbeat command extends the wait period by the length of time you defined in the heartbeat timeout parameter when you created the lifecycle hook. For example, if after fifteen minutes we decide that we want to run the record-lifecycle-action-heartbeat to provision more time, another thirty minutes would be added to our total time, giving us a total of forty-five minutes before the heartbeat times out. You must pass in the –lifecycle-action-token value when calling the heartbeat command. This token uniquely identifies a specific lifecycle action associated with an instance and is available, in this particular case, in the message published to our SQS queue (see Response above).
When our queue is finally empty, we can instruct the instance to terminate using the complete-lifecycle-action call, as demonstrated below:
complete-lifecycle-action --lifecycle-hook-name "do-some-work" --auto-scaling-group-name "example-Tomcat-ASG" --lifecycle-action-token "eaac0cdf-df85-4c8f-a9ed-f0a685066099" -lifecycle-action-result "CONTINUE"
In the above example, we use the complete-lifecycle-action call to instruct the Auto Scaling Service to continue to terminate the instance.
Summary
Admittedly, using lifecycle hooks to manage the transitions of an Auto Scaling group’s instance may not be an ideal solution in every case. It is far better to use the myriad of AWS services (S3, SQS, CloudWatch Logs, etc.) to design your application servers to be as stateless as possible (storing nothing but the application code).
Nonetheless, lifecycle hooks can assist in managing the state of instances and controlling the conditions under which an instance may launch or terminate. To learn more, please visit: Auto Scaling Group Lifecycle.