Controlling how EC2 instances terminate using AutoScaling lifecycle hooks

As always, the documentation on AWS is great, but a real world example of how people are using it can often be helpful.

We use the lifecycle hooks of AWS AutoScaling groups to control what happens when instances are terminated. This is useful as we have long running tasks that run on instances inside of AutoScaling groups.

There are 2 transitions available, EC2_INSTANCE_LAUNCHING and EC2_INSTANCE_TERMINATING. We currently don’t use the launching transition - as our instances do everything they need to from the launch configuration. We use the terminating transition however to delay the termination of an instance until the current task is complete. Setting it all up was relatively simple, and allows us to control the size of the AutoScaling group without having to worry about tasks being terminated whilst they are running. Best of all, we can use the same approach on multiple Autoscaling groups using the same approach with no extra code or infrastructure.

The basic flow we use is:

  1. Hook lifecycle notification to Amazon SNS

  2. SNS notification is sent to ELB endpoint. This ELB is linked to an autoscaling group with a single EC2 instance inside, set to keep the size to 1. We use an Elastic Load Balancer to ensure that the same endpoint is always available, and the autoscaling group means that we always have an instance running and can tolerate hardware failures.

  3. This instance handles the SNS notification, and places a record on an ElastiCache Redis instance. The key is unique to the id of the machine the SNS notification is for.

  4. This instance also performs heartbeating. Periodically code checks to see if the key is still present (every 30 minutes). If it is, then it sends a heartbeat event to keep the instance in the PENDING state. This is becuase the autoscaling group by default will timeout the lifecycle transition after a period of time unless heartbeat are sent.

  5. The instance that is being terminated is inside a different AutoScaling group. Before it handles a task, it checks to see if a key exists in Redis with its instance id. If so - then it uses the information in the key to send a complete-lifecycle-action - and no longer performs any tasks.

  6. AWS then completes the lifecycle of the instance by terminating it.

Adding the hook

The first step is to register the hook. We use SNS to do this, and there is no UX for adding a hook like this. Given that we try to automate everything anyway, so we can replicate the entire environment, this isn’t an issue for us. FYI - you can add a hook quite simply by using the CLI:

1
aws autoscaling put-lifecycle-hook --lifecycle-hook-name my-hook --auto-scaling-group-name my-asg --lifecycle-transition autoscaling:EC2_INSTANCE_TERMINATING --notification-target-arn arn:aws:sns:us-west-2:123456789012:my-sns-topic --role-arn arn:aws:iam::123456789012:role/my-notification-role

SNS Notifications

Adding a subscription to an SNS topic is really easy. In our case, we send to the address of the internal ELB holding our EC2 instance that controls the flow of instances.

Handling SNS Notifications

The code to handle SNS notifications is well documented. We use TypeScript for most of our code. The sample code to handle an SNS Notification is listed below. Note that this handles subscribing to the SNS topic, as well as recieving notifications.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
static HandleRequest(req: restify.Request, res: restify.Response, next: restify.Next) {

//see http://stackoverflow.com/questions/18484775/how-do-you-access-an-amazon-sns-post-body-with-express-node-js
var bodyarr: (string | Buffer)[] = []
req.on('data', function (chunk) {
bodyarr.push(chunk);
});
req.on('error', function (error) {
console.log(error.message);
return res.send(400, error.message);
});
req.on('end', function () {
try {
var body = JSON.parse(bodyarr.join(''));
} catch (syntaxErr) {
console.log("req.body invalid: " + req.url);
return res.send(400, "Invalid request body");
}
if (body == null) {
console.log("req.body null: " + req.url);
return res.send(400, "Invalid request body");
}

if (req.headers['x-amz-sns-message-type'] != null) {
if (req.headers['x-amz-sns-message-type'] == 'SubscriptionConfirmation') {
request(body.SubscribeURL, (error: Error, response: request.RequestResponse, body: any) => {

if (error) {
console.log(error.message);
res.send(400, "Error sending subscription Url");
} else {
res.send(200, 'OK');
}
});
} else if (req.headers['x-amz-sns-message-type'] == 'Notification') {

//regardless of our state - inform AWS we've handled the message
res.send(200, 'OK');

let msg = JSON.parse(body.Message);

console.log('SNS event received:{0}', JSON.stringify(msg));

if (msg.LifecycleTransition == 'autoscaling:EC2_INSTANCE_TERMINATING') {
//handle the autoscaling event here!
}
} else {
console.log('Unknown x-amz-sns-message-type:\n' + req);
res.send(200, 'OK');
}
} else {
res.send(200, 'OK');
}
});
}

Listen for if instance should terminate

We place the payload of the AWS message in redis. We use hmset, where the key is unique to the instance id in the payload, and the value is the payload in the SNS notification.

On each AWS instance that can be terminated in this way, before starting the next task, we can simply check to see if the key exists. If it does, we can inform AWS that the instance is ready to be terminated. A small gotcha we noticed was that we need to do some cleanup before we can be terminated - so we must do this first, then put the applciation into a “holding state” and then inform AWS. If we informed AWS first, sometimes it had terminated our instance before we had completed our cleanup! As part of our cleanup, we remove the key from redis, so our other notification app knows we’ve completed as well and won’t send anymore heartbeats.

The code to complete the lifecycle is really easy:

1
2
3
4
5
6
AutoScaling.CompleteLifecycleAction({
AutoScalingGroupName: autoScaleInfo.AutoScalingGroupName,
LifecycleActionResult: 'CONTINUE',
LifecycleActionToken: autoScaleInfo.LifecycleActionToken,
LifecycleHookName: autoScaleInfo.LifecycleHookName
});

As discussed above - the instance that is listening for the SNS notifications also checks if the key exists at regular intervals. If it is - it sends a heartbeat to AWS and then sets another interval, so we check again to see if we need to send another heartbeat. To send a heartbeat to AWS, use:

1
2
3
4
5
AutoScaling.RecordLifecycleActionHeartbeat({
AutoScalingGroupName: payload.AutoScalingGroupName,
LifecycleActionToken: payload.LifecycleActionToken,
LifecycleHookName: payload.LifecycleHookName
});

Simple! This is wroking really well for us. We can now run tasks that can potentially take hours, and not worry about the AutoScaling group terminating the instance if its running.

Other notes

If you are creating shortlived tasks, then this won’t be a problem. We ae looking into using spot priced instances for some tasks. As long as you can complete within a couple of minutes then you’ll potentially save a lot of money this way.

DISQUS

DISQUS