In nearly every n-tier application architecture (not just those in a cloud deployment), the application tier should be treated as stateless, meaning containing nothing other than the application code. Databases, files, and even logs should be shipped to another location. Each application server (or EC2 instance) should be treated as if it were disposable.
This design is especially important on AWS, because individual instances tend to have short lifespans. The underlying hardware in Amazon’s datacenters randomly fails and is sometimes scheduled for retirement. And unlike a physical server, which may be repaired, a terminated instance is gone forever.
The most common refactoring required here is in the case of applications that allow users to upload files. These files should be directly uploaded to S3 whenever possible, using the code library available for your language of choice. Best practices also dictate that a database not be stored on the same instance as the application code. Better options are Amazon RDS, SimpleDB, or DynamoDB. For applications requiring a different configuration (such as PostgreSQL), rolling your own using EBS-optimized instances and provisioned IOPS EBS volumes are preferred for high-availability and performance reasons. Using a placement group for a database cluster may also be a viable solution, though I haven’t yet experimented with it.
Autoscaling is the ability of EC2 instances to detect system load and either bring additional servers online or to spin down excess capacity. This happens automatically by using CloudWatch alarms that respond to average CPU utilization, outgoing network throughput, disk I/O queue, or any combination of several dozen metrics. Custom metrics can also be defined that scale based on application or business rules (such as number of active user sessions or Twitter activity).
Autoscaling is the heart and soul of AWS–it’s how you build scalable, self-healing applications. If you’re not using it, you’re better off using a traditional VPS or co-lo server, as it will be cheaper and more reliable. An autoscaling group of a single server (a group defined with an instance count of
max=1) can also be created if running multiple instances is cost-prohibitive and/or several minutes of random downtime is acceptable. This will permit AWS to kill an unresponsive server and automatically bring a replacement online.
Amazon’s FAQ for RDS reads, in part:
The Point-In-Time-Restore and Snapshot Restore features of Amazon RDS for MySQL require a crash recoverable storage engine and are supported for InnoDB storage engine only.
I consider these two features to be among the top reasons to be using RDS. Combined, they ensure that your database is automatically being backed up on regular schedule. Point-In-Time-Restore uses the transaction log to allow you to revert the state of your database to any point in time (down to the second) within the past 24 hours. Snapshots occur once every day and are rotated according to a user-defined schedule. I generally retain a week’s worth of RDS snapshots.
InnoDB tables can nearly always be converted to MyISAM without much issue, except in the case of database using full-text indexes. MySQL 5.6.4 introduced full-text indexes to InnoDB, but as of this writing, RDS only supports up to MySQL 5.5. Developers using full-text searching in their applications may wish port over that functionality to AWS’s CloudSearch so InnoDB can be used on RDS. Alternatively, MyISAM can continue to be used, but a manual backup strategy (using the
mysqldump utility) will have to be implemented.
The ability to have cloud-based task scheduling is one glaring omission that I’ve discovered in AWS’s bag of tricks. There’s no (official) way to ensure that a job is run once, and only once, on a regular schedule. Best practices dictate that no single-point-of failure exists in a deployment, yet I’m not aware of any type of clustered solution that works as simply as a plain ole’ cron table. Attempting to run cron jobs on autoscaling application servers will inevitably cause race conditions–and headaches.
The solutions that I recommend: First, make sure that cron jobs are really the best application design. In many cases, they’re used to communicate between various tiers in an application. A cron job might kick off a batch process to resample audio files, for example. In this use case, a message queue would be a preferable solution. It scales easier and has the benefit of ensuring that a job was completed successfully. Amazon’s Simple Queue Service (SQS) is a good choice.
If actual job scheduling is needed, then two options would be to use either an autoscaling group containing a single t1.micro instance, or use a third-party solution such as iron.io.
An Elastic Load Balancer (ELB) sits in front of an application server pool (autoscaling group) and distributes incoming requests amongst the instances. It will automatically pull a dead instance out of the pool and then resume routing traffic to the replacement once it comes online. By default, users may randomly be bounced around to different instances throughout their website visit. If your application uses session cookies, this can cause the problem of users constantly being logged out each time they request a new page. This occurs because session information is stored locally to each instance (generally in text files in a temporary directory). When the user is bounced to a different server, the session is lost.
Elastic Load Balancers include a “sticky session” feature that eliminates this problem. This is a special cookie that instructs the ELB to send the user to the same instance for the duration of his visit (session). The ELB can be configured either use it’s own cookie, or an existing cookie that is set by application.
Sticky sessions do have one notable drawback: if the instance that the user is “stuck to” dies, or is scaled down, the user loses his session. This may not be a major problem for your use case, or it could mean that a couple hundred shoppers lose their shopping cart contents. The alternative is to move the session data to a shared datastore, generally either ElasticCache (AWS’s version of Memcache) or DynamoDB (a very fast key-value “NoSQL” database).
Since it’s so simple to quickly and cheaply get a temporarily server online with EC2, spammers have been drawn to the platform in droves. Unfortunately, the result has been that large swaths of EC2’s address space has been penalized by major email hosts. Additionally, AWS monitors and limits the amount of outgoing SMTP traffic that can be sent per account. These facts can often make reliable email delivery an impossibility.
My recommendation would be use a hosted SMTP relay (such as AuthSMTP) for simple email needs. It’s extremely cheap ($32/yr) and well-worth the additional peace of mind if reliable email delivery is important to your application. If the ability to automatically track bounces, unsubscription requests, and complaints would be helpful, several transactional email providers provide feature-rich solutions, among them: JangoSMTP, SendGrid, and Mandrill. Amazon also has Simple Email Services (SES), which provides much of the same functionality.
It can be tempting to set-up and tune a server to work exactly as needed and then “burn” it to an AMI so the configuration can be reused, either in an autoscaling group or as a backup. However, AWS doesn’t offer much in the way of AMI versioning or management. It’s not even possible to determine on which date an AMI was created. This also creates the hassle of having to recreate a new AMI each time a minor change is applied.
A better option is to use a base AMI from a trusted source, such as Amazon’s official AMI’s or one that you’ve custom built, and then use a provisioning-at-boot strategy. When an EC2 instance is launched, a “user data” string can be specified. This string can be something as simple as a Bash script that executes a
yum update -y to update all the system software and then checks out a copy of the application code from a Git repository. Or the script can be used to bootstap a more complex provisioning process managed by Chef, Puppet, Salt or other configuration management and orchestration tools. Amazon’s official AMI’s also support cloud-init scripts, which are a good balance between the simplicity of a shell script and the power of larger orchestration tools.
The primary advantage of building the instance at boot is flexibility and control. The provisioning scripts can be stored alongside the application code in a source repository, which makes it simple to track changes to the application’s environment. For example, are you upgrading from Tomcat 6 to 7? Simply change version number in your bootstrap script, commit the changes, and create a new autoscaling group. Once the new servers boot and provision (10 minutes later), test your application. If it looks OK, then change the elastic load balancer to point at the new server pool. Done.