When it comes to AWS sentiment for enterprise uses there are two distinct views: “AWS wasn’t designed for enterprise workloads” and “AWS can be used for enterprise workloads”. The opinions are heavily polarized, and in my experience any time you have this much dichotomy, I find that the reality is somewhere in the middle. This dichotomy coupled with many of my customers’ interest in AWS (among other public cloud such as Azure, GCE, and vCHS) led me to seek out my own position on this topic via experimenting and testing. I found that many of the opinions written are not from the standpoint of a practitioner and when it comes to cloud technology evaluation that is one of the most important perspectives.
For my experiment I decided to take a app+DB 2-tier application (as an example Oracle Ebusiness, a popular Enterprise app, utilizes this architecture) running on my virtualized “on-prem infrastructure” and see if I could/how I could achieve the same same availability, fault tolerance and DR in AWS. This type of application stack is very typical for an enterprise. Another constraint was that re-writing the application was not an option. Writing an application utilizing cloud design patterns and architectures is a great strategy for new application development, but the majority of the enterprises will show you the door if you tell them the first step to adopting a technology is to re-write all their existing application. This is just a reality for many, many enterprise customers. This is an example of where technology cannot be viewed in isolation of business requirements and constraints.
Where appropriate, I will compare the design patterns to an on-prem solution utilizing vSphere, as that is probably the most relevant comparison.
With any public cloud provider, its important to understand which services are inherently fault tolerant and which require a fault tolerant application architecture.
AWS’s inherently fault tolerant services: S3, SimpleDB, DynamoDB, CloudFront, SWF, Elastic Transcoder, Redshift, SQS, SNS, SES, Route 53, Elastic Load Balancing (ELB), Elastic Beanstalk, Elasticache, Elastic MapReduce, IAM
AWS’s services which require a fault tolerant application architecture: EC2, VPC, EBS, RDS*
As an example, if you utilize S3 as a data store for your app, you can write to it assuming it will always be up and all failures will be handled by AWS at the infrastructure layer. On the other hand, if you utilize EC2 to run a workload, you should design your application assuming that at any given time your EC2 instance can go down due to physical failures. AWS is not going to do anything to provide fault tolerance at the EC2 layer — this is in contrast to vSphere which will will restart a VM utilizing HA if a physical host goes down.
The method to provide high availability in AWS is to run workloads in multiple availability zones (AZs). Availability zones can be viewed as independent data centers which are in the same geographical region and separated by synchronous distances with low latency interconnects. An example of an AZ would be US-EAST-1a, US-EAST1b, US-EAST1d, and so on.
If the workload in question is stateless, such as the web/app tier, multiple instances can be run active on multiple AZs. For a workload such as an Oracle DB, the proper design pattern is to run the DB as active in one AZ replicating to a stand-by (or read-only) copy in another AZ. You could run the passive copy in the same AZ, but it makes more sense to run it in another AZ since AWS provides multiple AZs per region, and that provides another layer of protection against faults. Broadly speaking, HA in AWS is achieved by running one component of each application in at least two AZs.
Most on-prem architectures will have an active DB in their primary DC and a passive copy in their DR DC. This is a subtly worth noting because while in the on-prem case the passive copy is used for DR, in AWS you will leverage an additional passive copy to provide HA.
Disaster recovery should be conducted in a region which is geographically disparate from your primary workload locations. AWS provides regions to allow for this. Examples of regions are: US-EAST, US-WEST, EU (Ireland), etc. Regions can be thought of as geographically distant locations which contain AWS data centers.
If the application is running out of US-EAST, replicating and backing up to US-WEST can protect against geographical outages, natural disasters, etc. This design pattern should be very familiar to most enterprise architectures.
Putting it all together
This isn’t meant to be a comprehensive guide on AWS design patterns, but at the same time I do want to discuss what services I chose to use and why. Here is a diagram showing my enterprise application living in AWS:
The easiest thing to do is start from the bottom of the stack and work our way up. For the DB layer, I chose to deploy a MySQL database using RDS. RDS is AWS’s managed database service. In AWS you have two options when deploying the DB layer: Method 1. Spin up an EC2 instance instance, install your DB software, configure it, do the same thing in another AZ and region, setup replication/failover between the master/slaves and setup backup yourself ; or Method 2. Simply use RDS to do it in a few clicks and let AWS manage the availability, failover and upgrading of the DBs. Without exaggerating I was able to create the DB layer of my stack in a few mouse clicks and 5 minutes. Below are some screenshots illustrating this (not complete, just the highlights):
DB Engine selection
Automated Multi-AZ deployment with a single click
Automated RDS DB Backups
I was amazed by how easy it was to get this going. With a few clicks I had a MySQL DB deployed in two AZs w/ replication and backed up once per day to S3. Failover also happens automatically and that’s something I tested. The Slave DB automatically takes over if the Master DB goes down and the same applies for fail back. You can reboot the Master DB node to test this as per the below:
Compared to setting up a DB on-prem, RDS automated many of the tasks and makes it very easy to get a fault tolerant highly available DB environment up and running — DBaaS.
Creating a 2nd slave DB (which can also act as a read replica) in another region for DR purposes is another click away:
In my case since I had my workloads deployed in US-EAST for production, I simply selected US-WEST for the read replica.
The next thing to look at is the Web/App tier. For this I simply used EC2 and installed my application. In my testing I deployed Joomla which is a popular CMS software. To make it available across AZs, I used Autoscaling groups so that the instances could scale out automatically with load as well live in multiple AZs:
Note: to tie the instances to a load balancer you must first create the load balancer. For this I used AWS’s ELB service. Just like the DB layer, there are two ways to leverage load balancers in AWS — deploy them yourself via EC2 instances, or use the ELB service. The ELB service is inherently fault tolerant and the preferred method. However, if there are features you need or some other reason to run your own load balancer software in an EC2 instance you can certainly go that route. However, you will have to then manage the fault tolerance of the load balancer layer just like anything else running on top of a bare EC2 instance.
ELB in US-EAST showing my 2 instances
Last but not least is DNS. AWS provides a fully managed DNS service called Route53. It is also one of the services which is inherently fault tolerant. It’s very advanced but in my case I used it very simply to tie my imaginary domain address translation to my load balancer in US-EAST:
Operating an Enterprise Application in AWS
With this application architecture built in AWS, Route53 will send incoming requests to the load balancer(s) in US-EAST. From there ELB will send traffic to the web/app tier EC2 instances and the load balancing decision can be based on a variety of health and performance metrics. The web/app tier itself sits on EC2 instances which participate in an Autoscale group and they live in multiple AZs. By being in an Autoscale group the web/app tier can automatically scale out to more instances or scale in, again based on a variety of metrics. From there the web/app tier will communicate with the MasterDB living in US-EAST-1a. The MasterDB will replicate to the SlaveDB in US-EAST-1b as well as to the SlaveDB living in the US-WEST region for DR purposes. One thing to note in the diagram is that the load balancer and app/web tier in US-WEST are colored white, and this is to indicate that they are setup but not turned on. This avoids the cost of them running in DR until they are actually needed. The SlaveDB is powered on, and this allows it to receive very up to date replication data from the MasterDB in US-EAST.
Route53 & ELB are both AWS managed services that are inherently fault tolerant. Thus, there is no need to do anything special to protect against their failure.
The EC2 instances for the app/web tier live in multiple AZs and will automatically be re-instantiated upon failure by the Autoscaling group. Because they are stateless this won’t be a problem for the application — ELB will simply route the traffic to an instance that is functioning.
RDS has been setup in a manner which has the databases living in multiple AZs and a read-only replica living in another region. In the event that the MasterDB goes down, failover will automatically occur to the SlaveDB by promoting it to Master and the web/app tier will talk to the new MasterDB in US-EAST-1b.
In the event of an entire AZ (data center) issue, ELB will simply route the traffic to the EC2 instances in the other AZ and the aforementioned DB failover will occur automatically. This should be a seamless transition for the most part.
If an entire region fails, there will need to be some orchestration to fire up the ELB and EC2 instances in US-WEST. At the same time the SlaveDB in US-WEST will need to be promoted to active and Route53 will need to be re-configured to point the DNS name(s) to the ELB in US-WEST. This can all be automated through the AWS API or done manually through the AWS console — this is the DR strategy.
This was just a very simple example of building and running an enterprise application architecture in AWS. In this case it was a simple Joomla application, but the architecture holds true for many enterprise applications. Scaling the app/web tier is done by autoscaling to more instances, and scaling the DB tier is done by scaling it up (I.E. upgrading AWS instance sizes).
The services that AWS has built on top of a simple IaaS (EC2) and data storage (EBS, S3) are what allow these “legacy” enterprise applications to be deployed in AWS. In most cases it makes sense to leverage these services instead of building them yourself on top of EC2.
While the application did not have to be modified, it was re-architected slightly (I.E. to leverage autoscaling groups, multi-AZ RDS deployment, ELB, Route53, etc) to be run properly in AWS. We often hear applications must be “designed for failure” but I think what most enterprises will find is that there is a continuum for designing applications for cloud. Legacy applications can be re-architected “slightly” to get much of the benefit of running in the public cloud or they can be re-written completely to take full advantage. An example of a complete re-write would be switching the DB layer for a scale-out NoSQL variant that can run active in multiple AZs and regions and doesn’t require data sharding when scaling out. This would also require code changes in the web/app tier.
I plan to do similar experiments in other public clouds, so stay tuned.