- Use SQS Batch Requests to reduce the number of requests hitting SQS which saves costs. Sending 10 messages in a single batch request which in the example save $30/month.
- Use SQS Long Polling to reduce extra polling requests, cutting down empty receives, which in the example saves ~$600 in empty receive leakage costs.
- Choose the right search technology choice to save costs in AWS by matching your activity pattern to the technology. For a small application with constant load or a heavily utilized search tier or seasonal loads Amazon Cloud Search looks like the cost efficient play.
- Use Amazon CloudFront Price Class to minimize costs by selecting the right Price Class for your audience to potentially reduce delivery costs by excluding Amazon CloudFront’s more expensive edge locations.
- Optimize ElastiCache Cluster costs by right sizing cluster node sizes. For different usage scenarios (heavy, moderate, low) their are optimal instances types. Choosing the right type for the right usage scenario saves money.
- Amazon Auto Scaling can save costs by better matching demand and capacity. Certainly not a new idea but the diagrams, different leakage scenarios (daily spike, weekly fluctuation, seasonal spike), and the explanation of potential savings (substantial) are well done.
- Use Amazon S3 Object Expiration feature to delete old backups, logs, documents, digital media, etc. A leakage of ~20 TB adds up to a tidy ~1650 USD a year.
All Your Base
AWS has already given you a lot of storage and processing options to choose from, and today we are adding a really important one.
You can now use Apache HBase to store and process extremely large amounts of data (think billions of rows and millions of columns per row) on AWS. HBase offers a number of powerful features including:
- Strictly consistent reads and writes.
- High write throughput.
- Automatic sharding of tables.
- Efficient storage of sparse data.
- Low-latency data access via in-memory operations.
- Direct input and output to Hadoop jobs.
- Integration with Apache Hive for SQL-like queries over HBase tables, joins, and JDBC support.
HBase in Action
HBase has been optimized for low-latency lookups and range scans, with efficient updates and deletions of individual records. Here are some of the things that you can do with it:
Reference Data for Hadoop Analytics – Because HBase is integrated into Hadoop and Hive and provides rapid access to stored data, it is a great way to store reference data that will be used by one or more Hadoop jobs on a single cluster or across multiple Hadoop clusters.
Log Ingestion and Batch Analytics – HBase can handle real-time ingestion of log data with ease, thanks to its high write throughput and efficient storage of sparse data. Combining this with Hadoop’s ability to handle sequential reads and scans in a highly optimized fashion, and you have a powerful tool for log analysis.
Storage for High Frequency Counters and Summary Data – HBase supports high update rates (the classic read-modify-write) along with strictly consistent reads and writes. These features make it ideal for storing counters and summary data. Complex aggregations such as max-min, sum, average, and group-by can be run as Hadoop jobs and the results can be piped back into an HBase table.
I should point out that HBase on EMR runs in a single Availability Zone and does not guarantee data durability; data stored in an HBase cluster can be lost if the master node in the cluster fails. Hence, HBase should be used for summarization or secondary data or you should make use of the backup feature described below.
You can do all of this (and a lot more) by running HBase on AWS. You’ll get all sorts of benefits when you do so:
Freedom from Drudgery – You can focus on your business and on your customers. You don’t have to set up, manage, or tune your HBase clusters. Elastic MapReduce will handle provisioning of EC2 instances, security settings, HBase configuration, log collection, health monitoring, and replacement of faulty instances. You can even expand the size of your HBase cluster with a single API call.
Backup and Recovery – You can schedule full and incremental backups of your HBase data to Amazon S3. You can rollback to an old backup on an existing cluster or you can restore a backup to a newly launched cluster.
You can start HBase from the command line by launching your Elastic MapReduce cluster with the –hbase flag :
You can also start it from the Create New Cluster page of the AWS Management Console:
When you create your HBase Job Flow from the console you can restore from an existing backup, and you can also schedule future backups:
Beyond the Basics
Here are a couple of advanced features and options that might be of interest to you:
You can modify your HBase configuration at launch time by using an EMR bootstrap action. For example, you can alter the maximum file size (hbase.hregion.max.filesize) or the maximum size of the memstore (hbase.regionserver.global.memstore.upperLimit).
You can monitor your cluster with the standard CloudWatch metrics that are generated for all Elastic MapReduce job flows. You can also install Ganglia at startup time by invoking a pair of predefined bootstrap actions (install-ganglia and configure-hbase-for-ganglia). We plan to add additional metrics, specific to HBase, over time.
You can run Apache Hive on the same cluster, or you can install it on a separate cluster. Hive will run queries transparently against HBase and Hive tables. We do advise you to proceed with care when running both on the same cluster; HBase is CPU and memory intensive, while most other MapReduce jobs are I/O bound, with fixed memory requirements and sporadic CPU usage.
HBase job flows are always launched with EC2 Termination Protection enabled. You will need to confirm your intent to terminate the job flow.
I hope you enjoy this powerful new feature!
PS – There is no extra charge to run HBase. You pay the usual rates for Elastic MapReduce and EC2.
Yesterday The 451 Group published a report asking “How will the database incumbents respond to NoSQL and NewSQL?”
That prompted the pertinent question, “What do you mean by ‘NewSQL’?”
Since we are about to publish a report describing our view of the emerging database landscape, including NoSQL, NewSQL and beyond (now available), it probably is a good time to define what we mean by NewSQL (I haven’t mentioned the various NoSQL projects in this post, but they are covered extensively in the report. More on them another day).
“NewSQL” is our shorthand for the various new scalable/high performance SQL database vendors. We have previously referred to these products as ‘ScalableSQL’ to differentiate them from the incumbent relational database products. Since this implies horizontal scalability, which is not necessarily a feature of all the products, we adopted the term ‘NewSQL’ in the new report.
And to clarify, like NoSQL, NewSQL is not to be taken too literally: the new thing about the NewSQL vendors is the vendor, not the SQL.
So who would be consider to be the NewSQL vendors? Like NoSQL, NewSQL is used to describe a loosely-affiliated group of companies (ScaleBase has done a good job of identifying, some of the several NewSQL sub-types) but what they have in common is the development of new relational database products and services designed to bring the benefits of the relational model to distributed architectures, or to improve the performance of relational databases to the extent that horizontal scalability is no longer a necessity.
In the first group we would include (in no particular order) Clustrix,GenieDB, ScalArc, Schooner, VoltDB, RethinkDB, ScaleDB, Akiban,CodeFutures, ScaleBase, Translattice, and NimbusDB, as well as Drizzle, MySQL Cluster with NDB, and MySQL with HandlerSocket. The latter group includes Tokutek and JustOne DB. The associated “NewSQL-as-a-service” category includes Amazon Relational Database Service, Microsoft SQL Azure, Xeround, Database.com and FathomDB.
(Links provide access to 451 Group coverage for clients. Non-clients can also apply for trial access).
Clearly there is the potential for overlap with NoSQL. It remains to be seen whether RethinkDB will be delivered as a NoSQL key value store for memcached or a “NewSQL” storage engine for MySQL, for example. While at least one of the vendors listed above is planning to enable the use of its database as a schema-less store, we also expect to see support for SQL queries added to some NoSQL databases. We are also sure that Citrusleaf won’t be the last NoSQL vendor to claim support for ACID transactions.
NewSQL is not about attempting to re-define the database market using our own term, but it is useful to broadly categorize the various emerging database products at this particular point in time.
Another clarification: ReadWriteWeb has picked up on this post andreported on the “NewSQL Movement”. I don’t think there is a movement in that sense that we saw the various NoSQL projects/vendors come together under the NoSQL umbrella with a common purpose. Perhaps the NewSQL players will do so (VoltDB andNimbusDB have reacted positively to the term, and Tokutek has become the first that I am aware of to explicitly describe its technology as NewSQL). As Derek Stainer notes, however: ” In the end it’s just a name, a way to categorize a group of similar solutions.”
In the meantime, we have already noted the beginning for the end of NoSQL, and the lines are blurring to the point where we expect the terms NoSQL and NewSQL will become irrelevant as the focus turns to specific use cases.
The identification of specific adoption drivers and use cases is the focus of our forthcoming long-form report on NoSQL, NewSQL and beyond, from which the 451 Group reported cited above is excerpted.
The report contains an overview of the roots of NoSQL and profiles of the major NoSQL projects and vendors, as well as analysis of the drivers behind the development and adoption of NoSQL and NewSQL databases, the evolving role of data grid technologies, and associated use cases.
It will be available very soon from the Information Management andCAOS practices and we will also publish more details of the key drivers as we see them and our view of the current database landscape here.
Since The Big List Of Articles On The Amazon Outage was published we’ve a had few updates that people might not have seen. Amazon of course released their Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region. Netlix shared their Lessons Learned from the AWS Outage as did Heroku (How Heroku Survived the Amazon Outage), Smug Mug (How SmugMug survived the Amazonpocalypse), and SimpleGeo (How SimpleGeo Stayed Up During the AWS Downtime).
The curious thing from my perspective is the general lack of response to Amazon’s explanation. I expected more discussion. There’s been almost none that I’ve seen. My guess is very few people understand what Amazon was talking about enough to comment whereas almost everyone feels qualified to talk about the event itself.
Lesson for crisis handlers: deep dive post-mortems that are timely, long, honestish, and highly technical are the most effective means of staunching the downward spiral of media attention.
Amazon’s Explanation Of What Happened
- Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region
- Hackers News thread on AWS Service Disruption Post Mortem
- Quite Funny Commentary on the Summary
- AWS outage follow-up: if you wanted details, you got details! by RightScale
- Amazon’s Own Post Mortem by Jeff Darcy
Experiences From Specific Companies, Both Good And Bad
- Lessons Netflix Learned from the AWS Outage by several Netflixians on the Netflix Tech Blog
- How Heroku Survived the Amazon Outage on the Heroku status page
- How SimpleGeo Stayed Up During the AWS Downtime by Mike Malone
- How SmugMug survived the Amazonpocalypse by Don MacAskill (Hacker News discussion)
- How Bizo survived the Great AWS Outage of 2011 relatively unscathed… by Someone at Bizo
- Joe Stump’s explanation of how SimpleGeo survived
- How Netflix Survived the Outage
- Why Twilio Wasn’t Affected by Today’s AWS Issues on Twilio Engineering’s Blog (Hacker News thread)
- On reddit’s outage
- What caused the Quora problems/outage in April 2011?
- Availability, redundancy, failover and data backups at LearnBoost
- How our small startup survived the Amazon EC2 Cloud-pocalypse from mobile app developer
- Recovering from Amazon cloud outage by Drew Engelson of PBS.
- PBS was affected for a while primarily because we do use EBS-backed RDS databases. Despite being spread across multiple availability-zones, we weren’t easily able to launch new resources ANYWHERE in the East region since everyone else was trying to do the same. I ended up pushing the RDS stuff out West for the time being. From Comment
Amazon Web Services Discussion Forum
A fascinating peek into the experiences of people who were dealing with the outage while they were experiencing it. Great real-time social archeology in action.
- Amazon Web Services Discussion Forum
- Cost-effective backup plan from now on?
- Life of our patients is at stake – I am desperately asking you to contact
- Why did the EBS, RDS, Cloudformation, Cloudwatch and Beanstalk all fail?
- Moved all resources off of AWS
- Any success stories?
- Is the mass exodus from East going to cause demand problems in the West?
- Finally back online after about 71 hours
- Amazon EC2 features vs windows azure
- Aren’t Availability Zones supposed to be “insulated from failures”?
- What a lot of people aren’t realizing about the downtime:
- ELB CNAME
- Availability Zones were used in a misleading manner
- Tip: How to recover your instance
- Crying in Forum Gets Results, Silver-level AWS Premium Support Doesn’t
- Well-worth reading: “design for failure” cloud deployment strategy
- New best practice
- Don’t bother with Premium Support
- Best practices for multi-region redundancy
- Learning from this case
- Amazon, still no instructions what to do?
- Anyone else prepared for an all-nighter?
- Is Jeff Bezos going to give a public statement?
- Rackspace, GoGrid, StormonDemand and Others
- Jeff Barr, Werner Vogels and other AWS persons – where have you been???
- After you guys fix EBS do I have do anything on my side?
- Need Help!!! Lives of people and billions in revenue are at risk now!!!
- I’ve Got A Suspicion
- Farewell EC2, Farewell
There were also many many instances of support and help in the log.
- Amazon EC2 outage: summary and lessons learned by RightScale
- AWS outage timeline & downtimes by recovery strategy by Eric Kidd
- The Aftermath of Amazon’s Cloud Outage by Rich Miller
Taking Sides: It’s The Customer’s Fault
- So Your AWS-based Application is Down? Don’t Blame Amazon by The Storage Architect
- The Cloud is not a Silver Bullet by Joe Stump (Hacker News thread)
- The AWS Outage: The Cloud’s Shining Moment by George Reese (Hacker News discussion)
- Failing to Plan is Planning to Fail by Ted Theodoropoulos
- Get a life and build redundancy/resiliency in your apps on the Cloud Computing group
Taking Sides: It’s Amazon’s Fault
- Stop Blaming the Customers – the Fault is on Amazon Web Services by Klint Finley
- AWS is down: Why the sky is falling by Justin Santa Barbara (Hacker News thread)
- Amazon Web Services are down – Huge Hacker News thread
- The EC2/EBS outage: What Amazon didn’t tell you by Jeremy Gaddis
Lessons Learned And Other Insight Articles
- Amazon’s EBS outage by Robin Harris of StorageMojo
- People Using Amazon Cloud: Get Some Cheap Insurance At Least by Bob Warfield
- Basic scalability principles to avert downtime by Ronald Bradford
- Amazon crash reveals ‘cloud’ computing actually based on data centers by Kevin Fogarty
- Seven lessons to learn from Amazon’s outage By Phil Wainewright
- The Cloud and Outages : Five Key Lessons by Patrick Baillie (Cloud Computing Group discussion)
- Some thoughts on outages by Till Klampaeckel
- Amazon.com’s real problem isn’t the outage, it’s the communication by Keith Smith
- How to work around Amazon EC2 outages by James Cohen (Hacker News thread)
- Today’s EC2 / EBS Outage: Lessons learned on Agile Sysadmin
- Amazon EC2 has gone down -what would a prefered hosting platform be? on Focus
- Single Points of Failure by Mat
- Coping with Cloud Downtime with Puppet
- Amazon Outage Concerns Are Overblown by Tim Crawford
- Where There Are Clouds, It Sometimes Rains by Clay Loveless
- Availability, redundancy, failover and data backups at LearnBoost by Guillermo Rauch
- Cloud hosting vs colocation by Chris Chandler (Hacker News thread)
- Amazon’s EC2 & EBS outage by Arnon Rotem-Gal-Oz
- Complex Systems Have Complex Failures. That’s Cloud Computing by Greg Ferro
- Amazon Web Services, Hosting in the Cloud and Configuration Management by Ian Chilton
- Lessons learned from deploying a production database in EC2 by by Grig Gheorghiu of Agile Testing
- Bezos on Amazon as a technology and invention company by John Gruber on Daring Fireball.
- On Importance of Planning for Failure by Dmitriy Samovskiy
- Amazon Outage Proves Value of Riak’s Vision by Basho
- Magical Block Store: When Abstractions Fail Us by Mark Joyent (Hacker News discussion)
- On Cascading Failures and Amazon’s Elastic Block Store by Jason
- An unofficial EC2 outage postmortem – the sky is not falling from CloudHarmony
- Cloudfail: Lessons Learned from AWS Outage by Jyoti Bansal
tl;dr: Amazon had a major outage last week, which took down some popular websites. Despite using a lot of Amazon services, SmugMug didn’t go down because we spread across availability zones and designed for failure to begin with, among other things.
We’ve known for quite some time that SkyNet was going to achieve sentience and attack us on April 21st, 2011. What we didn’t know is that Amazon’s Web Services platform (AWS) was going to be their first target, and that the attack would render many popular websitesinoperable while Amazon battled the Terminators.
Sorry about that, that was probably our fault for deploying SkyNet there in the first place.
We’ve been getting a lot of questions about how we survived (SmugMug was minimally impacted, and all major services remained online during the AWS outage) and what we think of the whole situation. So here goes.
HOW WE DID IT
We’re heavy AWS users with many petabytes of storage in their Simple Storage Service (S3) and lots of Elastic Compute Cloud (EC2) instances, load balancers, etc. If you’ve ever visited a SmugMug page or seen a photo or video embedded somewhere on the web (and you probably have), you’ve interacted with our AWS-powered services. Without AWS, we wouldn’t be where we are today – outages or not. We’re still very excited about AWS even after last week’s meltdown.
I wish I could say we had some sort of magic bullet that helped us stay alive. I’d certainly share if it I had one. In reality, our stability during this outage stemmed from four simple things:
First, all of our services in AWS are spread across multiple Availability Zones (AZs). We’d use 4 if we could, but one of our AZs is capacity constrained, so we’re mostly spread across three. (I say “one of our” because your “us-east-1b” is likely different from my “us-east-1b” – every customer is assigned to different AZs and the names don’t match up). When one AZ has a hiccup, we simple use the other AZs. Often this is a graceful, but there can be hiccups – there are certainly tradeoffs.
Second, we designed for failure from day one. Any of our instances, or any group of instances in an AZ, can be “shot in the head” and our system will recover (with some caveats – but they’re known, understood, and tested). I wish we could say this about some of our services in our own datacenter, but we’ve learned from our earlier mistakes and made sure that every piece we’ve deployed to AWS is designed to fail and recover.
Third, we don’t use Elastic Block Storage (EBS), which is the main component that failed last week. We’ve never felt comfortable with the unpredictable performance and sketchy durability that EBS provides, so we’ve never taken the plunge. Everyone (well, except for a few notable exceptions) knows that you need to use some level of RAID across EBS volumes if you want some reasonable level of durability (just like you would with any other storage device like a hard disk), but even so, EBS just hasn’t seemed like a good fit for us. Which also rules out their Relational Database Service (RDS) for us – since I believe RDS is, under the hood, EC2 instances runing MySQL on EBS. I’ll be the first to admit that EBS’ lack of predictable performance has been our primary reason for staying away, rather than durability, but a durability & availability has been a strong secondary consideration. Hard to advocate a “systems are disposable” strategy when they have such a vital dependency on another service. Clearly, at least to us, it’s not a perfect product for our use case.
Which brings us to fourth, we aren’t 100% cloud yet. We’re working as quickly as possible to get there, but the lack of a performant, predictable cloud database at our scale has kept us from going there 100%. As a result, the exact types of data that would have potentially been disabled by the EBS meltdown don’t actually live at AWS at all – it all still lives in our own datacenters, where we can provide predictable performance. This has its own downsides – we had two major outages ourselves this week (we lost a core router and its redundancy earlier, and a core master database server later). I wish I didn’t have to deal with routers or database hardware failures anymore, which is why we’re still marching towards the cloud.
So what did we see when AWS blew up? Honestly, not much. One of our Elastic Load Balancers (ELBs) on a non-critical service lost its mind and stopped behaving properly, especially with regards to communication with the affected AZs. We updated our own status board, and then I tried to work around the problem. We quickly discovered we could just launch another identical ELB, point it at the non-affected zones, and update our DNS. 5 minutes after we discovered this, DNS had propagated, and we were back in business. It’s interesting to note that the ELB itself was affected here – not the instances behind it. I don’t know much about how ELBs operate, but this leads me to believe that ELBs are constructed, like RDS, out of EC2 instances with EBS volumes. That seems like the most logical reason why an ELB would be affected by an EBS outage – but other things like network saturation, network component failures, split-brain, etc could easily cause it as well.
Probably the worst part about this whole thing is that the outage in question spread to more than one AZ. In theory, that’s not supposed to happen – I believe each AZ is totally isolated (physically in another building at the very least, if not on the other side of town), so there should be very few shared components. In practice, I’ve often wondered how AWS does capacity planning for total AZ failures. You could easily imagine peoples automated (and even non-automated) systems simply rapidly provisioning new capacity in another AZ if there’s a catastrophic even (like Terminators attacking your facility, say). And you could easily imagine that surge in capacity taking enough toll on one or more AZs to incapacitate them, even temporarily, which could cause a cascade effect. We’ll have to wait for the detailed post-mortem to see if something similar happened here, but I wouldn’t be surprised if a surge in EBS requests to a 2nd AZ had at least a deteriorating effect. Getting that capacity planning done just right is just another crazy difficult problem that I’m glad I don’t have to deal with for all of our AWS-powered services.
This stuff sounds super simple, but it’s really pretty important. If I were starting anew today, I’d absolutely build 100% cloud, and here’s the approach I’d take:
- Spread across as many AZs as you can. Use all four. Don’t be like this guy and put all of the monitoring for your poor cardiac arrest patients in one AZ (!!).
- If your stuff is truly mission critical (banking, government, health, serious money maker, etc), spread across as many Regions as you can. This is difficult, time consuming, and expensive – so it doesn’t make sense for most of us. But for some of us, it’s a requirement. This might not even be live – just for Disaster Recovery (DR)
- Beyond mission critical? Spread across many providers. This is getting more and more difficult as AWS continues to put distance between themselves and their competitors, grow their platform and build services and interfaces that aren’t trivial to replicate, but if your stuff is that critical, you probably have the dough. Check out Eucalyptus and Rackspace Cloud for starters.
- I should note that since spreading across multiple Regions and providers adds crazy amounts of extra complexity, and complex systems tend to be less stable, you could be shooting yourself in the foot unless you really know what you’re doing. Often redundancy has a serious cost – keep your eyes wide open.
- Build for failure. Each component (EC2 instance, etc) should be able to die without affecting the whole system as much as possible. Your product or design may make that hard or impossible to do 100% – but I promise large portions of your system can be designed that way. Ideally, each portion of your system in a single AZ should be killable without long-term (data loss, prolonged outage, etc) side effects. One thing I mentally do sometimes is pretend that all my EC2 instances have to be Spot instances – someone else has their finger on the kill switch, not me. That’ll get you to build right.
- Understand your components and how they fail. Use any component, such as EBS, only if you fully understand it. For mission-critical data using EBS, that means RAID1/5/6/10/etc locally, and some sort of replication or mirroring across AZs, with some sort of mechanism to get eventually consistent and/or re-instantiate after failure events. There’s a lot of work being done in modern scale-out databases, like Cassandra, for just this purpose. This is an area we’re still researching and experimenting in, but SimpleGeo didn’t seem affected and they use Cassandra on EC2 (and on EBS, as far as I know), so I’d say that’s one big vote.
- Try to componentize your system. Why take the entire thing offline if only a small portion is affected? During the EBS meltdown, a tiny portion of our site (custom on-the-fly rendered photo sizes) was affected. We didn’t have to take the whole site offline, just that one component for a short period to repair it. This is a big area of investment at SmugMug right now, and we now have a number of individual systems that are independent enough from each other to sustain partial outages but keep service online. (Incidentally, it’s AWS that makes this much easier to implement)
- Test your components. I regularly kill off stuff on EC2 just to see what’ll happen. I found and fixed a rare bug related to this over the weekend, actually, that’d been live and in production for quite some time. Verify your slick new eventually consistent datastore is actually eventually consistent. Ensure your amazing replicator will actually replicate correctly or allow you to rebuild in a timely fashion. Start by doing these tests during maintenance windows so you know how it works. Then, once your system seems stable enough, start surprising your Ops and Engineering teams by killing stuff in the middle of the day without warning them. They’ll love you.
- Relax. Your stuff is gonna break, and you’re gonna have outages. If you did all of the above, your outages will be shorter, less damaging, and less frequent – but they’ll still happen. Gmail has outages, Facebook has outages, your bank’s website has outages. They all have a lot more time, money, and experience than you do and they’re offline or degraded fairly frequently, considering. Your customers will understand that things happen, especially if you can honestly tell them these are things you understand and actively spend time testing and implementing. Accidents happen, whether they’re in your car, your datacenter, or your cloud.
Best part? Most of that stuff isn’t difficult or expensive, in large part thanks to the on-demand pricing of cloud computing.
WHAT ABOUT AWS?
Amazon has some explaining to do about how this outage affected multiple AZs, no question. Even so, high volume sites like Netflix and SmugMug remained online, so there are clearly cloud strategies that worked. Many of the affected companies are probably taking good hard looks at their cloud architecture, as well they should. I know we are, even though we were minimally affected.
Still, SmugMug wouldn’t be where we are today without AWS. We had a monster outage (~8.5 hours of total downtime) with AWS a few years ago, where S3 went totally dark, but that’s been the only significant setback. Our datacenter related outages have all been far worse, for a wide range of reasons, as many of our loyal customers can attest. That’s one of the reasons we’re working so hard to get our remaining services out of our control and into Amazon’s – they’re still better at this than almost anyone else on earth.
Will we suffer outages in the future because of Amazon? Yes. I can guarantee it. Will we have fewer outages? Will we have less catastrophic outages? That’s my bet.
THE CLOUD IS DEAD!
There’s a lot of noise on the net about how cloud computing is dead, stupid, flawed, makes no sense, is coming crashing down, etc. Anyone selling that stuff is simply trying to get page views and doesn’t know what on earth they’re talking about. Cloud computing is just a tool, like any other. Some companies, like Netflix and SimpleGeo, likely understand the tool better. It’s a new tool, so cut the companies that are still learning some slack.
Then send them to my blog.
And, of course, we’re always hiring. Come see what it’s like to love your job (especially if you’re into cloud computing).
UPDATE: Joe Stump is out with the best blog post about the outage yet, The Cloud is not a Silver Bullet, imho.
Amazon has a very will written account of their 8/8/2011 downtime: Summary of the Amazon EC2, Amazon EBS, and Amazon RDS Service Event in the EU West Region. Power failed, backup generators failed to kick in, there weren’t enough resources for EBS volumes to recover, API servers where overwhelmed, a DNS failure caused failovers to alternate availability zones to fail, a double fault occurred as the power event interrupted the repair of a different bug. All kind of typical stuff that just seems to happen.
Considering the previous outage, the big question for programmers is: what does this mean? What does it mean for how systems should be structured? Have we learned something that can’t be unlearned?
The Amazon post has lots of good insights into how EBS and RDS work, plus lessons learned. The short of the problem is large + complex = high probability of failure. The immediate fixes are adding more resources, more redundancy, more isolation between components, more automation, reduce recovery times, and build software that is more aware of large scale failure modes. All good, solid, professional responses. Which is why Amazon has earned a lot of trust.
We can predict, however, problems like this will continue to happen, not because of any incompetence by Amazon, but because: large + complex make cascading failure an inherent characteristic of the system. At some level of complexity any cloud/region/datacenter could be reasonably considered a single failure domain and should be treated accordingly, regardless of the heroic software infrastructure created to carve out availability zones.
Viewing a region as a single point of failure implies to be really safe you would need to be in multiple regions, which is to say multiple locations. Diversity as mother nature’s means of robustness would indicate using different providers as a good strategy. Something a lot of people have been saying for a while, but with more evidence coming in, that conclusion is even stronger now. We can’t have our cake and eat it too.
For most projects this conclusion doesn’t really matter all that much. 100% uptime is extremely expensive and Amazon will usually keep your infrastructure up and working. Most of the time multiple Availability Zones are all you need. And you can always say hey, we’re on Amazon, what I can I do? It’s the IBM defense.
All this diversity of course is very expensive and and very complicated. Double the budget. Double the complexity. The problem of synchronizing data across datacenters. The problem of failing over and recovering properly. The problem of multiple APIs. And so on.
Another option is a retreat into radical simplicity. Complexity provides a lot of value, but it also creates fragility. Is there way to become radically simpler?
When EC2 first started the mental model was of a magic Pez dispenser supplying an infinite stream of instances in any desired flavor. If you needed an instance, because of a either a failure or traffic spike, it would be there. As amazing as EC2 is, this model turned out to be optimistic.
From a thread on the Amazon discussion forum we learn any dispenser has limits:
As Availability Zones grow over time, our ability to continue to expand them can become constrained. In these scenarios, we will prevent customers from launching in the constrained zone if they do not yet have existing resources in that zone. We also might remove the constrained zone entirely from the list of options for new customers. This means that occasionally, different customers will see a different number of Availability Zones in a particular Region. Both approaches aim to help customers avoid accidentally starting to build up their infrastructure in an Availability Zone where they might have less ability to expand.
The solution: if you need guaranteed resources in different zones, purchase Reserved Instances. This will assure capacity when needed. There’s no way to know if the instance types you are interested in are available in an availability zone, so reserving instances is the only solution.
Architecturally this is a pain and removes part of the win of the cloud. Having nailed up instances is nearly one step from dedicated machines in a colo. And now, if you can’t count on on-demand instances, your architecture requires enough reserved machines to handle a disaster scenario, which means your fixed costs are high enough that you really need to make use of those reserved instances all the time, unless you have the money to just keep them as backup for failover. A much more complicated scenario, but I guess you have to run out of Pez eventually.