Amazon has a very will written account of their 8/8/2011 downtime: Summary of the Amazon EC2, Amazon EBS, and Amazon RDS Service Event in the EU West Region. Power failed, backup generators failed to kick in, there weren’t enough resources for EBS volumes to recover, API servers where overwhelmed, a DNS failure caused failovers to alternate availability zones to fail, a double fault occurred as the power event interrupted the repair of a different bug. All kind of typical stuff that just seems to happen.
Considering the previous outage, the big question for programmers is: what does this mean? What does it mean for how systems should be structured? Have we learned something that can’t be unlearned?
The Amazon post has lots of good insights into how EBS and RDS work, plus lessons learned. The short of the problem is large + complex = high probability of failure. The immediate fixes are adding more resources, more redundancy, more isolation between components, more automation, reduce recovery times, and build software that is more aware of large scale failure modes. All good, solid, professional responses. Which is why Amazon has earned a lot of trust.
We can predict, however, problems like this will continue to happen, not because of any incompetence by Amazon, but because: large + complex make cascading failure an inherent characteristic of the system. At some level of complexity any cloud/region/datacenter could be reasonably considered a single failure domain and should be treated accordingly, regardless of the heroic software infrastructure created to carve out availability zones.
Viewing a region as a single point of failure implies to be really safe you would need to be in multiple regions, which is to say multiple locations. Diversity as mother nature’s means of robustness would indicate using different providers as a good strategy. Something a lot of people have been saying for a while, but with more evidence coming in, that conclusion is even stronger now. We can’t have our cake and eat it too.
For most projects this conclusion doesn’t really matter all that much. 100% uptime is extremely expensive and Amazon will usually keep your infrastructure up and working. Most of the time multiple Availability Zones are all you need. And you can always say hey, we’re on Amazon, what I can I do? It’s the IBM defense.
All this diversity of course is very expensive and and very complicated. Double the budget. Double the complexity. The problem of synchronizing data across datacenters. The problem of failing over and recovering properly. The problem of multiple APIs. And so on.
Another option is a retreat into radical simplicity. Complexity provides a lot of value, but it also creates fragility. Is there way to become radically simpler?