Perspective: Do Customers Share Blame in Amazon Outages?
The more troubling situation is the infrequent failures that have human error involved, which result in more widespread service failure. In other words, it's not just one application's resources being unavailable, but a service being out for a large number of applications.
It's tempting to believe the problem is that Amazon just doesn't have good process or smart enough people working for it and that, if those aspects were addressed by it (or another provider), then these infrequent failures wouldn't occur.
This attitude is wrong. These corner case outages will continue, unfortunately. We are building a new model of computing-highly automated and vastly scaled, with rich functionality-and the industry is still learning how to operate and manage this new mode of computing. Inevitably, mistakes will occur. The mistakes are typically not simple errors but, rather, unusual conditions triggering unexpected events within the infrastructure. While cloud providers will do everything they can to prevent such situations, they will undoubtedly occur in the future.
In the End, It Comes Down To Risk
What is the solution for these infrequent yet widespread service outages? AWS recommends more extensive redundancy measures that span geographic regions. Given AWS scoping, that would protect against region-wide resource unavailability. There's only one problem. Implementing more expansive redundancy is complex and expensive-far more so than the simpler measures associated with resource redundancy.
Tips: Mitigating the Risk of Cloud Services Failure: How to Avoid Getting Amazon-ed
This brings us back to the topic of risk. Remember, it's frequency probability measured against magnitude of loss associated with a failure. You have to evaluate how frequently you expect these less-frequent, larger-scale resource failures to occur and compare that to the cost of preventing them via design and operations. In some sense, one is evaluating the cost of careful design and operation vs. the cost of a more general failure.
Certainly the cost of the design and operation can be worked out, while many people prefer to avoid thinking of the cost of a more widespread failure that would take their application offline. However, as more large revenue applications move to AWS, failing to evaluate risk and implement appropriate failure-resistant measures will be imprudent.
Overall, it's not as though the possibility of these outages is unknown, or that the appropriate mitigation techniques are easily discoverable as well. You should expect that CSPs will suffer general resource outages and not blame the provider in the event of such an outage. Instead, you should recognise that you made a decision without perhaps acknowledging the risk associated with it. Those who look at these outages and choose to do nothing more than damn the provider and demand perfection don't recognise how dangerous a game they are playing.
Bernard Golden is the vice president of Enterprise Solutions for enStratus Networks, a cloud management software company. He is the author of three books on virtualisation and cloud computing, including Virtualization for Dummies. Follow Bernard Golden on Twitter @bernardgolden.
Sign up for Computerworld eNewsletters.