Frictions in the Cloud, Part 2

Abort, Retry, Fail?

Details have finally begun to emerge in the past few days regarding exactly what caused the Amazon cloud-computing-service shutdown. Amazon's own account is rather dense and chewy, but Ars Technica has provided a more digestible explanation. When one reads this account, it's apparent that Amazon's "EC2" (Elastic Compute Cloud) failure is quite unlike some of the famous historical instances of computer failure or near-failure, such as the mathematical error that caused the Mariner 1 to be scrapped, the race condition that caused the Therac-25 cancer therapy machine to dispense a life-threatening dose of radiation, or even the much discussed Y2K problem. In these cases a defect local to a particular machine led in a fairly straightforward fashion to a bad output from that system. That was not the case with Amazon's recent troubles. On the other hand, their resemblance to the 1990 failure of the AT&T long-distance network is uncanny. In both cases initially localized failures rapidly spread across the system due to very procedures intended to recover from error. In the AT&T case, a switch received a set of "I'm okay" messages from another switch that (due to a software error) caused the machine to reset itself and then send out "I'm okay" messages to other switches that in turn caused those switches to think they had entered a bad state and reset themselves, etc. In the Amazon case, an accidental overloading of one network caused the system nodes to be unable to reach their existing back-ups. Thus they all began simultaneously to try to create new back-ups, overloading the main network and leading to failures in the overall management system. The go-to man for understanding these kinds of failures is Charles Perrow. Perrow points to two prerequisites for serious systematic errors:

  1. A complex system (that is, one with many components with many connections to one another--a dense graph, in computer science terms)
  2. A tightly-coupled system (that is, one where event B follows event A without any chance for human intervention in-between)

The presence of both of these conditions is a prerequisite for what Perrow calls "normal accidents": accidents whose occurrence (though not their specific etiology) is so predictable that we should consider them normal. Both conditions certainly apply to the cloud. Furthermore, it's eerie that in the AT&T case, the Amazon case, and Perrow's canonical case of Three Mile Island, it was the very fail-safe systems that were supposed to ensure error-recovery that escalated the problem from nuisance to systemic failure. Should we then expect more accidents in the cloud? Perhaps, but as long as they are not too frequent, I expect they are going to be a price that most users are willing to pay. What worries me more are the non-accidents--when hostile forces are seeking to exploit your complex system, the risks and the consequences of failure escalate (c.f. the PlayStation Network).