Cascading Failures

05/18/2007

Disasters are rarely caused from just one source. If you look at most plane crashes, bridge failures, nuclear plant meltdowns, and so on, there's usually a couple of different, possibly unrelated problems, that conspire to cause utter disaster.

One small failure happens, and there's a problem with the error recovery, which ends up making things worse, which triggers a larger fault, and before you know you have a cascading failure—and a real disaster on your hands.

Mike Nygard's new book, Release It! (which has been NUMBER ONE on Amazon's Software Engineering/Design Tools and Techniques bestseller list for a few weeks now) is chock full of patterns and antipatterns that you can use to make your software stand up to the harsh realities of real world usage. They are all interrelated and many should be used in combination with other, but here's a brief excerpt from just the Cascading Failures antipattern:

Devil

System failures start with a crack. That crack comes from some fundamental problem. Various mechanisms can retard or stop the crack, which are the topics of the next chapter. Absent those mechanisms, the crack can progress and even be amplified by some structural problems.

A cascading failure occurs when a crack in one layer triggers a crack in a calling layer.

An obvious example is a database failure. If an entire database cluster goes dark, then any application that calls the database is going to experience problems of some kind. If it handles the problems badly, then the application layer will start to fail. One system I saw would tear down any JDBC connection that ever threw a SQLException. Each page request would attempt to create a new connection, get a SQLException, try to tear down the connection, get another SQLException, and then vomit a stack trace all over the user.

Cascading failures require some mechanism to transmit the failure from one layer to another. The failure “jumps the gap” when bad behavior in the calling layer gets triggered by the failure condition in the called layer.

Cascading failures often result from resource pools that get drained because of a failure in a lower layer. Integration Points without Timeouts is a surefire way to create Cascading Failures.

Finger

Just as integration points are the number-one source of cracks, cascading failures are the number-one crack accelerator. Preventing cascading failures is the very key to resilience. The most effective patterns to combat cascading failures are Circuit Breaker and Timeouts.

Stop cracks from jumping the gap

A cascading failure occurs when cracks jump from one system or layer to another, usually because of insufficiently paranoid integration points. A cascading failure can also happen after a chain reaction in a lower layer. Your system surely calls out to other enterprise systems; make sure you can stay up when they go down.

Scrutinize resource pools

A cascading failure often results from a resource pool, such as a connection pool, that gets exhausted when none of its calls return. The threads that get the connections block forever; all other threads get blocked waiting for connections. Safe resource pools always limit the time a thread can wait to check out a resource.

Defend with Timeouts and Circuit Breaker

A cascading failure happens after something else has wrong. Circuit Breaker protects your system by avoiding to the troubled integration point. Using Timeouts ensures can come back from a call out to the troubled one.

Mike's book is available in paperback and as a PDF from pragmaticprogrammer.com.