Cascading Failures
Well, I promised not to discuss my life in here, but since I'm about to use
it as an example of generalized system failure modes, I figure it's okay.
The goal: from Waterloo on Thursday morning, travel to Montreal by Friday
evening at 8 pm. By car, this is a 7 hour drive. No problem, right?
Well, we were going to rent a car and drive on Thursday afternoon, but it
starting snowing/raining/sleeting so we decided not to drive after all;
instead, let's take the train, which is safer in bad weather. Since the
night train leaves you kind of tired, we decided to take the Friday morning
(9:30 am) train from Toronto.
Friday morning, the weather still sucked, but that's okay. We called a taxi
at 6:20 am to take us to the bus station in Waterloo. At 7 am, it finally
arrived - delayed by bad weather, of course. So we missed the 7am bus to
Toronto. No problem, there's an 8 am bus that should still make it to
Toronto in time. Unfortunately, the 8 am bus showed up at 8:30 (bad
weather), departed shortly afterwards, and got to Toronto at 10:00 (extra
late - bad weather). No problem, though; we rescheduled our train tickets
from the 9:30 to the 11:30 train. We changed the reservation by cell phone
from the bus, luckily, because by the time we arrived all the trains for the
day were fully booked. Turns out all the airports were closed (bad weather)
and the people taking flights had all switched to the train.
As we were picking up the tickets, they made an announcement that the 11:30
train would be leaving at 12:30 instead - bad weather. No problem: the 11:30
is supposed to get to Montreal at 4:45, so an hour later is 5:45, and even
with additional weather delays I should *certainly* be in Montreal by 8 pm.
So, we have some time, let's go for lunch.
At 12:20, we came back and found out that the train had left at 12:07,
having been re-re-scheduled while we were gone. In fact, they had made the
new announcement before we left the station, but because of a ridiculously
loud random music performance (something about the Juno awards) in the
middle of the station at the time, all the public announcements were
inaudible.
Feeling guilty, they asked us to wait while they figured out what they'd do
to get us to Montreal. The result: at 1pm or so, we found out that they
could squeeze us on the 3:30 train (arrives around 9:30; useless) or a
special 2:30 shuttle bus (could arrive at 8:30 in *good* weather; useless).
So Via Rail wasn't going to be able to help.
Last chance: rent a car after all (there's a rental place at the train
station) and drive it to Montreal. That takes at least 6 hours in good
weather. By 1:30 we had almost finished filling out the rental forms,
meaning that we *could* be in Montreal by 7:30 on a good day. Sadly, it
wasn't a good day. (Interestingly, if we had known at 11:30 that we would
miss the train, the rental would have saved us.)
I mentioned above that the airports were closed too (bad weather).
The Moral of the Story
Despite a metric tonne of backup plans (an extra day; an extra bus; an extra
train; backup train should still arrive early; could rent a car if the train
was cancelled) and slippage, we *still* didn't get to Montreal on time.
In management, we call this "slippage." In clustering, we call this
"cascading failures."
The lesson to learn here is that if you're going to add redundancy (like the
extra buses, trains, time, etc) you'd best make sure that the same root cause
can't screw up *all* of your backup plans at the same time. That means
don't put a five-station Oracle database cluster on the same power circuit,
don't write software that shuts down and expects the cluster to take over if
it gets confused (because what if *all* the nodes get confused by the same
thing?), and don't plug all your backup servers into the same Internet
connection. For that matter, don't store them all in the same nuclear bunker
in the Swiss Alps. If exactly the wrong thing happens, you'll be in
trouble.