Why you need STONITH
A very common fallacy when setting up
High-Availability
clusters - be it on Pacemaker + corosync, Linux-HA, RedHat
Cluster Suite, or else - is thinking that your setup,
despite all the warnings in the documentation or in the
logfiles, does not require node fencing.
What is node fencing?
Fencing is a mechanism by which the
"surviving" nodes in the cluster make sure that the node(s)
that have been evicted from the cluster are truly gone. This
is also referred to as node isolation, or, in a very
descriptive metaphor, STONITH ("Shoot the other node in the
head"). This mechanism is not just "fire and forget", but
the cluster software will wait for a positive confirmation
from it before proceeding with resource recovery.
But it has already failed, otherwise it would not
have been evicted, so why would this be necessary, you ask?
The key here is the distinction between
appearances and reality: a complete loss of
communication with a node looks to all other nodes as if the
node has disappeared. Since you, like the obedient
administrator that you are, have configured redundant
network links, the chance for this to happen is really slim,
right? But that is not the only possible cause. In fact, it
might still be around, just waiting to come out of a kernel
hang, or hiding behind firewall rules, to spew a bunch of
corrupted data to your shared state.
In short, node fencing/isolation/STONITH ensures the
integrity of your shared state by turning a mere, if
justified, suspicion into
confirmed reality.
(Pacemaker clusters also use this mechanism for escalated
error recovery; if Pacemaker has instructed a node to
release a service (by stopping it), but that operation
fails, the service is essentially "stuck" on that node. The
semantics of the "stop" operation mandate that it must not
fail, so this indicates a more fundamental problem on that
node. Hence, the default process then would be to stop all
other resources on that node, move them elsewhere, and fence
the node - rebooting it tends to be rather effective at
stopping anything that might have been stuck. This can be
disabled per-resource if you don't want some low-priority
failure to shift high-priority resources around, though.)
This is all very technical. So let me tell you a story
with several possible endings to illustrate.
Story time!
Once upon a time, three friends were sitting
huddled around a fire, peacefully eating their cookies. It
was a tough time: the world was out to get them, a zombie
infection was spreading, they couldn't trust anyone outside
their trusted cluster of friends. They were always watchful
and paid attention to each other.
Suddenly, one of the three stops responding to the
conversation they were having. How do you proceed?
- My cluster of friends does not require such a crude
mechanism! He'll be careful not to have been infected! If he
stops responding, he will simply be dead! You ignore the
problem, but then your former friend revives, spreads his
infection to your cookie stack, starts clobbering you with a
club to eat your brains, and his howl gives away your
location to all his new friends, who come down on you with
the intent of eating your brains.
- You use an unloaded gun to shoot your friend - the
trigger responds reassuringly. Your former friends
revives, and it is all about eating your brains
again.
- You kindly tap your friend on the shoulder, and
suggest that he please commit suicide. Your former
friend revives, snaps at your tapping hand, and starts
eating your brains.
- You speak a pre-agreed upon code word, a tiny
bomb
goes off in the head of your friend, blows his brains
out, and he drops on the spot. The grue does not eat
you. (In fact, the mechanism monitoring his brain probably
has already blown him up, but you speak the code word anyway
to make sure.)
- You take that crude, trusty shotgun and blow
his brains out, aiming away from the stack of
cookies. The grue does not eat you.
So what?
In order, we have gone through the "I do not need
STONITH or have disabled it", "I used the null
mechanism intended only for testing", "I used an
ssh-based mechanism", or the recommended "a
poison-pill mechanism with hardware watchdog support" (such
as external/sbd in Pacemaker environments) and the
time-tested "talk to a network power switch, management
board etc to cut the power" methods.
Pacemaker's escalated error recovery could be likened to
your friend telling you that despite his best attempts,
his wound has become infected (and he can't bring himself to
cut off his hand); he bravely gives away his
equipment to you, kneels down, says goodbye, and you blow
his brains out.
Does that drive the point home? How would you like to
survive armageddon? Of course, it is always possible that
you have a secret liking for becoming a zombie, and
crumbling (instead of eating) all your cookies.
In this case, talk to your two friends about
appropriate therapy.