Why you need STONITH
A very common fallacy when setting up High-Availability clusters - be it on Pacemaker + corosync, Linux-HA, RedHat Cluster Suite, or else - is thinking that your setup, despite all the warnings in the documentation or in the logfiles, does not require node fencing.
What is node fencing?
Fencing is a mechanism by which the "surviving" nodes in the cluster make sure that the node(s) that have been evicted from the cluster are truly gone. This is also referred to as node isolation, or, in a very descriptive metaphor, STONITH ("Shoot the other node in the head"). This mechanism is not just "fire and forget", but the cluster software will wait for a positive confirmation from it before proceeding with resource recovery.
But it has already failed, otherwise it would not have been evicted, so why would this be necessary, you ask?
The key here is the distinction between appearances and reality: a complete loss of communication with a node looks to all other nodes as if the node has disappeared. Since you, like the obedient administrator that you are, have configured redundant network links, the chance for this to happen is really slim, right? But that is not the only possible cause. In fact, it might still be around, just waiting to come out of a kernel hang, or hiding behind firewall rules, to spew a bunch of corrupted data to your shared state.
In short, node fencing/isolation/STONITH ensures the integrity of your shared state by turning a mere, if justified, suspicion into confirmed reality.
(Pacemaker clusters also use this mechanism for escalated error recovery; if Pacemaker has instructed a node to release a service (by stopping it), but that operation fails, the service is essentially "stuck" on that node. The semantics of the "stop" operation mandate that it must not fail, so this indicates a more fundamental problem on that node. Hence, the default process then would be to stop all other resources on that node, move them elsewhere, and fence the node - rebooting it tends to be rather effective at stopping anything that might have been stuck. This can be disabled per-resource if you don't want some low-priority failure to shift high-priority resources around, though.)
This is all very technical. So let me tell you a story with several possible endings to illustrate.
Once upon a time, three friends were sitting huddled around a fire, peacefully eating their cookies. It was a tough time: the world was out to get them, a zombie infection was spreading, they couldn't trust anyone outside their trusted cluster of friends. They were always watchful and paid attention to each other.
Suddenly, one of the three stops responding to the conversation they were having. How do you proceed?
- My cluster of friends does not require such a crude mechanism! He'll be careful not to have been infected! If he stops responding, he will simply be dead! You ignore the problem, but then your former friend revives, spreads his infection to your cookie stack, starts clobbering you with a club to eat your brains, and his howl gives away your location to all his new friends, who come down on you with the intent of eating your brains.
- You use an unloaded gun to shoot your friend - the trigger responds reassuringly. Your former friends revives, and it is all about eating your brains again.
- You kindly tap your friend on the shoulder, and
suggest that he please commit suicide. Your former
friend revives, snaps at your tapping hand, and starts
eating your brains.
- You speak a pre-agreed upon code word, a tiny
goes off in the head of your friend, blows his brains
out, and he drops on the spot. The grue does not eat
you. (In fact, the mechanism monitoring his brain probably
has already blown him up, but you speak the code word anyway
to make sure.)
- You take that crude, trusty shotgun and blow his brains out, aiming away from the stack of cookies. The grue does not eat you.
In order, we have gone through the "I do not need STONITH or have disabled it", "I used the null mechanism intended only for testing", "I used an ssh-based mechanism", or the recommended "a poison-pill mechanism with hardware watchdog support" (such as external/sbd in Pacemaker environments) and the time-tested "talk to a network power switch, management board etc to cut the power" methods.
Pacemaker's escalated error recovery could be likened to your friend telling you that despite his best attempts, his wound has become infected (and he can't bring himself to cut off his hand); he bravely gives away his equipment to you, kneels down, says goodbye, and you blow his brains out.
Does that drive the point home? How would you like to survive armageddon? Of course, it is always possible that you have a secret liking for becoming a zombie, and crumbling (instead of eating) all your cookies.
In this case, talk to your two friends about appropriate therapy.