Older blog entries for lmb (starting at number 109)

18 Oct 2010 (updated 18 Oct 2010 at 13:07 UTC) »
Linux Magazin Artikel zu Pacemaker, OCFS2 und DRBD

Dear international readers, what follows is a critique of a German language article, and hence the rest of this post will also be in German.

Selbstverständlich habe ich mich sehr gefreut, in der Ausgabe 11/2010 zu diesen Projekten eine Setup-Guide zu lesen, noch dazu auf Basis von openSUSE; alles Themen und Projekte, die mir sehr am Herz liegen.

Jedoch hat mich der Artikel fachlich sehr enttäuscht.

Im einzelnen meine Kritikpunkte:

  • Im Artikel wird ein Active/Passive Fail-over für einen LAMP-Stack konfiguriert. In diesem Fall ist OCFS2, genau wie DRBD's Active/Active Mode, fehl am Platz - DRBD sollte ebenfalls in eine Active/Passive ("Single Primary") Konfiguration betrieben werden.

  • Wenn schon OCFS2 zum Einsatz kommt, so sollte in jedem Fall OCFS2 unter der Kontrolle von Pacemaker und Corosync gestartet werden, und nicht via Init-Scripten und /etc/fstab. Ansonsten steht zum Beispiel vollständiges POSIX Locking nicht zur Verfügung; desweiteren kann die Konfiguration von /etc/ocfs2/cluster.conf entfallen, weil diese Informationen automatisch von Corosync übernommen werden.

    Gleiches gilt natürlich auch für DRBD: auch dieser Dienst sollte von Pacemaker gesteuert - und somit auch überwacht - werden. Nur so steht die volle Funktionalität von allen Cluster Komponenten und ihr Zusammenspiel sicher gestellt werden.

  • Auch beschreibt der Artikel, ohne in irgendeiner Form die Konsequenzen dessen zu diskutieren, die Deaktivierung des IO-Fencing-Mechanismus "STONITH". Dadurch können Daten-Diskrepanzen auftreten.

  • Gänzlich entsetzt war ich von dem "empfohlenen" Wrapper, der LSB Scripte "cluster-tauglich" machen soll. Nicht nur, dass der Cluster-Stack selbstverständlich eine Möglichkeit zur Einbindung von LSB Scripts mitbringt (via der Resource Class "LSB"), sondern das referenzierte Script ist auch noch fundamental kaputt - es wartet nicht, dass der Dienst wirklich gestartet oder gestoppt wurde, es gibt falsche Metadaten aus, und die Rückgabe-Werte der Status- und Monitor-Operationen sind fehlerhaft.

    Und dann wird dieses defekete Wrapper-Script auch noch für Dienste verwendet, für die der Cluster Stack selbstverständlich vollständige OCF Resource Agents mitbringt - nämlich Apache und MySQL.

  • Die Konfiguration des Clusters könnte durch Verwendung einer Resource Group, anstatt von drei Abhängigkeiten, ebenfalls gestrafft werden.

  • Geschwiegen sei davon, dass bei Ausfall eines Systems das andere, so wie in diesem Artikel angekündigt, eben nicht übernehmen würde, da die no-quorum-policy nicht gesetzt wird.

  • Ebenfalls wird verzichtet, auf Grundlagen eines redundanten Systems einzugehen: so wird nicht einmal eine unbedingt notwendige redundante Netzwerk-Anbindung konfiguriert, noch empfohlen.

  • Das im Detail Markennamen - openSUSE, SLE 11 HAE - falsch geschrieben sind, ist dann nur noch das i-Tüpfelchen.

Es fällt mir schwer, einen solchen Artikel konstruktiv zu kritisieren; werter Autor, lieber Lektor: das geht so nicht!

On selecting good timeouts

Timeouts are a common design choice or implementation detail in any computer system, but are in particular popular in High-Availability clusters (such as those build with the SUSE Linux High-Availability Extension and other stacks that are similarly based on corosync and pacemaker).

They are seemingly straightforward to detect faults: if the task doesn't complete within N seconds, it is considered failed, and recovery attempted. (The task could be anything from a network messaging protocol, a database starting under the cluster's control, any IO, and a number of other cases.)

However, selecting a good value for the timeout is less straightforward than it may seem; more often than not, they are much too short. This seems to stem from the belief that a fast response to failures is unconditionally a good thing: the system will perform better if timeouts are shorter. This is not quite true, though.

To illustrate, assume two scenarios:

  1. First, that the system has failed in such a way that it will not respond with a failed response to a monitor task immediately, but instead runs indefinitely unless aborted by the timeout.
  2. Second, that the system is operating fine, but experiencing a brief period of stress, where responses are delayed, just to the edge of the timeout value.

Now, let us explore the impact of a timeout that is one second "too long"; and then, one that is one second "too short".

For a too long timeout, the failure in the first scenario is detected one second later, adding one second to the recovery time. In the second scenario, no timeout occurs, and the system continues as normal.

For the too short timeout, the first scenario is recovered one second faster; the second scenario causes an unnecessary recovery, probably incurring a real service outage in the attempt to restart the application, or at least a brief period without service!

Another problem arises from how timeouts are often chosen; of course, if they were obviously too short, administrators would immediately notice, since their system would never get off the ground at all, but immediately start spewing errors. Instead, the timeouts are usually adequate for the tested scenario (note that you can use the pacemaker monitoring tools to look at the actual runtime of operations); if your test load exceeds the load of your live system, raise your hand - more often than not, it does not.

Under a stress/peak load, the system response tends to degenerate exponentially; it will not just slow down by ten percent, but by thirty. If this scenario gets treated as a failure, the likelihood that the fail-over system will experience the same level of stress is high; worse, requests may have queued up, and if - due to the stress, remember - the system did not shutdown cleanly, an application-internal recovery phase will compound the effect.

Monitoring application performance for load-distribution is quite a different task from monitoring application correctness. The former is important, and a performance degradation may also imply violation of service level agreements; however, initiating recovery through restart is unlikely to alleviate the problem. (In a pacemaker cluster, this would best be monitored externally and fed into the utilization constraints of the resources and nodes.)

In summary, a too short timeout is the worse choice; rather, it is safer to make hard timeouts large enough beyond reasonable doubt. Yes, it will slow down the fail-over and recovery slightly, but at least not cause them by mistake.

(For a rather excellent and exhaustive treatment of this subject matter, see K. Wolter, “Stochastic Models for Fault Tolerance: Restart, Rejuvenation and Checkpointing,” Habilitation Thesis, Humboldt-University, 2007.)

19 Jul 2010 (updated 19 Jul 2010 at 22:26 UTC) »

It has been a while since I took the chance to blog here; the time has been pretty packed with shipping SUSE Linux Enterprise 11 Service-Pack 1's High-Availability Extension (or SLE HA 11 SP1 for short ;-), and supporting the first deployments.

It is a good time to look back and review the very awesome new features that the community developed along with us, and that we are shipping as Enterprise-ready now.

A feature that I am personally very impressed by is the OCFS2 reflink feature; basically, OCFS2 cracked the hard nut of cluster-wide copy-on-write snapshots, which LVM2 has been trying to for years. This allows space-efficient and very fast provisioning of new VMs, snapshots for backup, cloning from templates, cloning from clones, etcetera; it really is amazing.

For those of you who prefer a visual, the team from NGN taped a video with me being interviewed by Sander at Novell's BrainShare in Amsterdam; this is my first video interview ever!

In case you would like an audio-only review, Ron and terry interviewed me for Novell Open Audio as well.

I hope you find them informative - if so, please spread them, and let me know your feedback.

My colleague Tim has drawn awesome cartoons to illustrate my last cluster zombie story on why you need STONITH (node fencing). Clusters and the undead, I spot an upcoming theme for my stories ...

30 Mar 2010 (updated 30 Mar 2010 at 11:04 UTC) »

Why you need STONITH

A very common fallacy when setting up High-Availability clusters - be it on Pacemaker + corosync, Linux-HA, RedHat Cluster Suite, or else - is thinking that your setup, despite all the warnings in the documentation or in the logfiles, does not require node fencing.

What is node fencing?

Fencing is a mechanism by which the "surviving" nodes in the cluster make sure that the node(s) that have been evicted from the cluster are truly gone. This is also referred to as node isolation, or, in a very descriptive metaphor, STONITH ("Shoot the other node in the head"). This mechanism is not just "fire and forget", but the cluster software will wait for a positive confirmation from it before proceeding with resource recovery.

But it has already failed, otherwise it would not have been evicted, so why would this be necessary, you ask?

The key here is the distinction between appearances and reality: a complete loss of communication with a node looks to all other nodes as if the node has disappeared. Since you, like the obedient administrator that you are, have configured redundant network links, the chance for this to happen is really slim, right? But that is not the only possible cause. In fact, it might still be around, just waiting to come out of a kernel hang, or hiding behind firewall rules, to spew a bunch of corrupted data to your shared state.

In short, node fencing/isolation/STONITH ensures the integrity of your shared state by turning a mere, if justified, suspicion into confirmed reality.

(Pacemaker clusters also use this mechanism for escalated error recovery; if Pacemaker has instructed a node to release a service (by stopping it), but that operation fails, the service is essentially "stuck" on that node. The semantics of the "stop" operation mandate that it must not fail, so this indicates a more fundamental problem on that node. Hence, the default process then would be to stop all other resources on that node, move them elsewhere, and fence the node - rebooting it tends to be rather effective at stopping anything that might have been stuck. This can be disabled per-resource if you don't want some low-priority failure to shift high-priority resources around, though.)

This is all very technical. So let me tell you a story with several possible endings to illustrate.

Story time!

Once upon a time, three friends were sitting huddled around a fire, peacefully eating their cookies. It was a tough time: the world was out to get them, a zombie infection was spreading, they couldn't trust anyone outside their trusted cluster of friends. They were always watchful and paid attention to each other.

Suddenly, one of the three stops responding to the conversation they were having. How do you proceed?

  1. My cluster of friends does not require such a crude mechanism! He'll be careful not to have been infected! If he stops responding, he will simply be dead! You ignore the problem, but then your former friend revives, spreads his infection to your cookie stack, starts clobbering you with a club to eat your brains, and his howl gives away your location to all his new friends, who come down on you with the intent of eating your brains.
  2. You use an unloaded gun to shoot your friend - the trigger responds reassuringly. Your former friends revives, and it is all about eating your brains again.
  3. You kindly tap your friend on the shoulder, and suggest that he please commit suicide. Your former friend revives, snaps at your tapping hand, and starts eating your brains.

  4. You speak a pre-agreed upon code word, a tiny bomb goes off in the head of your friend, blows his brains out, and he drops on the spot. The grue does not eat you. (In fact, the mechanism monitoring his brain probably has already blown him up, but you speak the code word anyway to make sure.)

  5. You take that crude, trusty shotgun and blow his brains out, aiming away from the stack of cookies. The grue does not eat you.

So what?

In order, we have gone through the "I do not need STONITH or have disabled it", "I used the null mechanism intended only for testing", "I used an ssh-based mechanism", or the recommended "a poison-pill mechanism with hardware watchdog support" (such as external/sbd in Pacemaker environments) and the time-tested "talk to a network power switch, management board etc to cut the power" methods.

Pacemaker's escalated error recovery could be likened to your friend telling you that despite his best attempts, his wound has become infected (and he can't bring himself to cut off his hand); he bravely gives away his equipment to you, kneels down, says goodbye, and you blow his brains out.

Does that drive the point home? How would you like to survive armageddon? Of course, it is always possible that you have a secret liking for becoming a zombie, and crumbling (instead of eating) all your cookies.

In this case, talk to your two friends about appropriate therapy.

29 Oct 2009 (updated 29 Oct 2009 at 11:19 UTC) »

Again a tip on how to write your OpenAIS/Pacemaker configuration in a simpler fashion; this applies to SUSE Linux Enterprise 11 High-Availability Extension too, of course.

For the full cluster functionality with OpenAIS/OCFS2/cLVM2 and an OCFS2 mount on top, you need to configure DLM, O2CB, cLVM2 clones, one to start the LVM2 volume group, and Filesystem resources to mount the file system. Add in all the dependencies needed, and you end up with a configuration pretty much like this (shown in CRM shell syntax, which is already much more concise than the raw XML):


primitive clvm ocf:lvm2:clvmd
primitive dlm ocf:pacemaker:controld
primitive o2cb ocf:ocfs2:o2cb
primitive ocfs2-2 ocf:heartbeat:Filesystem \
        params device="/dev/cluster-vg/ocfs2"
directory="/ocfs2-2" fstype="ocfs2"
primitive vg1 ocf:heartbeat:LVM \
        params volgrpname="cluster-vg"
clone c-ocfs2-2 ocfs2-2 \
        meta target-role="Started" interleave="true"
clone clvm-clone clvm \
        meta target-role="Started" interleave="true"
ordered="true"
clone dlm-clone dlm \
        meta interleave="true" ordered="true"
target-role="Stopped"
clone o2cb-clone o2cb \
        meta target-role="Started" interleave="true"
ordered="true"
clone vg1-clone vg1 \
        meta target-role="Started" interleave="true"
ordered="true"
colocation colo-clvm inf: clvm-clone dlm-clone
colocation colo-o2cb inf: o2cb-clone dlm-clone
colocation colo-ocfs2-2 inf: c-ocfs2-2 o2cb-clone
colocation colo-ocfs2-2-vg1 inf: c-ocfs2-2 vg1-clone
colocation colo-vg1 inf: vg1-clone clvm-clone
order order-clvm inf: dlm-clone clvm-clone
order order-o2cb inf: dlm-clone o2cb-clone
order order-ocfs2-2 inf: o2cb-clone c-ocfs2-2
order order-ocfs2-2-vg1 inf: vg1-clone c-ocfs2-2
order order-vg1 inf: clvm-clone vg1-clone
That's quite a bite, and becomes cumbersome for every fs you add.

However, there is a little known feature - you can actually clone a resource group:


primitive clvm ocf:lvm2:clvmd
primitive dlm ocf:pacemaker:controld
primitive o2cb ocf:ocfs2:o2cb
primitive ocfs2-2 ocf:heartbeat:Filesystem \
        params device="/dev/cluster-vg/ocfs2"
directory="/ocfs2-2" fstype="ocfs2"
primitive vg1 ocf:heartbeat:LVM \
        params volgrpname="cluster-vg"
group base-group dlm o2cb clvm vg1 ocfs2-2
clone base-clone base-group \
	meta interleave="true"

I think this speaks for itself; 20 lines of configuration reduced. You will also find that crm_mon output is much simpler and shorter, allowing you to see more of the cluster status in one go.

Today I'd like to briefly introduce a new safety feature in Pacemaker.

Many times, we have seen customers and users complain that they thought they had correctly setup their cluster, but then resources were not started elsewhere when they killed one of the nodes. With OCFS2 or clvmd, they would even see access to the filesystem on the surviving nodes blocking and processes, including kernel threads, end up in the dreaded "D" state! Surely this must be a bug in the cluster software.

Usually, it turns out that these scenarios escalated fairly quickly, because usually customers test recovery scenarios only fairly closely to before they want to deploy, or find out after they have deployed to production already. Not a good time for clear thinking.

However, most of these scenarios have a common misconfiguration: no fencing defined. Now, fencing is essential to data integrity, in particular with OCFS2, so the cluster refuses to proceed until fencing has completed; the blocking behaviour is actually correct. The system would warn about this at "ERROR" priority in several places.

Yet it became clear that something needed to be done; people do not like to read their logfiles, it seems. Inspired by a report by Jo de Baer, I thought it would be more convenient if the resources did not even start in the first place if such a gross misconfiguration was detected, and Andrew agreed.

The resulting patch is very short, but effective. Such misconfigurations now fail early, without causing the impression that the cluster might actually be working.

This does certainly not prevent all errors; it can't directly detect whether fencing is configured properly and actually works, which is too much for a poor policy engine to decide. But we can try to protect some administrators from themselves.

(As time progresses, we will perhaps add more such low hanging fruits to make the cluster "more obvious" to configure. But still, I would hope that going forward, more administrators would at least try to read and understand the logs - as you can see from the patch, the message was already very clear before, and "ERROR:" messages definitely should catch any administrators attention.)

It is with the greatest pleasure that I am able to announce that Novell has just posted the documentation for setting up OpenAIS, Pacemaker, OCFS2, cLVM2, DRBD, based on SUSE Linux Enterprise High-Availability 11 - but equally applicable to other users of this software stack.

We understand it is a work in progress, and the uptodate docbook sources will be made available under the LGPL too in the very near future in a mercurial repositoy, and we hope to turn this into a community project as well, providing the most complete documentation coverage for clustering on Linux one day!

  • So our new test cluster environment is a 16 node HP blade center, which pleases me quite a bit. The blades all have a hardware watchdog card, which of course makes perfect sense for a cluster to use.
  • However, the attempt to set the timeout to 5s was thwarted by the kernel message
    hpwdt: New value passed in is invalid: 5 seconds.
  • So in I dived into hpwdt.c, to find:
    static int hpwdt_change_timer(int new_margin)
    {
    /* Arbitrary, can't find the card's limits */
    if (new_margin < 30 || new_margin > 600) {
    printk(KERN_WARNING "hpwdt: New value passed in is invalid: %d seconds.\n", new_margin);
    return -EINVAL;
    }
  • Okay, that can happen. Sometimes driver writes have to make guesses when the vendor is not cooperative or unavailable. So who wrote the driver?
    * (c) Copyright 2007 Hewlett-Packard Development Company, L.P.
  • ...

I prefer to ignore christmas and the madness they call holidays, but would like to close the year with a series of three questions, starting today:

  1. What can Open Source (and/or Linux) contribute to making the world a better place? Think of developing nations and the real large issues, as well as the slightly smaller ones.

Please feel free to e-mail me your answers to lmb at suse dot de, but this is not required to follow this experiment.

100 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!