Older blog entries for lmb (starting at number 104)

29 Oct 2009 (updated 29 Oct 2009 at 11:19 UTC) »

Again a tip on how to write your OpenAIS/Pacemaker configuration in a simpler fashion; this applies to SUSE Linux Enterprise 11 High-Availability Extension too, of course.

For the full cluster functionality with OpenAIS/OCFS2/cLVM2 and an OCFS2 mount on top, you need to configure DLM, O2CB, cLVM2 clones, one to start the LVM2 volume group, and Filesystem resources to mount the file system. Add in all the dependencies needed, and you end up with a configuration pretty much like this (shown in CRM shell syntax, which is already much more concise than the raw XML):


primitive clvm ocf:lvm2:clvmd
primitive dlm ocf:pacemaker:controld
primitive o2cb ocf:ocfs2:o2cb
primitive ocfs2-2 ocf:heartbeat:Filesystem \
        params device="/dev/cluster-vg/ocfs2"
directory="/ocfs2-2" fstype="ocfs2"
primitive vg1 ocf:heartbeat:LVM \
        params volgrpname="cluster-vg"
clone c-ocfs2-2 ocfs2-2 \
        meta target-role="Started" interleave="true"
clone clvm-clone clvm \
        meta target-role="Started" interleave="true"
ordered="true"
clone dlm-clone dlm \
        meta interleave="true" ordered="true"
target-role="Stopped"
clone o2cb-clone o2cb \
        meta target-role="Started" interleave="true"
ordered="true"
clone vg1-clone vg1 \
        meta target-role="Started" interleave="true"
ordered="true"
colocation colo-clvm inf: clvm-clone dlm-clone
colocation colo-o2cb inf: o2cb-clone dlm-clone
colocation colo-ocfs2-2 inf: c-ocfs2-2 o2cb-clone
colocation colo-ocfs2-2-vg1 inf: c-ocfs2-2 vg1-clone
colocation colo-vg1 inf: vg1-clone clvm-clone
order order-clvm inf: dlm-clone clvm-clone
order order-o2cb inf: dlm-clone o2cb-clone
order order-ocfs2-2 inf: o2cb-clone c-ocfs2-2
order order-ocfs2-2-vg1 inf: vg1-clone c-ocfs2-2
order order-vg1 inf: clvm-clone vg1-clone

That's quite a bite, and becomes cumbersome for every fs you add.

However, there is a little known feature - you can actually clone a resource group:


primitive clvm ocf:lvm2:clvmd
primitive dlm ocf:pacemaker:controld
primitive o2cb ocf:ocfs2:o2cb
primitive ocfs2-2 ocf:heartbeat:Filesystem \
        params device="/dev/cluster-vg/ocfs2"
directory="/ocfs2-2" fstype="ocfs2"
primitive vg1 ocf:heartbeat:LVM \
        params volgrpname="cluster-vg"
group base-group dlm o2cb clvm vg1 ocfs2-2
clone base-clone base-group \
	meta interleave="true"

I think this speaks for itself; 20 lines of configuration reduced. You will also find that crm_mon output is much simpler and shorter, allowing you to see more of the cluster status in one go.

20 Aug 2009 »

Today I'd like to briefly introduce a new safety feature in Pacemaker.

Many times, we have seen customers and users complain that they thought they had correctly setup their cluster, but then resources were not started elsewhere when they killed one of the nodes. With OCFS2 or clvmd, they would even see access to the filesystem on the surviving nodes blocking and processes, including kernel threads, end up in the dreaded "D" state! Surely this must be a bug in the cluster software.

Usually, it turns out that these scenarios escalated fairly quickly, because usually customers test recovery scenarios only fairly closely to before they want to deploy, or find out after they have deployed to production already. Not a good time for clear thinking.

However, most of these scenarios have a common misconfiguration: no fencing defined. Now, fencing is essential to data integrity, in particular with OCFS2, so the cluster refuses to proceed until fencing has completed; the blocking behaviour is actually correct. The system would warn about this at "ERROR" priority in several places.

Yet it became clear that something needed to be done; people do not like to read their logfiles, it seems. Inspired by a report by Jo de Baer, I thought it would be more convenient if the resources did not even start in the first place if such a gross misconfiguration was detected, and Andrew agreed.

The resulting patch is very short, but effective. Such misconfigurations now fail early, without causing the impression that the cluster might actually be working.

This does certainly not prevent all errors; it can't directly detect whether fencing is configured properly and actually works, which is too much for a poor policy engine to decide. But we can try to protect some administrators from themselves.

(As time progresses, we will perhaps add more such low hanging fruits to make the cluster "more obvious" to configure. But still, I would hope that going forward, more administrators would at least try to read and understand the logs - as you can see from the patch, the message was already very clear before, and "ERROR:" messages definitely should catch any administrators attention.)

11 May 2009 »

It is with the greatest pleasure that I am able to announce that Novell has just posted the documentation for setting up OpenAIS, Pacemaker, OCFS2, cLVM2, DRBD, based on SUSE Linux Enterprise High-Availability 11 - but equally applicable to other users of this software stack.

We understand it is a work in progress, and the uptodate docbook sources will be made available under the LGPL too in the very near future in a mercurial repositoy, and we hope to turn this into a community project as well, providing the most complete documentation coverage for clustering on Linux one day!

21 Mar 2009 »

So our new test cluster environment is a 16 node HP blade center, which pleases me quite a bit. The blades all have a hardware watchdog card, which of course makes perfect sense for a cluster to use.
However, the attempt to set the timeout to 5s was thwarted by the kernel message
hpwdt: New value passed in is invalid: 5 seconds.
So in I dived into hpwdt.c, to find:
static int hpwdt_change_timer(int new_margin) { /* Arbitrary, can't find the card's limits */ if (new_margin < 30 || new_margin > 600) { printk(KERN_WARNING "hpwdt: New value passed in is invalid: %d seconds.\n", new_margin); return -EINVAL; }
Okay, that can happen. Sometimes driver writes have to make guesses when the vendor is not cooperative or unavailable. So who wrote the driver?
* (c) Copyright 2007 Hewlett-Packard Development Company, L.P.
...

25 Dec 2008 »

I prefer to ignore christmas and the madness they call holidays, but would like to close the year with a series of three questions, starting today:

What can Open Source (and/or Linux) contribute to making the world a better place? Think of developing nations and the real large issues, as well as the slightly smaller ones.

Please feel free to e-mail me your answers to lmb at suse dot de, but this is not required to follow this experiment.

15 Oct 2008 (updated 15 Oct 2008 at 13:28 UTC) »

An article by heise open covers the Linux Kongress, and also my presentation on convergence of cluster stacks, even though they represent my message it slightly more tentative than I intended it to be. But maybe I am too optimistic. For what it is worth, here is a picture of the slide where I outlined the components in the joint stack, which heise open calls a "good mix from all sources."
It is possibly quite important that that is my understanding of the results and goals, and even though I believe we had good buy-in in the development community, this should not be understood as a promise or commitment (or lack thereof) by Red Hat or Novell or anyone else to deliver this in the Enterprise distributions in particular, nor that there will be any loss of support for current configurations. If I could speak for both Red Hat and Novell, I would be earning a hell of a lot more money. (Some initial feedback to my blog entry here made me add this paragraph; I did discuss this in the presentation, but it is not captured on the slide shown.)

14 Oct 2008 (updated 14 Oct 2008 at 12:48 UTC) »

Lukas Chaplin of Linux-Lancers.com, a Linux recruiting and placement agency, has interviewed me about working from a home office. This is not yet as pervasive elsewhere as in the Open Source environment, which is really a shame.
Of course, before going to Lukas you should first check whether Novell & SuSE can offer you a new challenge!
It's been a while since I blogged, so I have two conference reports as well, starting with the Cluster Developer Summit in Prague, 2008-09-28 - 2008-10-02. (See the link for Fabio's report.)

This Summit was organized by Fabio from Red Hat and hosted by Novell, with attendees from Oracle, Atix, NTT Japan and others, which Lon captured on this picture. It is my honest belief that within a year or two, we shall have one single cluster stack on Linux; totally awesome! Amazing how much progress one can make if one is not stuck to one's own old code, but willing to select the best-of-breed.

I think we have come a long way in the last ten years; having explored several different paths through concurrent evolution, we are now seeing more and more convergence as there is less and less justification for the redundant effort expended. Dogs, cats, and mice eating together ... It also reinforced my opinion that small, focused developer events can be exceptionally productive.
At Linux Kongress 2008 in beautiful Hamburg, there were many tutorials and sessions where Pacemaker + heartbeat were used to build high-availability clusters. In my own session, I presented the last year or so of development on Pacemaker and heartbeat, and of course summarized the results from the Cluster Developer Summit.

I also learned about a neat trick Samba's CTDB plays with TCP to make fail-over faster; of course, thanks to this being Open Source, they were able to contribute it to the community instead of reinventing their own cluster stack. (Haha, just kidding, of course they rolled their own - this is Open Source after all.) However, it should be possible to copy it and integrate it as a generic function for IP address fail-over. Cool stuff.

I also very much enjoyed dinner with James, Jonathan, Andreas, Lars (Ellenberg), and Kay - who lives in Hamburg, but whom I only see at conferences ... Refer to the working from home offices interview!

15 Sep 2008 »

Miguel: you can use getsockopt(sockfd, SOL_SOCKET, SO_PEERCRED, cred, &n) to find out the farside pid and uid from within the server.

23 Aug 2008 (updated 23 Aug 2008 at 22:20 UTC) »

Hi all, long time no blog. But with the recent announcement of the Linux Kongress 2008 program, which will happen in my chosen home city Hamburg from 7th to 10th October, I have to share the joy:

Not one, but three tutorials - both in English and German - explaining how to use Linux-HA with the CRM/Pacemaker as an high-availability cluster environment.

Congratulations and thanks to Ralph Dehner, Lars Ellenberg, Joerg Jungermann, Maximilian Wilhelm!

Also, a brief talk by myself on the future of HA on Linux, fresh from the Cluster Developer Summit in Prag.

All in all, Linux Kongress has a very, very strong program this year, and I look forward to meeting you all in Hamburg - bring your umbrella!
On Monday, Hack Week 2008 begins. I will be working on shared storage-based fencing for heartbeat, and possibly some others projects relating to clustering.

I also look forward in particular to the First Penguin Award candidates: the price for the most daring failure. Failure is crucial to success; learning where the boundaries of our models and theories are is the foundation of science, and successful design. Only by anticipating and overcoming failure is success possible. If you doubt this for a single moment, read Petroski: Success through failure.

As a member of the panel and obsessed with things going wrong, I hope your project contributes to our knowledge; and the most valuable lesson of the whole week just could be learned from showing what does not work. And, there will be a price too! How good is that?

24 May 2008 »

Jozef has posted a very cool solutions article describing how to build a highly-available load-balancing solution for any TCP-based network service (including mail, web, ftp, etcetera) using entirely Open Source components and of course all included with SUSE Linux Enterprise Server 10 SP2 - Linux-HA, Linux Virtual Server, and ldirectord. Rock on!

Of course, you could buy an expensive appliance instead ...

95 older entries...