13 Jan 2008 lmb   » (Master)

In this post, Alan Robertson discusses cluster stacks. This is interesting, but has some misleading points:

  • Linux-HA (with or without OpenAIS) supports the AIS membership APIs. This is not quite correct, in as far as the support of the APIs provided is close to ancient, and - worse - that membership by itself is rather pointless; Linux-HA as-is does not provide the messaging or any other of the APIs for AIS, so the membership itself does not mean that any AIS application could run.

  • Nevertheless, in an ideal world, all cluster components and cluster-aware applications would sit on top of the same set of communications protocols. Let's just keep this one in mind, we're going to need it below!

  • The Linux-HA CRM function is largely divided between the PE and TE – which are described below. The CRM has been split out from the Linux-HA heartbeat project by its developers; I'm not sure how Alan failed to mention this, as he has been objecting to it for the last few weeks ;-)

    Technically, the description is not quite right either. The CRM itself is a fairly important component, electing the transition coordinator, dealing with failed nodes and implementing the state transitions at the cluster level. Its components not only include the Policy Engine or the Transitioner, but the CIB itself also is part of the CRM modules.

  • It's interesting how the PE receives the largest share of criticism, while no comments are made about the scalability and performance of the messaging layer itself. Oh well. The PE actually is modularized and completes its task in several stages - the original design called for placement first and ordering later, as distinct steps -, but the modules have a high inter-dependency, and in practice, it turned out not to be so easy; clear and robust interfaces are very hard to define. For a similar problem, look at how gcc "modularizes" its optimization steps.

    While the PE does perform round-robin load-balancing, full resource cost and load balancing attempts turn the problem into an exceptionally hard one; we considered this, and then postponed it until later. For now, our main goal is to keep services alive, and leave the load balancing to some external component which modifies our node weights; seems fairly modular to me, in fact.

    It's true that we might step towards modularization (again!) as we understand the problem more and more, but I object to the underlaying assumption that we hadn't thought of all that before.

  • The LRM proxy communicates between the CRM and the LRMs on all the various machines. This function is currently built into the CRM. This architectural decision was based on expedience more than anything else. I wonder how else the CRM's TE is supposed to communicate with the LRMs, as needed to carry out the commands and retrieve status, if not by having some form of proxy/interface to them?

  • To support larger clusters this needs to be separated out, made more scalable, and more flexible. This would allow a large number of LRMs to be supported by a small number of LRM proxies. The CRM and its components (TE, CIB) clearly requires an interface to the LRMs, so I'm not quite sure how this could be separated out.

    My guess would be that he is refering to the idea of having the CRM manage nodes (virtual or physical) which are not full cluster members as containers for resources. And, supposedly, not suggesting to treat them as virtual cluster members at the membership level any longer! Nice to see he's dropped that idea. Yet, as Alan likes to give credit when he came up with something, maybe he should give credit for this as well ...? Just thinking.

  • In large systems, this would probably use the ClusterIP capability to provide load distribution (leveling) across multiple LRM proxies. I have absolutely no idea what this is supposed to suggest.

  • The description of the quorum daemon might imply the suggestion that Linux-HA supported general split-site clusters right now. As much as I wish it did, this is not true.

    And while quorum in two-node clusters is indeed problematic (because they always have a tie on one node down), the quorum server most certainly is not needed for two node clusters, as fencing resolves this problem nicely, and has done so for years.

  • For a variety of reasons, kernel space doesn't have access to user-space cluster communications or membership.
    As a result, both the DLM and most cluster filesytems implements their own membership and communications.
    This is technically incorrect; OCFS2 has been instrumented to inherit the membership from user-space, as has GFS2. (Or, in fact, their DLMs inherit this.)

    The discussion of case 1 neglects the detail that the "other" membership also must be told to not talk to the other node, same as case 2; in fact, each membership must be reduced to the common subset. The method described for case 2 indeed is not pretty, and would not work right now (as the mechanisms do not exist), as claimed:

  • Although Case 2 isn't pretty, it works, and no amount of wishing and hoping is likely to ever make this kind of problem go away in the general case.

    This is quite certainly the most confusing message in this lecture. First, it is wrong today, even for Linux-HA: OCFS2 avoids this by inherting the Linux-HA membership through the Filesystem resource agent.

    Second, by porting the CRM modules - now called PaceMaker - to run natively on top of openAIS, just as C-LVM2, GFS2, and OCFS2 will, we are finally on the track to solve this perfectly and having everyone use the same membership.

    However, it should be noted that there has been exactly one person unhappy about this, who is now trying to sell it as if it was his idea, and not that he opposes it still - I wonder, who might that be?

I will further admit that it irks and offends me that Alan talks of the CRM as our work (as if he had been involved much in it), and explicitly mentions how he started the OCF in 2001, mentions IBM and Red Hat, yet completely fails to mention the contributions made by many Novell and SUSE engineers, most notably by Andrew Beekhof. Oh well.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!