In
this post, Alan Robertson discusses cluster stacks. This
is interesting, but has some misleading points:
- Linux-HA (with or without OpenAIS) supports the AIS
membership APIs. This is not quite correct, in as far as
the support of the APIs provided is close to ancient, and -
worse - that membership by itself is rather pointless;
Linux-HA as-is does not provide the messaging or any
other of the APIs for AIS, so the membership itself does not
mean that any AIS application could run.
- Nevertheless, in an ideal world, all cluster
components and cluster-aware applications would sit on top
of the same set of communications protocols. Let's just
keep this one in mind, we're going to need it below!
- The Linux-HA CRM function is largely divided between
the PE and TE – which are described below. The CRM has
been split out from the Linux-HA heartbeat project by its
developers; I'm not sure how Alan failed to mention this, as
he has been objecting to it for the last few weeks ;-)
Technically, the description is not quite right either.
The CRM itself is a fairly important component, electing the
transition coordinator, dealing with failed nodes and
implementing the state transitions at the cluster level. Its
components not only include the Policy Engine or the
Transitioner, but the CIB itself also is part of the CRM
modules.
- It's interesting how the PE receives the largest share
of criticism, while no comments are made about the
scalability and performance of the messaging layer itself.
Oh well. The PE actually is modularized and completes its
task in several stages - the original design called for
placement first and ordering later, as distinct steps -, but
the modules have a high inter-dependency, and in practice,
it turned out not to be so easy; clear and robust interfaces
are very hard to define. For a similar problem, look at how
gcc "modularizes" its optimization steps.
While the PE does perform round-robin load-balancing,
full resource cost and load balancing attempts turn the
problem into an exceptionally hard one; we considered this,
and then postponed it until later. For now, our main goal is
to keep services alive, and leave the load balancing to some
external component which modifies our node weights; seems
fairly modular to me, in fact.
It's true that we might step towards modularization
(again!) as we understand the problem more and more, but I
object to the underlaying assumption that we hadn't thought
of all that before.
- The LRM proxy communicates between the CRM and the
LRMs on all the various machines. This function is
currently built into the CRM. This architectural decision
was based on expedience more than anything else. I
wonder how else the CRM's TE is supposed to communicate with
the LRMs, as needed to carry out the commands and retrieve
status, if not by having some form of proxy/interface to them?
- To support larger clusters this needs to be separated
out, made more scalable, and more flexible. This would
allow a large number of LRMs to be supported by a small
number of LRM proxies. The CRM and its components (TE,
CIB) clearly requires an interface to the LRMs, so I'm not
quite sure how this could be separated out.
My guess would be that he is refering to the idea of
having the CRM manage nodes (virtual or physical) which are
not full cluster members as containers for resources. And,
supposedly, not suggesting to treat them as virtual cluster
members at the membership level any longer! Nice to see he's
dropped that idea. Yet, as Alan likes to give credit when he
came up with something, maybe he should give credit for this
as well ...? Just thinking.
- In large systems, this would probably use the
ClusterIP capability to provide load distribution (leveling)
across multiple LRM proxies. I have absolutely no idea
what this is supposed to suggest.
- The description of the quorum daemon might imply the
suggestion that Linux-HA supported general split-site
clusters right now. As much as I wish it did, this is not true.
And while quorum in two-node clusters is indeed
problematic (because they always have a tie on one node
down), the quorum server most certainly is not needed for
two node clusters, as fencing resolves this problem nicely,
and has done so for years.
- For a variety of reasons, kernel space doesn't have
access to user-space cluster communications or
membership.
As a result, both the DLM and most cluster
filesytems implements their own membership and
communications. This is technically incorrect; OCFS2 has
been instrumented to inherit the membership from user-space,
as has GFS2. (Or, in fact, their DLMs inherit this.)
The discussion of case 1 neglects the detail that the
"other" membership also must be told to not talk to the
other node, same as case 2; in fact, each membership must be
reduced to the common subset. The method described for case
2 indeed is not pretty, and would not work right now (as the
mechanisms do not exist), as claimed:
- Although Case 2 isn't pretty, it works, and no amount
of wishing and hoping is likely to ever make this kind of
problem go away in the general case.
This is quite certainly the most confusing message in
this lecture. First, it is wrong today, even for Linux-HA:
OCFS2 avoids this by inherting the Linux-HA membership
through the Filesystem resource agent.
Second, by porting the CRM modules - now called PaceMaker
- to run natively on top of openAIS, just as C-LVM2, GFS2,
and OCFS2 will, we are finally on the track to solve this
perfectly and having everyone use the same membership.
However, it should be noted that there has been exactly
one person unhappy about this, who is now trying to sell it
as if it was his idea, and not that he opposes it still - I
wonder, who might that be?
I will further admit that it irks and offends me that
Alan talks of the CRM as our work (as if he had been
involved much in it), and explicitly
mentions how he started the OCF in 2001, mentions IBM
and Red Hat, yet completely fails to mention the
contributions made by many Novell and SUSE engineers, most
notably by Andrew Beekhof. Oh well.