0350 Had trouble sleeping. Prayed for a few people. Got
The case is the "restart" test when only one machine is up
looks like it had an obvious bug. It said:
if node == self.CM.OurNode:
pat = self.uspat
But it should have said:
if node == self.CM.OurNode or self.CM.upcount() < 1:
pat = self.uspat
instead. I applied the fix on "servidor". Looks like it
some problems with X11 forwarding too. I changed the rsh
to supress forwarding X11 ports (sinc I don't need them).
Looks like there's also a bug in the Stonith test such that
doesn't look for the right patterns if the other node is
Another wee bug, this one slightly more subtle:
if (self.CM.upcount() == 1 and
Should have simply been:
if (self.CM.upcount() == 1):
I decided the logging should part of the CtsLab class.
So, it's now in the (as of yet not used) CtsLab class.
Another wee bug, this time in the Stonith code:
if (self.CM.upcount() == 1):
should have been
if (self.CM.upcount() <= 1):
It's fixed now.
0615 Sent out CtsLab code to list. Probably ought to take a
Somewhere along in here I spent an hour or two
1615 The entire 500 tests all went successfully.
Definitely fixed the bug,
since I ran it with the same random number set...
Some other minor bugs having to do with reporting at
were introduced. I think I fixed them.
1715 Continuing to write the Lab class. Break time.
2000 Break over. Hope to get the lab class integrated and
2300 Looks like they're working together fine. Better quit
while I'm ahead
after backing things up ;-). G'night.
0700 Today should be mostly a packing and preparing to
I have some loose ends to take care of before I leave,
but today I
have a car, but of course it's snowing pretty nicely
However, I'm going to have to look at the heartbeat
because it looks like I triggered a bug in the
with the tests. I guess that's what testing is
supposed to do ;-)
The test code "hit the jackpot" . "Both machines own
The evidence should be in the logs. I'll see. The
about 3 hours into the test run.
The problem is caused by the machine which had just
come up (sgi2)
failing to hear any heartbeats from the machine which
was up all along.
Perhaps this is caused by a piece of the code in the
which waits for the takeover to complete, hence
keeping packets from
being sent out.
Another possibility would be that it is a problem in
the receiving code
startup. This sounds more likely. Perhaps the
startup code should
be more synchronous. This is what the timing looks
Jan 29 02:03:56 sgi2 Starting heartbeat 0.4.8k
Jan 29 02:04:00 sgi2 UDP heartbeat started
Jan 29 02:04:01 sgi2 WARN: node sgi1: is dead
OK, that's technically our deadtime (5 seconds), but
we didn't give the
other guy much of a chance to give us a heartbeat,
because we were not
yet up very long. With a heartbeat interval of only 1
second, this is
almost impossible. Under heavy load or with a 3
second dead time, I
could imagine this being much worse. I think I
if this could happen before.
Sounds like we should start the timing of "dead" time
from the moment
we receive an ack that all of the child read/write
up and running. I guess that means that the code
needs to send such
ACKs and that the heartbeat core timing logic needs to
and modify it's idea of the "epoch" accordingly.
I guess this is great progress! I've moved from
debugging the test
tool to debugging the thing it's testing! Now I just
need to think
carefully about how to fix this bug in heartbeat ;-)
0800 I sent out last week's journal, and saved a similar
email as a template
to make sending it out in the future easier. I'm going to
packing now, and come back to the bug later.
1000 Finally finished packing! Now to run errands and do
all the other
things I need to do before leaving town for a few days.
1345 Got home and am doing a little more cleaning up,
reading email, etc.
1405 Gotta go get Laura from Mandalay (work).
1430 Went to go see the builder of our house and try and
out some things in how the house is put together.
2000 Checking email, printing off schedule document. Need
to stop this to
order some Orinoco cards, and go to bed... (to about
got to bed around 0000.
0430 Today I leave for LWCE/NYC. Expect less detail in the
subsequent entries, since I'll spend most of my time away
laptop. Better pack up the laptop, etc ;-)
Got everything packed and made the plane on time, etc.
Trip went without incident. Coded a little fix to the
I discovered with CTS. Watched the movie. It was
"Remember the Titans". I recommend it highly.
Arrived in NYC a little later than planned. Spent about a
trying to get my cell phone to work in NYC. It was a pain,
vendor needed to take some special security precautions to
NAM, etc. from being stolen and someone from making calls
I took the shuttle to the hotel and checked in fine. By
this happened it was a little too late to make it to the
to check in today.
Worked a little on the code. Got the timing fix "mostly"
Got a call from Horms, and went to dinner with him and his
from VA. Had a good discussion about where I want
heartbeat to go
and what he wants to do with it also. Ate an Aussie meat
It was pretty good. He said it was a little higher-class
than you'd often get in Australia. Went home about 2300.
bed around 0000.
0700 Really tired this morning.
Made it to the Javits Center about 0830 or
so. Talked to LWN staff at the speakers room. Got
both as speaker and as Exhibitor. I did LOTS of
Stacey Quandt from Giga didn't show, but everyone else
did. I also
spoke to a freelance journalist who had very similar ideas
the "small" enterprise and what they need from HA. He had
me speak when I was at Bell Labs in Naperville and dropped
by to see
me. Here was my agenda, which was mostly followed:
1000 Ben Rafanello and friends, IBM
1115 Stacey Quandt (no-show)
1200 D. H. Brown
1400 Jon Doyle & Compaq
1500 Dean Pannell
1500 Peter Badovinatz (IBM) at Developers' Den
1830 IBM Party. Spent a lot of time with Peter B (Wombat)
Learned some very interesting things from Peter, what he
said, and what
he didn't say. Glad I spent the time with him.
It was a long, busy, productive day, and I don't have much
I'm going to have to be careful, or I won't have any voice
my talk on Friday. I'll take some throat losenges with me
It seems to me that the show has been pretty good as far as
people coming by. I also talked to Ted Ts'o about the
winmodem debacle, and also with someone from IBM (Frank
email@example.com) who will help ensure that Lucent does
thing. I need to tell Ted about him. I also met Patrick
of MandrakeSoft. Dan Cox of Compaq told me to contact
about the HA disk (512) 432-8146.
0630 Got up, pulled down email, finished the fix for the
timing bug. It
seems to work fine now. Wrote a reply to Markus and Jay
they tell me sooner rather than later if they have feedback
I spend my time :-) Updated CVS with the timing fix.
Getting ready to go to the Javits Center. Maybe I'll have
a little time
to look around on the show floor today :-)
Spent about a 20 minutes writing up the notes from the show
0810 Go to Javits Center. Bye for now ;-)
2345 I spent the whole day at the show, mostly talking to
customers, suppliers, partners, etc. My appointments today
with Thomas Schaffner of Enterprise Linux, Mike McQuaid of
Systems, and Peter Badovinatz of IBM. I talked to lots of
people though, including one person from Lawrence Berkeley
Laboratories who might be interested in having us provide
services to help him deploy a high-availability web
server. I also
talked to Oracle about HA issues, SGI, and various other
whom I've forgotten. I did get finally get out of the
booth an hour
today to look around. Bought a book. Got a few goodies.
I worked with Joshua Uziel (uzi) to fix a byte-ordering bug
findif.c code had. He packaged it up in a patch and mailed
Other people I talked to: Shane Painter of Dell (whom I
Austin), Eric Lam of Coventive (interesting hardware
Perlstein of SGI (FailSafe support), Charlie Simpson of
Linux, and Satoshi Kawata of Red Hat Japan.
I stopped by the Mission Critical Linux folks and it sounds
they may end up using our open source test tool to help
their clusters. Right now they test everything by hand.
I bought my first meal since leaving home. Everything else
been freebies and a snack or two ;-)
I had a great conversation with our IBM liason (Malcolm?).
that he didn't know that SuSE had any HA efforts. I
this misimpression. It was a really good thing I think.
It sounds like he may have me go meet some IBM folks.
Apparently Malcolm has good news regarding our relationship
Better go to bed now, and get up to work on my talk
An aside: Apparently John Mehaffey mentioned us in one of
his talks. At least 2 or 3 people come by to see me as a
I'll drop him a thank you note.
Today I give my talk, and I return home.
0500 My stomach was a little unsettled, so I went ahead and
I need to reread my talk and see if I can/need to add
regarding the various APIs to the talk. Get dressed, do a
little packing, etc.
0545 Begin rewriting talk to change emphasis to Linux-HA
being a heartbeat talk.
0645 Began a runthrough of the talk. It took about 45
minutes. It should
fit in the time alotted. I'm a little worried about it
being a little
0740 Start to pack up in earnest. Am tired already. Sad
state of affairs.
Better locate my Penguin mints for later ;-) Took a little
0910 Time to pack up the laptop and leave for the
1950 Went over to the Javits center by cab. Arrived about
I run into Liz and Michael Hammell from the Linux Weekley
It turns out that Liz is returning to Denver on the same
I am. We make arrangements to share a ride to the airport.
Went by the booth. Talked for quite a while with Anas
clustering issues and then with Andreas Archangelli mainly
debugging tools. I'm glad he has a better attitude about
Linus does. Maybe I ought to duplicate some of the "klog"
for Linux. Wonder if Avaya would open source them? Maybe
have Roger or someone send me some klog output (if he could
some easily) so I could show it to Andreas.
I went to go hear Dirk's talk - a little late. Dirk seems
well-prepared and has a good talk. My PalmPilot alarm goes
near the end. It's time to go check out the room I'll give
in, and run through a little of it. I discover I'm more
than I'd guess. I wonder if anyone much will show up for
last talk in the conference? One couple shows up 30
Others show up shortly afterwards. Doesn't sound like I
to worry about. After a few minutes I sit down with the
who've come in and talk to them. It was nice - seems to
down my nerves. I find that a guy from Bloomberg financial
services that I met before is here. He's a Russian (?)
I get his card. I'm supposed to send him a copy of the
I don't know the routine here. Will someone introduce me?
should I start? About 2 minutes after, I decide that no
introduce me, and I'll start my talk now. I don't find any
for the lights, but someone in the audience tells me and I
the lights dimmed. By a few minutes into the talk there
people in the room. Nice turnout.
I get my first question. It's very confusing. It takes a
minutes to figure out what he wants to know. I'm about to
the discussion off when I figure it out and answer it.
more questions come. I'm beginning to warm up, and my
of humor takes off and the audience laughs. Now I'm having
have lots to say, and they ask lots of questions. The talk
at almost exactly the right time! It went very well. They
good audience. [I agreed to put the slides up on the
A fellow from LynuxWorks wants to talk to me. He's on the
mailing list (but I don't remember him too clearly). He
might put some resources on the Linux-HA project. He tells
they are going to open up the Intel High-Availability forum
other people - he implies that he means people like me,
me specifically. [I look up email from him later, and I
that he's a fellow I accidentally insulted on the list. I
he must have forgiven me]. Liz rings and says she wants to
to folks and will call me a little later.
I go to bag check to get my coat, and bag and go up to the
to talk to folks before Liz calls again. I chat a bit, run
guy from Conectiva. I get him some small SuSE souvenirs
himself and my friends at Conectiva (Marcelo, Olive and
Olive runs SuSE on his machine ;-) My phone rings, and it's
Time to go.
~1530 We get a limo and ride to the airport. It was a bit
than I'd like, but it was starting to rain and lots of
looking for rides, so we take it.
There's another fellow in the car with us, so we all chat.
to know about his company. He reads Linux Weekley News,
to have heard of heartbeat. So we all have something to
We arrive at the airport, in plenty of time. All is well.
exchange travel horror stories. It seems Liz has a bit of
travel problem phobia, and has had a few experiences to
She's going to go to talk at LinuxWorld Expo in Singapore.
She agrees to give me a ride home (it's not far out of the
The bus is fine, but being dropped off at home is nicer. I
that I left my Minidisc player with Stephen Ing. Oops!
says that the LWCE audiences rate the speakers on a 5-point
wonder how my talk was rated?
~1810 We load up on the plane. After we're enroute, the
we'll be in Denver 30 minutes early. He seems skeptical of
flight computer ;-) So am I. I nap until they turn off
seat belts sign. They bring dinner. It's not too bad.
comes on, and I dig out my laptop for this report. It took
20 minutes or so to write up the part after 0910.
2033 I switch my watch to Denver time. Now it's 1834 ;-)
1836 I decide to write Stephen an email, along with one to
fellow, and the one I need to send Ted Ts'o. If I feel
I'll try and catch up on the email from the list as well.
send John Mehaffey a note of thanks too. I added Brian's,
Alexender's, and John Mehaffey's info to my address book.
1917 I sent those emails. Now I'll try and catch up on
I applied Uzi's byte ordering patch. I'll try it when I
and have a network. I also need to send email to Rudy
or the Enterprise Linux people.
2019 I got rid of around 100 emails, and replied to many.
I've got about
another half-hour to go on the flight. Guess I'd better
how/when to finish up. Still need to email to/about Rudy.
2025 It's getting rough up here. Better shut down and put
up the laptop.
I had a most pleasant return trip with Liz and her family.
very kindly just dropped me off at home.
0800 Downloaded, read and replied to a little email. About
an hour I
suppose. Wife and I both tired, cranky :-(
I tried to grab email mid-afternoon. DSL down :-(
Got it back up in about a half-hour of time with Qwest.
Very tired after the show. Zzzz.
2100 Read, replied to more mail. Updated main and
on linux-ha web site. Thought some more about the upshot
from my talk. There is a lot of interest in HA things, and
particular I MUST split out the core code from the cluster
manager code. This has to be a near-term development
Users want it, Anas needs it, others too... It just
way more useful that way. I believe the development
from others is blocked because of this. I VERY MUCH need
to update the TODO list. It's WAY out of date.
Another thing to add to the TODO list: Make the
code plug-in modules, too...
2210 G'night. It's 0010 East Coast time now. No wonder I'm
I need to update my personal todo list from this journal
I'll send this out to my loyal readers ;-)