2001/01/28
=================================================================
0350 Had trouble sleeping. Prayed for a few people. Got
up.
The case is the "restart" test when only one machine is up
looks like it had an obvious bug. It said:
if node == self.CM.OurNode:
pat = self.uspat
But it should have said:
if node == self.CM.OurNode or self.CM.upcount() < 1:
pat = self.uspat
instead. I applied the fix on "servidor". Looks like it
was having
some problems with X11 forwarding too. I changed the rsh
command
to supress forwarding X11 ports (sinc I don't need them).
Looks like there's also a bug in the Stonith test such that
it
doesn't look for the right patterns if the other node is
down.
Another wee bug, this one slightly more subtle:
if (self.CM.upcount() == 1 and
self.CM.ShouldBeStatus[node]
== self.CM["up"]):
Should have simply been:
if (self.CM.upcount() == 1):
I decided the logging should part of the CtsLab class.
So, it's now in the (as of yet not used) CtsLab class.
Another wee bug, this time in the Stonith code:
if (self.CM.upcount() == 1):
should have been
if (self.CM.upcount() <= 1):
It's fixed now.
0615 Sent out CtsLab code to list. Probably ought to take a
nap before
church ;-)
Somewhere along in here I spent an hour or two
packing.
1615 The entire 500 tests all went successfully.
Definitely fixed the bug,
since I ran it with the same random number set...
Some other minor bugs having to do with reporting at
the end
were introduced. I think I fixed them.
1715 Continuing to write the Lab class. Break time.
2000 Break over. Hope to get the lab class integrated and
working tonight.
2300 Looks like they're working together fine. Better quit
while I'm ahead
after backing things up ;-). G'night.
2001/01/29
=================================================================
0700 Today should be mostly a packing and preparing to
leave day.
I have some loose ends to take care of before I leave,
but today I
have a car, but of course it's snowing pretty nicely
outside ;-)
However, I'm going to have to look at the heartbeat
code anyway,
because it looks like I triggered a bug in the
heartbeat code
with the tests. I guess that's what testing is
supposed to do ;-)
The test code "hit the jackpot" . "Both machines own
foreign resources".
The evidence should be in the logs. I'll see. The
error occurred
about 3 hours into the test run.
The problem is caused by the machine which had just
come up (sgi2)
failing to hear any heartbeats from the machine which
was up all along.
Perhaps this is caused by a piece of the code in the
takeover sequence
which waits for the takeover to complete, hence
keeping packets from
being sent out.
Another possibility would be that it is a problem in
the receiving code
startup. This sounds more likely. Perhaps the
startup code should
be more synchronous. This is what the timing looks
like:
Jan 29 02:03:56 sgi2 Starting heartbeat 0.4.8k
Jan 29 02:04:00 sgi2 UDP heartbeat started
Jan 29 02:04:01 sgi2 WARN: node sgi1: is dead
OK, that's technically our deadtime (5 seconds), but
we didn't give the
other guy much of a chance to give us a heartbeat,
because we were not
yet up very long. With a heartbeat interval of only 1
second, this is
almost impossible. Under heavy load or with a 3
second dead time, I
could imagine this being much worse. I think I
remember wondering
if this could happen before.
Sounds like we should start the timing of "dead" time
from the moment
we receive an ack that all of the child read/write
processes are
up and running. I guess that means that the code
needs to send such
ACKs and that the heartbeat core timing logic needs to
track them
and modify it's idea of the "epoch" accordingly.
I guess this is great progress! I've moved from
debugging the test
tool to debugging the thing it's testing! Now I just
need to think
carefully about how to fix this bug in heartbeat ;-)
0800 I sent out last week's journal, and saved a similar
email as a template
to make sending it out in the future easier. I'm going to
go finish
packing now, and come back to the bug later.
1000 Finally finished packing! Now to run errands and do
all the other
things I need to do before leaving town for a few days.
1345 Got home and am doing a little more cleaning up,
reading email, etc.
1405 Gotta go get Laura from Mandalay (work).
1430 Went to go see the builder of our house and try and
straighten
out some things in how the house is put together.
2000 Checking email, printing off schedule document. Need
to stop this to
order some Orinoco cards, and go to bed... (to about
2100). Finally
got to bed around 0000.
2001/01/30
=================================================================
0430 Today I leave for LWCE/NYC. Expect less detail in the
subsequent entries, since I'll spend most of my time away
from my
laptop. Better pack up the laptop, etc ;-)
Got everything packed and made the plane on time, etc.
Trip went without incident. Coded a little fix to the
timing bug
I discovered with CTS. Watched the movie. It was
"Remember the Titans". I recommend it highly.
Arrived in NYC a little later than planned. Spent about a
half-hour
trying to get my cell phone to work in NYC. It was a pain,
my
vendor needed to take some special security precautions to
keep my
NAM, etc. from being stolen and someone from making calls
on it.
Annoying.
I took the shuttle to the hotel and checked in fine. By
the time
this happened it was a little too late to make it to the
Javits center
to check in today.
Worked a little on the code. Got the timing fix "mostly"
working.
Got a call from Horms, and went to dinner with him and his
buddies
from VA. Had a good discussion about where I want
heartbeat to go
and what he wants to do with it also. Ate an Aussie meat
pie.
It was pretty good. He said it was a little higher-class
pie
than you'd often get in Australia. Went home about 2300.
Got to
bed around 0000.
2001/01/31
=================================================================
0700 Really tired this morning.
Made it to the Javits Center about 0830 or
so. Talked to LWN staff at the speakers room. Got
registered
both as speaker and as Exhibitor. I did LOTS of
appointments today.
Stacey Quandt from Giga didn't show, but everyone else
did. I also
spoke to a freelance journalist who had very similar ideas
about
the "small" enterprise and what they need from HA. He had
heard
me speak when I was at Bell Labs in Naperville and dropped
by to see
me. Here was my agenda, which was mostly followed:
1000 Ben Rafanello and friends, IBM
1115 Stacey Quandt (no-show)
1200 D. H. Brown
1400 Jon Doyle & Compaq
1500 Dean Pannell
1500 Peter Badovinatz (IBM) at Developers' Den
1830 IBM Party. Spent a lot of time with Peter B (Wombat)
Learned some very interesting things from Peter, what he
said, and what
he didn't say. Glad I spent the time with him.
It was a long, busy, productive day, and I don't have much
voice left.
I'm going to have to be careful, or I won't have any voice
left for
my talk on Friday. I'll take some throat losenges with me
tomorrow.
It seems to me that the show has been pretty good as far as
size and
people coming by. I also talked to Ted Ts'o about the
Lucent
winmodem debacle, and also with someone from IBM (Frank
Novak
fnovak@us.ibm.com) who will help ensure that Lucent does
the right
thing. I need to tell Ted about him. I also met Patrick
Martel
of MandrakeSoft. Dan Cox of Compaq told me to contact
Wayne Opland
about the HA disk (512) 432-8146.
2001/02/01
=================================================================
0630 Got up, pulled down email, finished the fix for the
timing bug. It
seems to work fine now. Wrote a reply to Markus and Jay
asking that
they tell me sooner rather than later if they have feedback
on how
I spend my time :-) Updated CVS with the timing fix.
Getting ready to go to the Javits Center. Maybe I'll have
a little time
to look around on the show floor today :-)
Spent about a 20 minutes writing up the notes from the show
so far.
0810 Go to Javits Center. Bye for now ;-)
2345 I spent the whole day at the show, mostly talking to
potential
customers, suppliers, partners, etc. My appointments today
were
with Thomas Schaffner of Enterprise Linux, Mike McQuaid of
Winchester
Systems, and Peter Badovinatz of IBM. I talked to lots of
other
people though, including one person from Lawrence Berkeley
Laboratories who might be interested in having us provide
professional
services to help him deploy a high-availability web
server. I also
talked to Oracle about HA issues, SGI, and various other
people
whom I've forgotten. I did get finally get out of the
booth an hour
today to look around. Bought a book. Got a few goodies.
I worked with Joshua Uziel (uzi) to fix a byte-ordering bug
that the
findif.c code had. He packaged it up in a patch and mailed
it to
me.
Other people I talked to: Shane Painter of Dell (whom I
met in
Austin), Eric Lam of Coventive (interesting hardware
model), Nate
Perlstein of SGI (FailSafe support), Charlie Simpson of
Enterprise
Linux, and Satoshi Kawata of Red Hat Japan.
I stopped by the Mission Critical Linux folks and it sounds
like
they may end up using our open source test tool to help
test
their clusters. Right now they test everything by hand.
I bought my first meal since leaving home. Everything else
has
been freebies and a snack or two ;-)
I had a great conversation with our IBM liason (Malcolm?).
It seems
that he didn't know that SuSE had any HA efforts. I
corrected
this misimpression. It was a really good thing I think.
It sounds like he may have me go meet some IBM folks.
Apparently Malcolm has good news regarding our relationship
with IBM.
Better go to bed now, and get up to work on my talk
tomorrow morning.
An aside: Apparently John Mehaffey mentioned us in one of
his talks. At least 2 or 3 people come by to see me as a
result.
I'll drop him a thank you note.
2001/02/02
=================================================================
Today I give my talk, and I return home.
0500 My stomach was a little unsettled, so I went ahead and
got up.
I need to reread my talk and see if I can/need to add
anything
regarding the various APIs to the talk. Get dressed, do a
little packing, etc.
0545 Begin rewriting talk to change emphasis to Linux-HA
APIs from
being a heartbeat talk.
0645 Began a runthrough of the talk. It took about 45
minutes. It should
fit in the time alotted. I'm a little worried about it
being a little
short.
0740 Start to pack up in earnest. Am tired already. Sad
state of affairs.
Better locate my Penguin mints for later ;-) Took a little
nap
before leaving.
0910 Time to pack up the laptop and leave for the
conference.
1950 Went over to the Javits center by cab. Arrived about
10 AM.
I run into Liz and Michael Hammell from the Linux Weekley
News.
It turns out that Liz is returning to Denver on the same
flight
I am. We make arrangements to share a ride to the airport.
Went by the booth. Talked for quite a while with Anas
about
clustering issues and then with Andreas Archangelli mainly
about
debugging tools. I'm glad he has a better attitude about
them than
Linus does. Maybe I ought to duplicate some of the "klog"
tools
for Linux. Wonder if Avaya would open source them? Maybe
I should
have Roger or someone send me some klog output (if he could
get
some easily) so I could show it to Andreas.
I went to go hear Dirk's talk - a little late. Dirk seems
well-prepared and has a good talk. My PalmPilot alarm goes
off
near the end. It's time to go check out the room I'll give
my talk
in, and run through a little of it. I discover I'm more
nervous
than I'd guess. I wonder if anyone much will show up for
the
last talk in the conference? One couple shows up 30
minutes early(!).
Others show up shortly afterwards. Doesn't sound like I
have much
to worry about. After a few minutes I sit down with the
people
who've come in and talk to them. It was nice - seems to
calm
down my nerves. I find that a guy from Bloomberg financial
services that I met before is here. He's a Russian (?)
guy.
I get his card. I'm supposed to send him a copy of the
slides
from today.
I don't know the routine here. Will someone introduce me?
When
should I start? About 2 minutes after, I decide that no
one will
introduce me, and I'll start my talk now. I don't find any
controls
for the lights, but someone in the audience tells me and I
get
the lights dimmed. By a few minutes into the talk there
are 40-50
people in the room. Nice turnout.
I get my first question. It's very confusing. It takes a
few
minutes to figure out what he wants to know. I'm about to
cut
the discussion off when I figure it out and answer it.
Now,
more questions come. I'm beginning to warm up, and my
sense
of humor takes off and the audience laughs. Now I'm having
fun,
have lots to say, and they ask lots of questions. The talk
finishes
at almost exactly the right time! It went very well. They
were a
good audience. [I agreed to put the slides up on the
Linux-HA site].
A fellow from LynuxWorks wants to talk to me. He's on the
mailing list (but I don't remember him too clearly). He
thinks they
might put some resources on the Linux-HA project. He tells
me
they are going to open up the Intel High-Availability forum
to
other people - he implies that he means people like me,
perhaps
me specifically. [I look up email from him later, and I
realize
that he's a fellow I accidentally insulted on the list. I
guess
he must have forgiven me]. Liz rings and says she wants to
say bye
to folks and will call me a little later.
I go to bag check to get my coat, and bag and go up to the
booth
to talk to folks before Liz calls again. I chat a bit, run
into a
guy from Conectiva. I get him some small SuSE souvenirs
for
himself and my friends at Conectiva (Marcelo, Olive and
Luis Claudio).
Olive runs SuSE on his machine ;-) My phone rings, and it's
Liz.
Time to go.
~1530 We get a limo and ride to the airport. It was a bit
more expensive
than I'd like, but it was starting to rain and lots of
people are
looking for rides, so we take it.
There's another fellow in the car with us, so we all chat.
Liz wants
to know about his company. He reads Linux Weekley News,
and seems
to have heard of heartbeat. So we all have something to
talk about.
We arrive at the airport, in plenty of time. All is well.
We
exchange travel horror stories. It seems Liz has a bit of
a
travel problem phobia, and has had a few experiences to
match.
She's going to go to talk at LinuxWorld Expo in Singapore.
She agrees to give me a ride home (it's not far out of the
way).
The bus is fine, but being dropped off at home is nicer. I
realize
that I left my Minidisc player with Stephen Ing. Oops!
Liz also
says that the LWCE audiences rate the speakers on a 5-point
scale. I
wonder how my talk was rated?
~1810 We load up on the plane. After we're enroute, the
pilot thinks
we'll be in Denver 30 minutes early. He seems skeptical of
his
flight computer ;-) So am I. I nap until they turn off
the
seat belts sign. They bring dinner. It's not too bad.
The movie
comes on, and I dig out my laptop for this report. It took
me
20 minutes or so to write up the part after 0910.
2033 I switch my watch to Denver time. Now it's 1834 ;-)
1836 I decide to write Stephen an email, along with one to
the Russian
fellow, and the one I need to send Ted Ts'o. If I feel
like it,
I'll try and catch up on the email from the list as well.
I'll
send John Mehaffey a note of thanks too. I added Brian's,
Alexender's, and John Mehaffey's info to my address book.
1917 I sent those emails. Now I'll try and catch up on
other email.
I applied Uzi's byte ordering patch. I'll try it when I
get home
and have a network. I also need to send email to Rudy
Pawul about
or the Enterprise Linux people.
2019 I got rid of around 100 emails, and replied to many.
I've got about
another half-hour to go on the flight. Guess I'd better
figure out
how/when to finish up. Still need to email to/about Rudy.
2025 It's getting rough up here. Better shut down and put
up the laptop.
Bye :-)
I had a most pleasant return trip with Liz and her family.
They
very kindly just dropped me off at home.
2001/02/03
=================================================================
0800 Downloaded, read and replied to a little email. About
an hour I
suppose. Wife and I both tired, cranky :-(
I tried to grab email mid-afternoon. DSL down :-(
Got it back up in about a half-hour of time with Qwest.
Very tired after the show. Zzzz.
2100 Read, replied to more mail. Updated main and
commercial pages
on linux-ha web site. Thought some more about the upshot
from my talk. There is a lot of interest in HA things, and
in
particular I MUST split out the core code from the cluster
manager code. This has to be a near-term development
priority.
Users want it, Anas needs it, others too... It just
becomes
way more useful that way. I believe the development
especially
from others is blocked because of this. I VERY MUCH need
to update the TODO list. It's WAY out of date.
Another thing to add to the TODO list: Make the
configuration
code plug-in modules, too...
2210 G'night. It's 0010 East Coast time now. No wonder I'm
tired.
I need to update my personal todo list from this journal
next week.
I'll send this out to my loyal readers ;-)