2001/01/28 ================================================================= 0350 Had trouble sleeping. Prayed for a few people. Got up. The case is the "restart" test when only one machine is up looks like it had an obvious bug. It said:
if node == self.CM.OurNode: pat = self.uspatBut it should have said:
if node == self.CM.OurNode or self.CM.upcount() < 1: pat = self.uspatinstead. I applied the fix on "servidor". Looks like it was having some problems with X11 forwarding too. I changed the rsh command to supress forwarding X11 ports (sinc I don't need them). Looks like there's also a bug in the Stonith test such that it doesn't look for the right patterns if the other node is down. Another wee bug, this one slightly more subtle:
if (self.CM.upcount() == 1 and self.CM.ShouldBeStatus[node] == self.CM["up"]):Should have simply been:
if (self.CM.upcount() == 1):I decided the logging should part of the CtsLab class. So, it's now in the (as of yet not used) CtsLab class. Another wee bug, this time in the Stonith code:
if (self.CM.upcount() == 1):should have been
if (self.CM.upcount() <= 1):It's fixed now.
0615 Sent out CtsLab code to list. Probably ought to take a nap before church ;-)
Somewhere along in here I spent an hour or two packing.
1615 The entire 500 tests all went successfully. Definitely fixed the bug, since I ran it with the same random number set... Some other minor bugs having to do with reporting at the end were introduced. I think I fixed them. 1715 Continuing to write the Lab class. Break time. 2000 Break over. Hope to get the lab class integrated and working tonight. 2300 Looks like they're working together fine. Better quit while I'm ahead after backing things up ;-). G'night. 2001/01/29 ================================================================= 0700 Today should be mostly a packing and preparing to leave day. I have some loose ends to take care of before I leave, but today I have a car, but of course it's snowing pretty nicely outside ;-)
However, I'm going to have to look at the heartbeat code anyway, because it looks like I triggered a bug in the heartbeat code with the tests. I guess that's what testing is supposed to do ;-) The test code "hit the jackpot" . "Both machines own foreign resources". The evidence should be in the logs. I'll see. The error occurred about 3 hours into the test run.
The problem is caused by the machine which had just come up (sgi2) failing to hear any heartbeats from the machine which was up all along. Perhaps this is caused by a piece of the code in the takeover sequence which waits for the takeover to complete, hence keeping packets from being sent out.
Another possibility would be that it is a problem in the receiving code startup. This sounds more likely. Perhaps the startup code should be more synchronous. This is what the timing looks like: Jan 29 02:03:56 sgi2 Starting heartbeat 0.4.8k Jan 29 02:04:00 sgi2 UDP heartbeat started Jan 29 02:04:01 sgi2 WARN: node sgi1: is dead
OK, that's technically our deadtime (5 seconds), but we didn't give the other guy much of a chance to give us a heartbeat, because we were not yet up very long. With a heartbeat interval of only 1 second, this is almost impossible. Under heavy load or with a 3 second dead time, I could imagine this being much worse. I think I remember wondering if this could happen before.
Sounds like we should start the timing of "dead" time from the moment we receive an ack that all of the child read/write processes are up and running. I guess that means that the code needs to send such ACKs and that the heartbeat core timing logic needs to track them and modify it's idea of the "epoch" accordingly.
I guess this is great progress! I've moved from debugging the test tool to debugging the thing it's testing! Now I just need to think carefully about how to fix this bug in heartbeat ;-)
0800 I sent out last week's journal, and saved a similar email as a template to make sending it out in the future easier. I'm going to go finish packing now, and come back to the bug later.
1000 Finally finished packing! Now to run errands and do all the other things I need to do before leaving town for a few days.
1345 Got home and am doing a little more cleaning up, reading email, etc. 1405 Gotta go get Laura from Mandalay (work). 1430 Went to go see the builder of our house and try and straighten out some things in how the house is put together. 2000 Checking email, printing off schedule document. Need to stop this to order some Orinoco cards, and go to bed... (to about 2100). Finally got to bed around 0000.
2001/01/30 ================================================================= 0430 Today I leave for LWCE/NYC. Expect less detail in the subsequent entries, since I'll spend most of my time away from my laptop. Better pack up the laptop, etc ;-) Got everything packed and made the plane on time, etc. Trip went without incident. Coded a little fix to the timing bug I discovered with CTS. Watched the movie. It was "Remember the Titans". I recommend it highly. Arrived in NYC a little later than planned. Spent about a half-hour trying to get my cell phone to work in NYC. It was a pain, my vendor needed to take some special security precautions to keep my NAM, etc. from being stolen and someone from making calls on it. Annoying.
I took the shuttle to the hotel and checked in fine. By the time this happened it was a little too late to make it to the Javits center to check in today. Worked a little on the code. Got the timing fix "mostly" working. Got a call from Horms, and went to dinner with him and his buddies from VA. Had a good discussion about where I want heartbeat to go and what he wants to do with it also. Ate an Aussie meat pie. It was pretty good. He said it was a little higher-class pie than you'd often get in Australia. Went home about 2300. Got to bed around 0000. 2001/01/31 ================================================================= 0700 Really tired this morning. Made it to the Javits Center about 0830 or so. Talked to LWN staff at the speakers room. Got registered both as speaker and as Exhibitor. I did LOTS of appointments today. Stacey Quandt from Giga didn't show, but everyone else did. I also spoke to a freelance journalist who had very similar ideas about the "small" enterprise and what they need from HA. He had heard me speak when I was at Bell Labs in Naperville and dropped by to see me. Here was my agenda, which was mostly followed: 1000 Ben Rafanello and friends, IBM 1115 Stacey Quandt (no-show) 1200 D. H. Brown 1400 Jon Doyle & Compaq 1500 Dean Pannell 1500 Peter Badovinatz (IBM) at Developers' Den 1830 IBM Party. Spent a lot of time with Peter B (Wombat) Learned some very interesting things from Peter, what he said, and what he didn't say. Glad I spent the time with him.
It was a long, busy, productive day, and I don't have much voice left. I'm going to have to be careful, or I won't have any voice left for my talk on Friday. I'll take some throat losenges with me tomorrow. It seems to me that the show has been pretty good as far as size and people coming by. I also talked to Ted Ts'o about the Lucent winmodem debacle, and also with someone from IBM (Frank Novak email@example.com) who will help ensure that Lucent does the right thing. I need to tell Ted about him. I also met Patrick Martel of MandrakeSoft. Dan Cox of Compaq told me to contact Wayne Opland about the HA disk (512) 432-8146. 2001/02/01 =================================================================
0630 Got up, pulled down email, finished the fix for the timing bug. It seems to work fine now. Wrote a reply to Markus and Jay asking that they tell me sooner rather than later if they have feedback on how I spend my time :-) Updated CVS with the timing fix. Getting ready to go to the Javits Center. Maybe I'll have a little time to look around on the show floor today :-) Spent about a 20 minutes writing up the notes from the show so far. 0810 Go to Javits Center. Bye for now ;-) 2345 I spent the whole day at the show, mostly talking to potential customers, suppliers, partners, etc. My appointments today were with Thomas Schaffner of Enterprise Linux, Mike McQuaid of Winchester Systems, and Peter Badovinatz of IBM. I talked to lots of other people though, including one person from Lawrence Berkeley Laboratories who might be interested in having us provide professional services to help him deploy a high-availability web server. I also talked to Oracle about HA issues, SGI, and various other people whom I've forgotten. I did get finally get out of the booth an hour today to look around. Bought a book. Got a few goodies. I worked with Joshua Uziel (uzi) to fix a byte-ordering bug that the findif.c code had. He packaged it up in a patch and mailed it to me. Other people I talked to: Shane Painter of Dell (whom I met in Austin), Eric Lam of Coventive (interesting hardware model), Nate Perlstein of SGI (FailSafe support), Charlie Simpson of Enterprise Linux, and Satoshi Kawata of Red Hat Japan.
I stopped by the Mission Critical Linux folks and it sounds like they may end up using our open source test tool to help test their clusters. Right now they test everything by hand.
I bought my first meal since leaving home. Everything else has been freebies and a snack or two ;-)
I had a great conversation with our IBM liason (Malcolm?). It seems that he didn't know that SuSE had any HA efforts. I corrected this misimpression. It was a really good thing I think. It sounds like he may have me go meet some IBM folks. Apparently Malcolm has good news regarding our relationship with IBM. Better go to bed now, and get up to work on my talk tomorrow morning.
An aside: Apparently John Mehaffey mentioned us in one of his talks. At least 2 or 3 people come by to see me as a result. I'll drop him a thank you note.
2001/02/02 ================================================================= Today I give my talk, and I return home. 0500 My stomach was a little unsettled, so I went ahead and got up. I need to reread my talk and see if I can/need to add anything regarding the various APIs to the talk. Get dressed, do a little packing, etc.
0545 Begin rewriting talk to change emphasis to Linux-HA APIs from being a heartbeat talk.
0645 Began a runthrough of the talk. It took about 45 minutes. It should fit in the time alotted. I'm a little worried about it being a little short.
0740 Start to pack up in earnest. Am tired already. Sad state of affairs. Better locate my Penguin mints for later ;-) Took a little nap before leaving.
0910 Time to pack up the laptop and leave for the conference.
1950 Went over to the Javits center by cab. Arrived about 10 AM. I run into Liz and Michael Hammell from the Linux Weekley News. It turns out that Liz is returning to Denver on the same flight I am. We make arrangements to share a ride to the airport.
Went by the booth. Talked for quite a while with Anas about clustering issues and then with Andreas Archangelli mainly about debugging tools. I'm glad he has a better attitude about them than Linus does. Maybe I ought to duplicate some of the "klog" tools for Linux. Wonder if Avaya would open source them? Maybe I should have Roger or someone send me some klog output (if he could get some easily) so I could show it to Andreas.
I went to go hear Dirk's talk - a little late. Dirk seems well-prepared and has a good talk. My PalmPilot alarm goes off near the end. It's time to go check out the room I'll give my talk in, and run through a little of it. I discover I'm more nervous than I'd guess. I wonder if anyone much will show up for the last talk in the conference? One couple shows up 30 minutes early(!). Others show up shortly afterwards. Doesn't sound like I have much to worry about. After a few minutes I sit down with the people who've come in and talk to them. It was nice - seems to calm down my nerves. I find that a guy from Bloomberg financial services that I met before is here. He's a Russian (?) guy. I get his card. I'm supposed to send him a copy of the slides from today. I don't know the routine here. Will someone introduce me? When should I start? About 2 minutes after, I decide that no one will introduce me, and I'll start my talk now. I don't find any controls for the lights, but someone in the audience tells me and I get the lights dimmed. By a few minutes into the talk there are 40-50 people in the room. Nice turnout.
I get my first question. It's very confusing. It takes a few minutes to figure out what he wants to know. I'm about to cut the discussion off when I figure it out and answer it. Now, more questions come. I'm beginning to warm up, and my sense of humor takes off and the audience laughs. Now I'm having fun, have lots to say, and they ask lots of questions. The talk finishes at almost exactly the right time! It went very well. They were a good audience. [I agreed to put the slides up on the Linux-HA site].
A fellow from LynuxWorks wants to talk to me. He's on the mailing list (but I don't remember him too clearly). He thinks they might put some resources on the Linux-HA project. He tells me they are going to open up the Intel High-Availability forum to other people - he implies that he means people like me, perhaps me specifically. [I look up email from him later, and I realize that he's a fellow I accidentally insulted on the list. I guess he must have forgiven me]. Liz rings and says she wants to say bye to folks and will call me a little later.
I go to bag check to get my coat, and bag and go up to the booth to talk to folks before Liz calls again. I chat a bit, run into a guy from Conectiva. I get him some small SuSE souvenirs for himself and my friends at Conectiva (Marcelo, Olive and Luis Claudio). Olive runs SuSE on his machine ;-) My phone rings, and it's Liz. Time to go.
~1530 We get a limo and ride to the airport. It was a bit more expensive than I'd like, but it was starting to rain and lots of people are looking for rides, so we take it.
There's another fellow in the car with us, so we all chat. Liz wants to know about his company. He reads Linux Weekley News, and seems to have heard of heartbeat. So we all have something to talk about.
We arrive at the airport, in plenty of time. All is well. We exchange travel horror stories. It seems Liz has a bit of a travel problem phobia, and has had a few experiences to match. She's going to go to talk at LinuxWorld Expo in Singapore. She agrees to give me a ride home (it's not far out of the way). The bus is fine, but being dropped off at home is nicer. I realize that I left my Minidisc player with Stephen Ing. Oops! Liz also says that the LWCE audiences rate the speakers on a 5-point scale. I wonder how my talk was rated?
~1810 We load up on the plane. After we're enroute, the pilot thinks we'll be in Denver 30 minutes early. He seems skeptical of his flight computer ;-) So am I. I nap until they turn off the seat belts sign. They bring dinner. It's not too bad. The movie comes on, and I dig out my laptop for this report. It took me 20 minutes or so to write up the part after 0910.
2033 I switch my watch to Denver time. Now it's 1834 ;-)
1836 I decide to write Stephen an email, along with one to the Russian fellow, and the one I need to send Ted Ts'o. If I feel like it, I'll try and catch up on the email from the list as well. I'll send John Mehaffey a note of thanks too. I added Brian's, Alexender's, and John Mehaffey's info to my address book.
1917 I sent those emails. Now I'll try and catch up on other email. I applied Uzi's byte ordering patch. I'll try it when I get home and have a network. I also need to send email to Rudy Pawul about or the Enterprise Linux people. 2019 I got rid of around 100 emails, and replied to many. I've got about another half-hour to go on the flight. Guess I'd better figure out how/when to finish up. Still need to email to/about Rudy. 2025 It's getting rough up here. Better shut down and put up the laptop. Bye :-)
I had a most pleasant return trip with Liz and her family. They very kindly just dropped me off at home.
2001/02/03 ================================================================= 0800 Downloaded, read and replied to a little email. About an hour I suppose. Wife and I both tired, cranky :-( I tried to grab email mid-afternoon. DSL down :-( Got it back up in about a half-hour of time with Qwest. Very tired after the show. Zzzz.
2100 Read, replied to more mail. Updated main and commercial pages on linux-ha web site. Thought some more about the upshot from my talk. There is a lot of interest in HA things, and in particular I MUST split out the core code from the cluster manager code. This has to be a near-term development priority. Users want it, Anas needs it, others too... It just becomes way more useful that way. I believe the development especially from others is blocked because of this. I VERY MUCH need to update the TODO list. It's WAY out of date. Another thing to add to the TODO list: Make the configuration code plug-in modules, too...
2210 G'night. It's 0010 East Coast time now. No wonder I'm tired. I need to update my personal todo list from this journal next week. I'll send this out to my loyal readers ;-)