Older blog entries for alanr (starting at number 6)

I've been testing some Weird stuff. Got a bug report from a guy from Dell that it dies after a day - right about the time the code decides to print all the processes' memory stats. He says it happens every time.

He's running with stonith enabled (which I hardly ever do, it's hard on the hardware). I can reproduce his problem quite readily - I sent the mem stat interval down to 5 minutes, and boom every 5 minutes it dies.

It needs stonith for it to fail. I removed the stonith option and it didn't fail.

I can also make it happen when I send it a SIGUSR2.

It looks like if I send a SIGUSR2 to the highest process id, then it prints out the memory stats and dies -- or just dies...

I need to look at the config file when it comes back up... I had 8 processes, one of them a zombie on sgi2. I'm not sure how many the config wanted...

The config had two links: one serial and one udp. The processes we create are:

	control process (parent process: runs last)
	write process
	read process
	write process
	read process
	master status process

But, I had 8 processes at that time...

There seems to be something wrong here ;-)

Here's the pids from the logs:

310	Prints "configuration validated"
312	prints "udp heartbeat started on..."
320	Still running... Locked...
321	Still running...
322	Still running... Locked...
322	Still running... Locked...
323	Still running... NOT Locked...
324	prints local status now set to... and Heartbeat restart
	prints "link sgi2:eth0 up"  Still running LOCKED
	prints "resource acquisition completed (none)"
	prints "Link sgi1:eth0 dead"
	prints "mach_down takeover complete"

644 defunct... prints "resource acquisition completed"

697 Control process... 697 Also prints heartbeat restart on... (?) 697 Also prints "link sgi2:eth0 up" 684 Control process prints "starting serial heartbeat on ..." 693 HBWRITE 694 HBREAD 696 HBWRITE 696 HBREAD 697: writes all the messages ;-) Master status process. prints and Heartbeat restart on... but only for local node

Found a fork in initiate_reset() with no exit at the end... and in an error leg in req_our_resources(), and giveup_resources()

This appears to have been the bug. After replacing the implicit return with an explicit exit, and doing so in a couple of other places in some funky error legs, I can't reproduce the problem any more.

I had also gotten a bug report that the multicast option-parsing code didn't work. I had "broken" it by fixing the ppp-udp code. However, my change was correct, and the multicast parsing code was incorrect. So, I fixed the multicast parsing code. I discovered in the process that even with the bug fix in, that it didn't work because the install process didn't install the mcast code. So, now I have both the mcast code working and this bizarre Stonith bug fixed. I've been running this test configuration with multicast (which I'd never tested before), and the stonith fix (but stonith turned off, because I suspect that the test code won't deal well with the machines getting rebooted each time they leave the cluster). Guess I ought to run a hundred iterations or so of that. (and fix the test code if it's broken). Robert_Macaulay@Dell.com (the original bug reporter) is currently setting it up for testing on his machines. It seems pretty likely that it'll work just fine for him.

I ran 1000 iterations of the test code. The final results are: 2001/03/11_14:23:38 Running test Restart [1000]
2001/03/11_14:24:26 Stopping Cluster Manager on all nodes
2001/03/11_14:24:31 ****************
2001/03/11_14:24:31 Overall Results:{'BadNews': 0, 'success': 1000, 'failure': 0}
2001/03/11_14:24:31 ****************
2001/03/11_14:24:31 Detailed Results
2001/03/11_14:24:31 Test Restart:{'success': 524, 'WasStopped': 156, 'node:sgi1': 253, 'calls': 524, 'node:sgi2': 271, 'skipped': 0, 'failure': 0, 'auditfail':0}
2001/03/11_14:24:31 Test flip:{'down->up': 160, 'up->down': 316, 'success': 476, 'started': 160, 'calls': 476, 'stopped': 316, 'skipped': 0, 'failure': 0, 'auditfail': 0}
2001/03/11_14:24:31 <<<<<<<<<<<<<<<< TESTS COMPLETED

435.94user 75.19system 13:28:48elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (6605568major+2597838minor)pagefaults 0swaps

This is great news. I need to run another set of 1000, and then some other tests (probably involving the stonith_host option), and then we'll declare it stable I think. Many thanks to Aaron Nienhuis and Robert Macaulay for finding these bugs and saving our users from finding them in a "stable" release.

2001/02/04 (Sunday) ========================================================== Spent an hour or more unpacking from the trip spread across various times of the day. This isn't quite so timeconsuming as packing, fortunately ;-)

1130 Am reconfiguring various computers in the network today. Need to move the KVM switch down to the basement and put it on the ones down there. Hooked up one of the machines. Need 2 more cables to do them all. Better go get them, I guess ;-) I converted my talk to HTML and pushed it and the StarOffice source both out to the linux-HA site. Did a little more general updating of the site. Much more still needs to be done, unfortunately. 1230 This stuff took about an hour or so.

2100 Writing a script to update the lab machines with RPMS automagically. 2145 Done. Gonna try and get the lab machines updated and a set of tests started to see if the fix I made in NYC works. 2200 Tests started successfully. Bye for now.

2001/02/05 ================================================================= 0645 Not much mail came in last night. Better check my fetchmail. Last night's test run of 500 iterations completed successfully at 0630. So, the fix I made on the plane/hotel room during the LWCE works. Good timing ;-) I got some more mail. Fetchmail must be working OK. All 500 iterations succeeded. I also cut down the standard failover time to 3 seconds. Need up update the changelog. The viewgraphs I put up on the web point at the home page, which points at Red Hat software, because it's out of date. I guess I'd better fix that as soon as I fix the ChangeLog.

0730 OK. Fixed the changelog. Time to release the development version 'k'. Need to find my freshmeat password, so I can follow the "official" procedures I documented on the web. Better go get my PalmPilot Freshmeat has changed a lot since was there last. I need to announce it on freshmeat and on the lists. Found it. I need to change lots of stuff to conform to the way freshmeat is now set up. This may take a while. I guess the freshmeat II rewrite is pretty new (last of Jan), so I would have had to do this, and now wasn't too bad a time. 0810 Now I need to announce it to the mailing lists ;-) Discovered a few minor glitches on the web page. Netscape crashed :-(

0832 Got the release notices out to SuSE internal sources and the various external HA mailing lists. I keep a little whiteboard with my near-term TODO list on it. Here's what's on it right now: Fix "sitemap" program Read CVS book Update TODO list on web Post Talk VGs (already did that) Test New version (already did that) Update home page Work more on test scripts Move disk drive to "servidor" Email: OSCAR folks Baytech weirdness

Release Unstable version (already did that) I'll update the board. Fortunately it's easy ;-) Done.

0838 I guess my next priority ought to be to fix the home page, since potential SuSE customers will be reading it, and it says I work for Lucent and I recommmend Red Hat. It's more than a year out of date - a bit embarassing :-( OK. Updated the home page (a little). That didn't take long. Now I'll tackle the TODO list Then either sitemap or some CVS book reading.

0852 Done. On to the todo list... Dropped a note to the Linux Weekley folks about their poor choice in names since it conflicts with LWN - my favorite Linux publication ;-)

0955 Finished the TODO list, and announced it. Hmmm... What next? Guess the CVS knowledge is pretty sorely needed at this point. I'll go read for a while. I need to know about how/when/why to set up CVS branches. I need to add some for linux-ha, I think... Short-term todo list looks much nicer now ;-) Of course, to understand this, I have to know something about CVS tags, too ;-). I also wrote a little script which tags my CVS tree with a tag derived mechanically from the release number.

Got some question email about heartbeat - answered. Got some question email about AutoMake/Build - answered. Got a suggestion about the ToDo list. Incorporated it. Ate lunch. Took about 10 minutes.

1140 Now on to "sitemap"... What are the symptoms? Directories with index.html in them aren't made into links. The directory LWCE-NYC-2001 is omitted from the directory name displayed. The links under it are all fine. Sorting should be case-insensitive It treats some files as directories. Perhaps the dirname() function is screwed up? Seems so. Sorting is still case-sensitive. Oh. It's doing perl sorting. Fixed it. fixed file sorting, too. Somewhow we're not picking up the title, etc. from some pages. $Title and $X-Meta-Description are missing from them... Something appears to be wrong/changed with HTML::HeadParser It isn't always returning the info to us... It seems to have something to do with the DTD line netscape puts in. It doesn't like it. I need to remove it. Sigh... 25-30 page edits later...

1340 Got them all removed. The index looks much better, but still isn't quite right... Sorting is still off... Of course, all the modification times are all wrong :-( I should have tried updating to a newer version of the Perl packages.

1355 Fixed sorting. Now I know why I was avoiding this ;-) Site map all better now.

I'm worried about getting DSL service when I move. It seems that there will be a 2-week delay after moving in and getting a basic phone line installed. This would mean I'd have to use dialup for about 2 weeks :-(

1425 OK... Back to working on CTS... I think I'll add the "monitor" function to IPaddr next. The basic thing is to "ping" the address.

1450 Done. Committed to CVS. Now change the code to actually use it in the audits...

1520 It looks like the tests should be pinging the node to make sure it's really serving the IP address as we go along. And, we should be verifying that all resources in a group are being served by the same node. Oh... Except I haven't put the latest version of the code on the test cluster which means it ought to be failing (!?) OOPS. It wasn't actually being called. The if-condition was too complex. It's a little simpler now, and now it fails like it ought to ;-) I distributed the new IPaddr script to the lab machines. It seems works now. I'll restore a little of the debug logging to make sure... Yep. It's working.

1600 I'll start a series of tests running. They take around 8 hours IIRC. I need to check mail before quitting for the evening. Not a lot of mail. Only 7 new emails. Martin Konold pointed out I forgot to mention the download URL.

1610 It's now corrected both inside SuSE and outside. Time to quit for the evening.

2100 My freshmeat entry got thrown away. I'll need to resubmit it and the information on the main branch. Sigh... Got an email from Volker. It needed a reply. I sent out a Call for Refinements for heartbeat I send out a wish list for what apps people want to make HA. I bought two more KVM cables. Now I can hook all the machines up to the switch. Wired up another computer to the KVM switch. I'd wire up the other two except I need to wait for the tests to finish. Speaking of tests, 380 or so have already succeeded. Only 120 more to go ;-) Ted Ts'o sent me mail about the Lucent winmodem problems, next chapter. I sent him a brief reply. Sigh...

2210 Bye for now. 395 tests run so far.

2001/02/06 ================================================================= 0620 All 500 tests completed successfully. Looks like my mails to the list and to Volker have generated some responses. It'll take a while to go through them. Most of the responses were pretty much what was expected. But, I'll update the ToDo list with a couple of them anyway. Composed an email to send Volker and Markus about staffing. Responded to more email. And more email, and more email.

0900 Time to process more email. Some from Lars, some from the ha list, some from others. Need to check the web stats and see how many downloads have occurred of the new code, but it's probably too soon to see them in the reports yet.

1000 Time to finish hooking the cluster up to the KVM switch. Done. Now, what next? I'm getting pretty close to being happy with the test environment as it stands now. But, I still need the "environment" dimension. I guess that's a good next step. Also add "quorum" to the ClusterManager class. HasQuorum() added. Looks like it works.

Let's see if I can remember all the things we still need to add to the test code. I'll go reread the email on it... The main thing remaining was Scenarios. Scenarios were the idea that we might run a particular set of configurations like what kind of resources, or what kind of workload either from the test machine, or workload running on the cluster machines.

Need to drop Lars an email about the state of the test tool and the HasQuorum member functions. On second thought, I'll save that until I'm done. Otherwise too much time is lost. Now on the "Scenario" concept... Worked on it a while, went to lunch (took nearly an hour today - getting out of the house was wonderful - a good break from the more-usual 10 minutes) 1225 Got back, got some detailed mail from SGI about CTS. Am writing a detailed response. This is taking a while. 1345 Finished. Now back to the scenarios...

1435 I now have the code for a basic, robust StartUp scenario. Wonder if it works? ;-) 1530 It seems to work now. It's also integrated into the RandomTest class. Hmmm... It seems the Quorum changes didn't all make it into CVS. I'm putting them back in.

1600 Bye for now.

1915 Got lots of email to respond to. Looks like some folks at HP may want to use heartbeat in a product.

2023 Bye for now.

2055 I just can't seem to stay away. More email responses (~15 mins). Back to the home network configuration ;-) Got an emergency request to make more free space on some FAT partitions. I'm doing that now in the "background". Looks like 338 tests successful so far with new version. I now have CVS access from "servidor" too, so things are easier to do right now ;-)

2230 We're now up to 430 tests successfully done. Tomorrow I need to: Do paperwork for LWCE/NYC trip :-( Attend the "All hands meeting" conference call at 1100 Write some kind of nasty ScenarioComponent for something like web server traffic or memory hog or CPU hog or generic network traffic or swap hogs, or something. A flood pingfest comes to mind as being a good place to start <;-) Move big disk to backup machine 2300 Bye for now. 2330 Changed my mind. Going to add a VerifyAllIdle action to the ResourceManager script tonight and then invoke it from the startup script. This will give folks who make one of the two most common errors a good clue that they made a mistake. The guys from HP made this common error, and I've had it with this problem! 2345 All 500 tests passed. 0016 The new code for the verifyallidle action is in, and activated. It seems to work just fine. Now to update the ChangeLog. 0030 All put in CVS. Send email to the HP guys ;-) 0040 Bye for now.

2001/02/07 ================================================================= 0605 Checked email. A number from Lars, a couple from HA lists, Lenz. Sent replies, filed. 0710 Find receipts for LWCE. Start expense report. Process more email. 0940 Paperwork done. Need to send it out. Now on to the nasty pingfest ScenarioComponent. Should be fun ;-) Looks like the last batch of 500 tests finished successfully at about 09:25.

1045 Looks like the PingFest flood ping test is working - perhaps a little too well ;-) The tests are running really slowly - but they're working! The switch port lights are on pretty nearly solid ;-)

1100 Went to the conference call. SuSE is letting most everyone go here in the US. Looks like I get to find a new job ;-) Update resume, phone call, interview Repeat until new job.

2001/02/08 ================================================================= 2001/02/09 ================================================================= 2001/02/10 =================================================================

2001/01/28 ================================================================= 0350 Had trouble sleeping. Prayed for a few people. Got up. The case is the "restart" test when only one machine is up looks like it had an obvious bug. It said:

		if node == self.CM.OurNode:
 			pat = self.uspat
But it should have said:
		if node == self.CM.OurNode or self.CM.upcount() < 1:
 			pat = self.uspat
instead. I applied the fix on "servidor". Looks like it was having some problems with X11 forwarding too. I changed the rsh command to supress forwarding X11 ports (sinc I don't need them). Looks like there's also a bug in the Stonith test such that it doesn't look for the right patterns if the other node is down. Another wee bug, this one slightly more subtle:
	    if (self.CM.upcount() == 1 and
        	==      self.CM["up"]):
Should have simply been:
	    if (self.CM.upcount() == 1):
I decided the logging should part of the CtsLab class. So, it's now in the (as of yet not used) CtsLab class. Another wee bug, this time in the Stonith code:
	   if (self.CM.upcount() == 1):
should have been
	   if (self.CM.upcount() <= 1): 
It's fixed now.

0615 Sent out CtsLab code to list. Probably ought to take a nap before church ;-)

Somewhere along in here I spent an hour or two packing.

1615 The entire 500 tests all went successfully. Definitely fixed the bug, since I ran it with the same random number set... Some other minor bugs having to do with reporting at the end were introduced. I think I fixed them. 1715 Continuing to write the Lab class. Break time. 2000 Break over. Hope to get the lab class integrated and working tonight. 2300 Looks like they're working together fine. Better quit while I'm ahead after backing things up ;-). G'night. 2001/01/29 ================================================================= 0700 Today should be mostly a packing and preparing to leave day. I have some loose ends to take care of before I leave, but today I have a car, but of course it's snowing pretty nicely outside ;-)

However, I'm going to have to look at the heartbeat code anyway, because it looks like I triggered a bug in the heartbeat code with the tests. I guess that's what testing is supposed to do ;-) The test code "hit the jackpot" . "Both machines own foreign resources". The evidence should be in the logs. I'll see. The error occurred about 3 hours into the test run.

The problem is caused by the machine which had just come up (sgi2) failing to hear any heartbeats from the machine which was up all along. Perhaps this is caused by a piece of the code in the takeover sequence which waits for the takeover to complete, hence keeping packets from being sent out.

Another possibility would be that it is a problem in the receiving code startup. This sounds more likely. Perhaps the startup code should be more synchronous. This is what the timing looks like: Jan 29 02:03:56 sgi2 Starting heartbeat 0.4.8k Jan 29 02:04:00 sgi2 UDP heartbeat started Jan 29 02:04:01 sgi2 WARN: node sgi1: is dead

OK, that's technically our deadtime (5 seconds), but we didn't give the other guy much of a chance to give us a heartbeat, because we were not yet up very long. With a heartbeat interval of only 1 second, this is almost impossible. Under heavy load or with a 3 second dead time, I could imagine this being much worse. I think I remember wondering if this could happen before.

Sounds like we should start the timing of "dead" time from the moment we receive an ack that all of the child read/write processes are up and running. I guess that means that the code needs to send such ACKs and that the heartbeat core timing logic needs to track them and modify it's idea of the "epoch" accordingly.

I guess this is great progress! I've moved from debugging the test tool to debugging the thing it's testing! Now I just need to think carefully about how to fix this bug in heartbeat ;-)

0800 I sent out last week's journal, and saved a similar email as a template to make sending it out in the future easier. I'm going to go finish packing now, and come back to the bug later.

1000 Finally finished packing! Now to run errands and do all the other things I need to do before leaving town for a few days.

1345 Got home and am doing a little more cleaning up, reading email, etc. 1405 Gotta go get Laura from Mandalay (work). 1430 Went to go see the builder of our house and try and straighten out some things in how the house is put together. 2000 Checking email, printing off schedule document. Need to stop this to order some Orinoco cards, and go to bed... (to about 2100). Finally got to bed around 0000.

2001/01/30 ================================================================= 0430 Today I leave for LWCE/NYC. Expect less detail in the subsequent entries, since I'll spend most of my time away from my laptop. Better pack up the laptop, etc ;-) Got everything packed and made the plane on time, etc. Trip went without incident. Coded a little fix to the timing bug I discovered with CTS. Watched the movie. It was "Remember the Titans". I recommend it highly. Arrived in NYC a little later than planned. Spent about a half-hour trying to get my cell phone to work in NYC. It was a pain, my vendor needed to take some special security precautions to keep my NAM, etc. from being stolen and someone from making calls on it. Annoying.

I took the shuttle to the hotel and checked in fine. By the time this happened it was a little too late to make it to the Javits center to check in today. Worked a little on the code. Got the timing fix "mostly" working. Got a call from Horms, and went to dinner with him and his buddies from VA. Had a good discussion about where I want heartbeat to go and what he wants to do with it also. Ate an Aussie meat pie. It was pretty good. He said it was a little higher-class pie than you'd often get in Australia. Went home about 2300. Got to bed around 0000. 2001/01/31 ================================================================= 0700 Really tired this morning. Made it to the Javits Center about 0830 or so. Talked to LWN staff at the speakers room. Got registered both as speaker and as Exhibitor. I did LOTS of appointments today. Stacey Quandt from Giga didn't show, but everyone else did. I also spoke to a freelance journalist who had very similar ideas about the "small" enterprise and what they need from HA. He had heard me speak when I was at Bell Labs in Naperville and dropped by to see me. Here was my agenda, which was mostly followed: 1000 Ben Rafanello and friends, IBM 1115 Stacey Quandt (no-show) 1200 D. H. Brown 1400 Jon Doyle & Compaq 1500 Dean Pannell 1500 Peter Badovinatz (IBM) at Developers' Den 1830 IBM Party. Spent a lot of time with Peter B (Wombat) Learned some very interesting things from Peter, what he said, and what he didn't say. Glad I spent the time with him.

It was a long, busy, productive day, and I don't have much voice left. I'm going to have to be careful, or I won't have any voice left for my talk on Friday. I'll take some throat losenges with me tomorrow. It seems to me that the show has been pretty good as far as size and people coming by. I also talked to Ted Ts'o about the Lucent winmodem debacle, and also with someone from IBM (Frank Novak fnovak@us.ibm.com) who will help ensure that Lucent does the right thing. I need to tell Ted about him. I also met Patrick Martel of MandrakeSoft. Dan Cox of Compaq told me to contact Wayne Opland about the HA disk (512) 432-8146. 2001/02/01 =================================================================

0630 Got up, pulled down email, finished the fix for the timing bug. It seems to work fine now. Wrote a reply to Markus and Jay asking that they tell me sooner rather than later if they have feedback on how I spend my time :-) Updated CVS with the timing fix. Getting ready to go to the Javits Center. Maybe I'll have a little time to look around on the show floor today :-) Spent about a 20 minutes writing up the notes from the show so far. 0810 Go to Javits Center. Bye for now ;-) 2345 I spent the whole day at the show, mostly talking to potential customers, suppliers, partners, etc. My appointments today were with Thomas Schaffner of Enterprise Linux, Mike McQuaid of Winchester Systems, and Peter Badovinatz of IBM. I talked to lots of other people though, including one person from Lawrence Berkeley Laboratories who might be interested in having us provide professional services to help him deploy a high-availability web server. I also talked to Oracle about HA issues, SGI, and various other people whom I've forgotten. I did get finally get out of the booth an hour today to look around. Bought a book. Got a few goodies. I worked with Joshua Uziel (uzi) to fix a byte-ordering bug that the findif.c code had. He packaged it up in a patch and mailed it to me. Other people I talked to: Shane Painter of Dell (whom I met in Austin), Eric Lam of Coventive (interesting hardware model), Nate Perlstein of SGI (FailSafe support), Charlie Simpson of Enterprise Linux, and Satoshi Kawata of Red Hat Japan.

I stopped by the Mission Critical Linux folks and it sounds like they may end up using our open source test tool to help test their clusters. Right now they test everything by hand.

I bought my first meal since leaving home. Everything else has been freebies and a snack or two ;-)

I had a great conversation with our IBM liason (Malcolm?). It seems that he didn't know that SuSE had any HA efforts. I corrected this misimpression. It was a really good thing I think. It sounds like he may have me go meet some IBM folks. Apparently Malcolm has good news regarding our relationship with IBM. Better go to bed now, and get up to work on my talk tomorrow morning.

An aside: Apparently John Mehaffey mentioned us in one of his talks. At least 2 or 3 people come by to see me as a result. I'll drop him a thank you note.

2001/02/02 ================================================================= Today I give my talk, and I return home. 0500 My stomach was a little unsettled, so I went ahead and got up. I need to reread my talk and see if I can/need to add anything regarding the various APIs to the talk. Get dressed, do a little packing, etc.

0545 Begin rewriting talk to change emphasis to Linux-HA APIs from being a heartbeat talk.

0645 Began a runthrough of the talk. It took about 45 minutes. It should fit in the time alotted. I'm a little worried about it being a little short.

0740 Start to pack up in earnest. Am tired already. Sad state of affairs. Better locate my Penguin mints for later ;-) Took a little nap before leaving.

0910 Time to pack up the laptop and leave for the conference.

1950 Went over to the Javits center by cab. Arrived about 10 AM. I run into Liz and Michael Hammell from the Linux Weekley News. It turns out that Liz is returning to Denver on the same flight I am. We make arrangements to share a ride to the airport.

Went by the booth. Talked for quite a while with Anas about clustering issues and then with Andreas Archangelli mainly about debugging tools. I'm glad he has a better attitude about them than Linus does. Maybe I ought to duplicate some of the "klog" tools for Linux. Wonder if Avaya would open source them? Maybe I should have Roger or someone send me some klog output (if he could get some easily) so I could show it to Andreas.

I went to go hear Dirk's talk - a little late. Dirk seems well-prepared and has a good talk. My PalmPilot alarm goes off near the end. It's time to go check out the room I'll give my talk in, and run through a little of it. I discover I'm more nervous than I'd guess. I wonder if anyone much will show up for the last talk in the conference? One couple shows up 30 minutes early(!). Others show up shortly afterwards. Doesn't sound like I have much to worry about. After a few minutes I sit down with the people who've come in and talk to them. It was nice - seems to calm down my nerves. I find that a guy from Bloomberg financial services that I met before is here. He's a Russian (?) guy. I get his card. I'm supposed to send him a copy of the slides from today. I don't know the routine here. Will someone introduce me? When should I start? About 2 minutes after, I decide that no one will introduce me, and I'll start my talk now. I don't find any controls for the lights, but someone in the audience tells me and I get the lights dimmed. By a few minutes into the talk there are 40-50 people in the room. Nice turnout.

I get my first question. It's very confusing. It takes a few minutes to figure out what he wants to know. I'm about to cut the discussion off when I figure it out and answer it. Now, more questions come. I'm beginning to warm up, and my sense of humor takes off and the audience laughs. Now I'm having fun, have lots to say, and they ask lots of questions. The talk finishes at almost exactly the right time! It went very well. They were a good audience. [I agreed to put the slides up on the Linux-HA site].

A fellow from LynuxWorks wants to talk to me. He's on the mailing list (but I don't remember him too clearly). He thinks they might put some resources on the Linux-HA project. He tells me they are going to open up the Intel High-Availability forum to other people - he implies that he means people like me, perhaps me specifically. [I look up email from him later, and I realize that he's a fellow I accidentally insulted on the list. I guess he must have forgiven me]. Liz rings and says she wants to say bye to folks and will call me a little later.

I go to bag check to get my coat, and bag and go up to the booth to talk to folks before Liz calls again. I chat a bit, run into a guy from Conectiva. I get him some small SuSE souvenirs for himself and my friends at Conectiva (Marcelo, Olive and Luis Claudio). Olive runs SuSE on his machine ;-) My phone rings, and it's Liz. Time to go.

~1530 We get a limo and ride to the airport. It was a bit more expensive than I'd like, but it was starting to rain and lots of people are looking for rides, so we take it.

There's another fellow in the car with us, so we all chat. Liz wants to know about his company. He reads Linux Weekley News, and seems to have heard of heartbeat. So we all have something to talk about.

We arrive at the airport, in plenty of time. All is well. We exchange travel horror stories. It seems Liz has a bit of a travel problem phobia, and has had a few experiences to match. She's going to go to talk at LinuxWorld Expo in Singapore. She agrees to give me a ride home (it's not far out of the way). The bus is fine, but being dropped off at home is nicer. I realize that I left my Minidisc player with Stephen Ing. Oops! Liz also says that the LWCE audiences rate the speakers on a 5-point scale. I wonder how my talk was rated?

~1810 We load up on the plane. After we're enroute, the pilot thinks we'll be in Denver 30 minutes early. He seems skeptical of his flight computer ;-) So am I. I nap until they turn off the seat belts sign. They bring dinner. It's not too bad. The movie comes on, and I dig out my laptop for this report. It took me 20 minutes or so to write up the part after 0910.

2033 I switch my watch to Denver time. Now it's 1834 ;-)

1836 I decide to write Stephen an email, along with one to the Russian fellow, and the one I need to send Ted Ts'o. If I feel like it, I'll try and catch up on the email from the list as well. I'll send John Mehaffey a note of thanks too. I added Brian's, Alexender's, and John Mehaffey's info to my address book.

1917 I sent those emails. Now I'll try and catch up on other email. I applied Uzi's byte ordering patch. I'll try it when I get home and have a network. I also need to send email to Rudy Pawul about or the Enterprise Linux people. 2019 I got rid of around 100 emails, and replied to many. I've got about another half-hour to go on the flight. Guess I'd better figure out how/when to finish up. Still need to email to/about Rudy. 2025 It's getting rough up here. Better shut down and put up the laptop. Bye :-)

I had a most pleasant return trip with Liz and her family. They very kindly just dropped me off at home.

2001/02/03 ================================================================= 0800 Downloaded, read and replied to a little email. About an hour I suppose. Wife and I both tired, cranky :-( I tried to grab email mid-afternoon. DSL down :-( Got it back up in about a half-hour of time with Qwest. Very tired after the show. Zzzz.

2100 Read, replied to more mail. Updated main and commercial pages on linux-ha web site. Thought some more about the upshot from my talk. There is a lot of interest in HA things, and in particular I MUST split out the core code from the cluster manager code. This has to be a near-term development priority. Users want it, Anas needs it, others too... It just becomes way more useful that way. I believe the development especially from others is blocked because of this. I VERY MUCH need to update the TODO list. It's WAY out of date. Another thing to add to the TODO list: Make the configuration code plug-in modules, too...

2210 G'night. It's 0010 East Coast time now. No wonder I'm tired. I need to update my personal todo list from this journal next week. I'll send this out to my loyal readers ;-)

2001/01/27 =========================================================== 0530 Woke up. Decided that this is 0730 NYC time, so it must be time to get up ;-) Checked mail. Went back to restructuring/enhancing the the test code. Still having occasional problems with python naming and modules, but now think I have a good strategy worked out for using it. Looks like the latest iteration of restructuring is now working. Guess I'd better go off and figure out what to do next. 0800 Enough for now. I made some pretty good progress on a couple of fronts.

1055 Joined Joe Barr on his recording bridge. He had 10 questions to ask, and I answered them as best I could. He's now interested in HA things and may write an article on it. He'll likely give me another call if he decides to do so. Interview got over at about 1130. I think he got the sound bites he was looking for.

1240 Looks like the test scripts show some failures. Looking at the logs the heartbeat code is working right, but the test code doesn't think it is. The case is the "restart" test when only one machine is up looks for the pattern for "remote machine has joined" when it should be looking for the pattern "local machine has joined" instead. Don't know why yet. 1315 Gotta go to Castle Rock and then going-away party for my cousin :-( Bye.

2001/01/26 ================================================================= 0640 Really tired this morning. Guess I'm getting too old to get less than 6 hours sleep too often. Laura is feeling a little better this morning, so she went to school today.

Surprisingly, no reply from Lars on the CTS code. Ahhh... It just came in, in 2 parts. I responded to one of them. Decided to clean up my Trash folder, as it has over 10K unread messages in it. I'll get rid of all Trash from last year. Good to take out the trash once a year whether it needs it or not ;-)

0800 Need to get dressed, etc. 0825 Back to work. I need to look over the current copy of "Enterprise Linux" it has a pretty cool cover article about the Weather Channel. Didn't actually read the article yet. Responded to Lars' comments. 0940 Finished responding to Lars' comments about CTS. Started implementing some of them. Splitting into multiple files, separating out the audit class. 1030 I'm exhausted and have a headache. Time to take a decongestant, a break and maybe a nap. Maybe I need to eat something? I see that we've done 602 iterations of the heartbeat code without any errors. This time I'm including the Stonith test in my set of tests to run (it slows things down a lot).

1055 Back to the salt mines :-) I feel a bit better. I see we're up to 655 iterations. I wrote the Audit class. I guess I'll stop the ongoing tests on 'servidor' and actually try the restructured code and see if it works, as opposed to "just compiles"

1145 Headache is back. Time for something stronger... Time for lunch... Went to lunch. Received a few boxes full of hardware for installing the network. Spent about 45 minutes checking the stuff out, making sure it was all there etc. Laura came home sick and exhausted, took her to Lunch, since she hadn't eaten. Still don't feel right. Took a half-hour nap. Spent a half-hour or so helping Amy get xawtv working on her PC, without much success. Having trouble importing some Python classes. Learning curve, I guess... (could it be a Python bug?)

Got email from Paddy about possible FailSafe meeting times. Replied, told him to avoid the CLIQ, 'cause I'm running a BOF (and representing SuSE?) there. Got email from Mia with corrected arrival date for hotel.

1740 Time to call it quits for a while and get Laura (and me) dinner. 2000 Called Joe Barr, and set up the appointment for the interview tomorrow at 11 AM. He seems like a really nice guy. I'm now writing the code for the CtsLab class. 2120 Tired. Going to bed. But, I feel better than I did earlier today.

2001/01/25 ================================================================= 0525 Started work. This will be an odd day. Thursdays always are. Today a little more so than normal.

I see the overnight run I made crapped out after about 15 minutes because I had too many open files. Hmmm... Never saw that before... Not surprisingly, it was in the new AuditResources code... It was doing a popen for determining if the other node is up. I'm not waiting for the child process to finish before going on. I'll see if waiting for it to finish helps... I see it's gone 130 iterations this time. Before it only went 60. That's a good sign. Looks like that fixed it. It's been > 300 iterations.

I got email from lars about the CTS. I've been responding to it. He has some good comments.

0615 Need to get dressed to take Kathy to school so I can have a car today. My wife is sick, my mother-in-law has an infection from her surgery and my father-in-law and I both have doctors' appointments today...

0700 Back to work...

I'm continuing to respond to Lars' email. He made a couple of good points, and some I don't care about. Completing my reply took exactly an hour. We're now up to 580 successful test iterations.

3-4 people subscribed to the linux-ha-dev list today. Replying to them took until 0920 or so. More email, more travel planning...

1025 Time to go to Doctor's appt.

Went to Doctor's, did about 15 mins of coding, went to lunch with a good friend who needed some time to talk. Got done about 1400. Picked up Kathy from school at about 1440

1500 Started back to work. Lots of email arrived while I was gone. They changed my hotel reservation, so I have to print off new stuff to carry with me and tell Wombat new hotel name.

Included in the email was a VIRUS ALERT, TELL ALL YOUR FRIENDS! ;-)

Apparently disconnecting my laptop stopped the tests running. I had about 1100 iterations at that point.

Got an subscription email from a commercial HA firm. I sent them the same "welcome to the list, what brings you here?" note I send everyone. It'll be interesting to hear what they say.

1645 Need to go make preparations for dinner, etc. Laura stayed in bed all day. No word from my in-laws yet on how they did.

2330 Decided to check mail and read about the worm. Took about an hour. I see the tests I had started finished just fine. G'night.

2001/01/24 =================================================================

I started work this morning about 7:15.

I spent the first two hours this morning dealing with email and talking to Lars on IRC. He now knows my situation and a bit more about the priorities in SuSE, Inc. I agreed to write up a few paragraphs on the Cluster Test System (CTS) for him.

I made a doctor's appointment for tomorrow morning so I can get some prescriptions refilled before taking off to NYC. (about 15 mins)

I spent about an hour or so writing up the CTS for Lars.

I spent about 15 minutes explaining to MilesTek about the troubles I had ordering equipment from their web site. I scanned in some pages and emailed them out. Sigh...

Responded to some email from MC Linux about Stonith. They're considering adopting it, and had a few questions about the expect() function in it. My reply seemed to satisfy them. Guess that's good.

Took off for lunch at 12:17 PM, returned as 14:10. Had to make a trip by the house and pick up Laura from work.

Set up an appointment with horms for Wednesday dinner.

I wrote the code to tell if some, all or none of the resources in a group are held by the current node. Probably even works ;-)

Doing conference paperwork: Scheduling things, getting the current schedule for the conference room, etc. This will probably take me an hour to do. Meeting with Horms (VA Linux), Ben Rafanello (IBM), Wombat (Peter Badovinatz @ IBM), Thomas Schaffner (Enterprise Linux), Mike McQuaid (Winchester Systems). I also talked to Jon Doyle for a half-hour or so somewhere in here.

Sent some email about the heartbeat API to Ericcson in Montreal. Took about 15 minutes to write.

1645 quitting work for a while (Dogs are going nuts, and wife is sick). 2112 back for a bit. Gonna work on the resource stability thing... Finally backed up the laptop ;-) 2200 Going to bed. Got the new cts.py code working including polling for resources to become acquired.

2001/01/23 =================================================================

I started work this morning about 7:30. I took about a half-hour off for lunch. I stopped around 4:45 or so and put in a half hour or so later in the evening to catch up on email, etc.

More updates to the test suite. Basic Resource Auditing works! It's now in CVS too.

Need to get the CTS harness to not audit resources too soon. It looks like the IP addresses aren't getting set up as fast as the auditing is taking place.

Further examination seems to bear this out, but the heartbeat code doesn't give any particular message when the transition takeover scripts have completed. I put in a little code to loop for a while re-auditing things until they get better. They always seem to get better at least ;-)

There are at least four possible cases: A machine went down: It held resources - we will take them over It didn't hold resources - we won't take them over A machine came up It will request resources (only machine, not nicefailback) It won't request resources: it has none, or nicefailback

Or maybe it's simpler than that? A machine came up - resource acquisition prints completion msg in all cases A machine went down - takeover code prints msg when done in all cases

What this really is is looking for the completion of a transition. Right now the code doesn't really know when the resources have been fully acquired locally. This is not a good thing.

I suppose what I need is a message whenever it completes acquisition of a set of resources, or when it decides it's not going to.

I put in some new messages that indicate when acquisition of resources completes when done by heartbeat, but not for system failover takeovers. Those will have to go in the mach_down script or something like that. I'll try and get that later tonight. My goal for tonight is to fix this resource auditing problem.

It appears that this will require a new script which synchronously waits for resources to become served. It would be called by mach_down. Or, I suppose that mach_down could just do this itself, but this all sounds really hard, because of the messaging model used by the scripts. Maybe I could use a directory in "/var/lib/heartbeat" to keep track of what resources have been acquired. Or, I suppose I could poll to wait for them to be taken over... Yuck... Could be worse, I guess... Either way I think I get to poll...

I guess I'll just change mach_down to poll for the resources that we are still waiting to acquire rather than add new scripts. This is best done by enhancing ResourceManager to have a groupstat command or something like that, then mach_down can use that without duplicating a lot of code.

This item (the test harness) took by far the majority of my time. I suppose about 60-70%

Dropped Lars an email telling him about the updates to the test suite. Emailed some guy in France about publishing a Stonith paper for an IEEE journal.

Updated the HA web site with several minor things including stuff for Kimberlite, and the Open Cluster group (OSCAR).

Talked for a half-hour or so to Winchester Systems about getting an eval unit of their multi-interface RAID box. Made an appointment to talk at NYC.

Minor updates to the HA thoughts doc about various concerns.

I'm worried about Samba failover, and I'm worried about NFS failover. Jeremy Allison thinks Samba failover is hard, but it may be mainly an app thing. MC Linux has done the NFS failover and thinks it's hard. This may be partly smoke screen. Maybe we can get by without lock failover?

Started this Journal.

Emailed Ibrahim the suggested new paragraph for the Linux Journal.

2001/01/22 =================================================================

Spent several hours struggling with fetchmail problems. Finally got it working again with help from Chris Mahmood. Oakland had changed a bunch of things and they didn't take effect until a reboot happened over the weekend. Wrote a bunch of code associated with resource auditing for the test suite. This includes the modification of the ClusterManager class and the creation of the new Resource Class. Committed the changes to CVS. Wrote the "HA Thoughts" document for where we're going with HA in SuSE. Spent a bunch of time trying to figure out what the Baytech is doing. It seems to pause for a second every 3-4 seconds, but respond OK otherwise, But more ominously it seems to give connection refused for a second or two every so often at seemingly random times.

Over lunch tried to call WebGear. They seem to be out of business!

Updated the HA thoughts doc.

I've decided to try keeping my journal up-to-date as a way of tracking (and hopefully improving) my personal productivity. Unfortunately I had forgotten my Advogato password. But now I have it, so away I go...

This entry is really for yesterday (2000/01/16).

I spent a while trading in a plane ticket for COMDEX (which I didn't go to), for a ticket to LWCE at the end of the month.

I continued a dialogue with lmb about changing the Stonith API. We both agree it needs to change, and I think we're converging on how to change it.

I integrated multicast support into heartbeat CVS.

I integrated APC UPS support code into the Stonith subsystem, and put it under CVS.

Since other folks that I (mostly) don't know wrote these pieces of code, the only conclusion that I can draw is that this open source stuff must be working ;-).

I wrote up some release procedures for heartbeat and posted them on the web.

I got the CVS version to build correctly again after all these changes and put it on my test machines.

(I'm trying to follow my own release procedures ;-))

Things I didn't expect to do was deal with a failure of the black printhead on my HP 2000C printer (it failed about 30% into its expected life).

I dealt with some folks from Avaya, and bought someone lunch who took me to CompUSA to get the print head. I fixed the stupid printer, and helped the fellow who took me to get the print head a little as he repaired our vacuum cleaner.

Somehow my ssh setup for my labs was broken, so I needed to repair that for my test tools.

All in all, a reasonably productive day.

I've been working on the heartbeat API. It actually works pretty well. Marcelo found a couple of bugs in it, and suggested restructuring a small piece of it. So, I fixed one of the bugs, will fix another, and let him do the restructuring (he wanted to). Marcelo's a good guy. I work closely with their people, and really anyone who's interested. I've issued specific invitations to everyone who's active in this area, including those guys in NC. Right now, we have folks from many companies using and contributing to heartbeat. It's a blast!

Linux Fail Safe is nearing it's open source release. We're getting pretty excited about it. It's by far the most powerful of the High-Availability products, open or closed source.

I hope to position heartbeat to be able to do membership and low-level communication for lots of different projects. We'll write a new simple cluster manager, and use the heartbeat API. There is a place for an HA batch queueing system. Of course, it could use heartbeat ;-)

I hope to change FailSafe to use it. Perhaps even the folks at Mission-Critical Linux could use it. SGI is eyeing it for things I don't think I'm free to talk about.

It's basic, but it works pretty darn well, and gets better all the time ;-)

I got some nice feedback from Eric Ayers about my talk at the ALE (Atlanta Linux Enthusiasts) meeting last month. If you want me to speak to your LUG or conference about Linux-HA, let me know. I like giving talks.

I guess I ought to write at least one journal entry.

Lately, I've been spending most of my time doing at least six different things:

Promoting Linux-HA. A week ago last Thursday (whenever that was), I spoke to the Atlanta Linux Enthusiasts. Going to Atlanta in July wasn't my idea of good timing (it's hot and humid then), but they audience was very interested, and quite well-informed. The talk was very well received, and I even got an idea for a useful feature in heartbeat, which I implemented a few days later.

Working on reset code for LinuxFailSafe. It uses the STONITH API below.

Designing, writing, implementing and changing a STONITH API. STONITH == Shoot The Other Node In The Head. Also called STOMITH, substituting Machine for Node. I like STONITH, because of the similarity to Stoning a person representing the ultimate rejection from the community. In any case, I've been designing the abstract API, and writing code to implement it for the BayTech RPC-5.

Designing and implementing an API for heartbeat. Heartbeat is pretty nice in several ways, but it is limited in what it can do. It does heartbeats better than any other open source product I know of, but doesn't integrate with other applications to speak of. The API will allow it to be easily used with lots of other applications, whether with FailSafe, or Piranha, or CXFS, or Kimberlite, or with Stephen's new cluster manager, or some newly designed cluster manager or whatever. It is nearly complete, but needs some minor redesign to eliminate certain security issues from it before people start using it. You can get the code for this and the Stonith API from the Linux-HA CVS repository.

Generally working on heartbeat. Fixing it up, etc.

Strategizing on how SuSE should promote and package Linux-HA. Generally worrying about what should be done, and puzzling over how to get it done. This activity overlaps with lmb.

General Notes
I just got a new user for heartbeat that I am absolutely sure will need some tech support this winter. Heartbeat is now running in Tahiti :-)

I just found out that a talk I gave back in April won an award for the best talk of the day at the Lucent Technologies Software Symposium. That was certainly nice.

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!