2001/01/27 =========================================================== 0530 Woke up. Decided that this is 0730 NYC time, so it must be time to get up ;-) Checked mail. Went back to restructuring/enhancing the the test code. Still having occasional problems with python naming and modules, but now think I have a good strategy worked out for using it. Looks like the latest iteration of restructuring is now working. Guess I'd better go off and figure out what to do next. 0800 Enough for now. I made some pretty good progress on a couple of fronts.
1055 Joined Joe Barr on his recording bridge. He had 10 questions to ask, and I answered them as best I could. He's now interested in HA things and may write an article on it. He'll likely give me another call if he decides to do so. Interview got over at about 1130. I think he got the sound bites he was looking for.
1240 Looks like the test scripts show some failures. Looking at the logs the heartbeat code is working right, but the test code doesn't think it is. The case is the "restart" test when only one machine is up looks for the pattern for "remote machine has joined" when it should be looking for the pattern "local machine has joined" instead. Don't know why yet. 1315 Gotta go to Castle Rock and then going-away party for my cousin :-( Bye.
2001/01/26 ================================================================= 0640 Really tired this morning. Guess I'm getting too old to get less than 6 hours sleep too often. Laura is feeling a little better this morning, so she went to school today.
Surprisingly, no reply from Lars on the CTS code. Ahhh... It just came in, in 2 parts. I responded to one of them. Decided to clean up my Trash folder, as it has over 10K unread messages in it. I'll get rid of all Trash from last year. Good to take out the trash once a year whether it needs it or not ;-)
0800 Need to get dressed, etc. 0825 Back to work. I need to look over the current copy of "Enterprise Linux" it has a pretty cool cover article about the Weather Channel. Didn't actually read the article yet. Responded to Lars' comments. 0940 Finished responding to Lars' comments about CTS. Started implementing some of them. Splitting into multiple files, separating out the audit class. 1030 I'm exhausted and have a headache. Time to take a decongestant, a break and maybe a nap. Maybe I need to eat something? I see that we've done 602 iterations of the heartbeat code without any errors. This time I'm including the Stonith test in my set of tests to run (it slows things down a lot).
1055 Back to the salt mines :-) I feel a bit better. I see we're up to 655 iterations. I wrote the Audit class. I guess I'll stop the ongoing tests on 'servidor' and actually try the restructured code and see if it works, as opposed to "just compiles"
1145 Headache is back. Time for something stronger... Time for lunch... Went to lunch. Received a few boxes full of hardware for installing the network. Spent about 45 minutes checking the stuff out, making sure it was all there etc. Laura came home sick and exhausted, took her to Lunch, since she hadn't eaten. Still don't feel right. Took a half-hour nap. Spent a half-hour or so helping Amy get xawtv working on her PC, without much success. Having trouble importing some Python classes. Learning curve, I guess... (could it be a Python bug?)
Got email from Paddy about possible FailSafe meeting times. Replied, told him to avoid the CLIQ, 'cause I'm running a BOF (and representing SuSE?) there. Got email from Mia with corrected arrival date for hotel.
1740 Time to call it quits for a while and get Laura (and me) dinner. 2000 Called Joe Barr, and set up the appointment for the interview tomorrow at 11 AM. He seems like a really nice guy. I'm now writing the code for the CtsLab class. 2120 Tired. Going to bed. But, I feel better than I did earlier today.
2001/01/25 ================================================================= 0525 Started work. This will be an odd day. Thursdays always are. Today a little more so than normal.
I see the overnight run I made crapped out after about 15 minutes because I had too many open files. Hmmm... Never saw that before... Not surprisingly, it was in the new AuditResources code... It was doing a popen for determining if the other node is up. I'm not waiting for the child process to finish before going on. I'll see if waiting for it to finish helps... I see it's gone 130 iterations this time. Before it only went 60. That's a good sign. Looks like that fixed it. It's been > 300 iterations.
I got email from lars about the CTS. I've been responding to it. He has some good comments.
0615 Need to get dressed to take Kathy to school so I can have a car today. My wife is sick, my mother-in-law has an infection from her surgery and my father-in-law and I both have doctors' appointments today...
0700 Back to work...
I'm continuing to respond to Lars' email. He made a couple of good points, and some I don't care about. Completing my reply took exactly an hour. We're now up to 580 successful test iterations.
3-4 people subscribed to the linux-ha-dev list today. Replying to them took until 0920 or so. More email, more travel planning...
1025 Time to go to Doctor's appt.
Went to Doctor's, did about 15 mins of coding, went to lunch with a good friend who needed some time to talk. Got done about 1400. Picked up Kathy from school at about 1440
1500 Started back to work. Lots of email arrived while I was gone. They changed my hotel reservation, so I have to print off new stuff to carry with me and tell Wombat new hotel name.
Included in the email was a VIRUS ALERT, TELL ALL YOUR FRIENDS! ;-)
Apparently disconnecting my laptop stopped the tests running. I had about 1100 iterations at that point.
Got an subscription email from a commercial HA firm. I sent them the same "welcome to the list, what brings you here?" note I send everyone. It'll be interesting to hear what they say.
1645 Need to go make preparations for dinner, etc. Laura stayed in bed all day. No word from my in-laws yet on how they did.
2330 Decided to check mail and read about the worm. Took about an hour. I see the tests I had started finished just fine. G'night.
I started work this morning about 7:15.
I spent the first two hours this morning dealing with email and talking to Lars on IRC. He now knows my situation and a bit more about the priorities in SuSE, Inc. I agreed to write up a few paragraphs on the Cluster Test System (CTS) for him.
I made a doctor's appointment for tomorrow morning so I can get some prescriptions refilled before taking off to NYC. (about 15 mins)
I spent about an hour or so writing up the CTS for Lars.
I spent about 15 minutes explaining to MilesTek about the troubles I had ordering equipment from their web site. I scanned in some pages and emailed them out. Sigh...
Responded to some email from MC Linux about Stonith. They're considering adopting it, and had a few questions about the expect() function in it. My reply seemed to satisfy them. Guess that's good.
Took off for lunch at 12:17 PM, returned as 14:10. Had to make a trip by the house and pick up Laura from work.
Set up an appointment with horms for Wednesday dinner.
I wrote the code to tell if some, all or none of the resources in a group are held by the current node. Probably even works ;-)
Doing conference paperwork: Scheduling things, getting the current schedule for the conference room, etc. This will probably take me an hour to do. Meeting with Horms (VA Linux), Ben Rafanello (IBM), Wombat (Peter Badovinatz @ IBM), Thomas Schaffner (Enterprise Linux), Mike McQuaid (Winchester Systems). I also talked to Jon Doyle for a half-hour or so somewhere in here.
Sent some email about the heartbeat API to Ericcson in Montreal. Took about 15 minutes to write.
1645 quitting work for a while (Dogs are going nuts, and wife is sick). 2112 back for a bit. Gonna work on the resource stability thing... Finally backed up the laptop ;-) 2200 Going to bed. Got the new cts.py code working including polling for resources to become acquired.
I started work this morning about 7:30. I took about a half-hour off for lunch. I stopped around 4:45 or so and put in a half hour or so later in the evening to catch up on email, etc.
More updates to the test suite. Basic Resource Auditing works! It's now in CVS too.
Need to get the CTS harness to not audit resources too soon. It looks like the IP addresses aren't getting set up as fast as the auditing is taking place.
Further examination seems to bear this out, but the heartbeat code doesn't give any particular message when the transition takeover scripts have completed. I put in a little code to loop for a while re-auditing things until they get better. They always seem to get better at least ;-)
There are at least four possible cases: A machine went down: It held resources - we will take them over It didn't hold resources - we won't take them over A machine came up It will request resources (only machine, not nicefailback) It won't request resources: it has none, or nicefailback
Or maybe it's simpler than that? A machine came up - resource acquisition prints completion msg in all cases A machine went down - takeover code prints msg when done in all cases
What this really is is looking for the completion of a transition. Right now the code doesn't really know when the resources have been fully acquired locally. This is not a good thing.
I suppose what I need is a message whenever it completes acquisition of a set of resources, or when it decides it's not going to.
I put in some new messages that indicate when acquisition of resources completes when done by heartbeat, but not for system failover takeovers. Those will have to go in the mach_down script or something like that. I'll try and get that later tonight. My goal for tonight is to fix this resource auditing problem.
It appears that this will require a new script which synchronously waits for resources to become served. It would be called by mach_down. Or, I suppose that mach_down could just do this itself, but this all sounds really hard, because of the messaging model used by the scripts. Maybe I could use a directory in "/var/lib/heartbeat" to keep track of what resources have been acquired. Or, I suppose I could poll to wait for them to be taken over... Yuck... Could be worse, I guess... Either way I think I get to poll...
I guess I'll just change mach_down to poll for the resources that we are still waiting to acquire rather than add new scripts. This is best done by enhancing ResourceManager to have a groupstat command or something like that, then mach_down can use that without duplicating a lot of code.
This item (the test harness) took by far the majority of my time. I suppose about 60-70%
Dropped Lars an email telling him about the updates to the test suite. Emailed some guy in France about publishing a Stonith paper for an IEEE journal.
Updated the HA web site with several minor things including stuff for Kimberlite, and the Open Cluster group (OSCAR).
Talked for a half-hour or so to Winchester Systems about getting an eval unit of their multi-interface RAID box. Made an appointment to talk at NYC.
Minor updates to the HA thoughts doc about various concerns.
I'm worried about Samba failover, and I'm worried about NFS failover. Jeremy Allison thinks Samba failover is hard, but it may be mainly an app thing. MC Linux has done the NFS failover and thinks it's hard. This may be partly smoke screen. Maybe we can get by without lock failover?
Started this Journal.
Emailed Ibrahim the suggested new paragraph for the Linux Journal.
Spent several hours struggling with fetchmail problems. Finally got it working again with help from Chris Mahmood. Oakland had changed a bunch of things and they didn't take effect until a reboot happened over the weekend. Wrote a bunch of code associated with resource auditing for the test suite. This includes the modification of the ClusterManager class and the creation of the new Resource Class. Committed the changes to CVS. Wrote the "HA Thoughts" document for where we're going with HA in SuSE. Spent a bunch of time trying to figure out what the Baytech is doing. It seems to pause for a second every 3-4 seconds, but respond OK otherwise, But more ominously it seems to give connection refused for a second or two every so often at seemingly random times.
Over lunch tried to call WebGear. They seem to be out of business!
Updated the HA thoughts doc.