2001/02/04 (Sunday) ========================================================== Spent an hour or more unpacking from the trip spread across various times of the day. This isn't quite so timeconsuming as packing, fortunately ;-)
1130 Am reconfiguring various computers in the network today. Need to move the KVM switch down to the basement and put it on the ones down there. Hooked up one of the machines. Need 2 more cables to do them all. Better go get them, I guess ;-) I converted my talk to HTML and pushed it and the StarOffice source both out to the linux-HA site. Did a little more general updating of the site. Much more still needs to be done, unfortunately. 1230 This stuff took about an hour or so.
2100 Writing a script to update the lab machines with RPMS automagically. 2145 Done. Gonna try and get the lab machines updated and a set of tests started to see if the fix I made in NYC works. 2200 Tests started successfully. Bye for now.
2001/02/05 ================================================================= 0645 Not much mail came in last night. Better check my fetchmail. Last night's test run of 500 iterations completed successfully at 0630. So, the fix I made on the plane/hotel room during the LWCE works. Good timing ;-) I got some more mail. Fetchmail must be working OK. All 500 iterations succeeded. I also cut down the standard failover time to 3 seconds. Need up update the changelog. The viewgraphs I put up on the web point at the home page, which points at Red Hat software, because it's out of date. I guess I'd better fix that as soon as I fix the ChangeLog.
0730 OK. Fixed the changelog. Time to release the development version 'k'. Need to find my freshmeat password, so I can follow the "official" procedures I documented on the web. Better go get my PalmPilot Freshmeat has changed a lot since was there last. I need to announce it on freshmeat and on the lists. Found it. I need to change lots of stuff to conform to the way freshmeat is now set up. This may take a while. I guess the freshmeat II rewrite is pretty new (last of Jan), so I would have had to do this, and now wasn't too bad a time. 0810 Now I need to announce it to the mailing lists ;-) Discovered a few minor glitches on the web page. Netscape crashed :-(
0832 Got the release notices out to SuSE internal sources and the various external HA mailing lists. I keep a little whiteboard with my near-term TODO list on it. Here's what's on it right now: Fix "sitemap" program Read CVS book Update TODO list on web Post Talk VGs (already did that) Test New version (already did that) Update home page Work more on test scripts Move disk drive to "servidor" Email: OSCAR folks Baytech weirdness
Release Unstable version (already did that) I'll update the board. Fortunately it's easy ;-) Done.
0838 I guess my next priority ought to be to fix the home page, since potential SuSE customers will be reading it, and it says I work for Lucent and I recommmend Red Hat. It's more than a year out of date - a bit embarassing :-( OK. Updated the home page (a little). That didn't take long. Now I'll tackle the TODO list Then either sitemap or some CVS book reading.
0852 Done. On to the todo list... Dropped a note to the Linux Weekley folks about their poor choice in names since it conflicts with LWN - my favorite Linux publication ;-)
0955 Finished the TODO list, and announced it. Hmmm... What next? Guess the CVS knowledge is pretty sorely needed at this point. I'll go read for a while. I need to know about how/when/why to set up CVS branches. I need to add some for linux-ha, I think... Short-term todo list looks much nicer now ;-) Of course, to understand this, I have to know something about CVS tags, too ;-). I also wrote a little script which tags my CVS tree with a tag derived mechanically from the release number.
Got some question email about heartbeat - answered. Got some question email about AutoMake/Build - answered. Got a suggestion about the ToDo list. Incorporated it. Ate lunch. Took about 10 minutes.
1140 Now on to "sitemap"... What are the symptoms? Directories with index.html in them aren't made into links. The directory LWCE-NYC-2001 is omitted from the directory name displayed. The links under it are all fine. Sorting should be case-insensitive It treats some files as directories. Perhaps the dirname() function is screwed up? Seems so. Sorting is still case-sensitive. Oh. It's doing perl sorting. Fixed it. fixed file sorting, too. Somewhow we're not picking up the title, etc. from some pages. $Title and $X-Meta-Description are missing from them... Something appears to be wrong/changed with HTML::HeadParser It isn't always returning the info to us... It seems to have something to do with the DTD line netscape puts in. It doesn't like it. I need to remove it. Sigh... 25-30 page edits later...
1340 Got them all removed. The index looks much better, but still isn't quite right... Sorting is still off... Of course, all the modification times are all wrong :-( I should have tried updating to a newer version of the Perl packages.
1355 Fixed sorting. Now I know why I was avoiding this ;-) Site map all better now.
I'm worried about getting DSL service when I move. It seems that there will be a 2-week delay after moving in and getting a basic phone line installed. This would mean I'd have to use dialup for about 2 weeks :-(
1425 OK... Back to working on CTS... I think I'll add the "monitor" function to IPaddr next. The basic thing is to "ping" the address.
1450 Done. Committed to CVS. Now change the code to actually use it in the audits...
1520 It looks like the tests should be pinging the node to make sure it's really serving the IP address as we go along. And, we should be verifying that all resources in a group are being served by the same node. Oh... Except I haven't put the latest version of the code on the test cluster which means it ought to be failing (!?) OOPS. It wasn't actually being called. The if-condition was too complex. It's a little simpler now, and now it fails like it ought to ;-) I distributed the new IPaddr script to the lab machines. It seems works now. I'll restore a little of the debug logging to make sure... Yep. It's working.
1600 I'll start a series of tests running. They take around 8 hours IIRC. I need to check mail before quitting for the evening. Not a lot of mail. Only 7 new emails. Martin Konold pointed out I forgot to mention the download URL.
1610 It's now corrected both inside SuSE and outside. Time to quit for the evening.
2100 My freshmeat entry got thrown away. I'll need to resubmit it and the information on the main branch. Sigh... Got an email from Volker. It needed a reply. I sent out a Call for Refinements for heartbeat I send out a wish list for what apps people want to make HA. I bought two more KVM cables. Now I can hook all the machines up to the switch. Wired up another computer to the KVM switch. I'd wire up the other two except I need to wait for the tests to finish. Speaking of tests, 380 or so have already succeeded. Only 120 more to go ;-) Ted Ts'o sent me mail about the Lucent winmodem problems, next chapter. I sent him a brief reply. Sigh...
2210 Bye for now. 395 tests run so far.
2001/02/06 ================================================================= 0620 All 500 tests completed successfully. Looks like my mails to the list and to Volker have generated some responses. It'll take a while to go through them. Most of the responses were pretty much what was expected. But, I'll update the ToDo list with a couple of them anyway. Composed an email to send Volker and Markus about staffing. Responded to more email. And more email, and more email.
0900 Time to process more email. Some from Lars, some from the ha list, some from others. Need to check the web stats and see how many downloads have occurred of the new code, but it's probably too soon to see them in the reports yet.
1000 Time to finish hooking the cluster up to the KVM switch. Done. Now, what next? I'm getting pretty close to being happy with the test environment as it stands now. But, I still need the "environment" dimension. I guess that's a good next step. Also add "quorum" to the ClusterManager class. HasQuorum() added. Looks like it works.
Let's see if I can remember all the things we still need to add to the test code. I'll go reread the email on it... The main thing remaining was Scenarios. Scenarios were the idea that we might run a particular set of configurations like what kind of resources, or what kind of workload either from the test machine, or workload running on the cluster machines.
Need to drop Lars an email about the state of the test tool and the HasQuorum member functions. On second thought, I'll save that until I'm done. Otherwise too much time is lost. Now on the "Scenario" concept... Worked on it a while, went to lunch (took nearly an hour today - getting out of the house was wonderful - a good break from the more-usual 10 minutes) 1225 Got back, got some detailed mail from SGI about CTS. Am writing a detailed response. This is taking a while. 1345 Finished. Now back to the scenarios...
1435 I now have the code for a basic, robust StartUp scenario. Wonder if it works? ;-) 1530 It seems to work now. It's also integrated into the RandomTest class. Hmmm... It seems the Quorum changes didn't all make it into CVS. I'm putting them back in.
1600 Bye for now.
1915 Got lots of email to respond to. Looks like some folks at HP may want to use heartbeat in a product.
2023 Bye for now.
2055 I just can't seem to stay away. More email responses (~15 mins). Back to the home network configuration ;-) Got an emergency request to make more free space on some FAT partitions. I'm doing that now in the "background". Looks like 338 tests successful so far with new version. I now have CVS access from "servidor" too, so things are easier to do right now ;-)
2230 We're now up to 430 tests successfully done. Tomorrow I need to: Do paperwork for LWCE/NYC trip :-( Attend the "All hands meeting" conference call at 1100 Write some kind of nasty ScenarioComponent for something like web server traffic or memory hog or CPU hog or generic network traffic or swap hogs, or something. A flood pingfest comes to mind as being a good place to start <;-) Move big disk to backup machine 2300 Bye for now. 2330 Changed my mind. Going to add a VerifyAllIdle action to the ResourceManager script tonight and then invoke it from the startup script. This will give folks who make one of the two most common errors a good clue that they made a mistake. The guys from HP made this common error, and I've had it with this problem! 2345 All 500 tests passed. 0016 The new code for the verifyallidle action is in, and activated. It seems to work just fine. Now to update the ChangeLog. 0030 All put in CVS. Send email to the HP guys ;-) 0040 Bye for now.
2001/02/07 ================================================================= 0605 Checked email. A number from Lars, a couple from HA lists, Lenz. Sent replies, filed. 0710 Find receipts for LWCE. Start expense report. Process more email. 0940 Paperwork done. Need to send it out. Now on to the nasty pingfest ScenarioComponent. Should be fun ;-) Looks like the last batch of 500 tests finished successfully at about 09:25.
1045 Looks like the PingFest flood ping test is working - perhaps a little too well ;-) The tests are running really slowly - but they're working! The switch port lights are on pretty nearly solid ;-)
1100 Went to the conference call. SuSE is letting most everyone go here in the US. Looks like I get to find a new job ;-) Update resume, phone call, interview Repeat until new job.
2001/02/08 ================================================================= 2001/02/09 ================================================================= 2001/02/10 =================================================================