How I wasted 3 hours in a server room,
Why I'm cursed with this cluster implimentation.
I have a cluster of linux systems, and they are controlled by some USB power switches so that the nodes can power each other off when a node dies.
Kimberlite doesn't understand these switches and the stonith module I hacked together in a couple of hours on Christmas eve just sucked too much to be stable, so I scripted the stonith stuff instead.
So I installed the scripts this morning and did some testing, and it all worked well. Until the very last reboot of the secondary node.
The module controlling the USB switch wouldn't finish initialising. It's a problem we had before and I thought I had it fixed. I decided to enable a USB option on the systems BIOS to see if that helped any.
Well it didn't. The system posted the SCSI card, then hung. It didn't even get anywhere near starting to look at the boot drive. Of course this card was controlling the quorum disk and I had to kill the cluster again. I popped out the card and the system the posted enough where I could get back into the BIOS setup. I disabled the option I had enabled, and the put the SCSI card back in again.
So now, the system posts and starts the boot process, but won't boot the drive I need. I stick in the rescue floppy.
I'm thinking, "Not seen this one before..."
I bootup FIRE and the inspect the boot-drive. Everything looks correct, and fsck comes back clean.
I reboot the system and hit Ctrl-C and disable the bios on the scsi card in case it's pre-empting the ide boot.
System still isn't booting, and I'm getting sick of seeing "vmlinuz... Ready". I mean, what the heck does it think it is? A C64?
Of course reasoning with it and trying to persuade your 1Ghz server that it's not some 8 bit glorified typewriter gets me nowhere. Just in case I type in on the console:
10 Print "Hello World"
20 goto 10
just incase I found a kernel easter egg or something. Of course it doesn't work.
I try to reenable the LSI card. I hit Ctrl-C to get into the bios. Nothing. I reboot and read the screen, it tells me to hit Ctrl-C. So I hit it. Still nothing.
I was trying to remember other control codes for LSI cards. I try Ctrl-M. Nothing. Ctrl-R, nope. Ctrl-L, nada. Ctrl-H, same. I vaguely remeber that some SCSI card I once used took Ctrl-A.
I think to myself "That only works on Adaptec cards, but this is LSI. It won't work". It worked. My jaw drops. I re-enable the bios on the scsi card and reboot. Yep. It still says Control-C, except that now Ctrl-C actually works.
This has to be the wierdest bug I've seen in an age. With scsi bios enabled you hit Ctrl-C to enter the card settings, with scsi bios disabled you have to hit Ctrl-A, even though the card tells you Ctrl-C. Wow.
My linux server still won't boot my linux.
I check the system bios, and I can't see anything different. I make a few small changes (reset configuration data, etc). Nothing.
I make a few big changes, and then decide I don't want to do these all at onces. So I hit F9 to default the settings and make one change (making sure the system starts up in the event of a power cut).
I reboot, and linux boots first time.
I think "hmm... what was different"
I reboot and head back into the bios. I look and compare with what was there at the start before I made the initial USB change.
Nothing. It was exactly the same. In everything.
I could have saved myself 3 hours of debug if I'd thought that hitting the default button would have worked instead of undoing the one change I made. Ironically undoing my initial change brought the system back to it's default settings anyway, because that the way we like our servers.
Now if I can only get kimberlite to work properly on RH9...