19 Feb 2001 jct   » (Journeyer)

ODS version 0.9 is almost ready to release! I had planned to get it out this weekend, but had my plans changed for me. Here's the story, as written up for a non-technical friend:

The weekend got off to a great start. I finally figured out the problem with the new network and had it functioning as well as could be expected. The parts had arrived earlier in the week: a new 100 megabit network hub, and three network cards to go in the PCs. I had put the new cards in the computers, connected the cables to the new hub... but it wouldn't work... at least not at the 100 megabit speed, though it would still work at the slower 10 megabit speed. At the higher speed, I couldn't even ping one machine from the other.

Friday night I had finally figured out the problem: it was my homemade network cables. I had been crimping the connectors onto the cables the wrong way. Once I had this figured out, I got out my knife and crimping tool, snipped the old connectors off all the cables, and crimped new connectors on them all. I didn't have that many to recrimp, but it took me most of the night because I'm slow. I was especially glad when the cable I had run through the attic, connecting the server in my study to the PC in the play room, finally worked correctly. Pulling that cable through the attic had been a real headache, and I wasn't looking forward to having to replace it.

I was up until the wee hours Friday, but when I was done, the network was working well. All the cables had been tucked neatly away. The test cables, including the temporary that had been laying stretched on the floor between my study and the playroom, were tidily coiled and stowed away for future use. And the new network hub was sitting in its new home on the shelf, warm and cozy, its "100" lights all lit up and blinking happily as the computers talked. I went to bed sometime around 3:00 AM, feeling good.

During the night, disaster struck.

I got up Saturday morning expecting finish writing the manual for my camera software so I could finally release it to the world. But my server was in an odd state: system load was extremely high, and it was unresponsively slow. It appeared to respond to mouse movement, but I couldn't bring up a process list to see who was hogging the CPU. Finally, I switched to the console window and saw an endless list of disk errors spewing out the error log. Uh-oh.

Trying hard not to panic, I attempted to shut down the system, but with the disk not working, I couldn't shut it down normally. I had to just turn the power off. Now, this is a bad thing to do on a Windows box, and it's even worse on a linux machine, but I had no choice. Things weren't looking good.

They got even worse when I tried to restart the machine. One of the disk drives wouldn't respond at all. It was a 9 Gigabyte Ultra-2 SCSI drive: relatively new, the fastest disk in my system. I tried repeatedly to get the drive online, and as I did, I began to get that sick feeling in the pit of my stomach. The drive wouldn't respond.

Eventually I experienced a brief moment of hope as I got the drive to talk to the SCSI controller again. It wasn't dead, it was just badly injured! If I could get the drive online again, even for just a little bit, I could dump its contents onto tape and rescue my information before the drive died complelety. I fumbled through my tape bin looking for some empties to dump onto. But my efforts were in vain: the drive wouldn't go back online. It would talk to the SCSI controller, but could only say that it had an internal error. Its injuries had left it brain-dead, and I would get nothing more out of it.

At this point, panic began to look more and more like a viable option. Trouble was, I wasn't exactly sure what to panic over. On Windows system, you have C: and D: drives, etc. These drive letters usually correspond one-to-one to the separate pieces of hardware installed in your machine. Your A: drive is your floppy. If you have one hard drive, it's your C: drive. If you have a second hard drive, it's your D: drive. If you have a CDROM, it'll then be E: drive.

On any type of unix system, these distinctions are hidden from you; all the different drives are merged seamlessly into one huge filesystem. You don't see the boundaries between one disk and the next, You don't think in terms of which "drive" your information is on, you just think about its directory -- its location in the hierarchy. Eventually you tend to forget about the fact that all this information is really stored on separate pieces of hardware; it's just one large happy collection of information.

This is fine for day-to-day work, but it's an incredible hindrance when you want to panic. I couldn't remember exactly what had been stored on that 9 GB disk drive. I wanted to run about the house screaming "oh my god I've lost all my ________", but I wasn't sure whether to fill in the blank with "e-mail" or "software" or "snapshots" or "porn" or "secret Nixon audio tapes" or what!

Eventually, I began to settle down and think again. I booted my linux rescue CD, and so I could look at the remaining good disk drive and figure out, through process of elimination, what I had lost. I got the good drive mounted, looked through its contents, and found my home directory. So I hadn't lost any e-mail. I had lost my operating system and my archive of digital photos. Brain beginning to function again, I remembered that I had a tape machine, so I probably had backups. Checking the shelf, I found backup tapes. I was finally beginning to see a way out of all this mess.

I won't talk you through the rest of it step by step, but suffice to say that I spent most of the next 24 hours just bootstrapping a running system. I found two unused old disk drives, and I had to take my system almost completely apart to extract the old drive and find places for the two replacements. I was able to recover a functional operating system from tape; my most recent backups were from July of last year, but not much had changed since then, so my system is almost fully back to normal.

That was Saturday. In getting the new system running again, I'd needed access to the innards of the system, and the best way was just to lay the drives out on the floor, connect their cables, and let them run that way. When I finally got my system up and running again, having worked through the night, it was 7AM and I was too tired to reassamble it. Sunday I spent reassembling my system and cleaning up the mess.

The most critical item on the dead drive is the lost collection of digital photos because they're irreplaceable. The backup tape is from July of last year. Fortunately, I haven't taken many photos since then, and what photos I have taken are still in the camera. So it looks like I might not have lost much... assuming my backup tape is good. I'll try to recover those sometime this week.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!