Recent blog entries for Boris

5 May 2005 (updated 5 May 2005 at 23:36 UTC) »

Note to self: There's a really good reason why you don't use dump to backup a filesystem thats NFS mounted...

<rant>

Grr... I hate outsourcing...

Recently I had a PDA that was broken. I had two options: 1) Ship it back to where I bought it and get a refund (they clearly hadn't checked it - all thier fault) or 2) Find out the repair costs.

I thought I'd see what the repair costs were first. They recommended sending a web-form email. Having worked in support for many years, I like to let the poor support techs know what they are in for when I log a call. I did the diagnosis, and logged it nice and clearly listing the problems and symptoms in the email. (Short form: digitizer was broken. What are the repair costs?)

First response: Standard email back saying "Have you upgraded to the version 1.1 firmware? If that doesn't work try a soft reset, and if that doesn't work, try a hard reset. Thank you" What?! It's a hardware problem. I'd listed it as a physical hardware problem. How's that going to fix the busted digitizer? People, at least try and use a script that parses the email. I could write something in 10 minutes in perl and PHP that would do a better job than that.

Turned it over to my wife to try to get a response (She's much better at getting people to do stuff). She did the email chat thingy from thier website. Told them exactly what the problem was. Asked what the repair cost was. Was told it would be a warranty repair. No cost. Sounds good. So, I shipped it off.

A few days later I get a call saying that they looked at it. It would be a $125 repair, and could they have my credit card number?

I called then them back, and some Indian bloke says that the head of engineering looked at it, and the digitizer was broken and it would be $125 to repair. I told him that we told them that *twice* and that they said it was a warranty repair.

He said, "Yes Sir, but the digitizer is broken and that $125, but it has a 3 month warranty".

I said "Yes, but you said 'warranty repair - no cost' and now theres a cost, even though I told you twice that this is exactly what the problem was..."

He said "The head of engineering says that to repair it costs $125..."

I said "I described the problem clearly. You said 'Warranty Repair'. The problem was a broken digitizer. Why is this not a warranty repair?"

I could clearly tell he'd run out scripted responses. He said "Sorry sir, I'm not a technician. We're in Southern India"

I asked him to transfer me to engineering. He responded "I can't do that" I repeated the request to be transfered to someone other than him, either his boss or someone who was dealing with the PDA. He said "I can't do that"

I said "Well get the head of engineering to call me then" He said "I can't do that - our contract states..." I cut him off and said "I don't care - I was told no cost, and now there is a cost, and you're not telling me why there is suddenly a cost, and you won't get the guy who is saying there is a cost to contact me."

He went back to the script and said "but the repair is under warranty and its a 3 month warranty and can I have your credit card number"

One thing I hate worse than dealing with machines are people who act like machines. I told him to ship it back and hung up on him in mid sentance.

I long for the days when you could get transferred to some tech who would take 2 minutes out of his day to tell you that it looked worse than they thought. You could at get an understanding why you are suddenly paying out almost as much as the thing cost in the first place, and feel better about supporting some guy in Southern California because you would know exactly what was wrong.

I like to know why I'm forking out cash, and not for some one liner email that Mr Call Center in Southern India got with no clear description of the problem and reasoning why there's suddenly a charge.

Grr. I hate outsourcing. If you're going to outsource, at least outsource it to some tech who will know what the heck is going on. Don't outsource it to some script machine.

</rant>

OMG. Now they have spam degrees in Nigeria... Todays spam:

Dear Sir/Madam, I am Princess Chioma , daughter of HRH King Solomon Abonime, the king of Ogoni Kingdom. I am 25 years old and a graduate of Mass Communication. My father was the king of Ogoni Kingdom the highest oil producing area in Nigeria. He was in charge of reviving royalties from the multi-national oil companies and government on behalf of the oil producing communities in Nigeria.

etc, etc

17 Jan 2004 (updated 17 Jan 2004 at 01:10 UTC) »
How I wasted 3 hours in a server room,

OR

Why I'm cursed with this cluster implimentation.

I have a cluster of linux systems, and they are controlled by some USB power switches so that the nodes can power each other off when a node dies.

Kimberlite doesn't understand these switches and the stonith module I hacked together in a couple of hours on Christmas eve just sucked too much to be stable, so I scripted the stonith stuff instead.

So I installed the scripts this morning and did some testing, and it all worked well. Until the very last reboot of the secondary node.

The module controlling the USB switch wouldn't finish initialising. It's a problem we had before and I thought I had it fixed. I decided to enable a USB option on the systems BIOS to see if that helped any.

Well it didn't. The system posted the SCSI card, then hung. It didn't even get anywhere near starting to look at the boot drive. Of course this card was controlling the quorum disk and I had to kill the cluster again. I popped out the card and the system the posted enough where I could get back into the BIOS setup. I disabled the option I had enabled, and the put the SCSI card back in again.

So now, the system posts and starts the boot process, but won't boot the drive I need. I stick in the rescue floppy.

It boots.

It goes:

vmlinuz..................

Ready.

And hangs.

I'm thinking, "Not seen this one before..."

I bootup FIRE and the inspect the boot-drive. Everything looks correct, and fsck comes back clean.

I reboot the system and hit Ctrl-C and disable the bios on the scsi card in case it's pre-empting the ide boot.

No change.

System still isn't booting, and I'm getting sick of seeing "vmlinuz... Ready". I mean, what the heck does it think it is? A C64?

Of course reasoning with it and trying to persuade your 1Ghz server that it's not some 8 bit glorified typewriter gets me nowhere. Just in case I type in on the console:

10 Print "Hello World"

20 goto 10

just incase I found a kernel easter egg or something. Of course it doesn't work.

I try to reenable the LSI card. I hit Ctrl-C to get into the bios. Nothing. I reboot and read the screen, it tells me to hit Ctrl-C. So I hit it. Still nothing.

I was trying to remember other control codes for LSI cards. I try Ctrl-M. Nothing. Ctrl-R, nope. Ctrl-L, nada. Ctrl-H, same. I vaguely remeber that some SCSI card I once used took Ctrl-A. I think to myself "That only works on Adaptec cards, but this is LSI. It won't work". It worked. My jaw drops. I re-enable the bios on the scsi card and reboot. Yep. It still says Control-C, except that now Ctrl-C actually works.

This has to be the wierdest bug I've seen in an age. With scsi bios enabled you hit Ctrl-C to enter the card settings, with scsi bios disabled you have to hit Ctrl-A, even though the card tells you Ctrl-C. Wow.

My linux server still won't boot my linux.

I check the system bios, and I can't see anything different. I make a few small changes (reset configuration data, etc). Nothing.

I make a few big changes, and then decide I don't want to do these all at onces. So I hit F9 to default the settings and make one change (making sure the system starts up in the event of a power cut).

I reboot, and linux boots first time.

I think "hmm... what was different"

I reboot and head back into the bios. I look and compare with what was there at the start before I made the initial USB change.

Nothing. It was exactly the same. In everything.

I could have saved myself 3 hours of debug if I'd thought that hitting the default button would have worked instead of undoing the one change I made. Ironically undoing my initial change brought the system back to it's default settings anyway, because that the way we like our servers.

Now if I can only get kimberlite to work properly on RH9...

Just now I got a piece of spam trying to make me download a tool called "Spam Ready" which supposedly stops spam.

Who in thier right mind would trust software from a company who spams...

Time to update my content filter yet again...

Been hacking at getting a cluster running without the need for actual shared hard drives (not even for the quorum), and I'm looking at using some form of network block drivers for this. (BTW If anyone has clustered linux servers without using physical shared drives let me know)

Cyclic redundancy for today: I needed to hit google to look up some information on block drivers. Switched back to my code screen and hacked a little. Switched back to google, and entered "google.com" in the google search box.

It came back and one of the options it offers is "Show Google's cache of google.com" Nice. I wonder how many levels of cache I can go to :)

10 Jul 2003 (updated 10 Jul 2003 at 22:30 UTC) »

So there was a fire a few blocks up the street from the university yesterday.

Apart from a few hours when the power was cut to my office the fire seemed to not have any affect on the university.

Until today.

My co-worker started his holidays today.

I get into work and go get a laptop to restart the sun servers after the power cut.

I discover that security has closed the lab where we keep the main servers and our storeroom. The storm drain/sewage system/something had backed up with water poured into the warehouse and this caused the subfloor under the lab to fill to a depth of about 8 inches. This water was filled with charred sawdust and wood fragments, so we got flooded, but the place smells like a fire.

I had to move all the equipment in the lab into a store room, close the lab, reschedule two classes, notify staff and students, and deal with a dozen other issues arising from having to close a computer labs for the forseeable future.

And all I wanted to do was to finish working on my iSCSI clustering project.

Hmm. Real Weasel is a nifty bit of hardware.

Exactly what we need to remotely administer a server that locks up randomly.

My attepmt to get my PC to show the modem status lights was a dismal failure.

Earthlink's stupid dialler ate the modem status indicator a long time ago. I've been poking around Windows innards trying to find what was changed.

Last night I got the lights to display once. Just once. They never came back again. I've tried to recreate what I did so I can figure out what needs to be fixed.

Tonight I gave up. I wiped out all the dialup networking files I could find and reinstalled windows hoping that wiping out the registry and dumping new files might work.

Nope. Didn't work. Something was changed that lasts past a reinstall, and I don't know what the heck it is.

I've searched the internet for solutions, but there isn't a solution to be found.

Anyone with ideas or even better a solution should email me

A kernel ghost story...

I just discovered a real nasty thing with grub, raid and RedHat 7.3. It goes like this:

I just spent the last 2 days building a server. I have RedHat 7.3 as the os, 2 ide drives running software raid 1, and these are using ext3 as the filesystem.

7.3 has been out a while and a lot of the packages are out of date, so I do a update all RPMs to bring the server up to spec. This includes updating the kernel to 2.4.18-5. It shouldn't be a problem.

I check /boot, and it's all been updated to the correct kernel. I check grub, and these files are pointing to vmlinuz 2.4.18-5. I'm happy and I reboot.

Grub loads. It prompts with "Red Hat Linux (2.4.18-3)". I think "What the...?" I try booting it to see what happens, expecting obscure error codes, kernel panics, and the end of civilisation as we know it.

It starts booting, then it loads vmlinuz-2.4.18-3, alogn with the rest of the OS. Of course all the device drivers fail because the symbols are wrong. I'm amazed, astounded and astonished. I search for signs of 2.4.18-3. It's not there. Totally absent. Not a 2.4.18-3 anywhere on the server. Totally eradicated. I check two different ways just to be sure. I am totally confused, confounded and chastened. The server just semi-successfully booted a ghost; I had thought I felt a chill enter the room. I wrote that off to the AC coming on, maybe it hadn't. I think "Buh?"

I search the internet for answers - it knows everything - this has to have happened before. This will be fixed in two minutes. No problem.

I find lots of questions relating to this. No answers. Nothing. Not a clue. Not a hint.

This is a problem.

I wonder if maybe civilisation did end when I booted a kernel that didn't exist and could only have been a ghost. I idly wonder if Dr. Egon Spengler would have any ideas. He seems like the type that would know about a ghost kernel.

I assume grub is storing the config file elsewhere. It doesn't explain how it's loading the previous kernel though; it can't be storing that much data in the MBR. I try running grub-install. It won't run because it says "/dev/md0 does not have any BIOS drive". Of course not. I try forcing it to do something. Anything. Because I have everything on a raid device it refuses. I try everything short of trying to find a sharp stick and poking it, but it just won't work. By now I'm looking at losing 2 days of a server install because I have no way to get all my config files off the machine (no devices are working, remember?). By now, I'm beginning to drool and idly wondering if it will stain my shirt.

I decide that the system is pretty irreprable, and elect to try desperate measures.

init 1
...
umount /dev/md0
mount -t ext3 /dev/hda1 /tmp
cd /tmp/boot
ls

Suddenly I see the old boot directory. vmlinuz-2.4.18-3 is there, along with the old grub files that it won't change. I wipe the drool off my chin, and type the following:

mkdir old
mv * old
cp -r /boot/* .
ls

All new files are there, I double check permissions and links. I type: sync; init 6

Grub appears on the screen. It happily prompts me to boot "Red Hat Linux (2.4.18-5)". The new kernel boots and all is right with the world again. I've exorcised the ghost of 2.4.18-3 and I can go back to breaking MySQL again.

114 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!