Older blog entries for dalke (starting at number 26)

Regarding gary's comment on The ultimate hacker's car. I program so I can forget about things - I can forget how something is implemented because it just works the right way. I don't want an infinitely adjustable car, just one I like. Knowing how to tweak the "camber, castor and toe angles" simply does not interest me. I don't have a customized .emacs file. I haven't changed my app skins. Only thing window manager setting I've changed is to make the background black and have a "focus follows mouse" - in order to keep things more like the 8 years I spent using SGI's 4DWM.

Watched the Discover channel on cable a few weeks ago. (Don't have cable for fear of being a deeper thrall to the TV Fates. :) Couple of years ago I watched it when a show started on about severe storms. "Cool" I thought, expecting to learn more about how storms are formed. Nope. Sorry. It was a dramatic reenactment of a family who lived through a tornado. No science.

Then, two weeks ago I watched a show about storms on other planets, with ... dramatic reenactments of: "what would happen if we had dust storms like those on Mars?" "Storms like Jupiter's Great Red Spot?" "Flying (in a 727 I think) above a supercell in Jupiter's atmosphere?" Experience a coronal mass ejection 10,000 times greater than any ever experienced?"

Bleh.

Going to be in Cambridge, UK in a couple weeks, visiting EBI. Then to BOSC and ISMB. Bought luggage yesterday because not only is my only other luggage carry on size, but the handle broke just before my last set of trips.

Moths everywhere. Open a door and in come a slew. They love cooking themselves on my halogen lamp. I now know the aroma of toasted bug.

Been looking around for text search engines for the biopython.org project. I want to be able to do fast text searches of multi-gigabyte data sets, like GenBank.

So far what I have is the parser generator, which can identify the semantically important regions of the various formats. These would pull out things like "author", "organism", "accession number" and "description", possibly with the help of some Python code or with XSLT.

People would like to be able to search that data. Some of the searches are easy to implement, like a search for the identifier "100K_RAT." Others are a bit more complicated, like searching for the author named "Smith", as compared to "Smith, D. C." which would be the full text of an author field. Finally, people would like boolean and phrase search capabilities, like searching "'hoof and mouth disease' or 'foot and mouth disease'" in the description field. Some of the fields may allow stemming and some may not, although stemming is not a requirement.

All this data is record oriented and I have full access to the original data files (which may be compressed, so I wouldn't want to require access to the data to do the search).

The data is not in a form which can be handled directly by an indexing engine. For example, every record may have the word "GENE" because that's part of the format definition, but I don't want a search for "gene" to return all records, or ignore all records because "gene" got put on a stop list. ("Gene V" is the name of a specific gene.)

Instead, I can convert the input into a proper form and either call an API to say "here's the text for a new 'description' field" or convert the text into appropriate XML.

I would like to be able to update the database so modified forms of a record replace old entries. (There is a guaranteed unique key for each record.) This is needed for GenBank records which distribute a delta file every day but only do full releases every few months.

Finally, it needs to work on Linux, Solaris and IRIX and hopefully MS Windows and Mac (at least X). Python interface a plus, but I've done plenty of interoperability with command-line programs before, and with calling C APIs using Python.

Oh, and did I mention that fast is good? Ideally, simple lookups for some fields, like record identifier or aliases, should be blisteringly fast, while more complex keyword searches should be a fraction of a second. The system I have now, which only does exact match word searches is built on Sleepycat's BerkelyDB and does the lookup part quite quickly.

I know I'm not asking for all that much, am I? :)

I tried looking around for existing software which does this. I've used Glimpse before, but that was when I didn't worry about being able to search for given subfields. (The price of a couple thousand dollars is okay with me, so long as I can test it out before committing.)

The only program which seems close is Zebra, or rather a Z39.50 system in general. But I know almost nothing about those systems and it doesn't look like all that many people use that specific program.

A completely different one would be eXist which allows searches of XML fields using mySQL. However, I would like not to have a database system running because that makes for a more complicated setup. Plus, it is written in Java, which would put an undo requirement for a bioPYTHON project.

Any thoughts?

Have I ever mentioned how much I hate computers?

Now, don't get me wrong. I love programming. But I hate computers.

I replaced the hard drive in my computer about 1.5 months ago. It dual boots between Windows 98 and Mandrake. Why am I using Windows? Because I couldn't get the ppp connection working under Mandrake - it would stop anywhere between 1 minute and 30 minutes of use. The phone line would still be in use, but just sitting there making noise.

So I replaced the hard drive and upgraded Mandrake from 6.5 to 7.2. Thanks to the help of a friend, it went pretty smoothly, although I exlaimed "I hate computers" a few dozen times during that process.

Last Friday evening, Norton Antivirus kicked in and say the master boot record had changed and would I like to fix it. Admittedly, I didn't read the text too closely where it said "this could be caused by upgrading the OS" so I hit the "Repair" button. This replaced the MBR with the old and now invalid information NAV had saved somewhere. Which meant I couldn't boot my machine.

Did I mention I hate computers?

Please realize that I didn't and still don't have a good idea of what is on the master boot record. All I thought it did was wipe the boot loader, so I spent about 4 hours on Saturday trying to reinstall grub.

It wouldn't work. It wouldn't identify the partition but would say it was uknown and of type "0x83". This is strange because 0x83 is the ext2fs partition type identifier. Perhaps it's something to do with the "stage 1.5" loader, so tried figuring that out. Nope. Pulled up the latest CVS to see if that fixed it. Nope.

Started going through the code. Frustrating code. Global variables everywhere makes it hard to follow the thread of meaning. Various #ifdefs for stage 1.5 vs. stage 2 code. Mixed use of "grub_printf" and "printf". Using a statically embedded function as an iterator for a variable in the surrounding scope. Overuse of the for() statement so all of the checks for the end of loop are inside the for() so to have an empty body.

Anyway, I managed to track the problem down to a seven line conditional check for an if statement. Everything was find except for the last part, which was a superblock check.

It failed because the partition entry no longer pointed to an ext2 file system. This is a good thing. However, it didn't tell me why it failed and only implied that something was wrong with the filesystem type identifier.

After I figured out that the partition table was bad, I spent the rest of Saturday getting the machine back into shape. Managed to dig up some tools (like rescuept) which reconstruct the partition table information.

Even getting that going was nasty. Had to figure out how to make a bootable floppy. Used a Red Hat 7 distribution to get into a bare-bones Linux mode then figure out how "mknod" worked so I could talk to the different partitions.

But I don't know anything about partition tables. I managed to get the settings in fdisk to match the output of rescuept. (Used another program to double check, "s" something.)

Amazingly enough, that seemed to work.

EXCEPT!

Under Win98 I could no longer put my laptop into suspend or hibernate mode. No problem under Linux. The only thing that changed was the partition table. I had assigned the FAT partition for my D: drive an identifer of 0xB, but couldn't tell if 0xC was really the right one. (There are suggestions that early FAT32 needed special BIOS support, so I conjectured that perhaps a different identifier was used when that changed.) Changed to 0xC. Still no suspend. Conjectured that there was a setting changed in the registry.

There I was able to pull out my secret weapon, which was that I hadn't made all that many changes to Windows since I copied the files from my old drive to the new one, so I was able to export and diff the two settings. There were a couple of differences, but changing them didn't fix things. Replacing the new registry with the old also didn't. Doing a recursive diff on the two Windows' didn't highlight anything.

Windows running from the old disk can hibernate. So there isn't some PROM setting which needs to be changed. I backed up the /dev/hda1 partition and copied over the old installation. Should be identical code. But still Windows doesn't hibernate from that drive.

A friend of mine works for Microsoft. Called him up hoping he could divine why two almost identical configurations would allow one to standby and the other not to do so (even though it did the week before). His divining agents were no help. And he can't get his laptop to hibernate at all under Windows Me.

Did I mention I don't like computers?

While rereading the Linux man page for "fdisk" I see the comment "you should always use an OS-specific partition table program." So I played around with the Windows fdisk, which is a pale shadow of the Linux one.

Then I noticed a small comment on my worksheet. (Paper doesn't get a corrupted partition table!) It had the statement "in extended partition." Apparently there is a difference between "primary" and "extended" partitions and what I had used as primary partitions 2, 3 and 4 really should be extended partitions 5, 6 and 7. While Linux and Windows understood them just fine as primary partitions, I guessed that perhaps Windows acts differently when there's more than one primary partition.

Rebuilt the partition table to use an extended partition. When I used primary partitions I needed to specify the start and end clusters. When using the extended table I noticed that the start clusters were always correct. Figured this was a good thing.

Somewhere in here my FAT16 and EXT2 partitions became corrupted. Don't know if it was when I created the extended partition or playing around with the Windows fdisk or I forgot to unmount nicely or what it was. But I semi-ruined them. Remember that I said I backed up the FAT16 partition? It was onto the ext2 one, and now that file is in never never land. Also got the only time I've seen where fsck stops and asks for manual intervention.

Computers. Hate. Blech.

Managed to mount and save my home directory. It's about 3GB of which over 1GB are files from various bioinformatics databases. Easy to drop a GB on a 20GB drive. I must admit that nice feature of computers.

So now I'm in the process of reinstalling everything back onto my new hard drive. Didn't loose that much. I was very cautious about my email, so had an extra archive of that. Worst problem will be the final details of my Q4 finances, but even there I still have the paper copies.

Still, I've been working at this for four days, and learned a slew of things I didn't care about and would rather forget. Everything I did needed to be double checked because if I, oh, swapped an if= with an of= I would wipe everything. (I am able to borrow 10GB of space on someone else's machines, so I do have a second backup.)

I hate computers.

How to improve things?

First would be if Norton used the word "Restore" rather than "Repair" in that button I pressed which sent me on this nasty journey. I would really enjoy it if Norton has a "undo last repairs" option since I figured out there would be a problem before I rebooted. I could also have booted into Windows from a grub'ed boot disk to run that command.

Huh. When anything other than Norton touches the MBR, NAV kicks in to say there might be some virus activity. That would have warned me that something possibly back was taking place. I wonder how NAV recognizes self and if an anti-antivirus program could take advantage of that.

Second would be if there was no distinction between "primary" and "extended" partition. I bet it was some sort of hack to get around the 2GB limit in early hardware.

Third would be more verbose reporting information in grub, to say why it couldn't recognize a disk with a correct partition number.

Fourth would be better reporting in Windows to say why things failed - that is, an equivalent to the system log under unix. Hmm, I bet there is, but I just don't know how to find it.

There's probably more, but that's enough for now.

Don Norman is right. Computers are still in their primitive infancy. Will they mature in my lifetime or will there continue to be levels upon levels of needless complexity?

Don't come back telling my why things have to be done this way. Think of it as a challenge. People use software to achieve goals in their domains. How can you reduce (eliminate!) the need for knowledge unrelated to those goals?

Bear in mind too that I've been a programmer since '83 and a professional programmer since '95. If I have this much problem with computers, no wonder I'm not the only one who hates them. But as a programmer, they're the only game in town.

How about a Python solution to the Scheme problem? Of course, I made a more general solution allowing any number of "f"s :)

class f:
    def __init__(self, s):
        self.s = s
    def __add__(self, other):
        return Merge(self, other)
    def __getitem__(self, i):
        return self.s[i]

class Merge: def __init__(self, left, right): self.left = left self.right = right def __getitem__(self, i): return self.left[i] + self.right[i] def __len__(self): return min(len(self.left), len(self.right)) def __str__(self): s = "" for i in range(len(self)): s = s + self[i] return s __repr__ = __str__ def __add__(self, other): return Merge(self, other)

>>> f("Js nte ceeWnae")+f("utAohrShm anbe") Just Another Scheme Wannabee >>> f("123") + f("ABC") + f("abc") + f("!@#") 1Aa!2Bb@3Cc# >>>

This uses Python 1.5.2. With Python 2.0 I would probably do

class f(UserString.UserString):
    def __add__(self, other):
        return Merge(self, other)
An uglier version of __str__ using 2.0 tricks is
def __str__(self):
    return "".join(x[0]+x[1] for x in \
                    zip(self.left, self.right))

I could probably cut the line count in half if I got rid of the generality, so no fair doing a direct line count comparison between this and the Scheme solutions!

jdybnis asked:
Does anybody have any others [points to regexp package]?

There's pcre at pcre.org. Python 2.0 includes the sre regexp library. There are a lot more. A quick web search also found Regex++ and the ORO library for Java which is part of Jakarta. Then there's my bastardized regular expression converter to the mxTextTools state machine as part of Martel. (Bastardized since it doesn't do full backtracking.)

I wrote an article for Dr. Dobb's Journal. It's in the January 2000 edition, which I understand is starting to arrive in the mail. Soon, yes, you too will be able to own a copy of words written by yours truely! As an added bonus, this is their 25th year anniversary edition.

3 Dec 2000 (updated 3 Dec 2000 at 21:05 UTC) »
deekayen:
You can't have safe sex when you're having sex with someone infected with HIV or AIDS. There are pores in latex condoms. Sure, you can't see them... they're small... but HIV is still 1/100 the size of the pores in the condom and can easily slip right through.

This statement is completely incorrect. Here are some STD virus sizes from The Big Picture Book of Viruses:

  • Retroviridae (HIV) are spherical; 80-100 nm in diameter
  • Herpesviruses (herpes and cytomegalovirus) are spherical; 120-200 nm in diameter
  • Papillomavirus (from http://www.uct.ac.za/depts/mmi/stannard/papillo.html) are about 550 nm in diameter.
  • Hepadnaviridae (Hepatitis B) are filamentous; 40-48 nm in diameter
  • Flaviviridae (Hepatitis C) are spheroidal; enveloped; 40-60 nm in diameter
  • Poxviridae (Molluscum contagiosum) are ovoid, or brick- shaped; 140-260 nm in diameter; 220-450 nm long

So Hepatitis B and C are both smaller than HIV, and the ratio of smallest HIV to largest Molluscum contagiosum is still less than a factor of 4. (The volume is about a factor of 100, but that's for the extreme case and you're talking about pore size.)

Second, what is the pore size of a condom and how important is that in transmission? I found claims like yours which state it as fact but don't back it up. On the other hand, sites like http://www.fda.gov/cdrh/ost/reports/fy98/INFECTION.HTM and http://hivinsite.ucsf.edu/topics/condoms/2098.32ca.html point out the results of tests on the permeability of condoms and find them effective barriers. Yes, the pore size is larger (seems to be about 1000nm if I read the reports correctly) but pore size isn't the only consideration on transmittability. These are back up with theoretical, laboratory and epidemiological studies.

So you are incorrect both on the relative sizes of the viruses and the importance of pore size on transmissability of viruses through condoms. Unless you have evidence to back yourself up?

Update

*Chagrin* Oops, I misread the "1/100" and thought deekayen said the HIV virus was 1/100th the size of other viruses. He did not say that. He said they were 1/100th the size of the pores in condoms. The numbers I found suggest it's more towards 1/10th than 1/100th, but not enough to be seriously wrong on that account.

Still, there have been tests upon tests which show that condoms are an effective means to prevent STD infection.

My apologies for the misinterpretation.

Today's my birthday (okay, it's tomorrow in most of the world, but not according to my watch :). I am now 30 years old. Oh my.

As my present to the world, I released Martel- 0.2 today and put up my poster from BOSC.

The short description is that it's a parser generator written in Python and designed for stateful *regular* grammers, as compared to context free grammers where you can lexically tokenize with very little knowledge of where you are in the file. This is very handy when parsing files designed to be read by humans or hand-written parsers (that is, written by people with no knowledge of recusive descent).

I use a modified version of a subset of the Perl5 regular expression language to describe the file format. I take the expression and convert it to a table for mxTextTools, which parses the string and returns a taglist describing the file as a tree. (?P<named>groups) are used to label interesting nodes in the tree.

The tree is traversed in prefix order and sent to the callback object as events. I reuse the SAX API for the event names and meanings, so I can leverage off of a lot of existing code and ideas, like DOM.

I'm pretty proud of it. It took a lot of thought and work, and the result is quite nice, and I haven't come across anything like it elsewhere. (Pointers anyone?)

Oh, I see I forgot to mention the license. It's the old Python license "with the serial numbers scratched off", which is basically BSD but without the adversiting clause. Enjoy!

This morning - early morning - Mitch and I hiked up Atalaya Mountain (7 miles round trip, from 7340 to 9121ft elevation). It's the biggest mountain closest to Santa Fe, and is the boundary between Sante Fe and the watershed. From the top you get a great view of the city and all up and down the Rio Grande valley.

This was the first non-easy hike I did since the forests reopened after the fires. I've done the hike during the day a couple of times, but always ended up too hot and tired to really enjoy it. The problem being I was doing it during the day, in summer, in a desert, on a trail which is only 1/4th under the trees. (Hint: take lots of water.)

Instead, I figured on doing it at night, which would be cooler, with no need for sun block, and with no chance for an afternoon thunderstorm popping up while on the trail. All good things. I figured the best time to go would be to make it to the top at sunrise, so I could see the city and the valley with all the lights, and also see the change as the sun appears. I'll also point out I chose yesterday because it's a couple of days after a full moon, so the still bright moon will be high in the sky before dawn.

It's about 2 hours up to the top, in the daytime when you need more breaks to drink and cool off. We started about 4:20 for a 6:04 sunrise, and made it just in time. As we climbed up, we were surprised to see the lights of Los Alamos and White Rock in the distance. Climbing even higher and the light of northern Albuquerque appeared in the south-west. The moonlight did help a lot, but we walked off the trail a few times. The timing worked out well since the sky was getting that dawn glow when we did the last bit of trail, which is the most complicated.

At the top. Sun slowly rising up. Watching the rose tipped peaks across the valley. Looking towards Sandia Mountain, we see part of the horizon sky still dark, with no hint of red. Turned out to be the shadow of Atalaya. Coyotes howled handshakes (it seemed :) at the base of the mountain. Rising higher, the new, now golden light brought out the relief of the far mountains. You could see the bright street and highway signs directly downlight - the ones with retro-reflectors in them.

Gorgeous. Highly recommended. But might not want to do it in winter with snow on the ground and rather sub-freezing temperatures. Also, the peak had excellent cell phone coverage.

17 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!