Older blog entries for titus (starting at number 3)

10 Nov 2004 »

Tuple spaces

It's good to see tuple spaces gaining some exposure; Patrick Logan mirrored my instinctive reaction to the Amazon Queue service beta in saying that he wished they'd provided a tuple space implementation (notwithstanding the ease of building a tuple space on top of a queue, yes.)

I first ran across tuple spaces when I implemented one without really knowing that it was a tuple space. My batchqueue implementation for Cartwheel is based on the tuple space concept, although it's more like a queue the way it's implemented. Essentially, "producers" (usually the Cartwheel Web site, nicknamed 'canal') dump job requests ("tuples") into a PostgreSQL 'request' table. "Consumers", queue processing programs running on compute nodes, monitor the table for new requests and extract a new request when one is available. Results are returned to the database and linked to the request table.

When I developed the first implementation of Cartwheel, the main goal was to avoid executing "os.system" calls from the Web server. At the time I was using AOLserver/PyWX, a high-performance threaded Web server running my/our Python embedding, and it seemed like a bad idea to do os.system calls from within a threaded app! A side benefit of implementing the queue processing as a tuple space on top of PostgreSQL was that jobs could be distributed across multiple computers. Now, it's a major feature of the thing ;). (And, since I've switched to Quixote/SCGI, os.system still seems like a bad idea but it's less of an issue.)

While my tuple space implementation on top of PostgreSQL isn't well suited for speedy turnaround (typically picking up a job requires up to 1 second), it was absolutely trivial to implement: literally, something like 5 lines of code. You can see it in my pyzine article (search for "tuple_space.add"). Once you add comments, and error handling so that e.g. CTRL-C returns the job to the tuple space rather than giving up on it, and some simple reporting functions, it adds up to a couple hundred lines of code. All in all, I'd stack tuple spaces up against any other parallel processing technique for simplicity of implementation.

One recurring idea has been to reimplement Google's MapReduce technique on top of Cartwheel (or some other system) to produce a highly scalable system for whole-genome motif searching. Naturally, the first thing to do is to come up with a name for the system: that's much more important than an implementation! I've been thinking of "Motiefer", along the lines of my FamilyJewels project. (So much less obvious than some dumb acronym like "ParMotSear"... but hmm, "SAR" would be kind of amusing. We'll see.)

Huh. Well, I was going to write something specific about Python for the purpose of proving to Ryan Phillips that this blog should be on PlanetPython, but ... I guess I did. OK.

9 Nov 2004 (updated 9 Nov 2004 at 23:57 UTC) »

Hmm, 3rd entry. I guess I have enough things to rant about to keep this a moderately busy diary!

Today's rant inspired by Chinook.

Chinook is a cool-sounding "P2P bioinformatics" application that aims to provide command line services in a P2P manner. I was directed to it by Mike Brudno, one of the authors of LAGAN (a global alignment package); he pointed out that it sounded like it had goals similar to Cartwheel's goals. True 'nuff, it does -- fuzzily defined, "to provide a less-sucky interface to command-line apps".

I'll rant about command-line apps and their prevalence in bioinformatics some other time. (Why, o why, do bioinformatics software developers spend so much time writing standalone binaries?!?)

But the rant about Chinook is a different rant. I quote: "Currently, there is no source code available for Chinook. The source though is licenced under Creative Common's Attribution-NonCommercial license and is freely available on request to chinook@bcgsc.bc.ca."

Sigh.

I'd be the first to admit that no software developer uses my software -- I'm not really writing it for them, anyway, I'm writing it for myself and for the bench biologists who like click-and-drool. But suppose someone, someday, overcomes that first energy barrier and says "hmm, this Cartwheel thing looks interesting. I wonder what the source looks like, and if I could run it myself?" All they need to do is nip over to SourceForge and check it out for themselves. No e-mailing to me is necessary. What about FamilyRelations? Heck, people have been finding the tutorial, downloading the thing, and running it, without me ever finding out. (I only find out when I break something in an update. ;)

There's something deeper than mere convenience here: people just aren't going to take the time to even glance at your software if you don't make it available to them w/o hassle. Software developers and scripters aren't even going to give your code the time of day if they have to e-mail you first. I think even a slight inconvenience can have real effects on people adopting your code and/or your project -- which, let's face it, is the goal.

There are other culprits: Apollo pulled this shit too, in the beginning. (I guess people use it now; don't know anyone offhand.) My favorite example of this BS, though, has to be BioHUB. This is a tool that is only really going to be useful if people use it, either by developing for it or by using it directly. Dunno about you, but (as a developer who would like to make use of it) this statement doesn't inspire confidence: "In the future the Caltech BioHub maybe released under an open source license."

Sigh^2.

5 Nov 2004 (updated 5 Nov 2004 at 07:04 UTC) »

A few days ago, I needed to use my motif searching GUI to search a DNA sequence. Normally this would mean that I'd need to:

scp the sequence file to my Mac;
log into my Cartwheel account;
upload the sequence into Cartwheel;
create some analysis in Cartwheel using the sequence;
run FamilyRelationsII and load that analysis;
click on the sequence and selected 'motif search' from the menu.

Painful, ehh? Well, the system isn't exactly optimized for command-line use ;).

It turned out to be just as easy to write a separate command-line executable that loaded the sequence from the file and brought up the motif search view. (This required writing a new constructor for the motif search view & factoring out the common constructor functions into a _setup function, but heck, it was probably time to do that anyway.) 15 minutes later, & voila -- 'motif-search sequence.fa' lived!

So, naturally, I sent out an e-mail to my bioinformatics homies informing them of this mildly useful program. The response from one fella? "Hey, you know, it would be great if you could also search for this kind of motif, not just the simple kind that's easy to type in. Oh, and if you could maybe plot sequence conservation, too, then you'd have a pretty neat program."

Hmm.

First of all, there are a number of user-interface issues there that need to be worked out. Not difficult, but not easy. And making it useful for anyone but the five biologists at Caltech capable of using the command-line would be at least a week-long project. (That's why I wrote Cartwheel in the first place; biologists are not UNIX-savvy!)

Second of all -- dude, you're a programmer. It's open-source. I'd be happy to suggest a starting point. I'm willing to admit that GUI programming is trickier than many other kinds of programming, but I've got a working framework going and I'd even be willing to sketch out what you'd have to do. But I'm damned if I'm going to spend time working on something that's only useful to one person, unless that person is me ;).

O well. As a friend says, this is the Curse of the GUI -- the users always want that nifty extra feature, the one that's really only directly useful to them. Who knew that people would want to not only change the color of the elephant, but make him polka-dotted too?

29 Oct 2004 (updated 29 Oct 2004 at 00:23 UTC) »

Too much other stuff to do to create my own diary site. Let's see how advogato works for me (and if I work for it ;).

Just added my two "full-time" open source projects to Advogato: Cartwheel and FamilyRelations. These are linked (server/client) bioinformatics projects that are part of my PhD research.