Older blog entries for titus (starting at number 6)

14 Nov 2004 »

QOTDE: "It is difficult to make predictions, especially about the future"

The write way to right re-usable bioinformatics tools.

It's frustrating how many fantastic bioinformatics analysis tools exist in a difficult-to-use form. Most of the algorithmically challenging tools I use exist only in command-line form; in fact, I can't think of a single sequence analysis program that has an external API. (I understand the situation may be slightly different in the area of clustering software, but that's not my biz at the moment.) A good external interface for NCBI BLAST or CLUSTALW would have saved me many hours.

It's not only the complex programs that suffer from this lack. One of my favorite whipping dogs is EMBOSS, a collection of many rather small command-line programs that do useful bits of analysis. They have tons of stuff, covering most anything you need to do in sequence analysis, but it's all locked away behind formats and stdin/stdout, and much of it is simply easier to re-write if you don't know how to use the program in the first place. In fact, I bet that over 90% of the programs in EMBOSS could be re-written from scratch in little more than a weekend using the scripting language of your choice (abbr "Python").

This is not an entirely idle contention; I rewrote part of fuzztran a few months ago. It took me 30 minutes -- not because I'm a fantastic programmer, but because I had a pattern-searching library that solved a more general problem. Here's what happened:

fuzztran uses a pattern language to search a database of nucleotide sequences after translation. It's useful in situations where you have a leetle bit of protein sequence -- say, from some Edman sequencing -- and want to search a genome or mRNA library for a match. This was exactly the situation I was in, but I needed to search a rather large library containing over 5 million sequences from a whole-genome shotgun sequencing effort on the sea urchin. Moreover, I needed to do an intersection of the results: I wanted to search for two substrings in proximity to each other.

I trundled on over to EMBOSS, read the fuzztran documentation, and tried running it. I immediately ran into several difficulties: it wasn't particularly fast; I didn't know if it was actually working, or if I had entered things in the wrong format; it didn't permit "percent mismatches", as in "find me sequences that match at the 90% level"; it was annoying to script; and the output format wasn't easily parseable.

Ergh.

I spent about 20 minutes trying to find an easy way to use the thing and finally decided that my time was better spent writing a specific tool for my needs. I ended up using my motility toolkit, which supports fuzzy pattern searching with position-weight matrices. I wrote a quick function to reverse-translate amino acids into codons, and thence into a position-weight matrix; once I had this "translate_protein_to_PWM" function written, the final code was very short:

for prot in protein_list:
    matrix = translate_protein_to_PWM(prot)
    length = len(matrix)
    pwm = motility.PWM(matrix)

     # allow % mismatches
    min_score = length - int(float(length) * MISMATCH_PERCENT + 0.5)

     print 'searching:', prot
    for sequence in sequences:
        if pwm.find(sequence, min_score):
	   # save.

The code, together with testing and debugging, took a total of 30 minutes to write, and worked great -- we found the right protein & went on to verify it experimentally. (The tool is now in my slippy collection under "search-database-for-prot.py".)

Even better, this code was readily extensible to do other things, like mixed protein searches (where you've gotten mixed sequence, e.g. "RYAAGG" and "YGGGAR" were sequenced simultaneously and can't be deconvolved, so you need to search for [RY][YG][AG][AG][GA][RG]") and general domain searches. So that was nice.

OK. Ungapped fuzzy protein sequence searching is, in many senses, a toy problem. There are tons of ways to do it, I'd bet, and none of them would take very long to implement from scratch. The situation is more frustrating when you have to deal with the warts on something like water, which does a Smith-Waterman alignment. This is a moderately tricky piece of code, and reimplementing it isn't a good option for a short-term project. What would be great is if someone broke out the code that did the tricky bits -- the alignment itself -- from the code that worried about parsing input data and constructing output formats. To their credit, the EMBOSS people seem to have done this, but it's in a library that as far as I can tell isn't documented. So it's probably easiest for Joe Blow Bioinformatician to simply use the command-line program, with all of the clumsiness inherent in that approach.

I'd bitch less about the whole problem if it weren't that the EMBOSS folk, and the NCBI folk (who make BLAST), are paid for software development. As mjg59 points out, most analysis programs are written on research grants, where the short-term view outstrips the long-term view. Not so for EMBOSS, who apparently has a whole team of people writing this stuff. I just don't get it; Perl and Python are perfectly good scripting languages, and they're cross-platform; surely it would be easier to just provide a good embedding of the algorithmically challenging functions and then just write the individual programs as scripts??

O well. Some day I hope to rewrite BLAST and retool CLUSTALW to support a nice library API. 'til then, I guess I'll just gripe about the general problem here ;0).

--titus

12 Nov 2004 (updated 12 Nov 2004 at 16:19 UTC) »

QOTDE: "The lessons of history teach us -- if the lessons of history teach us anything -- that nobody learns the lessons that history teaches us." (R. Heinlein)

Use Python -- or a language like it. Plus, my savage hatred of "system()"

Hey, look -- a fan! Matthew, dontcha know that the best way to defeat trolls is to ignore them? Or was that giant advertising animatroids? I forget. (<-- gratuitous Simpson's reference.)

Quite apart from my drug problems (acid freak, not crackhead -- there's a difference!) and the gratuitous misreference to GUI programming (I agree completely! I hate GUIs even more than I hate command-line programs -- they're just useful, on occasion!) and the unfortunate failure of my former coauthors -- the swinish bastards! -- to recognize my contributions to the deep foundations of every paper on Avida, I have to agree that any statement recommending, say, Python over Perl, APL, Pascal, or COBOL as a solution is likely to be at best disingenuous and at worst just plain wrong. It is well-known that any Turing-complete language (given infinite memory, yada yada) can emulate any other -- so why choose between them?

Dunno. But, repetitive as it may be to say it, I think a large part of the solution to bad scientific programming is to use a language like Python. Seriously, I'm perfectly aware that Lincoln Stein (and likely Matthew Garrett) can kick my ass when it comes to a mano-y-mano, Perl-y-Python scripting contest. I'm even reasonably confident that Lincoln Stein could take me down in person; he looks mean. (I haven't met Matthew.) But to cite an N of 1 ("worked for me!") as an actual argument... well, I'm no math major but it seems like a large std deviation.

An argument that I might make, were I still slavishly and unreasonably devoted to Perl rather than to Python, would be to point out that anyone writing C extensions for Perl by hand without using SWIG and/or XSAPI probably has bigger problems than over-frequent enjoyment of a little crack. If that's the big problem with Perl, then it's not a problem at all.

This argument ignores the value of writing pseudocode instead of line noise, but that seems to be a personal preference rather than an absolute, for some reason...

And (seriously) Matt's point that this is a social problem is entirely correct. Teaching people Python at an earlier age might help there. ;)

...why "system()" sucks.

But let's move on to a different argument: my savage hatred of "system()". Do an experiment: try writing a parser for the "generic" GFF format. What, you say? That's easy? Sure is -- for each and every one of the bajillion programs that output GFF, it's easy! Now, let's see which field(s) they overloaded this time...

The problem, to put it bluntly, is formats. In information theoretic terms, stdout is often a very lossy channel, and it is difficult (and often impossible) to make it 100% clean. Why? Well, suppose someone gives you some brilliantly written (and novel) standalone piece of code, and it takes in sequences in FASTA format together with a couple of parameters. Now the program does some fantastically complex set of calculations -- gene finding, HMM search, Gibbs sampling, sequence alignment -- and spits out some text as a result. That's right -- some text. What does the text mean? At this point the hapless user of a novel program has several options. S/he can:

write a one-off parser that grabs the necessary data and runs.
write a complete parser that parses all of the output and puts it into a nice structure for later use.
hope like hell that the author of the program provided a "standard" format like GFF that captures some significant component of the output.
wait for someone more anal retentive (or needier, or smarter, or harder-working) to write a really good parser for the format.

Libraries like BioPerl or BioPython give you #3 and #4 (with time). #2 takes a lot of effort and is only worth it when you really need all of the info in the output. #1 is what everybody does, in practice, right up until it bites 'em in the butt.

There's one huuuuuge problem with all of this, however: you're at the mercy of the author of the package to provide full, honest information in the output. Well, good luck with that, and have a good time rewriting your parser when Joe Package Author decides that semicolons are a better divider than commas...

It should be obvious that the best solutions above (#2/#4) can only ever be as good as a good embedding of the package in your SLOC (Scripting Language Of Choice). And, far too often, the actual parsing solution isn't that good, and can't be extended without breaking everybody else's parsers. That's why command-line executables with no associated library or embedding will, to a general and somewhat loose approximation, always suck.

So, people: use Python. Or COBOL. And write library functions loosely wrapped in main()s, not deeply embedded spaghetti code.

--titus

The shoutout today goes gnutizen, who obviously has his own drug issues; he certified me as "Journeyer"!

p.s. It turns out I was a math major. Huh. Weird.

p.p.s. If someone with some Perl and C/C++ knowledge were to go comment on my SWIG/Perl embedding of motility (see the CVS) it could be most useful to me. Just a thought.

p.p.p.s. In the bioinformatics language wars, I have to say that Bioconductor really takes the cake in the "absurdity" category. I personally like R, but why someone would choose it over a more mainstream language for general-purpose programming <shakes head>...

11 Nov 2004 (updated 11 Nov 2004 at 21:04 UTC) »

QOTDE: "If we knew what we were doing, it wouldn't be called 'research'." (A. Einstein, esq.)

Research programming, and the Doom of Command-Line Executables

Scientific analysis programs are often badly written, and usually available only as command-line executables.

The first question is, why? There are a few different reasons:

Scientific programmers are usually grad students and postdocs. These people are entirely untrained (and uninterested) in programming or software engineering.
Those who are trained in software engineering are usually computer scientists of some stripe, so 90% of them are completely useless in front of a computer anyway. (See #1 for the resulting authorship.)
Most scientific projects are ad hoc piles of crap from the first line of code laid down to the last semicolon written.
There is lots of turnover in science: students and postdocs move on quickly.
The standard research programming languages (Fortran and C) do not lend themselves to re-usable code, to say the least.

I don't blame the scientists for the resulting poorly built software. After all, the goal in science is to keep moving forward with your research, and if you take the short-term view on software you'll only think about the next step required for your project. Even if you do try to plan ahead, odds are you're going to be screwed by the Real World, which doesn't care what you think your results should be, and often has its own ideas. Then there's the desire to move on, which doesn't lend itself to good software practcices. And, in any case, it's not like anyone teaches software development properly, so scientists have to learn how to do it on their own. Plus, if your advisor/mentor/supervisor tells you that Fortran is the way to go... then Fortran you will use.

Short-term thinking is probably the worst culprit in all of this. Advisors have no obvious incentive to promote long software projects. But I do think this focus can be bad. I've ignored my advisor's direction to focus on the short term twice: once it resulted in Avida (still a going concern 11 years later) and once it resulted in Cartwheel. If Charles and I hadn't simply written Avida (against Chris Adami's instructions) we would have been stuck with a modification to the huge pile of crap that was Tierra at the time. My current advisor, Eric Davidson, simply didn't understand the point of Cartwheel until years later (I'm still not sure what he thinks it is, actually). I think Cartwheel is a success because it's taken over much of the sequence annotation functions in the lab -- and now we don't have to run a bunch of Perl scripts, by hand, on our Beowulf cluster, every time we want to annotate a piece o' sequence. Victory over Perl, at least!

Overall, this kind of short-term thinking results in a lot of short-ish, one-off coding projects that solve a particular research need and contain no obviously re-usable code. Typically this can be encapsulated in a simple command-line program that has relatively obvious parameters and spits out a result that is directly interpretable by one person: the person who wrote the code. At this point the project is considered fini and the coder moves on. Result: one undocumented command-line program that other people may or may not find useful and in any case will be difficult to use.

OK, so that settles why badly-written command-line programs exist in such profusion in research. The second question is, why do I hate them? That's probably fairly obvious, but just to hammer in the point, I'll submit a tirade about that some time in the future.

The final question is, what can we do about it?

I'm convinced that a large part of the answer is this: use a scripting language like Python.

Why "like Python"? How "like Python"?

Python is simple, easy to learn, and fairly concise.
Python is easy to read. (It also looks a lot like cleanly written C should, which helps C programmers out.)
Python makes code re-use relatively easy. In particular, Python is inherently module- and object-oriented.
Python is cross-platform.
Python provides easy access to string processing: functionality that C and Fortran don't really have.
C and C++ code can easily be wrapped in Python.
Python is interpreted & provides interactive command-line access.
Python has automatic memory management: no malloc/free nonsense, or resulting memory corruption.

Hopefully it's obvious why these are all good features for a research programming language! Access to C and C++ code is surprisingly important, because an awful lot of useful code -- research and otherwise -- exists in C and C++ libraries. Plus, when you feel the need for speed, C and C++ are still the way to go.

However, none of the other languages that I'm most familiar with (C, C++, Java, Perl, and Tcl) satisfy all of these. C, C++, and Java are not interpreted, and Java can't easily wrap C/C++ code. Plus, C/C++ are not particularly cross-platform unless you know what you're doing. Perl and Tcl are both good scripting languages that satisfy most of the above criteria -- in particular, wrapping C code in Tcl (although not Perl) is fantastically easy, and Perl is very easy to learn for old C/UNIX hands - but neither one is object-oriented from the ground up, and neither one supports code-reuse very nicely.

Perl is a fucking nightmare when it comes to wrapping C code, too; anyone who doesn't think this is invited to try it. Sheesh. What was Larry Wall thinking?!

Ruby might be a good bet, but then I understand that it's basically Python anyway... (<dons asbestos suit hurriedly>).

So use Python. Trust me -- I know what I'm doing. ;)

Well, that's it for today; gotta go read /. I'll leave you with one final thought: the two dominant points of technological friction for bioinformatics are (a) the widespread use of Perl and Java, and (b) the omnipresence of incredibly useful but hideously unscriptable command-line programs like BLAST and 90% of the pimply little programs in EMBOSS.

(I'd hold Lincoln Stein personally responsible for (a), but the truth is that he's (i) a nice guy and (ii) BioPython isn't helping. That's a whole 'nother story. I must admit to complete bafflement re EMBOSS.)

O hey, here's a shoutout to Nathan Gray, the only other person I know who compulsively writes about stuff on the Web.

10 Nov 2004 »

Tuple spaces

It's good to see tuple spaces gaining some exposure; Patrick Logan mirrored my instinctive reaction to the Amazon Queue service beta in saying that he wished they'd provided a tuple space implementation (notwithstanding the ease of building a tuple space on top of a queue, yes.)

I first ran across tuple spaces when I implemented one without really knowing that it was a tuple space. My batchqueue implementation for Cartwheel is based on the tuple space concept, although it's more like a queue the way it's implemented. Essentially, "producers" (usually the Cartwheel Web site, nicknamed 'canal') dump job requests ("tuples") into a PostgreSQL 'request' table. "Consumers", queue processing programs running on compute nodes, monitor the table for new requests and extract a new request when one is available. Results are returned to the database and linked to the request table.

When I developed the first implementation of Cartwheel, the main goal was to avoid executing "os.system" calls from the Web server. At the time I was using AOLserver/PyWX, a high-performance threaded Web server running my/our Python embedding, and it seemed like a bad idea to do os.system calls from within a threaded app! A side benefit of implementing the queue processing as a tuple space on top of PostgreSQL was that jobs could be distributed across multiple computers. Now, it's a major feature of the thing ;). (And, since I've switched to Quixote/SCGI, os.system still seems like a bad idea but it's less of an issue.)

While my tuple space implementation on top of PostgreSQL isn't well suited for speedy turnaround (typically picking up a job requires up to 1 second), it was absolutely trivial to implement: literally, something like 5 lines of code. You can see it in my pyzine article (search for "tuple_space.add"). Once you add comments, and error handling so that e.g. CTRL-C returns the job to the tuple space rather than giving up on it, and some simple reporting functions, it adds up to a couple hundred lines of code. All in all, I'd stack tuple spaces up against any other parallel processing technique for simplicity of implementation.

One recurring idea has been to reimplement Google's MapReduce technique on top of Cartwheel (or some other system) to produce a highly scalable system for whole-genome motif searching. Naturally, the first thing to do is to come up with a name for the system: that's much more important than an implementation! I've been thinking of "Motiefer", along the lines of my FamilyJewels project. (So much less obvious than some dumb acronym like "ParMotSear"... but hmm, "SAR" would be kind of amusing. We'll see.)

Huh. Well, I was going to write something specific about Python for the purpose of proving to Ryan Phillips that this blog should be on PlanetPython, but ... I guess I did. OK.

9 Nov 2004 (updated 9 Nov 2004 at 23:57 UTC) »

Hmm, 3rd entry. I guess I have enough things to rant about to keep this a moderately busy diary!

Today's rant inspired by Chinook.

Chinook is a cool-sounding "P2P bioinformatics" application that aims to provide command line services in a P2P manner. I was directed to it by Mike Brudno, one of the authors of LAGAN (a global alignment package); he pointed out that it sounded like it had goals similar to Cartwheel's goals. True 'nuff, it does -- fuzzily defined, "to provide a less-sucky interface to command-line apps".

I'll rant about command-line apps and their prevalence in bioinformatics some other time. (Why, o why, do bioinformatics software developers spend so much time writing standalone binaries?!?)

But the rant about Chinook is a different rant. I quote: "Currently, there is no source code available for Chinook. The source though is licenced under Creative Common's Attribution-NonCommercial license and is freely available on request to chinook@bcgsc.bc.ca."

Sigh.

I'd be the first to admit that no software developer uses my software -- I'm not really writing it for them, anyway, I'm writing it for myself and for the bench biologists who like click-and-drool. But suppose someone, someday, overcomes that first energy barrier and says "hmm, this Cartwheel thing looks interesting. I wonder what the source looks like, and if I could run it myself?" All they need to do is nip over to SourceForge and check it out for themselves. No e-mailing to me is necessary. What about FamilyRelations? Heck, people have been finding the tutorial, downloading the thing, and running it, without me ever finding out. (I only find out when I break something in an update. ;)

There's something deeper than mere convenience here: people just aren't going to take the time to even glance at your software if you don't make it available to them w/o hassle. Software developers and scripters aren't even going to give your code the time of day if they have to e-mail you first. I think even a slight inconvenience can have real effects on people adopting your code and/or your project -- which, let's face it, is the goal.

There are other culprits: Apollo pulled this shit too, in the beginning. (I guess people use it now; don't know anyone offhand.) My favorite example of this BS, though, has to be BioHUB. This is a tool that is only really going to be useful if people use it, either by developing for it or by using it directly. Dunno about you, but (as a developer who would like to make use of it) this statement doesn't inspire confidence: "In the future the Caltech BioHub maybe released under an open source license."

Sigh^2.

5 Nov 2004 (updated 5 Nov 2004 at 07:04 UTC) »

A few days ago, I needed to use my motif searching GUI to search a DNA sequence. Normally this would mean that I'd need to:

scp the sequence file to my Mac;
log into my Cartwheel account;
upload the sequence into Cartwheel;
create some analysis in Cartwheel using the sequence;
run FamilyRelationsII and load that analysis;
click on the sequence and selected 'motif search' from the menu.

Painful, ehh? Well, the system isn't exactly optimized for command-line use ;).

It turned out to be just as easy to write a separate command-line executable that loaded the sequence from the file and brought up the motif search view. (This required writing a new constructor for the motif search view & factoring out the common constructor functions into a _setup function, but heck, it was probably time to do that anyway.) 15 minutes later, & voila -- 'motif-search sequence.fa' lived!

So, naturally, I sent out an e-mail to my bioinformatics homies informing them of this mildly useful program. The response from one fella? "Hey, you know, it would be great if you could also search for this kind of motif, not just the simple kind that's easy to type in. Oh, and if you could maybe plot sequence conservation, too, then you'd have a pretty neat program."

Hmm.

First of all, there are a number of user-interface issues there that need to be worked out. Not difficult, but not easy. And making it useful for anyone but the five biologists at Caltech capable of using the command-line would be at least a week-long project. (That's why I wrote Cartwheel in the first place; biologists are not UNIX-savvy!)

Second of all -- dude, you're a programmer. It's open-source. I'd be happy to suggest a starting point. I'm willing to admit that GUI programming is trickier than many other kinds of programming, but I've got a working framework going and I'd even be willing to sketch out what you'd have to do. But I'm damned if I'm going to spend time working on something that's only useful to one person, unless that person is me ;).

O well. As a friend says, this is the Curse of the GUI -- the users always want that nifty extra feature, the one that's really only directly useful to them. Who knew that people would want to not only change the color of the elephant, but make him polka-dotted too?

29 Oct 2004 (updated 29 Oct 2004 at 00:23 UTC) »

Too much other stuff to do to create my own diary site. Let's see how advogato works for me (and if I work for it ;).

Just added my two "full-time" open source projects to Advogato: Cartwheel and FamilyRelations. These are linked (server/client) bioinformatics projects that are part of my PhD research.