Older blog entries for titus (starting at number 13)

The only problem with troubleshooting is that trouble sometimes shoots back. -- Joe Zeff.

I've been noticing a fair amount of commentary on Python and Java lately: I particularly enjoyed Bruce Eckel's take on Static vs Dynamic typing, and Phillip Eby's Python Is Not Java (and Java Is Not Python, either). Phillip Eby makes the point that the Python and Java mindsets are quite different when it comes to frameworks: Python programmers tend to develop the structure out as they need it, while Java designers try to specify the frameworks' structure first & then fill in with specific implementations. Isn't this antithetical to the agile programming paradigm that's been gaining popularity lately?

Jython does a nice job of mingling Java libraries with Python coding; I think many of the Python-native extension modules can be loaded directly by Jython, too. Is this a possible solution to the question of static vs dynamic typing -- build your software in a language like Jython, and then slowly solidify it into Java?

I primarily do research programming, in which the specific goals of the software are largely undefined & the flexibility of the code should be one of the proximal design considerations, so I definitely prefer the Python(/Perl/Ruby) mindset in day-to-day work. There is a question in my mind, though, about where future bioinformatics software efforts will aim: I doubt that the current loosely-coupled/badly-specified project-specific protocols for genome databases and service frameworks will last, so where next? We could either start developing specifications (e.g. the distributed annotation system (DAS) or MAGE) or implementations (e.g. GMOD). If the former, there will be a significant barrier to entry for new projects, as they will need to spend time developing to the standard and confirming adherence. (This is the primary reason why DAS is a failure, I think.) If the latter, I predict a general tendency towards complexity of internal design as different projects try to cram all their needs into a single system. Either situation would be bad.

My preference is for what I think is a middle ground: the development of APIs around common tasks, in a variety of languages. The idea would be to take protocols like DAS and provide fairly simple library implementations that give you 90% of the needed functionality with 10% of the code complexity (based on the well known 90%/10% rule ;). The key is to make sure the implementations work well enough to do something useful & are in enough languages that e.g. the lone maverick Python/OCaml/Ruby programmer in the sea of Perl & Java programmers wants to play as well (just as one example!).

At the moment there are few tasks generic enough to be encapsulated by such an approach: the two that I can think of are annotation & microarray data presentation. Annotation suffers from a general lack of interoperability: not only does everyone have their own standards, but features don't transfer well between standards. I hear microarray data is the same, although I don't work with it much. It'd be interesting to try to work around the ontology problems (do you *really* want to define an ontology before getting your work done!?) to produce a genuinely useful annotation UI that interoperates. I don't see one out there that's usable by "mere" biologists, and I think that's the right target audience...

Why not use, say, XML? Well, properly grokking XML is burdensome and the whole process is pretty legalistic (lots of people yakking etc.). Since the goal is to lower ease of entry I think it's important to have some functioning libraries as soon as possible -- that way people can get the thrill of having the code actually work. When the library moves towards a standard, projects that are already functioning will at least have some reason to move with that library...

Hats off to the Chinook folks, who are developing a P2P bioinformatics system; you can access the code via CVS, finally.


Bugs bugs bugs bugs bugs...

Apparently this week is "let's find bugs in Titus's software" week. Didn't know it was formally defined... but three different people have poked holes in three different-but-related projects. The holes range from already-fixed-but-not-in-the-build (FRII), important-but-easy-to-fix (Cartwheel), and important-and-bloody-difficult-to-fix (paircomp). I have to say my users are really great: finding two of these bugs required great attention to detail. Thanks, guys!

The trickiest bug to fix involves finding transitive connections between three two-way comparisons (find all paths A-->B-->C such that for each path A-->B and B-->C and A-->C). I came up with a clever solution that was easy to understand and easy to implement in simple code; unfortunately, it falls apart in the face of reverse complementing. (As you may know, DNA is readable in two directions: AATTGGCC is equivalent to its reverse complement, GGCCAATT (complement: A <--> T, G <--> C).) This problem is compounded by the asinine data structure that I use to represent the matches. Looks like it's time for a serious refactoring...

All of these bugs remind me of this great quote from an interview with Damian Conway:

"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." -- Brian Kernighan, via Damian Conway

I really enjoyed reading this Damian Conway interview on builderau.com. This is a man who has done it all, and has sound advice based on experience. He also gives an excellent reason for using Perl: it's an immensely powerful language that lets you do pretty much whatever you want. (I don't think it's a good idea for inexperienced programmers to use Perl for anything more than short scripts, but -- like Python -- I suspect "short scripts" describes 95% of what is done with Perl ;).

In other news, my OCaml adventures proceed apace. I just finished my very first OCaml program (temp link). dd2.ml implements a simple recursive global-alignment algorithm that finds the optimum gapped alignment between two sequences. Dog slow, but functional (ha ha...)! Now to see if I can add some heuristics into the algorithm to make it speedier.

OCaml is a lot of fun, I must say. At some point I look forward to making use of OCaml's ability to ship cross-platform bytecode around to different machines. It'd be great to be able to add new alignment views and other analyses directly into FamilyRelationsII simply by downloading some new OCaml code! I've also been thinking about how to use OCaml in my tuple space/map-reduce implementation... seems like a good fit!

Last but not least: WSGI. There is now a Web site containing my Quixote and SCGI adapters for the Python WSGI standard. It also turns out I owe Ian Bicking an apology: when I asked why Webware didn't have an adapter, I'd missed Ian's WSGIKit implementation (SVN here, blog here). It's not an adapter so much as a reimplementation effort, as far as I can tell, so I still think there's room for a simple adapter that Just Works (tm). If experiments continue sucking maybe I'll work on that...

ta for now,

30 Nov 2004 (updated 30 Nov 2004 at 20:46 UTC) »
Stevey -- check out http://www.blogtorrent.com/, it might be what you're looking for. [UPDATE: no, it's not. Never mind.]

In other news, I just updated my QWIP/SWAP README with some simple usage examples, after trying them out with WSGI Utils. (They worked! (sort of)) Stupidly enough I previously posted a dated direct link to the qwip-swap .tar.gz, so I'm waiting 'til I can construct a Real Web Site for QWIP/SWAP to post the slightly updated distro.



'Vegetarian' -- it's an old Indian word meaning 'lousy hunter'.
              -- Red Green
30 Nov 2004 (updated 30 Nov 2004 at 08:04 UTC) »

''' There is a joke about American engineers and French engineers. The American team brings a prototype to the French team. The French team's response is: "Well, it works fine in practice; but how will it hold up in theory?" ''' -- unknown, via Mike Vanier.

OCaml, Python/WSGI, and scalable programming:

Spent some time over the last few days "learning" OCaml, by which I mean reading first the C++/Java programmer's intro to OCaml and then an OCaml tutorial. This is all part of an effort to broaden my horizons: I enjoy using Python and C to solve problems on a daily basis, but I've never learned a functional programming language. Man, is it frustrating to pick up a new language -- I feel completely helpless to even write even the simplest program. This is compounded by my complete inability to think recursively...

I'm looking into OCaml because several different computer-geek friends suggested I try it out. Since all of them profess a love of Python, yet are wiser and more experienced than I in the ways of programming languages (I guess a CS background is useful for something...) I decided to buckle down and study OCaml a bit. So far I've gained an appreciation for the cleverness of OCaml and OCaml programmers, marvelled at 'match', and realized how cool currying is. Not bad for two days ;).

In other news, David Warnock pointed out in his blog that my simple Thanksgiving Day WSGI wrapper for SCGI might be the best-performing WSGI server around, because it's built on top of mod_scgi/SCGI. mod_scgi/SCGI is already fully functional and used for "real" Web sites that run Quixote, and my leetle SWAP code effectively turns this into a full-blown WSGI server. Cool. It seemed too easy to implement, though, so I must be missing some aspect of the WSGI master plan -- why hasn't Webware done this yet, for example?

In connection with that, I've been thinking that an interesting project would be to implement an SCGI server in OCaml. I don't see anything like it out there on the projects page, and it wouldn't take that long to do...

Last but not least, as part of my OCaml adventure, I came across Mike Vanier's rant on the scalability of languages. In it he says, or implies, many things that I wish I could have said more clearly. Things like "The right way to use languages like C is to implement small, focused low-level components of applications written primarily in higher-level languages". Yeah, that.

Mike is one of the three people that suggested I learn OCaml, so I'm a bit saddened by his epilogue in which he turns a little bit away from OCaml (for good reasons, it sounds like, but nonetheless...)


QOTDE: Things Will Change -- Iain M. Banks, Against a Dark Background (the quote on Gorko's Tomb)

WSGI, Quixote, SCGI, QWIP, and SWAP

In a fit of depression over lousy experimental results, with a healthy serving of turkey on top, I decided to turn my hand to something I do better than experimental molecular biology: program in Python. (Trust me, whatever you think of my programming... my molecular biology is weaker. sigh.)

Pursuant to the general public prodding of various people on the Quixote list, I spent a few hours on the couch today and built two interfaces for WSGI, QWIP and SWAP. (README and source download.)

QWIP, the "Quixote-WSGI interface p(something)", wraps the Quixote publisher in a WSGI-compliant application object. This lets any WSGI-compliant servers out there (are there any?) publish Quixote objects.

SWAP, the "SCGI-WSGI application p(something)", allows the SCGI standalone server interface ('scgi server') to run WSGI-compliant applications. For example, this lets mod_scgi run WSGI applications via the SCGI server -- including QWIP-wrapped applications, which was my testing strategy ;).

Overall, my modicum of experience with the internals of Web servers (mostly from PyWX and some minor hacking on Quixote) served me well; it took me about 1 hr to get QWIP working, and about 3 hours to get SWAP working. (Over half of those three hours was spent figuring out that (1) I was instantiating a new object rather than calling the superconstructor, because I'd left out __init__; and (2) that SCGI expected the input and output streams to be closed to signal that the connection was over. Sigh.) It was pretty satisfying to sit back and set up this set of modules:

Apache <--> mod_scgi <--> SCGI server <--> SWAP <--> QWIP <--> Quixote demo
and have it all work!

I'm now moderately more optimistic about the usefulness of WSGI. I hate (no, loathe) frameworks that attempt to solve the problems of mankind, if you'll just drink this cool-aid sir... But, notwithstanding the philosophical debut in the WSGI PEP, it was pleasant to implement the adapters and I could see WSGI being of significant benefit to Web server authors. Or maybe by buying into the framework I've sold out and you can't trust my opinion ;).

So, kudos to Phillip Eby & I hope this stuff is useful to someone! Now, back to making my Quixote applications do more stuff!


p.s. Has anyone else noticed that advogato.com and www.advogato.com read cookies differently? Kind of amusing to go to one or the other and have different options available, one as logged-in member & the other as nobody...

QOTDE: "One of the symptoms of an approaching nervous breakdown is the belief that one's work is terribly important." -- Bertrand Russell, via Timothy Foreman.

Academic publishing may not quite be ready for Open Source just yet...

When last we met our fearless hero, I'd talked about submitting an article on my software to BioTechniques. We got word back on Tuesday: editorial rejection prior to review. The reason? Lack of originality, because, to quote:

"As noted by the authors, the programs described in this manuscript are available online and already in wide use."

Silly me: I thought that having shown that the programs worked for a wide variety of biological systems was a good thing!

It seems that the logic-challenged people at BioTechniques only want unproven software published. If I convolute my own logic processor, I can understand this, sort of: why would anyone read an article about software that they're already using? Of course, the assumption going into that is that the sole purpose of publishing is to introduce people to completely novel results, not just something that most people won't have seen. It's certainly not like FRII is so widely used that Joe Developmental Biologist will have already seen it.

O well. On the advice of a friend more seasoned than I, I am re-submitting to BMC Bioinformatics, where I am told that functioning software is welcomed.

I do have to say that this little interaction has not raised my opinion of BioTechniques. The editors didn't bother sending it out for peer review, they simply slotted it into their narrow preconceptions of How Software Is Done and cut off the bits that didn't fit. At least I don't have to be upset with my peers; I can just call the BT editors "clueless" and move on!

There appear to be very few places to actually publish software. This is surprising, given how much biology is starting to depend on it in this new era of too much sequence. The standard technique is to do some moderately interesting bit of science using the software & then drop it into a moderately good journal like Genome Research. That's great -- if the software you're writing has some immediate scientific value that can be ascertained without experiments. If you need to do experiments, you're talking about a 6-12 mo wait before you can finish the experiments & then publish the software. Not exactly timely.

It's more troublesome that you can't expose your software to the Real World and publish it as novel once other people know about it. Next time I write a standalone piece of software I'll have to remember not to tell anyone else before publishing it...

Thursday morning miscellany

Johnny Bartlett, a fine member of this august site, asked me to pimp his book, although he acknowledges it's not a must-read for software architecture. The book is "Programming from the Ground Up"; not having read it, I am willing to pimp it not only by request but because Joel Spolsky recommends it. Go buy it.

A sinaesthetic friend sent me this fascinating article on tetrachromatic women. I think it's a very interesting philosophical exercise to contemplate what such people see & realize that we will never know. The article makes a big point of the adaptability of the human brain; I'm not that surprised, because it seems like the brain adapts to place people in their own political realities easily enough... ;) I guess that physical adaption on the level of new nerve pathways is moderately surprising, although not new: see this article, & search for "inverted".

Normally I hate reposting links without commentary, 'cause meme tracking has shown that everyone does it, so why should I waste my time? But sometimes you run across something so hilarious that you've just gotta share: sometimes you need a bigger tow truck than you originally thought. There was also a great story about crashing doorbells, but I won't repost that.

Support our troops (if you're from the US)

Last but not least, check out AnySoldier.com. Whether or not you believe in the war (I do think getting rid of Saddam was a good idea) or support our leader (are you kidding?), we should remember that the troops who are over there are generally good people who are in an uncertain combat environment fighting for their lives. It behooves us to support them, whatever you think of the people who sent them there.

A friend who is also a military nut said this about what units to contribute to:

On one hand, I'd suggest reserve/nat'l guard units - they being in a more protracted and stressful situation than they expected. On the other hand, reservists are more likely to have a better support structure from their families (more likely married). Active duty Army GIs and Marines are more likely to be single, 18-21 year olds. Obviously both have families, but, you know, direct support from a spouse and children in addition to the rest of the family is what I'm getting at.

So, send books & chocolate, and support 'em.


This diary entry dedicated to my synaesthetic friend Tamara.

QOTDE: "The only mistake you can make is to believe you cannot make mistakes." (via Carlos Gershenson)

Web software non-release & good books on software development

I'm waiting eagerly for our server admin to update Cartwheel to my latest version. We have two "working branches" on the SourceForge CVS repository, one for the Beowulf cluster configuration and one for the Web server configuration; to update, I simply merge the development branch into each of these branches and tell Ian (our server admin) to run 'cvs update' and restart. This time is a bit more complicated, because Python, psycopg, and Quixote all have new versions; I've added some new analysis programs (LAGAN and blastz); and I'm now using BioPython to parse NCBI BLAST output. BioPython, in particular, is a pain -- it's a big heap o' code, and it doesn't interact well with Quixote. O well, c'est la vie.

This update is pretty substantive: I added a bunch of new functionality to round out what was already there, then wrote it up in an article we're submitting to BioTechniques. (Let me know if you want a pre-acceptance copy.) I've been told that BioTechniques isn't the highest quality or highest impact journal, but I get the impression that it reaches a fairly wide audience of biologists. And that's my goal: to reach the users, not to publish a scientific article (got some of those on the way!). This paper is paper #2 of 4 dissertation papers, too, and it's nice to get it off my back. It's also the first paper where I'm corresponding author, which is pretty cool; for the non-academics out there, that signals that it's my project, not my advisor's.

I don't know when, if ever, I'll get around to an actual "release" of Cartwheel. There's no point as long as the one server that we run keeps up with demand; I don't think it's near to conking out, but I could be wrong. I've never stress-tested it, because it's not that kind of Web site... Maybe someday other people will start installing it and then I'll want to canonicalize the installation a bit more ;).

The kind of release we're doing now -- "here's the functionality, go play" -- is certainly the right thing for its current users, who are mostly GUI-using biologists. Anyone who wants to take a deeper look can do so via SourceForge; there's some moderately useful Web services APIs in there, for example.

...good books on software development

Rather than being critical of yahoo academic software development, I thought I'd be friendly today. Here's my list of good background books for software design. It's a very short list: Lakos's C++ book, Design Patterns, and Patterns of Software (+ some other links at the bottom). Fowler's Refactoring definitely belongs on there as well.

I regard these as "must-reads" if you're going to seriously think about writing even a moderately sized software project; if you read them and think "that was a waste of time..." you're either very experienced or you should do us a favor and not write any more software. In my not-so-humble opinion.

(Hmm, that wasn't very friendly. I've got to work on those anger issues, it seems.)

Please send me a private e-mail if you have additional suggestions; I'm always interested in good new books.


Today's diary entry dedicated to salmoni, who could use a little love... Stick with it, you'll find something!

QOTDE: "It is difficult to make predictions, especially about the future"

The write way to right re-usable bioinformatics tools.

It's frustrating how many fantastic bioinformatics analysis tools exist in a difficult-to-use form. Most of the algorithmically challenging tools I use exist only in command-line form; in fact, I can't think of a single sequence analysis program that has an external API. (I understand the situation may be slightly different in the area of clustering software, but that's not my biz at the moment.) A good external interface for NCBI BLAST or CLUSTALW would have saved me many hours.

It's not only the complex programs that suffer from this lack. One of my favorite whipping dogs is EMBOSS, a collection of many rather small command-line programs that do useful bits of analysis. They have tons of stuff, covering most anything you need to do in sequence analysis, but it's all locked away behind formats and stdin/stdout, and much of it is simply easier to re-write if you don't know how to use the program in the first place. In fact, I bet that over 90% of the programs in EMBOSS could be re-written from scratch in little more than a weekend using the scripting language of your choice (abbr "Python").

This is not an entirely idle contention; I rewrote part of fuzztran a few months ago. It took me 30 minutes -- not because I'm a fantastic programmer, but because I had a pattern-searching library that solved a more general problem. Here's what happened:

fuzztran uses a pattern language to search a database of nucleotide sequences after translation. It's useful in situations where you have a leetle bit of protein sequence -- say, from some Edman sequencing -- and want to search a genome or mRNA library for a match. This was exactly the situation I was in, but I needed to search a rather large library containing over 5 million sequences from a whole-genome shotgun sequencing effort on the sea urchin. Moreover, I needed to do an intersection of the results: I wanted to search for two substrings in proximity to each other.

I trundled on over to EMBOSS, read the fuzztran documentation, and tried running it. I immediately ran into several difficulties: it wasn't particularly fast; I didn't know if it was actually working, or if I had entered things in the wrong format; it didn't permit "percent mismatches", as in "find me sequences that match at the 90% level"; it was annoying to script; and the output format wasn't easily parseable.


I spent about 20 minutes trying to find an easy way to use the thing and finally decided that my time was better spent writing a specific tool for my needs. I ended up using my motility toolkit, which supports fuzzy pattern searching with position-weight matrices. I wrote a quick function to reverse-translate amino acids into codons, and thence into a position-weight matrix; once I had this "translate_protein_to_PWM" function written, the final code was very short:

for prot in protein_list:
    matrix = translate_protein_to_PWM(prot)
    length = len(matrix)
    pwm = motility.PWM(matrix)

# allow % mismatches min_score = length - int(float(length) * MISMATCH_PERCENT + 0.5)

print 'searching:', prot for sequence in sequences: if pwm.find(sequence, min_score): # save.

The code, together with testing and debugging, took a total of 30 minutes to write, and worked great -- we found the right protein & went on to verify it experimentally. (The tool is now in my slippy collection under "search-database-for-prot.py".)

Even better, this code was readily extensible to do other things, like mixed protein searches (where you've gotten mixed sequence, e.g. "RYAAGG" and "YGGGAR" were sequenced simultaneously and can't be deconvolved, so you need to search for [RY][YG][AG][AG][GA][RG]") and general domain searches. So that was nice.

OK. Ungapped fuzzy protein sequence searching is, in many senses, a toy problem. There are tons of ways to do it, I'd bet, and none of them would take very long to implement from scratch. The situation is more frustrating when you have to deal with the warts on something like water, which does a Smith-Waterman alignment. This is a moderately tricky piece of code, and reimplementing it isn't a good option for a short-term project. What would be great is if someone broke out the code that did the tricky bits -- the alignment itself -- from the code that worried about parsing input data and constructing output formats. To their credit, the EMBOSS people seem to have done this, but it's in a library that as far as I can tell isn't documented. So it's probably easiest for Joe Blow Bioinformatician to simply use the command-line program, with all of the clumsiness inherent in that approach.

I'd bitch less about the whole problem if it weren't that the EMBOSS folk, and the NCBI folk (who make BLAST), are paid for software development. As mjg59 points out, most analysis programs are written on research grants, where the short-term view outstrips the long-term view. Not so for EMBOSS, who apparently has a whole team of people writing this stuff. I just don't get it; Perl and Python are perfectly good scripting languages, and they're cross-platform; surely it would be easier to just provide a good embedding of the algorithmically challenging functions and then just write the individual programs as scripts??

O well. Some day I hope to rewrite BLAST and retool CLUSTALW to support a nice library API. 'til then, I guess I'll just gripe about the general problem here ;0).


12 Nov 2004 (updated 12 Nov 2004 at 16:19 UTC) »

QOTDE: "The lessons of history teach us -- if the lessons of history teach us anything -- that nobody learns the lessons that history teaches us." (R. Heinlein)

Use Python -- or a language like it. Plus, my savage hatred of "system()"

Hey, look -- a fan! Matthew, dontcha know that the best way to defeat trolls is to ignore them? Or was that giant advertising animatroids? I forget. (<-- gratuitous Simpson's reference.)

Quite apart from my drug problems (acid freak, not crackhead -- there's a difference!) and the gratuitous misreference to GUI programming (I agree completely! I hate GUIs even more than I hate command-line programs -- they're just useful, on occasion!) and the unfortunate failure of my former coauthors -- the swinish bastards! -- to recognize my contributions to the deep foundations of every paper on Avida, I have to agree that any statement recommending, say, Python over Perl, APL, Pascal, or COBOL as a solution is likely to be at best disingenuous and at worst just plain wrong. It is well-known that any Turing-complete language (given infinite memory, yada yada) can emulate any other -- so why choose between them?

Dunno. But, repetitive as it may be to say it, I think a large part of the solution to bad scientific programming is to use a language like Python. Seriously, I'm perfectly aware that Lincoln Stein (and likely Matthew Garrett) can kick my ass when it comes to a mano-y-mano, Perl-y-Python scripting contest. I'm even reasonably confident that Lincoln Stein could take me down in person; he looks mean. (I haven't met Matthew.) But to cite an N of 1 ("worked for me!") as an actual argument... well, I'm no math major but it seems like a large std deviation.

An argument that I might make, were I still slavishly and unreasonably devoted to Perl rather than to Python, would be to point out that anyone writing C extensions for Perl by hand without using SWIG and/or XSAPI probably has bigger problems than over-frequent enjoyment of a little crack. If that's the big problem with Perl, then it's not a problem at all.

This argument ignores the value of writing pseudocode instead of line noise, but that seems to be a personal preference rather than an absolute, for some reason...

And (seriously) Matt's point that this is a social problem is entirely correct. Teaching people Python at an earlier age might help there. ;)

...why "system()" sucks.

But let's move on to a different argument: my savage hatred of "system()". Do an experiment: try writing a parser for the "generic" GFF format. What, you say? That's easy? Sure is -- for each and every one of the bajillion programs that output GFF, it's easy! Now, let's see which field(s) they overloaded this time...

The problem, to put it bluntly, is formats. In information theoretic terms, stdout is often a very lossy channel, and it is difficult (and often impossible) to make it 100% clean. Why? Well, suppose someone gives you some brilliantly written (and novel) standalone piece of code, and it takes in sequences in FASTA format together with a couple of parameters. Now the program does some fantastically complex set of calculations -- gene finding, HMM search, Gibbs sampling, sequence alignment -- and spits out some text as a result. That's right -- some text. What does the text mean? At this point the hapless user of a novel program has several options. S/he can:

  1. write a one-off parser that grabs the necessary data and runs.
  2. write a complete parser that parses all of the output and puts it into a nice structure for later use.
  3. hope like hell that the author of the program provided a "standard" format like GFF that captures some significant component of the output.
  4. wait for someone more anal retentive (or needier, or smarter, or harder-working) to write a really good parser for the format.

Libraries like BioPerl or BioPython give you #3 and #4 (with time). #2 takes a lot of effort and is only worth it when you really need all of the info in the output. #1 is what everybody does, in practice, right up until it bites 'em in the butt.

There's one huuuuuge problem with all of this, however: you're at the mercy of the author of the package to provide full, honest information in the output. Well, good luck with that, and have a good time rewriting your parser when Joe Package Author decides that semicolons are a better divider than commas...

It should be obvious that the best solutions above (#2/#4) can only ever be as good as a good embedding of the package in your SLOC (Scripting Language Of Choice). And, far too often, the actual parsing solution isn't that good, and can't be extended without breaking everybody else's parsers. That's why command-line executables with no associated library or embedding will, to a general and somewhat loose approximation, always suck.

So, people: use Python. Or COBOL. And write library functions loosely wrapped in main()s, not deeply embedded spaghetti code.


The shoutout today goes gnutizen, who obviously has his own drug issues; he certified me as "Journeyer"!

p.s. It turns out I was a math major. Huh. Weird.

p.p.s. If someone with some Perl and C/C++ knowledge were to go comment on my SWIG/Perl embedding of motility (see the CVS) it could be most useful to me. Just a thought.

p.p.p.s. In the bioinformatics language wars, I have to say that Bioconductor really takes the cake in the "absurdity" category. I personally like R, but why someone would choose it over a more mainstream language for general-purpose programming <shakes head>...

11 Nov 2004 (updated 11 Nov 2004 at 21:04 UTC) »

QOTDE: "If we knew what we were doing, it wouldn't be called 'research'." (A. Einstein, esq.)

Research programming, and the Doom of Command-Line Executables

Scientific analysis programs are often badly written, and usually available only as command-line executables.

The first question is, why? There are a few different reasons:

  1. Scientific programmers are usually grad students and postdocs. These people are entirely untrained (and uninterested) in programming or software engineering.

  2. Those who are trained in software engineering are usually computer scientists of some stripe, so 90% of them are completely useless in front of a computer anyway. (See #1 for the resulting authorship.)

  3. Most scientific projects are ad hoc piles of crap from the first line of code laid down to the last semicolon written.

  4. There is lots of turnover in science: students and postdocs move on quickly.

  5. The standard research programming languages (Fortran and C) do not lend themselves to re-usable code, to say the least.

I don't blame the scientists for the resulting poorly built software. After all, the goal in science is to keep moving forward with your research, and if you take the short-term view on software you'll only think about the next step required for your project. Even if you do try to plan ahead, odds are you're going to be screwed by the Real World, which doesn't care what you think your results should be, and often has its own ideas. Then there's the desire to move on, which doesn't lend itself to good software practcices. And, in any case, it's not like anyone teaches software development properly, so scientists have to learn how to do it on their own. Plus, if your advisor/mentor/supervisor tells you that Fortran is the way to go... then Fortran you will use.

Short-term thinking is probably the worst culprit in all of this. Advisors have no obvious incentive to promote long software projects. But I do think this focus can be bad. I've ignored my advisor's direction to focus on the short term twice: once it resulted in Avida (still a going concern 11 years later) and once it resulted in Cartwheel. If Charles and I hadn't simply written Avida (against Chris Adami's instructions) we would have been stuck with a modification to the huge pile of crap that was Tierra at the time. My current advisor, Eric Davidson, simply didn't understand the point of Cartwheel until years later (I'm still not sure what he thinks it is, actually). I think Cartwheel is a success because it's taken over much of the sequence annotation functions in the lab -- and now we don't have to run a bunch of Perl scripts, by hand, on our Beowulf cluster, every time we want to annotate a piece o' sequence. Victory over Perl, at least!

Overall, this kind of short-term thinking results in a lot of short-ish, one-off coding projects that solve a particular research need and contain no obviously re-usable code. Typically this can be encapsulated in a simple command-line program that has relatively obvious parameters and spits out a result that is directly interpretable by one person: the person who wrote the code. At this point the project is considered fini and the coder moves on. Result: one undocumented command-line program that other people may or may not find useful and in any case will be difficult to use.

OK, so that settles why badly-written command-line programs exist in such profusion in research. The second question is, why do I hate them? That's probably fairly obvious, but just to hammer in the point, I'll submit a tirade about that some time in the future.

The final question is, what can we do about it?

I'm convinced that a large part of the answer is this: use a scripting language like Python.

Why "like Python"? How "like Python"?

  1. Python is simple, easy to learn, and fairly concise.

  2. Python is easy to read. (It also looks a lot like cleanly written C should, which helps C programmers out.)

  3. Python makes code re-use relatively easy. In particular, Python is inherently module- and object-oriented.

  4. Python is cross-platform.

  5. Python provides easy access to string processing: functionality that C and Fortran don't really have.

  6. C and C++ code can easily be wrapped in Python.

  7. Python is interpreted & provides interactive command-line access.

  8. Python has automatic memory management: no malloc/free nonsense, or resulting memory corruption.

Hopefully it's obvious why these are all good features for a research programming language! Access to C and C++ code is surprisingly important, because an awful lot of useful code -- research and otherwise -- exists in C and C++ libraries. Plus, when you feel the need for speed, C and C++ are still the way to go.

However, none of the other languages that I'm most familiar with (C, C++, Java, Perl, and Tcl) satisfy all of these. C, C++, and Java are not interpreted, and Java can't easily wrap C/C++ code. Plus, C/C++ are not particularly cross-platform unless you know what you're doing. Perl and Tcl are both good scripting languages that satisfy most of the above criteria -- in particular, wrapping C code in Tcl (although not Perl) is fantastically easy, and Perl is very easy to learn for old C/UNIX hands - but neither one is object-oriented from the ground up, and neither one supports code-reuse very nicely.

Perl is a fucking nightmare when it comes to wrapping C code, too; anyone who doesn't think this is invited to try it. Sheesh. What was Larry Wall thinking?!

Ruby might be a good bet, but then I understand that it's basically Python anyway... (<dons asbestos suit hurriedly>).

So use Python. Trust me -- I know what I'm doing. ;)

Well, that's it for today; gotta go read /. I'll leave you with one final thought: the two dominant points of technological friction for bioinformatics are (a) the widespread use of Perl and Java, and (b) the omnipresence of incredibly useful but hideously unscriptable command-line programs like BLAST and 90% of the pimply little programs in EMBOSS.

(I'd hold Lincoln Stein personally responsible for (a), but the truth is that he's (i) a nice guy and (ii) BioPython isn't helping. That's a whole 'nother story. I must admit to complete bafflement re EMBOSS.)

O hey, here's a shoutout to Nathan Gray, the only other person I know who compulsively writes about stuff on the Web.

4 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!