QOTDE: "If we knew what we were doing, it wouldn't be called 'research'."
(A. Einstein, esq.)
Research programming, and the Doom of Command-Line Executables
Scientific analysis programs are often badly written, and usually
available only as command-line executables.
The first question is, why? There are a few different reasons:
- Scientific programmers are usually grad students and postdocs.
These people are entirely untrained (and uninterested) in programming or
- Those who are trained in software engineering are usually
computer scientists of some stripe, so 90% of them are completely
useless in front of a computer anyway. (See #1 for the resulting
- Most scientific projects are ad hoc piles of crap from the first
line of code laid down to the last semicolon written.
- There is lots of turnover in science: students and postdocs
move on quickly.
- The standard research programming languages (Fortran and C)
do not lend themselves to re-usable code, to say the least.
I don't blame the scientists for the resulting poorly built software.
After all, the goal in science is to keep moving forward with your
research, and if you take the short-term view on software you'll only
think about the next step required for your project. Even if you do try
to plan ahead, odds are you're going to be screwed by the Real World,
which doesn't care what you think your results should be, and often
has its own ideas. Then there's the desire to move on, which doesn't
lend itself to good software practcices. And, in any case, it's not
like anyone teaches software development properly, so scientists have
to learn how to do it on their own. Plus, if your
advisor/mentor/supervisor tells you that Fortran is the way to
go... then Fortran you will use.
Short-term thinking is probably the worst culprit in all of this.
Advisors have no obvious incentive to promote long software projects.
But I do think this focus can be bad. I've ignored my advisor's
direction to focus on the short term twice: once it resulted in Avida (still a going
concern 11 years later) and once it resulted in
Cartwheel. If Charles
and I hadn't simply written Avida (against Chris Adami's instructions)
we would have been stuck with a modification to the huge pile of crap
that was Tierra
at the time. My current advisor, Eric Davidson, simply didn't
understand the point of Cartwheel until years later (I'm still not
sure what he thinks it is, actually). I think Cartwheel is a success
because it's taken over much of the sequence annotation functions in
the lab -- and now we don't have to run a bunch of Perl scripts, by
hand, on our Beowulf cluster, every time we want to annotate a piece
o' sequence. Victory over Perl, at least!
Overall, this kind of short-term thinking results in a lot of
short-ish, one-off coding projects that solve a particular research
need and contain no obviously re-usable code. Typically this can be
encapsulated in a simple command-line program that has relatively
obvious parameters and spits out a result that is directly
interpretable by one person: the person who wrote the code. At this
point the project is considered fini and the coder moves on. Result:
one undocumented command-line program that other people may or may not
find useful and in any case will be difficult to use.
OK, so that settles why badly-written command-line programs
exist in such profusion in research. The second question is, why do I
hate them? That's probably fairly obvious, but just to hammer in the
point, I'll submit a tirade about that some time in the future.
The final question is, what can we do about it?
I'm convinced that a large part of the answer is this: use a scripting
language like Python.
Why "like Python"? How "like Python"?
- Python is simple, easy to learn, and fairly concise.
- Python is easy to read. (It also looks a lot like cleanly written C
should, which helps C programmers out.)
- Python makes code re-use relatively easy. In particular, Python is
inherently module- and object-oriented.
- Python is cross-platform.
- Python provides easy access to string processing: functionality that C
and Fortran don't really have.
- C and C++ code can easily be wrapped in Python.
- Python is interpreted & provides interactive command-line access.
- Python has automatic memory management: no malloc/free nonsense,
or resulting memory corruption.
Hopefully it's obvious why these are all good features for a research
programming language! Access to C and C++ code is surprisingly
important, because an awful lot of useful code -- research and
otherwise -- exists in C and C++ libraries. Plus, when you feel the
need for speed, C and C++ are still the way to go.
However, none of the other languages that I'm most familiar with (C,
C++, Java, Perl, and Tcl) satisfy all of these. C, C++, and Java are
not interpreted, and Java can't easily wrap C/C++ code. Plus, C/C++
are not particularly cross-platform unless you know what you're doing.
Perl and Tcl are both good scripting languages that satisfy most of the
above criteria -- in particular, wrapping C code in Tcl (although not
Perl) is fantastically easy, and Perl is very easy to learn for old
C/UNIX hands - but neither one is object-oriented from the ground up,
and neither one supports code-reuse very nicely.
Perl is a fucking nightmare when it comes to wrapping C code, too;
anyone who doesn't think this is invited to try it. Sheesh. What was
Larry Wall thinking?!
Ruby might be a good bet, but then I understand that it's basically
Python anyway... (<dons asbestos suit hurriedly>).
So use Python. Trust me -- I know what I'm doing. ;)
Well, that's it for today; gotta go read /. I'll leave you with one
final thought: the two dominant points of technological friction for
bioinformatics are (a) the widespread use of Perl and Java, and (b)
the omnipresence of incredibly useful but hideously unscriptable
command-line programs like BLAST and 90% of the pimply little programs
(I'd hold Lincoln Stein personally responsible for (a), but the truth
is that he's (i) a nice guy and (ii) BioPython isn't helping. That's
a whole 'nother story. I must admit to complete bafflement re EMBOSS.)
O hey, here's a shoutout to Nathan Gray, the only
other person I know who compulsively writes about stuff on the Web.