The Future of Bioinformatics (in Python), part 1 (b)
My last post initiated a discussion on the biology-in-python mailing list about BioPython, among other things. (Here is a link to the discussion, which is kind of long and unfocused.)
I'm happy that the bip list is serving as a place for people to interact with the BioPython maintainers to discuss the future of BioPython. Hopefully it will lead to more involvement with BioPython, which would be a good thing.
However, I would like to take the time to question the longer-term utility of the BioPython/Perl approach.
Bioinformatics -- by which I mostly mean sequence analysis -- has predominantly followed the UNIX scripting/pipeline model, in which data is kept in simple, easily-manipulated formats (comma- or tab-separated values, or CSV) and then processed incrementally. This approach has a number of advantages:
- Each step is isolated and so easier to understand.
- Each step produces a simple, easy-to-parse kind of data.
- Each step is language neutral (anything can read CSV).
- New programmers can learn to use each step in isolation.
- The components are re-usable.
I've used this exact approach for well over a decade: first for analyzing Avida data, then Earthshine, and most recently scads of genome data.
This scripting & pipeline approach is what BioPython and BioPerl facilitate. They have a lot of tools for running programs to produce data and loading in different formats, and they serve as a good library for this purpose.
The scripting/pipeline approach does have some deficiencies as a general data-analysis approach:
- Poor (O(n)) scalability: processing CSV files is hard to do supra-linearly, and often the easiest analysis approach is actually O(n**2)
- Hard to test: generally people do not test their scripts. Even now that I've become test infected, I find scripts to be more difficult to test than modules and libraries. I can do it, but it's not natural for me, and empirical evidence suggests that it's not natural for most people.
- Hard to re-use: scripts are often quite fragile with respect to assumptions about input data, and these assumptions are rarely spelled out or asserted within the code. This leads to hard-to-diagnose errors that often occur deep within the tool chain (if they ever show up explicitly).
- Poor metadata support: try attaching metadata to a CSV file. You'll end up with something like GFF3, which overloads the metadata field to mean something slightly different with each database. Awesome.
- Too easy to map into SQL databases: yes, you can load CSV files into SQL databases, but JOINs are a relatively rare form of actual data analysis -- and that's what SQL databases are best at. SQL databases do a particular poor job of interval analysis (overlap/nearest neighbor extraction/etc.)
- Poor abstraction: when you load something into memory from a CSV file, it's easy to treat it as a list. Lists are, generally, a poor way to interact with sequence annotations. (This is really the same problem as the SQL database problem.)
- Poor user interface: it's hard to put lipstick on a script! People who aren't comfortable with UNIX and file munging (i.e. most biologists) have a hard time using scripts, and it's rather difficult to wrap a script in a GUI or Web site.
- Poor reproducibility: every scientist I know has trouble keeping track of what parameters they used last time they ran a script. Even if they keep track of things in a lab notebook, that's a poor medium for reference; logging and notebook software don't seem to work very well for this, either.
These deficiencies didn't bother me too much when I was first interacting with genomic data, but they've become glaringly apparent in the face of massively parallel sequencing data. The advent of 454 and (particularly) Solexa sequencing data, where you can get tens of millions of short reads from a DNA sample, means that scalability concerns dominate; the ready availability of such data means that everyone has some and needs to analyze it, and they want good, fast, correct tools to do so. In the struggle to cope with this data, things like maq emerge, which uses a largely opaque intermediate data format to make Solexa data analysis scalable; this ends up being a bit of an intermediate model, where you query and manipulate maq databases from the command line. It can be scripted, but it doesn't have the advantages of language neutrality or easy parse-ability, and so you lose some of the advantages of scripting. Since maq doesn't really work as a programming library, either, you don't gain the benefits of abstraction (it's designed on extract-transform-load model where you run each command as an isolated operation). There are lots of pieces of bioinformatics software like this: they solve one problem well, but they're not built to output data that can be easily combined with data from another program -- at which point you run into format and scripting issues.
For me, the deficiencies of the scripting model largely come down to the lack of an abstraction layer that separates how the data is stored from how I want to query and manipulate the data. The introduction of a good abstraction layer immediately potentiates re-usability, because now I can separate data loading from data query and start using objects to build queries. It also makes scalability a matter of building a good, general solution once, or perhaps building specific solutions that all look the same at the API level. Once the API is firm, it's relatively straightforward to test; once I can separate the API from implementation I can implement different backend storage and retrieval mechanisms as I like (pickle, SQL, whatever); and I can build a GUI interface without having to change the internals every time I change data storage types or analysis algorithms.
On the flip side, once move into a framework, you now have the problem that you're coding at a level well above most newbie programmers and biologists, so ease-of-use becomes a real issue. This means that people need good documentation and good tutorials, in particular -- the Achilles Heel of open source & academic software. And, of course, the framework has to actually work well and solve problems well enough to reward the casual scientist who needs a tool.
So, with respect to BioPython, I appreciate the functionality it has, but I think the model is wrong for my work (and for work in a world full of genomes and sequence). What I really want is a complete solution stack for sequence analysis and annotation:
data storage
--
object layer
--
scripting layer
--
user interface tools
I'm out of time now, but next installment, I'll talk about how pygr provides much of this "solution stack" for me.
If you're interested in a longer, more detailed version of much the same argument, see Chris Lee's paper with Stott Parker and Michael Gorlick, Evolving from Bioinformatics-in-the-Small to Bioinformatics-in-the-Large.
For a recent overview of pygr's functionality, see the draft paper, Pygr: A Python graph framework for highly scalable comparative genomics and annotation database analysis.
--titus
FOAF updates: Trust rankings are now exported, making the data available to other users and websites. An external FOAF URI has been added, allowing users to link to an additional FOAF file.
Keep up with the latest Advogato features by reading the Advogato status blog.
If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!