Name: Titus Brown
Member since: 2004-10-29 00:14:46
Last Login: 2006-12-13 16:10:13
Homepage: http://chabry.caltech.edu/~t/
Missive from a Swing State
I have a habit of occasionally sending odd e-mails to my postdoc lab mailing list, for reasons that I cannot adequately explain. Here's the latest one:
Dear Bronner-Fraser Lab, I would like to thank you all for your private letters of support; between the blizzards of Colorado, the floods of Texas, and the red-v-blue fighting in Michigan, it's been a tough year for us newly independent Bronner-Fraser-ites! Now that McCain has withdrawn his cadre from Michigan and Obama has successfully captured the bastion of Michigan State (strategically located next to the capitol of Michigan!), I expect the out-and-out party fighting to die down, at least in the streets of Lansing. While fiercely partisan Reds will probably continue to snipe at us for a while, Obama continues to clear Lansing block-by-block in his advance with forces of crushing superiority. However, the newly added wrinkle of the pro- and anti-bailout forces (or, as they're known locally, the Inflationists and the Depressionists) leads me to believe that we are not long for peace here; more on that soon. Overall, we've been very lucky. Our compound is located far enough away from densely populated areas that we haven't been affected much by the fighting, and some of our local friends who are less fortunate have bolstered our ranks. Armored vehicles continue to roam the main roads, but again, we're not close enough to them to be affected. We do have a local supply of drinking water, and so far the mine fields are holding up; I do worry a bit about the ammo situation, but I've directed everyone to switch away from automatics until we get a new shipment in. Internet has not been interrupted, although this is a mixed blessing; I've had to stop communicating with one of the few remaining die-hard conservatives I know, because his e-mails were correlated with attacks by the local McCain militia. I have come to believe that he was hitting me with pro-Palin blogs as a distraction technique, although I quickly saw through the tactic -- even the conservatives don't think she's viable any more. Anyway, I'll continue to keep you all updated as the fight continues. best, --titus
Please send ammo!
Python is a ... little language?
Here at MSU, we just had a 40th anniversary celebration of the Computer Science department. As it happens, Carl Page (Sr.) was a founding member of CSE at MSU, and so his son, Carl Page (Jr.) came and participated in a panel. In response to my question about what we should be doing in the CS department to better prepare our students for the future, he commented that our recent move towards Python was good, and that we should push towards more open-source involvement as well. I'm happy with both of those opinions :).
While talking about Python, however, Carl said:
Python is a scrappy little language that plays well with others.
Well, that, or maybe it was:
Python is a crappy little language that plays well with others.
Hmm, which was it??
--titus
The Future of Bioinformatics (in Python), part 1 (b)
My last post initiated a discussion on the biology-in-python mailing list about BioPython, among other things. (Here is a link to the discussion, which is kind of long and unfocused.)
I'm happy that the bip list is serving as a place for people to interact with the BioPython maintainers to discuss the future of BioPython. Hopefully it will lead to more involvement with BioPython, which would be a good thing.
However, I would like to take the time to question the longer-term utility of the BioPython/Perl approach.
Bioinformatics -- by which I mostly mean sequence analysis -- has predominantly followed the UNIX scripting/pipeline model, in which data is kept in simple, easily-manipulated formats (comma- or tab-separated values, or CSV) and then processed incrementally. This approach has a number of advantages:
- Each step is isolated and so easier to understand.
- Each step produces a simple, easy-to-parse kind of data.
- Each step is language neutral (anything can read CSV).
- New programmers can learn to use each step in isolation.
- The components are re-usable.
I've used this exact approach for well over a decade: first for analyzing Avida data, then Earthshine, and most recently scads of genome data.
This scripting & pipeline approach is what BioPython and BioPerl facilitate. They have a lot of tools for running programs to produce data and loading in different formats, and they serve as a good library for this purpose.
The scripting/pipeline approach does have some deficiencies as a general data-analysis approach:
- Poor (O(n)) scalability: processing CSV files is hard to do supra-linearly, and often the easiest analysis approach is actually O(n**2)
- Hard to test: generally people do not test their scripts. Even now that I've become test infected, I find scripts to be more difficult to test than modules and libraries. I can do it, but it's not natural for me, and empirical evidence suggests that it's not natural for most people.
- Hard to re-use: scripts are often quite fragile with respect to assumptions about input data, and these assumptions are rarely spelled out or asserted within the code. This leads to hard-to-diagnose errors that often occur deep within the tool chain (if they ever show up explicitly).
- Poor metadata support: try attaching metadata to a CSV file. You'll end up with something like GFF3, which overloads the metadata field to mean something slightly different with each database. Awesome.
- Too easy to map into SQL databases: yes, you can load CSV files into SQL databases, but JOINs are a relatively rare form of actual data analysis -- and that's what SQL databases are best at. SQL databases do a particular poor job of interval analysis (overlap/nearest neighbor extraction/etc.)
- Poor abstraction: when you load something into memory from a CSV file, it's easy to treat it as a list. Lists are, generally, a poor way to interact with sequence annotations. (This is really the same problem as the SQL database problem.)
- Poor user interface: it's hard to put lipstick on a script! People who aren't comfortable with UNIX and file munging (i.e. most biologists) have a hard time using scripts, and it's rather difficult to wrap a script in a GUI or Web site.
- Poor reproducibility: every scientist I know has trouble keeping track of what parameters they used last time they ran a script. Even if they keep track of things in a lab notebook, that's a poor medium for reference; logging and notebook software don't seem to work very well for this, either.
These deficiencies didn't bother me too much when I was first interacting with genomic data, but they've become glaringly apparent in the face of massively parallel sequencing data. The advent of 454 and (particularly) Solexa sequencing data, where you can get tens of millions of short reads from a DNA sample, means that scalability concerns dominate; the ready availability of such data means that everyone has some and needs to analyze it, and they want good, fast, correct tools to do so. In the struggle to cope with this data, things like maq emerge, which uses a largely opaque intermediate data format to make Solexa data analysis scalable; this ends up being a bit of an intermediate model, where you query and manipulate maq databases from the command line. It can be scripted, but it doesn't have the advantages of language neutrality or easy parse-ability, and so you lose some of the advantages of scripting. Since maq doesn't really work as a programming library, either, you don't gain the benefits of abstraction (it's designed on extract-transform-load model where you run each command as an isolated operation). There are lots of pieces of bioinformatics software like this: they solve one problem well, but they're not built to output data that can be easily combined with data from another program -- at which point you run into format and scripting issues.
For me, the deficiencies of the scripting model largely come down to the lack of an abstraction layer that separates how the data is stored from how I want to query and manipulate the data. The introduction of a good abstraction layer immediately potentiates re-usability, because now I can separate data loading from data query and start using objects to build queries. It also makes scalability a matter of building a good, general solution once, or perhaps building specific solutions that all look the same at the API level. Once the API is firm, it's relatively straightforward to test; once I can separate the API from implementation I can implement different backend storage and retrieval mechanisms as I like (pickle, SQL, whatever); and I can build a GUI interface without having to change the internals every time I change data storage types or analysis algorithms.
On the flip side, once move into a framework, you now have the problem that you're coding at a level well above most newbie programmers and biologists, so ease-of-use becomes a real issue. This means that people need good documentation and good tutorials, in particular -- the Achilles Heel of open source & academic software. And, of course, the framework has to actually work well and solve problems well enough to reward the casual scientist who needs a tool.
So, with respect to BioPython, I appreciate the functionality it has, but I think the model is wrong for my work (and for work in a world full of genomes and sequence). What I really want is a complete solution stack for sequence analysis and annotation:
data storage
--
object layer
--
scripting layer
--
user interface tools
I'm out of time now, but next installment, I'll talk about how pygr provides much of this "solution stack" for me.
If you're interested in a longer, more detailed version of much the same argument, see Chris Lee's paper with Stott Parker and Michael Gorlick, Evolving from Bioinformatics-in-the-Small to Bioinformatics-in-the-Large.
For a recent overview of pygr's functionality, see the draft paper, Pygr: A Python graph framework for highly scalable comparative genomics and annotation database analysis.
--titus
The Future of Bioinformatics (in Python), part 1 (a)
Chris Lasher wrote a nice blog post naming me as a rabble rouser in the area of "Python in bioinformatics". His post raised a number of interesting points, some of which I'd like to discuss here on my blog.
First, why is Python not more dominant in bioinformatics? I really lay this at the feet of Lincoln Stein, who (from what I can tell) was the dominant force behind BioPerl in the early days. So it worked really well and attracted all sorts of attention and users and actual use. However, I think the tide is shifting away from Perl: from the not-so-imminent release of a complex, backwardsly-incompatible Perl 6, to the massive quantities of completely non-reusable Perl code that have been flung in every direction, people are starting to get sick of Perl. also, a lot of people in academia are moving towards Python for bioinformatics, if not in a very coordinated way: when I left Caltech, two of the three heavy bioinformatics groups were using Python, and when I arrived at MSU I found several groups doing bioinformatics in Python and only one using Perl (and, at that, mainly because they rely on GMOD).
Heck, there are a lot of Python-in-bio sightings these days. I just went to a talk by Rob Knight, who works on the human microbiome project, and he mentioned developing PyCogent with some collaborators. A lab on campus uses TAMO for motif searching. Cistematic and a variety of tools from the Wold Lab use Python. James Taylor is working hard on developing Galaxy into a general purpose tool. So I don't despair for Python's presence in biology.
I think the world is moving, medium-to-long-term, towards the use of Perl for scripting-level work, Python for frameworks and re-usable software, and R for statistical analysis of data sets (BioConductor is also popping up a lot these days). Personally I think this is the right approach and bodes well in the long term.
--
Second, Chris says,
I think I have not worked with Biopython because I am not encouraged to do so, and am actually discouraged, because of research, and the current culture of academia.
I, too, am struggling with the problem that research scientists, somewhat shockingly, are more interested in doing (and funding) novel research than in building re-usable software. OK, I'm being a bit sarcastic, but that's only a mildly sarcastic statement, really; while it's understandable that researchers want to do research, the rise of large-scale data and computational methods in biology unambiguously argues for computational competence in the next generation of researchers. Part of computational competence is knowing how to get stuff done effectively and correctly, not to mention with reusable software when possible. I am actually shocked that there's so little focus on Software Carpentry-like skills in science and education, and I'm doing my best to push on that front here at MSU (see my very first course here, which is introducing Python, Subversion and automated testing to CSE undergrads).
That computing in biology sucks is not by any means a novel observation; see this nice article, Computational Biology Resources Lack Persistence and Usability, for example. My take on things is that the funding bodies simply need to recognize the utility of software maintenance, which is slowly happening, and that the undergrad and graduate departments need to adapt to the future by teaching this stuff. But there's no question it's going to be Darwinian out there -- as Stewart Brand says, "Once a new technology starts rolling, if you're not part of the steamroller, you're part of the road." Hopefully some of us can be the steamroller and not the road, yeh?
So what's my solution, you might ask? Well, now that I'm a bigshot professor, I'm going to be encouraging (well, demanding where possible) that my students and collaborators use good software development techniques and release their source code and data. But my real "secret" -- and please steal it if you can :) -- is that I hope to continue building a real infrastructure that can underlie solutions to my various research problems. If I can build a re-usable core of well-tested tools on top of a solid framework, I should be able to do research faster, better, smarter, and more reliably than my colleagues and competitors. That should translate into more publications, more grants, and more problems actually solved. (I'll let you know how that goes; it's early days still.)
That, incidentally, is why you should ignore people who tell you not work on your coding or on general-purpose libraries: because if it's useful to you, it's worth doing right and making it useful to yourself in the future.
This is also one of the reasons why I'm investing a substantial amount of my scarcest resource (time) in pygr. pygr is a solution for scalable storage, retrieval, and named persistence of sequence-associated data, and it works fantastically well. The real problem with pygr is the high barrier to entry, and that's what we're working on lowering, if only so that my own students will have less trouble learning it.
Some other time I'll talk about why pygr and pygr-like solutions are the right solution to reusability in bioinformatics.
So, in summary: don't worry, be happy, Python is coming to bioinformatics one way or another. And don't worry, just work hard at becoming the steamroller (and not the road) by improving your coding skills and becoming a general-purpose computational scientist, or at least general-purpose bioinformatician. You won't regret it.
Heck, you can always come work for me, right? ;)
--titus
The Future of Bioinformatics (in Python), part 1 (a)
Chris Lasher wrote a nice blog post naming me as a rabble rouser in the area of "Python in bioinformatics". His post raised a number of interesting points, some of which I'd like to discuss here on my blog.
First, why is Python not more dominant in bioinformatics? I really lay this at the feet of Lincoln Stein, who (from what I can tell) was the dominant force behind BioPerl in the early days. So it worked really well and attracted all sorts of attention and users and actual use. However, I think the tide is shifting away from Perl: from the not-so-imminent release of a complex, backwardsly-incompatible Perl 6, to the massive quantities of completely non-reusable Perl code that have been flung in every direction, people are starting to get sick of Perl. also, a lot of people in academia are moving towards Python for bioinformatics, if not in a very coordinated way: when I left Caltech, two of the three heavy bioinformatics groups were using Python, and when I arrived at MSU I found several groups doing bioinformatics in Python and only one using Perl (and, at that, mainly because they rely on GMOD).
Heck, there are a lot of Python-in-bio sightings these days. I just went to a talk by Rob Knight, who works on the human microbiome project, and he mentioned developing PyCogent with some collaborators. A lab on campus uses TAMO for motif searching. Cistematic and a variety of tools from the Wold Lab use Python. James Taylor is pushing Galaxy pretty hard. So I don't despair for Python's presence in biology.
I think the world is moving, medium-to-long-term, towards the use of Perl for scripting-level work, Python for frameworks and re-usable software, and R for statistical analysis of data sets (BioConductor is also popping up a lot these days). Personally I think this is the right approach and bodes well in the long term.
--
Second, Chris says,
I think I have not worked with Biopython because I am not encouraged to do so, and am actually discouraged, because of research, and the current culture of academia.
I, too, am struggling with the problem that research scientists, somewhat shockingly, are more interested in doing (and funding) novel research than in building re-usable software. OK, I'm being a bit sarcastic, but that's only a mildly sarcastic statement, really; while it's understandable that researchers want to do research, the rise of large-scale data and computational methods in biology unambiguously argues for computational competence in the next generation of researchers. Part of computational competence is knowing how to get stuff done effectively and correctly, not to mention with reusable software when possible. I am actually shocked that there's so little focus on Software Carpentry-like skills in science and education, and I'm doing my best to push on that front here at MSU (see my very first course here, which is introducing Python, Subversion and automated testing to CSE undergrads).
That computing in biology sucks is not by any means a novel observation; see this nice article, Computational Biology Resources Lack Persistence and Usability, for example. My take on things is that the funding bodies simply need to recognize the utility of software maintenance, which is slowly happening, and that the undergrad and graduate departments need to adapt to the future by teaching this stuff. But there's no question it's going to be Darwinian out there -- as Stewart Brand says, "Once a new technology starts rolling, if you're not part of the steamroller, you're part of the road." Hopefully some of us can be the steamroller and not the road, yeh?
So what's my solution, you might ask? Well, now that I'm a bigshot professor, I'm going to be encouraging (well, demanding where possible) that my students and collaborators use good software development techniques and release their source code and data. But my real "secret" -- and please steal it if you can :) -- is that I hope to continue building a real infrastructure that can underlie solutions to my various research problems. If I can build a re-usable core of well-tested tools on top of a solid framework, I should be able to do research faster, better, smarter, and more reliably than my colleagues and competitors. That should translate into more publications, more grants, and more problems actually solved. (I'll let you know how that goes; it's early days still.)
That, incidentally, is why you should ignore people who tell you not work on your coding or on general-purpose libraries: because if it's useful to you, it's worth doing right and making it useful to yourself in the future.
This is also one of the reasons why I'm investing a substantial amount of my scarcest resource (time) in pygr. pygr is a solution for scalable storage, retrieval, and named persistence of sequence-associated data, and it works fantastically well. The real problem with pygr is the high barrier to entry, and that's what we're working on lowering, if only so that my own students will have less trouble learning it.
Some other time I'll talk about why pygr and pygr-like solutions are the right solution to reusability in bioinformatics.
So, in summary: don't worry, be happy, Python is coming to bioinformatics one way or another. And don't worry, just work hard at becoming the steamroller (and not the road) by improving your coding skills and becoming a general-purpose computational scientist, or at least general-purpose bioinformatician. You won't regret it.
Heck, you can always come work for me, right? ;)
--titus
titus certified others as follows:
Others have certified titus as follows:
[ Certification disabled because you're not logged in. ]
FOAF updates: Trust rankings are now exported, making the data available to other users and websites. An external FOAF URI has been added, allowing users to link to an additional FOAF file.
Keep up with the latest Advogato features by reading the Advogato status blog.
If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!