Big Data Biology - why efficiency matters
I'm going to pick on Mick Watson today. (It's OK. He's just a foil for this discussion, and I hope he doesn't take it too personally.)
Mick made the following comment on my earlier Big Data Biology blog post:
I do wonder whether there is just a bit too much hand wringing about "big data".
For e.g., the rumen metagenomic data you mentioned above, I can assemble using MetaVelvet on our server in less than a day (admittedly it has 512Gb of RAM, but doesn't everyone?). I can count the 17mers in it using Jellyfish in a few hours.
So I just set the processes running, two days later, I have my analysis. What's the problem? Does it matter that you can do it quicker?
Big data doesn't really worry me.
I know I am being flippant, but really to me the challenge isn't the data, it's the biology. I don't care if it takes 2 hours, 2 days or 2 weeks to process the data.
Improve your computing efficiency by 100x, I don't care; improve your ability to extract biological information by 100x, then I'm interested :)
He makes one very, very, very good point -- who cares if you can run an analysis (whatever it is) and it doesn't provide any value? The end goal of my sequencing analysis is to provide insight into biological processes; I might as well just delete the data (an O(1) "analysis" operation, if one with a big constant in front of it..) if the analysis isn't going to yield useful information.
But he also seems to think that speed and efficiency of analyses doesn't matter for science. And I don't just think he's dead wrong, I know he's dead wrong.
This is both an academic point and a practical point. And, in fact, an algorithmic point.
The academic point is simple: our ability to do thorough exploratory analysis of a large sequencing data set is limited by at least four things. These four things are:
Our ability to do initial processing on the data - error trimming and correction, and data summary (mapping and assembly, basically).
The information available for cross-reference. Most (99.9%) of our bioinformatic analyses rely on homology (for inference of function) and annotation.
(This is why Open Access of data is so freakin' important to us bioinformaticians. If you hide your database from us, it might as well not exist for all we care.)
Statistics. We do a lot of sensitive signal analysis and multiple testing, and we are really quite bad at computing FDRs and other statistical properties. Each statistical advance is greeted with joy.
The ability to complete computations on (1), (2), and (3).
Every 100gb data set takes a day to process. Mapping and assembly can take hours to days to weeks. Each database search costs time and effort (in part because the annotations are all in different formats). Each MCMC simulation or background calculation takes significant time, even if it's automated.
Inefficient computation thus translates to an economic penalty on science (in time, effort, and attention span). This, in turn, leads directly to science that is not as good as it could be (as do poor computational science skills, badly written software, inflexible workflows, opaque pipelines, and too quick a rush to hypotheses -- hey, look, a central theme to my blog posts!)
Anecdote: someone recently e-mailed us to tell us about how they could assemble a comparable soil data set to ours in a mere week and 3 TB of memory. Our internal estimates suggest that for full sensitivity, we need to do 5-10 assemblies of that data set (each with different parameters) followed by a similarly expensive post-assembly merging -- so, minimally, 6 weeks of compute requiring 3 TB of memory, full-time, on as many cores as possible. You've gotta imagine that there's going to be a lot of internal pressure to get results in less time (surely we can get away with only 1 assembly?) with less parameter searching (what, you think we can tell you which parameters are going to work?) and this pressure is going to translate to doing less in the way of data set exploration. (Never mind the actual economics -- since this data set would take about 1 week of sequencer time, and $10,000 or so, to generate today, I think they don't make sense either.)
I can point you to at least three big metagenome Illumina assembly papers where I know these computational limitations truncated their exploration of the data set. (Wait, you say there are only three? Well, I'm not going to tell you which three they are.)
This one's a bit more obvious, but, interestingly, Mick also treads all over it. He says "...I can assemble using MetaVelvet on our server in less than a day (admittedly it has 512 Gb of RAM, but doesn't everyone?"
Well, no, they don't.
We didn't have access to such a big server until recently. We had plenty of offers for occasional access, but when we explained that we needed them for a few weeks of dedicated compute (for parameter exploration -- see above) and also that no, we weren't willing to sign copyright or license for our software over to a national lab for that access, somewhat oddly a lot of the offers came to naught.
It turns out most people don't have access to such bigmem computers, or even big compute clusters; and when they do, those computers and clusters aren't configured for biologists to use.
Democratization of sequencing should mean democratization of analysis, too. Every year our next-gen sequence analysis course gets tons of applicants from small colleges and universities where the compute infrastructure is small and what does exist is overwhelmed by Monte Carlo calculations. Our course explicitly teaches them to use Amazon to do their compute -- with that, they can take that knowledge home, and spend small amounts of money to buy IaaS, or apply for an AWS education grant to do their analysis. We feel for them because we were in their situation until recently.
Expensive compute translates to a penalty on the very ability of many scientists and teachers to access computational science. (Insert snide comment on similar limitations in practical access to US education, health care, and justice).
Assemblers kinda suck. Everyone knows it, and recent contests & papers have done a pretty good job of highlighting the limitations (see GAGE and Assemblathon). This is not because the field is full of stupid people, but rather because assembly is a really, really hard problem (see Nagarajan & Pop) -- so hard that really smart people have worked for decades on it. (In many ways, the fact that it works at all is a tribute to their brilliance.)
Advances in assembly algorithms have led to our current crop of assemblers, but assemblers are still relatively slow and relatively memory consumptive. Our diginorm paper benchmarks Trinity as requiring 38 hours in 42gb of RAM for 100m mouse mRNAseq reads; genome and metagenome assemblers require similar size resources, although the variance depends on the sample, of course. SGA and Cortex seem unreasonably memory efficient to me :), but I understand that they perform less well on things other than single genomes (like, say, metagenomic data) -- in part because the underlying data structures are targeted at specific features of their data.
What's the plan for the future, in which we will be applying next-gen sequencing to non-model organisms, evolutionary experiments, and entire populations of novel critters? These sequencing data sets will have different features from the ones we are used to tackling with current tech -- including higher heterozygosity and strong GC-rich biases.
I personally think the next big advances in assembly will come through the systematic application of sample- or sub-sample specific, compute-expensive algorithms like EMIRGE to our data sets. While perfect assembly may be a pipe dream, significant and useful incremental advances seem very achievable, especially if the practical cost of current assembly algorithms drops.
Not so parenthetically, this is one of the reasons I'm so excited about digital normalization (the general concept, not only our implementation) --
I bet more algorithmically expensive solutions would be investigated, implemented, and applied if memory and time requirements dropped, don't you?
Or if the data could be made less error-prone and simpler?
Or if the volume of data could be reduced without losing much information?
I will take one side of that bet...
Of course, I'm more than a wee bit biased on this whole topic. A big focus of my group has been in spending the last three years fighting the trend of "just use a bigger computer and it will all be OK". Diginorm and partitioning are two of the results, and a few more will be emerging soon. I happen to think it's incredibly important; I would have done something else with my time, energy, and money if not. Hopefully you can agree that it's important, even if you're interested in other things.
So: yes, computational efficiency is not the only thing. And it's a surprisingly convenient moving target; frequently, you yourself can just wait a few months or buy a bigger computer, and achieve similar results. But sometimes that attitude masks the fact that efficient computation can bring better, cheaper, and broader science. We need to pay attention to that, too.
And, Mick? I don't think I can improve your ability to extract biological information by 100x. On metagenomes, would 2-10x be a good enough start?