Older blog entries for titus (starting at number 469)

What is digital normalization, anyway?

I'm out at a Cloud Computing for the Human Microbiome Workshop and I've been trying to convince people of the importance of digital normalization. When I posted the paper the reaction was reasonably positive, but I haven't had much luck explaining why it's so awesome.

At the workshop, people were still confused. So I tried something new.

I first made a simulated metagenome by taking three genomes worth of data from the Chitsaz et al. (2011) paper (see http://bix.ucsd.edu/projects/singlecell/) and shuffling them together. I combined the sequences in a ratio of 10:25:50 for the E. coli sequences, the Staph sequences, and the SAR sequences, respectively; the latter two were single-cell MDA genomic DNA. I took the first 10m reads of this mix and then estimated the coverage.

You can see the coverage of these genomic data sets estimated by using the known reference sequences in the first figure. E. coli looks nice and Gaussian; Staph is smeared from here to heck; and much of the SAR sequence is low coverage. This reflects the realities of single cell sequencing: you get really weird copy number biases out of multiple displacement amplification.

Then I applied three-pass digital normalization (see the paper) and plotted the new abundances. As a reminder, this operates without knowing the reference in advance; we're just using the known reference here to check the effects.

http://ivory.idyll.org/permanent/raw-coverage.png

Coverage of genome read mix, calculated by mapping the mixed reads onto the known reference genomes.

http://ivory.idyll.org/permanent/norm-coverage.png

Coverage post-digital-normalization, again calculated by mapping the mixed reads onto the known reference genomes.

As you can see, digital normalization literally "normalizes" the data to the best of its ability. That is, it cannot create higher coverage where high coverage doesn't exist (for the SAR), but it can convert the existing high coverage into nice, Gaussian distributions centered around a much lower number. You also discard quite a bit of data (look at the X axes -- about 85% of the reads were discarded in downsampling the coverage like this).

When you assemble this, you get as good or better results than assembling the unnormalized data, despite having discarded so much data. This is because no low-coverage data is discarded, so you still retain as much overall covered bases -- just in fewer reads. To boot, it works pretty generically for single genomes, MDA genomes, transcriptomes, and metagenomes.

And, as a reminder? Digital normalization does this in fixed, low memory; in a single pass; and without any reference sequence needed.

Pretty neat.

--titus

Syndicated 2012-04-06 11:17:51 from Titus Brown

What is digital normalization, anyway?

I'm out at a Cloud Computing for the Human Microbiome Workshop and I've been trying to convince people of the importance of digital normalization. When I posted the paper the reaction was reasonably positive, but I haven't had much luck explaining why it's so awesome.

At the workshop, people were still confused. So I tried something new.

I first made a simulated metagenome by taking three genomes worth of data from the Chitsaz et al. (2011) paper (see http://bix.ucsd.edu/projects/singlecell/) and shuffling them together. I combined the sequences in a ratio of 10:25:50 for the E. coli sequences, the Staph sequences, and the SAR sequences, respectively; the latter two were single-cell MDA genomic DNA. I took the first 10m reads of this mix and then estimated the coverage.

You can see the coverage of these genomic data sets estimated by using the known reference sequences in the first figure. E. coli looks nice and Gaussian; Staph is smeared from here to heck; and much of the SAR sequence is low coverage. This reflects the realities of single cell sequencing: you get really weird copy number biases out of multiple displacement amplification.

Then I applied three-pass digital normalization (see the paper) and plotted the new abundances. As a reminder, this operates without knowing the reference in advance; we're just using the known reference here to check the effects.

http://ivory.idyll.org/permanent/raw-coverage.png

Coverage of genome read mix, calculated by mapping the mixed reads onto the known reference genomes.

http://ivory.idyll.org/permanent/norm-coverage.png

Coverage post-digital-normalization, again calculated by mapping the mixed reads onto the known reference genomes.

As you can see, digital normalization literally "normalizes" the data to the best of its ability. That is, it cannot create higher coverage where high coverage doesn't exist (for the SAR), but it can convert the existing high coverage into nice, Gaussian distributions centered around a much lower number. You also discard quite a bit of data (look at the X axes -- about 85% of the reads were discarded in downsampling the coverage like this).

When you assemble this, you get as good or better results than assembling the unnormalized data, despite having discarded so much data. This is because no low-coverage data is discarded, so you still retain as much overall covered bases -- just in fewer reads. To boot, it works pretty generically for single genomes, MDA genomes, transcriptomes, and metagenomes.

And, as a reminder? Digital normalization does this in fixed, low memory; in a single pass; and without any reference sequence needed.

Pretty neat.

--titus

Syndicated 2012-04-05 02:59:51 from Titus Brown

Our approach to replication in computational science

I'm pretty proud of our most recently posted paper, which is on a sequence analysis concept we call digital normalization. I think the paper is pretty kick-ass, but so is the way in which we're approaching replication. This blog post is about the latter.

(Quick note re "replication" vs "reproduction": The distinction between replication and reproducibility is, from what I understand, that "replicable" means "other people get exactly the same results when doing exactly the same thing", while "reproducible" means "something similar happens in other people's hands". The latter is far stronger, in general, because it indicates that your results are not merely some quirk of your setup and may actually be right.)

So what did we do to make this paper extra super replicable?

If you go to the paper Web site, you'll find:

  • a link to the paper itself, in preprint form, stored at the arXiv site;
  • a tutorial for running the software on a Linux machine hosted in the Amazon cloud;
  • a git repository for the software itself (hosted on github);
  • a git repository for the LaTeX paper and analysis scripts (also hosted on github), including an ipython notebook for generating the figures (more about that in my next blog post);
  • instructions on how to start up an EC2 cloud instance, install the software and paper pipeline, and build most of the analyses and all of the figures from scratch;
  • the data necessary to run the pipeline;
  • some of the output data discussed in the paper.

(Whew, it makes me a little tired just to type all that...)

What this means is that you can regenerate substantial amounts (but not all) of the data and analyses underlying the paper from scratch, all on your own, on a machine that you can rent for something like 50 cents an hour. (It'll cost you about $4 -- 8 hours of CPU -- to re-run everything, plus some incidental costs for things like downloads.)

Not only can you do this, but if you try it, it will actually work. I've done my best to make sure the darn thing works, and this is the actual pipeline we ourselves ran to produce the figures in the paper. All the data is there, and all of the code used to process the data, analyze the results, and produce the figures is also there. In version control.

When you combine that with the ability to run this on a specific EC2 instance -- a combination of a frozen virtual machine installation and a specific set of hardware -- I feel pretty confident that at least this component of our paper is something that can be replicated.

A few thoughts on replicability, and effort

Why did I go to all this trouble??

Wasn't it a lot of work?

Well, interestingly enough, it wasn't that much work. I already use version control for everything, including paper text; posting it all to github was a matter of about three commands.

Writing the code, analysis scripts, and paper was an immense amount of work. But I had to do that anyway.

The most extra effort I put in was making sure that the big data files were available. I didn't want to add the the 2gb E. coli resequencing data set to git, for example. So I ended up tarballing those files sticking them on S3.

The Makefile and analysis scripts are ugly, but suffice to remake everything from scratch; they were already needed to make the paper, so in order to post them all I had to do was put in a teensy bit of effort to remove some unintentional dependencies.

The ipython notebook used to generate the figures (again -- next blog post) was probably the most effort, because I had to learn how to use it, which took about 20 minutes. But it was one of the smoothest transitions into using a new tool I've ever experienced in my ~25 years of coding.

Overall, it wasn't that much extra effort on my part.

Why bother in the first place??

The first and shortest answer is, because I could, and because I believe in replication and reproducibility, and wanted to see how tough it was to actually do something like this. (It's a good deal above and beyond what most bioinformaticians do.)

Perhaps the strongest reason is that our group has been bitten a lot in recent months by irreplicable results. I won't name names, but several Science and PNAS and PLoS One papers of interest to us turned out to be basically impossible for us to replicate. And, since we are engaged in developing new computational methods that must be compared to previous work, an inability to regenerate exactly the results in those other papers meant we had to work harder than we should have, simply to reproduce what they'd done.

A number of these problems came from people discarding large data sets after publishing, under the mistaken belief that their submission to the Short Read Archive could be used to regenerate their results. (Often SRA submissions are unfiltered, and no one keeps the filtering parameters around...right?) In some cases, I got the right data sets from the authors and could replicate (kudos to Brian Haas of Trinity for this!), but in most cases, ixnay on the eplicationre.

Then there were the cases where authors clearly were simply being bad computational scientists. My favorite example is a very high profile paper (coauthored by someone I admire greatly), in which the script they sent to us -- a script necessary for the initial analyses -- had a syntax error in it. In that case, we were fairly sure that the authors weren't sending us the script they'd actually used... (It was Perl, so admittedly it's hard to tell a syntax error from legitimate code, but even the Perl interpreter was choking on this.)

(A few replication problems came from people using closed or unpublished software, or being hand-wavy about the parameters they used, or using version X of some Web-hosted pipeline for which only version Y was now available. Clearly these are long-term issues that need to be discussed with respect to replication in comp. bio., but that's another topic.)

Thus, my group has wasted a lot of time replicating other people's work. I wanted to avoid making other people go through that.

A third reason is that I really, really, really want to make it easy for people to pick up this tool and use it. Digital normalization is super ultra awesome and I want as little as possible to stand in the way of others using it. So there's a strong element of self-interest in doing things this way, and I hope it makes diginorm more useful. (I know about a dozen people that have already tried it out in the week or so since I made the paper available, which is pretty cool. But citations will tell.)

What use is replication?

Way back when, Jim Graham politely schooled me in the true meaning of reproducibility, as opposed to replication. He was about 2/3 right, but then he went a bit too far and said

But let's drop the idea that I'm going to take your data and your code and "reproduce" your result. I'm not. First, I've got my own work to do. More importantly, the odds are that nobody will be any wiser when I'm done."

Well, let's take a look at that concern, shall we?

With the benefit of about two years of further practice, I can tell you this is a dangerously wrong way to think, at least in the field of bioinformatics. My objections hinge on a few points:

First, based on our experiences so far, I'd be surprised if the authors themselves could replicate their own computational results -- too many files and parameters are missing. We call that "bad science".

Second, odds are, the senior professor has little or no detailed understanding of what bioinformatic steps were taken in processing the data, and moreover is uninterested in the details; that's why they're not in the Methods. Why is that a problem? Because the odds are quite good that many biological analyses hinge critically on such points. So the peer reviewers and the community at large need to be able to evaluate them (see this RNA editing kerfuffle for an excellent example of reviewer fail). Yet most bioinformatic pipelines are so terribly described that even with some WAG I can't figure out what, roughly speaking, is going on. I certainly couldn't replicate it, and generating specific critiques is quite difficult in that kind of circumstance.

Parenthetically, Graham does refer to the climate sciences struggles with reproducibility and replication. If only they put the same effort into replication and data archiving they did into arguing with climate change deniers...

Third, Graham may be guilty of physics chauvinism (just like I'm almost certainly guilty of bioinformatics chauvinism...) Physics and biology are quite different: in physics, you often have a theoretical framework to go by, and results should at least roughly adhere to that or else they are considered guilty until proven innocent. In biology, we usually have no good idea of what we're expecting to see, and often we're looking at a system for the very first time. In that environment, I think it's important to make the underlying computation WAY more solid than you would demand in physics (see RNA editing above).

As Narayan Desai pointed out to me (following which I then put it in my PyCon talk (slide 5)), physics and biology are quite different in the way data is generated and analyzed. There's fewer sources of data generation in physics, there's more of a computational culture, and there's more theory. Having worked with physicists for much of my scientific life (and having published a number of papers with physicists) I can tell you that replication is certainly a big problem over there, but the consequences don't seem as big -- eventually the differences between theory and computation will be worked out, because they're far more noticeable when you have theory, like in physics. Not so in biology.

Fourth, a renewed emphasis on computational methods (and therefore on replicability of computational results) is a natural part of the transition to Big Data biology. The quality of analysis methods matters A LOT when you are dealing with massive data sets with weak signals and many systematic biases. (I'll write about this more later.)

Fifth, and probably most significant from a practical perspective, Graham misses the point of reuse. In bioinformatics, it behooves us to reuse proven (aka published) tools -- at least we know they worked for someone, at least once, which is not usually the case for newly written software. I don't pretend that it's the responsibility of people to write awesome reusable tools for every paper, but sure as heck I should expect to be able to run them on some combination of hardware and software. Often that's not the case, which means I get to reinvent the wheel (yay...) even when I'm doing the same stupid thing the last five pubs did.

For our paper, khmer and screed should be quite reusable. The analysis pipeline for the paper? It's not that great. But at least you can run it, and potentially steal code from it, too.

When I was talking to a colleague about the diginorm paper, he said something jokingly: "wow, you're making it way too easy for people!" -- presumably he meant it would be way to easy for people to criticize or otherwise complain about the specific way we're doing things. Then, a day or two later he said, "hmm, but now that I think of it, no one ever uses the software we publish, and you seem to have had better luck with that..." -- recognizing that if you are barely able to run your own software, perhaps others might find it even more difficult.

Heck, the diginorm paper itself would have been far harder to write without the data sets from the Trinity paper and the Velvet-SC paper. Having those nice, fresh, well-analyzed data sets already at hand was fantastic. Being able to run Trinity and reproduce their results was wonderful.

There's a saying in software engineering: "one of the main people you should be programming for is yourself, in 6 months." That's also true in science -- I'm sure I won't remember the finer details of the diginorm paper analysis in 2 years -- but I can always go look into version control. More importantly, new graduate students can go look and really see what's going on. (And I can use it for teaching, too.) And so can other people working with me. So there's a lot of utility in simply nailing everything down and making it runnable.

Replication is by no means sufficient for good science. But I'll be more impressed by the argument that "replication isn't all that important" when I see lack of replication as the exception rather than the rule. Replication is essential, and good, and useful. I long for the day when it's not interesting, because it's so standard. In the meantime I would argue that it certainly doesn't do any harm to emphasize it.

(Note that I really appreciate Jim Graham's commentary, as I think he is at worst usefully wrong on these points, and substantially correct in many ways. I'm just picking on him because he wrote it all down in one place for me to link to, and chose to use the word 'sic' when reproducing my spelling mistake. Low blow ;)

The future

I don't pretend to have all, or even many, of the answers; I just like to think about what form they might take.

I don't want to argue that this approach is a panacea or a high-quality template for others to use, inside or out of bioinformatics. For one thing, I haven't automated some of the analyses in the paper; it's just too much work for too little benefit at this point. (Trust me, they're easy to reproduce... :). For another, our paper used a fairly small amount of data overall; only a few dozen gigabytes all told. This makes it easy to post the data for others to use later on. Several of our next few papers will involve over a half terabyte of raw data, plus several hundred gb of ancillary and intermediate results; no idea what we'll do for them.

Diginorm is also a somewhat strange bioinformatics paper. We just analyzed other people's data sets (an approach which for some reason isn't in favor in high impact bioinformatics, probably because high impact journal subs are primarily reviewed by biologists who want to see cool new data that we don't understand, not boring old data that we don't understand). There's no way we can or should argue that biological replicates done in a different lab should replicate the results; that's where reproducibility becomes important.

But I would like it if people considered this approach (or some other approach) to making their analyses replicable. I don't mind people rejecting good approaches because they don't fit; to each their own. But this kind of limited enabling of replication isn't that difficult, frankly, and even if it were, it has plenty of upsides. It's definitely not irrelevant to the practice of science -- I would challenge anyone to try to make that claim in good faith.

--titus

p.s. I think I have to refer to this cancer results not reproducible paper somewhere. Done.

Syndicated 2012-04-02 14:29:39 from Titus Brown

Our approach to replication in computational science

I'm pretty proud of our most recently posted paper, which is on a sequence analysis concept we call digital normalization. I think the paper is pretty kick-ass, but so is the way in which we're approaching replication. This blog post is about the latter.

(Quick note re "replication" vs "reproduction": The distinction between replication and reproducibility is, from what I understand, that "replicable" means "other people get exactly the same results when doing exactly the same thing", while "reproducible" means "something similar happens in other people's hands". The latter is far stronger, in general, because it indicates that your results are not merely some quirk of your setup and may actually be right.)

So what did we do to make this paper extra super replicable?

If you go to the paper Web site, you'll find:

  • a link to the paper itself, in preprint form, stored at the arXiv site;
  • a tutorial for running the software on a Linux machine hosted in the Amazon cloud;
  • a git repository for the software itself (hosted on github);
  • a git repository for the LaTeX paper and analysis scripts (also hosted on github), including an ipython notebook for generating the figures (more about that in my next blog post);
  • instructions on how to start up an EC2 cloud instance, install the software and paper pipeline, and build most of the analyses and all of the figures from scratch;
  • the data necessary to run the pipeline;
  • some of the output data discussed in the paper.

(Whew, it makes me a little tired just to type all that...)

What this means is that you can regenerate substantial amounts (but not all) of the data and analyses underlying the paper from scratch, all on your own, on a machine that you can rent for something like 50 cents an hour. (It'll cost you about $4 -- 8 hours of CPU -- to re-run everything, plus some incidental costs for things like downloads.)

Not only can you do this, but if you try it, it will actually work. I've done my best to make sure the darn thing works, and this is the actual pipeline we ourselves ran to produce the figures in the paper. All the data is there, and all of the code used to process the data, analyze the results, and produce the figures is also there. In version control.

When you combine that with the ability to run this on a specific EC2 instance -- a combination of a frozen virtual machine installation and a specific set of hardware -- I feel pretty confident that at least this component of our paper is something that can be replicated.

A few thoughts on replicability, and effort

Why did I go to all this trouble??

Wasn't it a lot of work?

Well, interestingly enough, it wasn't that much work. I already use version control for everything, including paper text; posting it all to github was a matter of about three commands.

Writing the code, analysis scripts, and paper was an immense amount of work. But I had to do that anyway.

The most extra effort I put in was making sure that the big data files were available. I didn't want to add the the 2gb E. coli resequencing data set to git, for example. So I ended up tarballing those files sticking them on S3.

The Makefile and analysis scripts are ugly, but suffice to remake everything from scratch; they were already needed to make the paper, so in order to post them all I had to do was put in a teensy bit of effort to remove some unintentional dependencies.

The ipython notebook used to generate the figures (again -- next blog post) was probably the most effort, because I had to learn how to use it, which took about 20 minutes. But it was one of the smoothest transitions into using a new tool I've ever experienced in my ~25 years of coding.

Overall, it wasn't that much extra effort on my part.

Why bother in the first place??

The first and shortest answer is, because I could, and because I believe in replication and reproducibility, and wanted to see how tough it was to actually do something like this. (It's a good deal above and beyond what most bioinformaticians do.)

Perhaps the strongest reason is that our group has been bitten a lot in recent months by irreplicable results. I won't name names, but several Science and PNAS and PLoS One papers of interest to us turned out to be basically impossible for us to replicate. And, since we are engaged in developing new computational methods that must be compared to previous work, an inability to regenerate exactly the results in those other papers meant we had to work harder than we should have, simply to reproduce what they'd done.

A number of these problems came from people discarding large data sets after publishing, under the mistaken belief that their submission to the Short Read Archive could be used to regenerate their results. (Often SRA submissions are unfiltered, and no one keeps the filtering parameters around...right?) In some cases, I got the right data sets from the authors and could replicate (kudos to Brian Haas of Trinity for this!), but in most cases, ixnay on the eplicationre.

Then there were the cases where authors clearly were simply being bad computational scientists. My favorite example is a very high profile paper (coauthored by someone I admire greatly), in which the script they sent to us -- a script necessary for the initial analyses -- had a syntax error in it. In that case, we were fairly sure that the authors weren't sending us the script they'd actually used... (It was Perl, so admittedly it's hard to tell a syntax error from legitimate code, but even the Perl interpreter was choking on this.)

(A few replication problems came from people using closed or unpublished software, or being hand-wavy about the parameters they used, or using version X of some Web-hosted pipeline for which only version Y was now available. Clearly these are long-term issues that need to be discussed with respect to replication in comp. bio., but that's another topic.)

Thus, my group has wasted a lot of time replicating other people's work. I wanted to avoid making other people go through that.

A third reason is that I really, really, really want to make it easy for people to pick up this tool and use it. Digital normalization is super ultra awesome and I want as little as possible to stand in the way of others using it. So there's a strong element of self-interest in doing things this way, and I hope it makes diginorm more useful. (I know about a dozen people that have already tried it out in the week or so since I made the paper available, which is pretty cool. But citations will tell.)

What use is replication?

Way back when, Jim Graham politely schooled me in the true meaning of reproducibility, as opposed to replication. He was about 2/3 right, but then he went a bit too far and said

But let's drop the idea that I'm going to take your data and your code and "reproduce" your result. I'm not. First, I've got my own work to do. More importantly, the odds are that nobody will be any wiser when I'm done."

Well, let's take a look at that concern, shall we?

With the benefit of about two years of further practice, I can tell you this is a dangerously wrong way to think, at least in the field of bioinformatics. My objections hinge on a few points:

First, based on our experiences so far, I'd be surprised if the authors themselves could replicate their own computational results -- too many files and parameters are missing. We call that "bad science".

Second, odds are, the senior professor has little or no detailed understanding of what bioinformatic steps were taken in processing the data, and moreover is uninterested in the details; that's why they're not in the Methods. Why is that a problem? Because the odds are quite good that many biological analyses hinge critically on such points. So the peer reviewers and the community at large need to be able to evaluate them (see this RNA editing kerfuffle for an excellent example of reviewer fail). Yet most bioinformatic pipelines are so terribly described that even with some WAG I can't figure out what, roughly speaking, is going on. I certainly couldn't replicate it, and generating specific critiques is quite difficult in that kind of circumstance.

Parenthetically, Graham does refer to the climate sciences struggles with reproducibility and replication. If only they put the same effort into replication and data archiving they did into arguing with climate change deniers...

Third, Graham may be guilty of physics chauvinism (just like I'm almost certainly guilty of bioinformatics chauvinism...) Physics and biology are quite different: in physics, you often have a theoretical framework to go by, and results should at least roughly adhere to that or else they are considered guilty until proven innocent. In biology, we usually have no good idea of what we're expecting to see, and often we're looking at a system for the very first time. In that environment, I think it's important to make the underlying computation WAY more solid than you would demand in physics (see RNA editing above).

As Narayan Desai pointed out to me (following which I then put it in my PyCon talk (slide 5)), physics and biology are quite different in the way data is generated and analyzed. There's fewer sources of data generation in physics, there's more of a computational culture, and there's more theory. Having worked with physicists for much of my scientific life (and having published a number of papers with physicists) I can tell you that replication is certainly a big problem over there, but the consequences don't seem as big -- eventually the differences between theory and computation will be worked out, because they're far more noticeable when you have theory, like in physics. Not so in biology.

Fourth, a renewed emphasis on computational methods (and therefore on replicability of computational results) is a natural part of the transition to Big Data biology. The quality of analysis methods matters A LOT when you are dealing with massive data sets with weak signals and many systematic biases. (I'll write about this more later.)

Fifth, and probably most significant from a practical perspective, Graham misses the point of reuse. In bioinformatics, it behooves us to reuse proven (aka published) tools -- at least we know they worked for someone, at least once, which is not usually the case for newly written software. I don't pretend that it's the responsibility of people to write awesome reusable tools for every paper, but sure as heck I should expect to be able to run them on some combination of hardware and software. Often that's not the case, which means I get to reinvent the wheel (yay...) even when I'm doing the same stupid thing the last five pubs did.

For our paper, khmer and screed should be quite reusable. The analysis pipeline for the paper? It's not that great. But at least you can run it, and potentially steal code from it, too.

When I was talking to a colleague about the diginorm paper, he said something jokingly: "wow, you're making it way too easy for people!" -- presumably he meant it would be way to easy for people to criticize or otherwise complain about the specific way we're doing things. Then, a day or two later he said, "hmm, but now that I think of it, no one ever uses the software we publish, and you seem to have had better luck with that..." -- recognizing that if you are barely able to run your own software, perhaps others might find it even more difficult.

Heck, the diginorm paper itself would have been far harder to write without the data sets from the Trinity paper and the Velvet-SC paper. Having those nice, fresh, well-analyzed data sets already at hand was fantastic. Being able to run Trinity and reproduce their results was wonderful.

There's a saying in software engineering: "one of the main people you should be programming for is yourself, in 6 months." That's also true in science -- I'm sure I won't remember the finer details of the diginorm paper analysis in 2 years -- but I can always go look into version control. More importantly, new graduate students can go look and really see what's going on. (And I can use it for teaching, too.) And so can other people working with me. So there's a lot of utility in simply nailing everything down and making it runnable.

Replication is by no means sufficient for good science. But I'll be more impressed by the argument that "replication isn't all that important" when I see lack of replication as the exception rather than the rule. Replication is essential, and good, and useful. I long for the day when it's not interesting, because it's so standard. In the meantime I would argue that it certainly doesn't do any harm to emphasize it.

(Note that I really appreciate Jim Graham's commentary, as I think he is at worst usefully wrong on these points, and substantially correct in many ways. I'm just picking on him because he wrote it all down in one place for me to link to, and chose to use the word 'sic' when reproducing my spelling mistake. Low blow ;)

The future

I don't pretend to have all, or even many, of the answers; I just like to think about what form they might take.

I don't want to argue that this approach is a panacea or a high-quality template for others to use, inside or out of bioinformatics. For one thing, I haven't automated some of the analyses in the paper; it's just too much work for too little benefit at this point. (Trust me, they're easy to reproduce... :). For another, our paper used a fairly small amount of data overall; only a few dozen gigabytes all told. This makes it easy to post the data for others to use later on. Several of our next few papers will involve over a half terabyte of raw data, plus several hundred gb of ancillary and intermediate results; no idea what we'll do for them.

Diginorm is also a somewhat strange bioinformatics paper. We just analyzed other people's data sets (an approach which for some reason isn't in favor in high impact bioinformatics, probably because high impact journal subs are primarily reviewed by biologists who want to see cool new data that we don't understand, not boring old data that we don't understand). There's no way we can or should argue that biological replicates done in a different lab should replicate the results; that's where reproducibility becomes important.

But I would like it if people considered this approach (or some other approach) to making their analyses replicable. I don't mind people rejecting good approaches because they don't fit; to each their own. But this kind of limited enabling of replication isn't that difficult, frankly, and even if it were, it has plenty of upsides. It's definitely not irrelevant to the practice of science -- I would challenge anyone to try to make that claim in good faith.

--titus

p.s. I think I have to refer to this cancer results not reproducible paper somewhere. Done.

Syndicated 2012-04-02 02:40:28 from Titus Brown

Digital normalization of short-read shotgun data

We just posted a pre-submission paper to arXiv.org:

A single pass approach to reducing sampling variation, removing errors, and scaling de novo assembly of shotgun sequences

Authors: C. Titus Brown, Adina Howe, Qingpeng Zhang, Alexis B. Pyrkosz, and Timothy H. Brom

arXiv link

Paper Web site, with source code and tutorials

Abstract:

Deep shotgun sequencing and analysis of genomes, transcriptomes, amplified single-cell genomes, and metagenomes enable the sensitive investigation of a wide range of biological phenomena. However, it is increasingly difficult to deal with the volume of data emerging from deep short-read sequencers, in part because of random and systematic sampling variation as well as a high sequencing error rate. These challenges have led to the development of entire new classes of short-read mapping tools, as well as new de novo assemblers. Even newer assembly strategies for dealing with transcriptomes, single-cell genomes, and metagenomes have also emerged. Despite these advances, algorithms and compute capacity continue to be challenged by the continued improvements in sequencing technology throughput. We here describe an approach we term digital normalization, a single-pass computational algorithm that discards redundant data and both sampling variation and the number of errors present in deep sequencing data sets. Digital normalization substantially reduces the size of data sets and accordingly decreases the memory and time requirements for de novo sequence assembly, all without significantly impacting content of the generated contigs. In doing so, it converts high random coverage to low systematic coverage. Digital normalization is an effective and efficient approach to normalizing coverage, removing errors, and reducing data set size for shotgun sequencing data sets. It is particularly useful for reducing the compute requirements for de novo sequence assembly. We demonstrate this for the assembly of microbial genomes, amplified single-cell genomic data, and transcriptomic data. The software is freely available for use and modification.

---

I'll blog more about this stuff over the next few days, but, briefly, this paper presents a single pass, fixed-memory approach to downsampling sequencing data (yeah, the stuff I've been talking about for a while now). This approach is called "digital normalization", or "diginorm" for sure. It eliminates lots of data, evens out coverage, and removes errors from shotgun sequencing data sets. The net effect is an often massive amount of data reduction combined with significant scaling of de novo assembly.

Or, to put it another way, it's a streaming lossy compression algorithm that primarily "loses" errors in sequencing data.

We've implemented it as a preprocessing filter that should work on any data set you want to assemble, with potentially any assembler. It's written in C++, with a Python wrapper, as part of khmer. And, of course, it's freely available for use, re-use, modification, and redistribution under the BSD license, 'cause why the heck not?

If you want to try it out, we've linked to some tutorials for running microbial genome assemblies with Velvet, as well as eukaryotic transcriptome assemblies with Oases and Trinity, on the paper Web site

We'll submit this paper to PNAS on Friday; I'm still waiting for some final proofreading.

Together with our other paper, which is currently under revision for resubmission to PNAS, these two papers form the theoretical basis for our attack on soil metagenome assembly. (Our plan is sheer elegance in its simplicity!)

--titus

Syndicated 2012-03-22 01:38:07 from Titus Brown

Top 12 reasons you know you are a Big Data biologist

A few people have recently asked me what this "Big Data" thing is in biology. It's not an easy question to answer, though, because biology's a bit peculiar, and a lot of Big Data researchers are not working in bio. While I was thinking about this I kept on coming up with anecdotes -- and, well, that turned into this: the Top 12 Reasons You Know You Are a Big Data Biologist.

---

  1. You no longer get enthused by the prospect of more data.

I was at a conference a few months back, and an Brit colleague of mine rushed up to me and said, "Hey! We just got an Illumina HiSeq! Do you have anything you want sequenced?" My immediate, visceral response was "Hell no! We can't even deal with a 10th of the sequence we've already got! PLEASE don't send me any more! Can I PAY you to not send me any more?"

  1. Analysis is the bottleneck.

I'm dangerously close to genomics bingo here, but:

I was chatting with a colleague in the hallway here at MSU, pitching some ideas about microbiome work, and he said, "What a good idea! I have some samples here that I'd like to sequence that could help with that." I responded, "How about we stop producing data and think about the analysis?" He seemed only mildly offended -- I have a rep, you see -- but biology, as a field, is wholly unprepared to go from data to an analyses.

My lab has finally turned the corner from "hey, let's generate more data!" to "cool, data that we can analyze and build hypotheses from!" -- yeah, we're probably late bloomers, but after about 3 years of tech development, there's very little we can't do at a basic sequence level. Analysis at that level is no longer the bottleneck. Next step? Trying to do some actual biology! But we are working in a few very specific areas, and I think the whole field needs to undergo this kind of shift.

  1. You've finally learned that 'for' loops aren't all they're cracked up to be.

This was one of my early lessons. Someone had dumped a few 10s of millions of reads on me -- a mere gigabyte or so of data -- and I was looking for patterns in the reads. I sat down to write my usual opening lines of Python to look at the data: "for sequence in dataset:" and ... just stopped. The data was just too big to do iteration in a scripting language with it! Cleverness of some sort needed to be applied.

Corollary: not just Excel, but BLAST, start to look like Really Bad Ideas That Don't Scale. Sean Eddy's HMMER 3 FTW...

  1. Addressing small questions just isn't that interesting.

Let's face it, if you've just spent a few $k generating dozens of gigabytes amounts of hypothesis-neutral data on gene expression from (say) chick, the end goal of generating a list of genes that are differentially regulated is just not that exciting. (Frankly, you probably could have done it with qPCR, if you didn't want that cachet of "mRNAseq!") What more can you do with the data?

(This point comes courtesy of Jim Tiedje, who says (I paraphrase): "The problem for assistant professors these days is that many of may not be thinking big enough." Also see: Don't be afraid of failure: really go for it)

The biggest training challenge (in my opinion) is going to be training people in how to push past the obvious analysis of the data and go for deeper integrative insight. This will require training not just in biology but in data analysis, computational thinking, significant amounts of informed skepticism, etc. (See our course.) I think about it like this: generating hypotheses from large amounts of data isn't that interesting -- I can do that with publicly available data sets without spending any money! Constraining the space of hypotheses with big data sets is far more interesting, because it gives you the space of hypotheses that aren't ruled out; and its putting your data to good use. I'll, uh, let you know if I ever figure out how to do this myself...

I think there are plenty of people that can learn to do this, but as Greg Wilson correctly points out, there has to be a tradeoff: what do you take out of existing training curricula, to be replaced with training in data analysis? I wish I knew.

  1. Transferring data around is getting more than a bit tedious.

For some recent papers, I had to copy some big files from EC2 over to my lab computer, and from there to our HPC system. It was Slow, in the "time for lunch" sense. And these were small test data sets, compressed. Transferring our big data sets around is getting tedious. Luckily we have a lot of them, so I can usually work on analysis of one while another is transferring.

5. Your sysadmin/HPC administrator yells at you on a regular basis about your disk usage.

We regularly get nastygrams from our local sysadmins accusing of using too many terabytes of disk space. This is in contrast to the good ol' days of physics (which is where I got my sysadmin chops), where your sysadmin would yell at you for using too much CPU...

  1. You regularly crash big memory machines.

My favorite quote of the year so far is from the GAGE paper), in which Salzberg et al (2012) say "For larger genomes, the choice of assemblers is often limited to those that will run without crashing." Basically, they took a reasonably big computer, threw some big data at various assembly packages, and watched the computer melt down. Repeatedly.

Someone recently sent me an e-mail saying "hey, we did it! we took 3 Gb of sequence data from soil and assembled it in only 1 week in 3 TB of RAM!" Pretty cool stuff -- but consider that at least four to six runs would need to be done (parameter sweep!), and it takes only about 1 week and $10k to generate twice that data. In the long run, this does not seem cost effective. (It currently takes us 1-2 months to analyze this data in 300 GB of RAM. I'm not saying we have the answer. ;)

  1. Throwing away data looks better and better.

I made a kind of offhanded comment to a NY Times reporter once (hint: don't do this) about how at some point we're going to need to start throwing away data. He put it as the ultimate quote in his article. People laughed at me for it. (BUT I WAS RIGHT! HISTORY WILL SHOW!)

But seriously, if someone came up to you and said "we can get rid of 90% of your data for you and give you an answer that's just as good", many biologists would have an instant negative response. But I think there's a ground truth in there: a lot of Big Data is noise. If you can figure out how to get rid of it... why wouldn't you? This is an interesting shift in thinking from the "every data point is precious and special" that you adopt when it takes you a !#%!#$ week to generate each data point.

I attended a talk that David Haussler gave at Caltech recently. He was talking about how eventually we would need to resequence millions of individual cancer cells to look for linked sets of mutations. At 50-300 Gb of sequence per cell, that's a lot of data. But most of that data is going to be uninteresting -- wouldn't it be great if we could pick out the interesting people and then throw the rest away? It would certainly help with data archiving and analysis...

8. Big computer companies call you because they're curious about why you're buying such big computers.

True story: a Big Computer Company called up our local HPC to ask why we were buying so many bigmem machines. They said "It's the damned biologists -- they keep on wanting more memory. Why? We don't know - we suspect the software's badly written, but can't tell. Why don't you talk to Titus? He pretends to understand this stuff." I don't think it's weird to get calls trying to sell me stuff -- but it is a bit weird to have our local HPC's buying habits be so out of character, due to work that I and others are doing, that Big Computer Companies notice.

(Note: the software's not mine, and it's not badly written, either.)

9. Your choice must increasingly be "improve algorithms" rather than "buy bigger computers"

I've been banging this drum for a while. Sequencing capacity is outpacing Moore's Law, and so we need to rethink algorithms. An algorithm that was nlogn used to be good enough; now, if analysis requires a supra-linear algorithm, we need to figure out how to make it linear. (Sublinear would be better.)

Anecdote: we developed a nifty data structure for attacking metagenome assembly (see: http://arxiv.org/abs/1112.4193). It scaled (scales) assembly by a factor of about 20x, which got us pretty excited -- that meant we could in theory assemble things like MetaHIT and rumen on commodity hardware without doing abundance filtering. Literally the day that we told our collaborators we had it working, they dumped 10x more data on us and told that they could send us more any time we wanted. (...and that, boys and girls, was our introduction to the HiSeq!) 20x wasn't enough. Sigh.

The MG-RAST folk have told me a similar story. They did some awesomely cool engineering and got their pipeline running about 100x faster. That'll hold them for a year or so against the tidalwave of data.

Corollary: don't waste your time with 2% improvements in sensitivity and specificity unless you also deliver 10x in compute performance.

10. You spend a lot of time worrying about biased noise, cross-validation, and the incorrect statistical models used.

We were delayed in some of our research by about a year, because of some systematic biases being placed in our sequencing data by Illumina. Figuring out that these non-biological features were there took about two months; figuring out how to remove them robustly took another 6 months; and then making sure that removing didn't screw up the actual biological signal took another four months.

This is a fairly typical story from people who do a lot of data analysis. We developed a variety of cross-validation techniques and ways of intuiting whether or not something was "real" or "noise", and we spent a certain amount of time discussing what statistical approaches to use to assess significance. In the end we more or less gave up and pointed out that on simulated data what we were doing didn't screw things up.

  1. Silicon Valley wants to hire your students to go work on non-biology problems.

Hey, it's all Big Data, right?

---

So: what is Big Data in biology?

First, I've talked mostly about DNA sequence analysis, because that's what I work on. But I know that proteomics and image analysis people are facing similar dilemmas. So it's not just sequence data.

Second, compute technology is constantly improving. So I think we need moving definitions.

Here are some more serious points that I think bear on what, exactly, problems for Big Data in biology. (They're not all specific to biology, but I can defend them on my home ground more easily, you see.)

  1. You have archival issues on a large scale.

You have lots of homogeneously formatted data that probably contains answers you don't know you're looking for yet, so you need to save it, metadata it, and catalog it. For a long time.

  1. The rate at which data is arriving is itself increasing.

You aren't just getting one data set. You're getting dozens (or hundreds) this year. And you'll get more than that next year.

One implication of this is that you'd better have a largely automated analysis pipeline, or else you will need an increasing number of people just to work with the data, much less do anything interesting. Another implication is that software reuse becomes increasingly important: if you're building custom software for each data set, you will fall behind. A third implication is that you need a long-term plan for scaling your compute capacity.

  1. Data structure and algorithm research is increasingly needed.

You cannot rely on many heavyweight iterations over your data, or simple data structures for lookup: the data is just too big and existing algorithms are tailored to smaller data. For example, BLAST works fine for a few gigabytes of data; past that, it becomes prohibitively slow.

  1. Infrastructure designers are needed.

Issues of straightforward data transfer, network partitioning, and bus bandwidth start to come to the forefront. Bottleneck analysis needs to be done regularly. In the past, you could get away with "good enough", but as throughput continues to increase, bottlenecks will need to be tackled on a regular basis. For this, you need a person who is immersed in your problems on a regular basis; they are hard to find and hard to keep.

One interesting implication here is for cloud computing, where smart people set up a menu of infrastructure options and you can tailor your software to those options. So far I like the idea, but I'm told by aficionados that (for example) Amazon still falls short.

  1. You have specialized hardware needs.

Sort of a corollary of the above: what kind of analyses do you need to do? And what's the hardware bottleneck? That's where you'll get the most benefit from focused hardware investment.

  1. Hardware, infrastructure design, and algorithms all need to work together.

Again, a corollary of the above, but: if your bottleneck is memory, focus on memory improvements. If your bottleneck is disk I/O, focus on hardware speed and caching. If your bottleneck is data transfer, try to bring your compute to your data.

  1. Software needs to change to be reusable and portable.

Robust, reusable software platforms are needed, with good execution guarantees; that way you have a foundation to build on. This software needs to be flexible (practically speaking, scriptable in a language like Python or Ruby or Perl), well developed and tested, and should fade into the background so that you can focus on more interesting things like your actual analysis. It should also be portable so that you can "scale out" -- bring the compute to your data, rather than vice versa. This is where Hadoop and Pig and other such approaches fit now, and where we seriously need to build software infrastructure in biology.

  1. Analysis thinking needs to change.

Comprehensively analyzing your data sets is tough when your data sets are really big and noisy. Extracting significant signals from them is potentially much easier, and some approaches and algorithms for doing this in biology exist or are being developed (see especially Lior Pachter's eXpress). But this is a real shift in algorithmic thinking, and it's also a real shift in scientific thinking, because you're no longer trying do understand the entire data set -- you're trying to focus on the bits that might be interesting.

  1. Analyses are increasingly integrative.

It's hard to make sense of lots of data on its own: you need to link it in to other data sets. Data standards and software interoperability and "standard" software pipelines are incredibly important for doing this.

  1. The interesting problems are still discipline-specific.

There are many people working on Big Data, and there is big business in generic solutions. There's lots of Open Source stuff going on, too. Don't reinvent those wheels; figure out how to connect them to your biology, and then focus on the bits that are interesting to you and important for your science.

11. New machine learning, data mining, and statistical models need to be developed for data-intensive biological science.

As data volume increases, and integrative hypothesis development proceeds, we need to figure out how to assess the quality and significance of hypotheses. Right now, a lot of people throw their data at several programs, pick their favorite answer, and then recite the result as if it's correct. Since often they will have tried out many programs, this presents an obvious multiple testing problem. And, since users are generally putting in data that violates one or more of the precepts of the program developers, the results may not be applicable.

  1. A lack of fear of computational approaches is a competitive advantage.

The ability to approach computational analyses as just another black box upon which controls must be placed is essential. Even if you can open up the black box and understand what's inside, it needs to be evaluated not on what you think it's doing but on what it's actually doing at the black box level. If there's one thing I try to teach to students, it's to engage with the unknown without fear, so that they can think critically about new approaches.

---

Well, that's it for now. I'd be interested in hearing about what other people think I've missed. And, while I'm at it, a hat tip to Erich Schwarz, Greg Wilson, and Adina Howe for reading and commenting on a draft.

--titus

Syndicated 2012-03-07 14:42:24 from Titus Brown

Top 12 reasons you know you are a Big Data biologist

A few people have recently asked me what this "Big Data" thing is in biology. It's not an easy question to answer, though, because biology's a bit peculiar, and a lot of Big Data researchers are not working in bio. While I was thinking about this I kept on coming up with anecdotes -- and, well, that turned into this: the Top 12 Reasons You Know You Are a Big Data Biologist.

---

  1. You no longer get enthused by the prospect of more data.

I was at a conference a few months back, and an Brit colleague of mine rushed up to me and said, "Hey! We just got an Illumina HiSeq! Do you have anything you want sequenced?" My immediate, visceral response was "Hell no! We can't even deal with a 10th of the sequence we've already got! PLEASE don't send me any more! Can I PAY you to not send me any more?"

  1. Analysis is the bottleneck.

I'm dangerously close to genomics bingo here, but:

I was chatting with a colleague in the hallway here at MSU, pitching some ideas about microbiome work, and he said, "What a good idea! I have some samples here that I'd like to sequence that could help with that." I responded, "How about we stop producing data and think about the analysis?" He seemed only mildly offended -- I have a rep, you see -- but biology, as a field, is wholly unprepared to go from data to an analyses.

My lab has finally turned the corner from "hey, let's generate more data!" to "cool, data that we can analyze and build hypotheses from!" -- yeah, we're probably late bloomers, but after about 3 years of tech development, there's very little we can't do at a basic sequence level. Analysis at that level is no longer the bottleneck. Next step? Trying to do some actual biology! But we are working in a few very specific areas, and I think the whole field needs to undergo this kind of shift.

  1. You've finally learned that 'for' loops aren't all they're cracked up to be.

This was one of my early lessons. Someone had dumped a few 10s of millions of reads on me -- a mere gigabyte or so of data -- and I was looking for patterns in the reads. I sat down to write my usual opening lines of Python to look at the data: "for sequence in dataset:" and ... just stopped. The data was just too big to do iteration in a scripting language with it! Cleverness of some sort needed to be applied.

Corollary: not just Excel, but BLAST, start to look like Really Bad Ideas That Don't Scale. Sean Eddy's HMMER 3 FTW...

  1. Addressing small questions just isn't that interesting.

Let's face it, if you've just spent a few $k generating dozens of gigabytes amounts of hypothesis-neutral data on gene expression from (say) chick, the end goal of generating a list of genes that are differentially regulated is just not that exciting. (Frankly, you probably could have done it with qPCR, if you didn't want that cachet of "mRNAseq!") What more can you do with the data?

(This point comes courtesy of Jim Tiedje, who says (I paraphrase): "The problem for assistant professors these days is that many of may not be thinking big enough." Also see: Don't be afraid of failure: really go for it)

The biggest training challenge (in my opinion) is going to be training people in how to push past the obvious analysis of the data and go for deeper integrative insight. This will require training not just in biology but in data analysis, computational thinking, significant amounts of informed skepticism, etc. (See our course.) I think about it like this: generating hypotheses from large amounts of data isn't that interesting -- I can do that with publicly available data sets without spending any money! Constraining the space of hypotheses with big data sets is far more interesting, because it gives you the space of hypotheses that aren't ruled out; and its putting your data to good use. I'll, uh, let you know if I ever figure out how to do this myself...

I think there are plenty of people that can learn to do this, but as Greg Wilson correctly points out, there has to be a tradeoff: what do you take out of existing training curricula, to be replaced with training in data analysis? I wish I knew.

  1. Transferring data around is getting more than a bit tedious.

For some recent papers, I had to copy some big files from EC2 over to my lab computer, and from there to our HPC system. It was Slow, in the "time for lunch" sense. And these were small test data sets, compressed. Transferring our big data sets around is getting tedious. Luckily we have a lot of them, so I can usually work on analysis of one while another is transferring.

5. Your sysadmin/HPC administrator yells at you on a regular basis about your disk usage.

We regularly get nastygrams from our local sysadmins accusing of using too many terabytes of disk space. This is in contrast to the good ol' days of physics (which is where I got my sysadmin chops), where your sysadmin would yell at you for using too much CPU...

  1. You regularly crash big memory machines.

My favorite quote of the year so far is from the GAGE paper), in which Salzberg et al (2012) say "For larger genomes, the choice of assemblers is often limited to those that will run without crashing." Basically, they took a reasonably big computer, threw some big data at various assembly packages, and watched the computer melt down. Repeatedly.

Someone recently sent me an e-mail saying "hey, we did it! we took 3 Gb of sequence data from soil and assembled it in only 1 week in 3 TB of RAM!" Pretty cool stuff -- but consider that at least four to six runs would need to be done (parameter sweep!), and it takes only about 1 week and $10k to generate twice that data. In the long run, this does not seem cost effective. (It currently takes us 1-2 months to analyze this data in 300 GB of RAM. I'm not saying we have the answer. ;)

  1. Throwing away data looks better and better.

I made a kind of offhanded comment to a NY Times reporter once (hint: don't do this) about how at some point we're going to need to start throwing away data. He put it as the ultimate quote in his article. People laughed at me for it. (BUT I WAS RIGHT! HISTORY WILL SHOW!)

But seriously, if someone came up to you and said "we can get rid of 90% of your data for you and give you an answer that's just as good", many biologists would have an instant negative response. But I think there's a ground truth in there: a lot of Big Data is noise. If you can figure out how to get rid of it... why wouldn't you? This is an interesting shift in thinking from the "every data point is precious and special" that you adopt when it takes you a !#%!#$ week to generate each data point.

I attended a talk that David Haussler gave at Caltech recently. He was talking about how eventually we would need to resequence millions of individual cancer cells to look for linked sets of mutations. At 50-300 Gb of sequence per cell, that's a lot of data. But most of that data is going to be uninteresting -- wouldn't it be great if we could pick out the interesting people and then throw the rest away? It would certainly help with data archiving and analysis...

8. Big computer companies call you because they're curious about why you're buying such big computers.

True story: a Big Computer Company called up our local HPC to ask why we were buying so many bigmem machines. They said "It's the damned biologists -- they keep on wanting more memory. Why? We don't know - we suspect the software's badly written, but can't tell. Why don't you talk to Titus? He pretends to understand this stuff." I don't think it's weird to get calls trying to sell me stuff -- but it is a bit weird to have our local HPC's buying habits be so out of character, due to work that I and others are doing, that Big Computer Companies notice.

(Note: the software's not mine, and it's not badly written, either.)

9. Your choice must increasingly be "improve algorithms" rather than "buy bigger computers"

I've been banging this drum for a while. Sequencing capacity is outpacing Moore's Law, and so we need to rethink algorithms. An algorithm that was nlogn used to be good enough; now, if analysis requires a supra-linear algorithm, we need to figure out how to make it linear. (Sublinear would be better.)

Anecdote: we developed a nifty data structure for attacking metagenome assembly (see: http://arxiv.org/abs/1112.4193). It scaled (scales) assembly by a factor of about 20x, which got us pretty excited -- that meant we could in theory assemble things like MetaHIT and rumen on commodity hardware without doing abundance filtering. Literally the day that we told our collaborators we had it working, they dumped 10x more data on us and told that they could send us more any time we wanted. (...and that, boys and girls, was our introduction to the HiSeq!) 20x wasn't enough. Sigh.

The MG-RAST folk have told me a similar story. They did some awesomely cool engineering and got their pipeline running about 100x faster. That'll hold them for a year or so against the tidalwave of data.

Corollary: don't waste your time with 2% improvements in sensitivity and specificity unless you also deliver 10x in compute performance.

10. You spend a lot of time worrying about biased noise, cross-validation, and the incorrect statistical models used.

We were delayed in some of our research by about a year, because of some systematic biases being placed in our sequencing data by Illumina. Figuring out that these non-biological features were there took about two months; figuring out how to remove them robustly took another 6 months; and then making sure that removing didn't screw up the actual biological signal took another four months.

This is a fairly typical story from people who do a lot of data analysis. We developed a variety of cross-validation techniques and ways of intuiting whether or not something was "real" or "noise", and we spent a certain amount of time discussing what statistical approaches to use to assess significance. In the end we more or less gave up and pointed out that on simulated data what we were doing didn't screw things up.

  1. Silicon Valley wants to hire your students to go work on non-biology problems.

Hey, it's all Big Data, right?

---

So: what is Big Data in biology?

First, I've talked mostly about DNA sequence analysis, because that's what I work on. But I know that proteomics and image analysis people are facing similar dilemmas. So it's not just sequence data.

Second, compute technology is constantly improving. So I think we need moving definitions.

Here are some more serious points that I think bear on what, exactly, problems for Big Data in biology. (They're not all specific to biology, but I can defend them on my home ground more easily, you see.)

  1. You have archival issues on a large scale.

You have lots of homogeneously formatted data that probably contains answers you don't know you're looking for yet, so you need to save it, metadata it, and catalog it. For a long time.

  1. The rate at which data is arriving is itself increasing.

You aren't just getting one data set. You're getting dozens (or hundreds) this year. And you'll get more than that next year.

One implication of this is that you'd better have a largely automated analysis pipeline, or else you will need an increasing number of people just to work with the data, much less do anything interesting. Another implication is that software reuse becomes increasingly important: if you're building custom software for each data set, you will fall behind. A third implication is that you need a long-term plan for scaling your compute capacity.

  1. Data structure and algorithm research is increasingly needed.

You cannot rely on many heavyweight iterations over your data, or simple data structures for lookup: the data is just too big and existing algorithms are tailored to smaller data. For example, BLAST works fine for a few gigabytes of data; past that, it becomes prohibitively slow.

  1. Infrastructure designers are needed.

Issues of straightforward data transfer, network partitioning, and bus bandwidth start to come to the forefront. Bottleneck analysis needs to be done regularly. In the past, you could get away with "good enough", but as throughput continues to increase, bottlenecks will need to be tackled on a regular basis. For this, you need a person who is immersed in your problems on a regular basis; they are hard to find and hard to keep.

One interesting implication here is for cloud computing, where smart people set up a menu of infrastructure options and you can tailor your software to those options. So far I like the idea, but I'm told by aficionados that (for example) Amazon still falls short.

  1. You have specialized hardware needs.

Sort of a corollary of the above: what kind of analyses do you need to do? And what's the hardware bottleneck? That's where you'll get the most benefit from focused hardware investment.

  1. Hardware, infrastructure design, and algorithms all need to work together.

Again, a corollary of the above, but: if your bottleneck is memory, focus on memory improvements. If your bottleneck is disk I/O, focus on hardware speed and caching. If your bottleneck is data transfer, try to bring your compute to your data.

  1. Software needs to change to be reusable and portable.

Robust, reusable software platforms are needed, with good execution guarantees; that way you have a foundation to build on. This software needs to be flexible (practically speaking, scriptable in a language like Python or Ruby or Perl), well developed and tested, and should fade into the background so that you can focus on more interesting things like your actual analysis. It should also be portable so that you can "scale out" -- bring the compute to your data, rather than vice versa. This is where Hadoop and Pig and other such approaches fit now, and where we seriously need to build software infrastructure in biology.

  1. Analysis thinking needs to change.

Comprehensively analyzing your data sets is tough when your data sets are really big and noisy. Extracting significant signals from them is potentially much easier, and some approaches and algorithms for doing this in biology exist or are being developed (see especially Lior Pachter's eXpress). But this is a real shift in algorithmic thinking, and it's also a real shift in scientific thinking, because you're no longer trying do understand the entire data set -- you're trying to focus on the bits that might be interesting.

  1. Analyses are increasingly integrative.

It's hard to make sense of lots of data on its own: you need to link it in to other data sets. Data standards and software interoperability and "standard" software pipelines are incredibly important for doing this.

  1. The interesting problems are still discipline-specific.

There are many people working on Big Data, and there is big business in generic solutions. There's lots of Open Source stuff going on, too. Don't reinvent those wheels; figure out how to connect them to your biology, and then focus on the bits that are interesting to you and important for your science.

11. New machine learning, data mining, and statistical models need to be developed for data-intensive biological science.

As data volume increases, and integrative hypothesis development proceeds, we need to figure out how to assess the quality and significance of hypotheses. Right now, a lot of people throw their data at several programs, pick their favorite answer, and then recite the result as if it's correct. Since often they will have tried out many programs, this presents an obvious multiple testing problem. And, since users are generally putting in data that violates one or more of the precepts of the program developers, the results may not be applicable.

  1. A lack of fear of computational approaches is a competitive advantage.

The ability to approach computational analyses as just another black box upon which controls must be placed is essential. Even if you can open up the black box and understand what's inside, it needs to be evaluated not on what you think it's doing but on what it's actually doing at the black box level. If there's one thing I try to teach to students, it's to engage with the unknown without fear, so that they can think critically about new approaches.

---

Well, that's it for now. I'd be interested in hearing about what other people think I've missed. And, while I'm at it, a hat tip to Erich Schwarz, Greg Wilson, and Adina Howe for reading and commenting on a draft.

--titus

Syndicated 2012-03-06 22:28:01 from Titus Brown

Data Intensive Science, and Workflows

I'm writing this on my way back from Stockholm, where I attended a workshop on the 4th Paradigm. This is the idea (so named by Jim Gray, I gather?) that data-intensive science is a distinct paradigm from the first three paradigms of scientific investigation -- theory, experiment, and simulation. I was invited to attend as a token biologist -- someone in biology who works with large scale data, and thinks about large scale data, but isn't necessarily only devoted to dealing with large scale data.

The workshop was pretty interesting. It was run by Microsoft, who invited me & paid my way out there. The idea was to identify areas of opportunity in which Microsoft could make investments to help out scientists and develop new perspectives on the future of eScience. To do that, we played a game that I'll call the "anchor game", where we divvied up into groups to discuss the blocks to our work that stemmed from algorithms and tools, data, process and workflows, social/organizational aspects. In each group we put together sticky notes with our "complaints" and then ranked them by how big of an anchor they were on us -- "deep" sticky notes held us back more than shallow sticky notes. We then reorganized by discipline, and put together an end-to-end workflow that we hoped in 5 years would be possible, and then finally we looked for short- and medium-term projects that would help get us there.

The big surprise for me in all of this was that it turns out I'm most interested in workflows and process! All of my sticky notes had the same theme: it wasn't tools, or data management, or social aspects that were causing me problems, but rather the development of appropriate workflows for scientific investigation. Very weird, and not what I would have predicted from the outset.

Two questions came up for me during the workshop:

Why don't people use workflows in bioinformatics?

The first question that comes to mind is, why doesn't anyone I know use a formal workflow engine in bioinformatics? I know they exist (Taverna, for one, has a bunch of bioinformatics workflows); I'm reasonably sure they would be useful; but there seems to be some block against using them!

What would the ideal workflow situation be?

During the anchor game, our group (which consisted of two biologists, myself and Hugh Shanahan; a physicist, Geoffrey Fox; and a few computer scientists, including Alex Wade) came up with an idea for a specific tool. The tool would be a bridge between Datascope for biologists and a workflow engine. The essential idea is to combine data exploration with audit trail recording, which could then be hardened into a workflow template and re-used.

---

Thinking about the process I usually use when working on a new problem, it tends to consist of all these activites mixed together:

  1. evaluation of data quality, and application of "computational" controls
  2. exploration of various data manipulation steps, looking for statistical signal
  3. solidifying of various subcomponents of the data manipulation steps into scripted actions
  4. deployment of the entire thing on multiple data sets

Now, I'm never quite done -- new data types and questions always come up, and there are always components to tweak. But portions of the workflow become pretty solid by the time I'm halfway done with the initial project, and the evaluation of data quality accretes more steps but rarely loses one. So it could be quite useful to be able to take a step back and say, "yeah, these steps? wrap 'em up and put 'em into a workflow, I'm done." Except that I also want to be able to edit and change them in the future. And I'd like to be able to post the results along with the precise steps taken to generate them. And (as long as I'm greedy) I would like to work at the command line, while I know that others would like to be able to work in a notebook or graphical format. And I'd like to be able to "scale out" the computation by bringing the compute to the data.

For all of this I need three things: I need workflow agility, I need workflow versioning, and I need workflow tracking. And this all needs to sit on top of a workflow component model that lets me run the components of the workflow wherever the data is.

I'm guessing no tool out there does this, although I know other people are thinking this way, so maybe I'm wrong. The Microsoft folk didn't know of any, though, and they seemed pretty well informed in this area :).

The devil's choice I personally make in all of this is to go for workflow agility, and ignore versioning and tracking and the component model, by scripting the hell out of things. But this is getting old, and as I get older and have to teach my wicked ways to grad students and postdocs, the lack of versioning and tracking and easy scaling out gets more and more obvious. And now that I'm trying to actually teach computational science to biologists, it's simply appallingly difficult to convey this stuff in a sensible way. So I'm looking for something better.

One particularly intriguing type of software I've been looking at recently is the "interactive Web notebook" -- ipython notebook and sagenb. These are essentially Mathematica or matlab-style notebook packages that work with IPython or Sage, scientific computing and mathematical computing systems in Python. They let you run Python code interactively, and colocate it with its output (including graphics); the notebooks can then be saved and loaded and re-run. I'm thinking about working one or the other into my class, since it would let me move away from command-line dependence a bit. (Command lines are great, but throwing them at biologists, along with Python programming, data analysis, and program execution all together, seems a bit cruel. And not that productive.)

It would also be great to have cloud-enabled workflow components. As I embark on more and more sequence analysis, there are only about a dozen "things" we actually do, but mixed and matched. These things could be hardened, parameterized into components, and placed behind an RPC interface that would give us a standard way to execute them. Combined with a suitable data abstraction layer, I could run the components from location A on data in location B in a semi-transparent way, and also record and track their use in a variety of ways. Given a suitably rich set of components and use cases, I could imagine that these components and their interactions could be executed from multiple workflow engines, and with usefully interesting GUIs. I know Domain Specific Languages are already passe, but a DSL might be a good way to find the right division between subcomponents.

I'd be interested in hearing about such things that may already exist. I'm aware of Galaxy, but I think the components in Galaxy are not quite written at the right level of abstraction for me; Galaxy is also more focused on the GUI than I want. I don't know anything about Taverna, so I'm going to look into that a bit more. And, inevitably, we'll be writing some of our own software in this area, too.

Overall, I'm really interested in workflow approaches that let me transition seemlessly between discovery science and "firing for effect" for hypothesis-driven science.

A few more specific thoughts:

In the area of metagenomics (one of my research focuses at the moment), it would be great to see img/m, camera, and MG-RAST move towards a "broken out" workflow that lets semi-sophisticated computational users (hi mom!) run their stuff on the Amazon Cloud and on private HPCs or clouds. While I appreciate hosted services, there are many drawbacks to them, and I'd love to get my hands on the guts of those services. (I'm sure the MG-RAST folk would like me to note that they are moving towards making their pipeline more usable outside of Argonne: so noted.)

In the comments to my post on Four reasons I won't use your data analysis pipeline, Andrew Davison reminds me of VisTrails, which some people at the MS workshop recommended.

I met David De Roure from myExperiment at the MS workshop. To quote, "myExperiment makes it easy to find, use, and share scientific workflows and other Research Objects, and to build communities."

David put me in touch with Carole Goble who is involved with Taverna. Something to look at.

In the cloud computing workshop I organized at the Planet and Animal Genome conference this January, I will get a chance to buttonhole one of the Galaxy Cloud developers. I hope to make the most of this opportunity ;).

It'd be interesting to do some social science research on what difficulties users encounter when they attempt to use workflow engines. A lot of money goes into developing them, apparently, but at least in bioinformatics they are not widely used. Why? This is sort of in line with Greg Wilson's Software Carpentry and the wonderfully named blog It will never work in theory: rather than guessing randomly at what technical directions need to be pursued, why not study it empirically? It is increasingly obvious to me that improving computational science productivity has more to do with lowering learning barriers and changing other societal or cultural issues than with a simple lack of technology, and figuring out how (and if) appropriate technology could be integrated with the right incentives and teaching strategy is pretty important.

--titus

p.s. Special thanks to Kenji Takeda and Tony Hey for inviting me to the workshop, and especially for paying my way. 'twas really interesting!

Syndicated 2011-12-11 15:00:57 from Titus Brown

Trying out 'cram'

I desperately need something to run and test things at the command line, both for course documentation (think "doctest" but with shell prompts) and for script testing (as part of scientific pipelines). At the 2011 testing-in-python BoF, Augie showed us cram, which is the mercurial project's internal test code ripped out for the hoi polloi to use.

Step zero: wonder-twin-powers activate a new virtualenv!

% virtualenv e
% . e/bin/activate

Step one: install!

% pip install cram

... that just works -- always a good sign!

OK, let's test the bejeezus out of 'ls'.

% mkdir cramtest
% cd cramtest

Next, I put

$ ls

into a file. Be careful -- you apparently need exactly two spaces before the $ or it doesn't recognize it like a test.

Now, I run:

% cram ls.t

and I get

.
# Ran 1 tests, 0 skipped, 0 failed.

Awesome! A dot!

The only problem with this is that when I run 'ls' myself, I see:

ls.t    ls.t~

Hmm.

As a test of the cram test software, let's modify the file 'ls.t' to contain a clearly broken test, rather than an empty one:

$ ls
there is nothing here to see

and I get

!
--- /Users/t/dev/cramtest/ls.t
+++ /Users/t/dev/cramtest/ls.t.err
@@ -1,2 +1,1 @@
   $ ls
-  there is nothing here to see

# Ran 1 tests, 0 skipped, 1 failed.

OK, so I can make it break -- excellent! Cram comes advertised with the ability to fix its own tests by replacing broken output with actual output; let's see what happens, shall we?

% cram -i ls.t

!
--- /Users/t/dev/cramtest/ls.t
+++ /Users/t/dev/cramtest/ls.t.err
@@ -1,2 +1,1 @@
   $ ls
-  there is nothing here to see
Accept this change? [yN] y
patching file /Users/t/dev/cramtest/ls.t
Reversed (or previously applied) patch detected!  Assume -R? [n] y
Hunk #1 succeeded at 1 with fuzz 1.

# Ran 1 tests, 0 skipped, 1 failed.
% more ls.t
$ ls
there is nothing here to see
there is nothing here to see

OK, so, first, wtf is the whole reversed patch detected nonsense? Sigh. And second, where's the output from 'ls' going!?

Hmm, maybe cram is setting up a temp directory? That would explain a lot, and would also be a very sensible approach. It's not mentioned explicitly on the front page, but if you read into it a bit, it looks likely. OK.

Let's modify 'ls.t' to create a file:

$ touch testme
$ ls

and run it...

% cram ls.t
!
--- /Users/t/dev/cramtest/ls.t
+++ /Users/t/dev/cramtest/ls.t.err
@@ -1,3 +1,4 @@
   $ touch testme
   $ ls
+  testme


# Ran 1 tests, 0 skipped, 1 failed.

Ah-hah! Now we're getting somewhere! Fix the test by making 'ls.t' read like so:

$ touch testme
$ ls
testme

and run:

% cram ls.t
.
# Ran 1 tests, 0 skipped, 0 failed.

Awesome! Dot-victory ho!

Now let's do something a bit more interesting: check out and run my PyCon 2011 talk code for ngram graphs. Starting with this in 'khmer-ngram.t',

$ git clone git://github.com/ctb/khmer-ngram.git
$ cd khmer-ngram
$ ls
$ python run-doctests.py basic.txt

I run 'cram khmer-ngram.t' and get

!
--- /Users/t/dev/cramtest/khmer-ngram.t
+++ /Users/t/dev/cramtest/khmer-ngram.t.err
@@ -1,4 +1,15 @@
   $ git clone git://github.com/ctb/khmer-ngram.git
+  Initialized empty Git repository in /private/(yada, yada)
   $ cd khmer-ngram
   $ ls
+  basic.html
+  basic.txt
+  data
+  graphsize-book.py
+  hash.py
+  load-book.py
+  run-doctests.py
+  shred-book.py
   $ python run-doctests.py basic.txt
+  ... running doctests on basic.txt
+  *** SUCCESS ***

# Ran 1 tests, 0 skipped, 1 failed.

After getting cram to fix the file (using -i), and re-running cram, it now chokes at exactly one place; betcha you can guess where...:

!
--- /Users/t/dev/cramtest/khmer-ngram.t
+++ /Users/t/dev/cramtest/khmer-ngram.t.err
@@ -1,5 +1,5 @@
   $ git clone git://github.com/ctb/khmer-ngram.git
-  Initialized empty Git repository in /private/(yada, yada)
+  Initialized empty Git repository in /private/(different yada)
   $ cd khmer-ngram
   $ ls
   basic.html

# Ran 1 tests, 0 skipped, 1 failed.

Right. How do you deal with output that does change unpredictably? Easy! Throw in a wildcard regexp like so

Initialized empty Git repository in .* (re)

My whole khmer-ngram.t file now looks like this:

$ git clone git://github.com/ctb/khmer-ngram.git
Initialized empty Git repository in .* (re)
$ cd khmer-ngram
$ ls
basic.html
basic.txt
data
graphsize-book.py
hash.py
load-book.py
run-doctests.py
shred-book.py
$ python run-doctests.py basic.txt
... running doctests on basic.txt
*** SUCCESS ***

And I can run cram on it without a problem:

.
# Ran 1 tests, 0 skipped, 0 failed.

Great!

I love the regexp fix, too; none of this BS that doctest forces upon you.

So, the next question: how do multiple tests work? If you look above, you can see that it's running all the commands as one test. Logically you should be able to just separate out the block of text and make it into multiple tests... let's try adding

I'll add in another test:

  $ ls

to the khmer-ngram.t file; does that work? It looks promising:

!
--- /Users/t/dev/cramtest/khmer-ngram.t
+++ /Users/t/dev/cramtest/khmer-ngram.t.err
@@ -17,3 +17,12 @@
 I'll add in another test:

   $ ls
+  basic.html
+  basic.txt
+  data
+  graphsize-book.py
+  hash.py
+  hash.pyc
+  load-book.py
+  run-doctests.py
+  shred-book.py

# Ran 1 tests, 0 skipped, 1 failed.

and it sees two tests... but, after fixing the expected output using 'cram -i', I only get one test:

.
# Ran 1 tests, 0 skipped, 0 failed.

So it seems like a little internal inconsistency in cram here. Two tests when something's failing, one test when both are running. No big deal in the end.

And... I have to admit, that's about all I need for testing/checking course materials! The cram test format is perfectly compatible with ReStructuredText, so I can go in and write real documents in it, and then test them. Command line testing FTW?

And (I just checked) I can even put in Python commands and run doctest on the same file that cram runs on. Awesome.

Critique:

The requirement for two spaces exactly before the $ was not obvious to me, nor was the implicit (and silent, even in verbose mode) use of a temp directory. I wiped out my test file a few times by answering "yes" to patching, too. What was up with the 'reversed patch' foo?? And of course it'd be nice if the number of dots reflected something more granular than the number of files run. But heck, it mostly just works! I didn't even look at the source code at all!

Verdict: a tentative 8/10 on the "Can titus use your testing tool?" scale.

I'll try using it in anger on a real project next time I need it, and report back from there.

--titus

p.s. To try out my full cram test from above, grab the file from the khmer-ngram repo at github; see:

https://github.com/ctb/khmer-ngram/blob/master/cram-test.t .

Syndicated 2011-03-14 01:46:48 from Titus Brown

My new data analysis pipeline code

First, I write a recipe file, 'metagenome.recipe', laying out my job description for, say, sequence trimming and assembly with Velvet:

fasta_file soil-data.fa

qc_filter min_length=50 remove_Ns=true

graph_filter min_length=400

velvet_assemble k=33 min_length=1000 scaffolding=True

Then I specify machine parameters, e.g. 'bigmem.conf':

[defaults]
n_threads=8

[graph_filtering]
use_mem=32gb

[velvet]
needs_mem=64gb

And finally, I run the pipeline:

% ak-run metagenome.recipe -c bigmem.conf

If I have cloud access (and who doesn't?) I can tell the pipeline to spin up and down nodes as needed:

% ak-aws-run metagenome.recipe -c bigmem.conf

(Bear in mind most of these tasks are multi-hour, if not multi-day, operations, so I'm not too worried about optimizing machine use and re-use.)

Hadoop jobs could be spawned underneath, depending on how each recipe component was actually implemented.

As for testing reproducibility of pipeline results, which is the short-term motivation here, I can store results for regression testing with later versions:

% ak-run metagenome.recipe -c bigmem.conf --save-endpoint=/some/path

and then compare:

% ak-run --check-endpoint=/some/path

---

Now, does anyone know of a package or packages that actually do this, so I/we don't have to write it??

See texttest and ruffus for some of my inspiration/interest.

--titus

Syndicated 2011-03-11 14:56:29 from Titus Brown

460 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!