Digital normalization of short-read shotgun data
We just posted a pre-submission paper to arXiv.org:
A single pass approach to reducing sampling variation, removing errors, and scaling de novo assembly of shotgun sequences
Authors: C. Titus Brown, Adina Howe, Qingpeng Zhang, Alexis B. Pyrkosz, and Timothy H. Brom
Deep shotgun sequencing and analysis of genomes, transcriptomes, amplified single-cell genomes, and metagenomes enable the sensitive investigation of a wide range of biological phenomena. However, it is increasingly difficult to deal with the volume of data emerging from deep short-read sequencers, in part because of random and systematic sampling variation as well as a high sequencing error rate. These challenges have led to the development of entire new classes of short-read mapping tools, as well as new de novo assemblers. Even newer assembly strategies for dealing with transcriptomes, single-cell genomes, and metagenomes have also emerged. Despite these advances, algorithms and compute capacity continue to be challenged by the continued improvements in sequencing technology throughput. We here describe an approach we term digital normalization, a single-pass computational algorithm that discards redundant data and both sampling variation and the number of errors present in deep sequencing data sets. Digital normalization substantially reduces the size of data sets and accordingly decreases the memory and time requirements for de novo sequence assembly, all without significantly impacting content of the generated contigs. In doing so, it converts high random coverage to low systematic coverage. Digital normalization is an effective and efficient approach to normalizing coverage, removing errors, and reducing data set size for shotgun sequencing data sets. It is particularly useful for reducing the compute requirements for de novo sequence assembly. We demonstrate this for the assembly of microbial genomes, amplified single-cell genomic data, and transcriptomic data. The software is freely available for use and modification.
I'll blog more about this stuff over the next few days, but, briefly, this paper presents a single pass, fixed-memory approach to downsampling sequencing data (yeah, the stuff I've been talking about for a while now). This approach is called "digital normalization", or "diginorm" for sure. It eliminates lots of data, evens out coverage, and removes errors from shotgun sequencing data sets. The net effect is an often massive amount of data reduction combined with significant scaling of de novo assembly.
Or, to put it another way, it's a streaming lossy compression algorithm that primarily "loses" errors in sequencing data.
We've implemented it as a preprocessing filter that should work on any data set you want to assemble, with potentially any assembler. It's written in C++, with a Python wrapper, as part of khmer. And, of course, it's freely available for use, re-use, modification, and redistribution under the BSD license, 'cause why the heck not?
If you want to try it out, we've linked to some tutorials for running microbial genome assemblies with Velvet, as well as eukaryotic transcriptome assemblies with Oases and Trinity, on the paper Web site
We'll submit this paper to PNAS on Friday; I'm still waiting for some final proofreading.
Together with our other paper, which is currently under revision for resubmission to PNAS, these two papers form the theoretical basis for our attack on soil metagenome assembly. (Our plan is sheer elegance in its simplicity!)