I've been doing some more focused bioinformatics programming recently,
and as I'm thinking about how to teach biologists about data analysis, I realize
more and more how much backstory goes into even relatively simple
programming.
The problem: given a reference genome, and a very large set of short,
error-prone, random sequences from an individual (with a slightly
different genome) aligned to that genome, find all differences and
provide enough summary information that probably-real differences can
be distinguished from differences due only to error. More explicitly,
if I have a sequence to which a bunch of shorter sequences have been
aligned, like so:
GAGGGCTACAAAAGGAGGCATAGGACCGATAGGACAT (reference genome)
GGCTAC -HWI-EAS216_0019:6:44:8736:13789#0/1[0:6]
GGCTAAA HWI-EAS216_0019:6:11:12331:10539#0/1[28:35]
GGCTAAAAA HWI-EAS216_0019:6:54:3745:18270#0/1[26:35]
GGCTAAAAAA HWI-EAS216_0019:6:75:4175:16939#0/1[25:35]
GGATACAACAG -HWI-EAS216_0019:6:96:16613:3276#0/1[0:11]
GGCTAAAAAAA HWI-EAS216_0019:6:11:8510:9363#0/1[22:33]
GGCTAAAAAAA HWI-EAS216_0019:6:72:14343:19416#0/1[21:32]
GGCTACAACAG -HWI-EAS216_0019:6:116:10526:5940#0/1[4:15]
GGCTAAAACAA HWI-EAS216_0019:6:103:9198:1208#0/1[18:29]
GGCTACAACAG -HWI-EAS216_0019:6:50:14928:13905#0/1[7:18]
GGCTACAACAG -HWI-EAS216_0019:6:109:18796:4787#0/1[14:25]
GGCTACAACAG -HWI-EAS216_0019:6:102:14050:8429#0/1[18:29]
GGCTACAACAG -HWI-EAS216_0019:6:22:14814:10742#0/1[23:34]
GCTCCACCAG -HWI-EAS216_0019:6:106:8241:6484#0/1[25:35]
TACAACAG HWI-EAS216_0019:6:4:8095:15474#0/1[0:8]
CAACAG HWI-EAS216_0019:6:115:15766:4844#0/1[0:6]
* ** * *
1 23 4 5
how do you gather all of the locations with differences, and then
figure out which differences are significant? (Variations in 1 and 2
are probably errors, 4 and 5 may be real, and 3 is pretty clearly
real, I would say.)
The numbers here aren't trivial: you're looking at ~1 million-3
billion bases for the reference genome, and 20-120 million or more
shorter resequencing reads.
The first thing to point out is that it's not enough to simply look at
all of the aligned sequences individually; you need to know how many
total sequences are aligned to each spot, and what each nucleotide is.
This is because the new sequences are error-prone: observing a single
variation isn't enough, you need to know that it's systematic
variation. So it's a total-information situation, and hence not just
pleasantly parallel: you need to integrate the information afterwards.
I came up with two basic approaches: you can first figure out where
there's any variation, and then iterate again over the short
sequences, figure out which ones overlap detected variation, and
record those counts; or you can figure out where there's variation,
index the short sequences by position, and then query them by
position. Both are easy to code.
For the first, approach A:
for seq, location in aligned_short_sequences:
record_variation(genome[location])
for seq, location in aligned_short_sequences:
if has_variation(location):
record_nucleotide(genome[location], seq)
For the second, approach B:
for seq, location in aligned_short_sequences:
record_variation(genome[location])
for location in recorded_variation:
aligned_subset = get_aligned_short_sequences(location)
for seq, _ in aligned_subset:
record_nucleotide(genome[location], seq)
It's readily apparent that barring stupidity, approach A is generally
going to be faster: there's one less loop, for one thing. Also, if
you think about implementation details, approach B is potentially
going to involve retrieving aligned short sequences multiple times,
while approach A looks at each aligned short sequence exactly once.
However, there are tradeoffs, especially if you're building tools to
support either approach. The most obvious one is, what if you need to
look at variation on less than a whole-genome level? For example, for
visualization, or gene overlap calculations, you're probably not going
to care about an immense region of the genome; you may want to zoom in
on a few hundred bases. In that case, you'd want to use some of the
tools from approach B: given a position, what aligns there?
if variation_at(genome[location]):
aligned_subset = get_aligned_short_sequences(location)
for seq, _ in aligned_subset:
record_nucleotide(genome[location], seq)
Another tradeoff is simply whether or not to store everything in
memory (fast) or on disk (slower, but scales to bigger data sets) -- I
first wrote this code using an in-memory set, to look at low-variation
chicken genome resequencing. Then, when I tried to apply it to a high
variation Drosophila data set and a deeply sequenced Campylobacter
data set, it consumed a lot of RAM. So you have to think about
whether or not you want a single tool that works across all these
situations, and if so, how you want it to behave for large genomes,
with lots of real variation, and extremely deep sequencing -- plan for
the worst, in other words.
I also tend to worry about correctness. Is the code giving you the
right results, and how easy is it to understand? Each approach has
different drawbacks, and relies on different bodies of code that
make certain operations easy and other operations hard.
The upshot is that I wrote several programs implementing different
approaches, and tried them out on different data sets, and chose the
fastest one for one purpose (overall statistics) and a slower one
for visualization. And most programmers I know would have done the
same: tried a few different approaches, benchmarked them, and chosen
the right tool for the current job.
Where does this leave people who don't intuitively understand the
difference between the algorithms, and don't really understand what
the tools are doing underneath, and why one tool might return
different results (based on the parameters given) than another? How
do you explain that sort of thing in a short amount of time?
I don't know, but it's going to be interesting to find out...
BTW, the proper way to do this is with pygr
will someday soon be:
for seq, location in aligned_short_sequences:
record_variation(genome[location])
mapping = index_short_sequences()
results = intersect_mappings(variations, mapping)
where 'intersect_mappings' uses internal knowlege of the data
structures to do everything quickly and correctly. That day is still
a few months off...
--titus