Older blog entries for titus (starting at number 458)

A memory efficient way to remove low-abundance k-mers from large (metagenomic?) DNA data sets

I've spent the last few weeks working on a simple solution to a challenging problem in DNA sequence assembly, and I think we've got a nice simple theoretical solution with an actual implementation. I'd be interested in comments!


Briefly, the algorithmic challenge is this:

We have a bunch of DNA sequences in (let's say) FASTA format,


and we want to pre-filter these sequences so that only sequences containing high-abundance DNA words of length k ("k-mers"), remain. For example, given a set of DNA sequences, we might want to remove any sequence that does not contain a k-mer present at least 5 times in the entire data set.

This is very straightforward to do for small numbers of sequences, or for small k. Unfortunately, we are increasingly confronted by data sets containing 250 million sequences (or more), and we would like to be able to do this for large k (k > 20, at least).

You can break the problem down into two basic steps: first, counting k-mers; and second, filtering sequences based on those k-mer counts. It's not immediately obvious how you would parallelize this task: the counting should be very quick (basically, it's I/O bound) while the filtering is going to rely on wide-reaching lookups that aren't localized to any subset of k-mer space.

tl; dr? we've developed a way to do this for arbitrary k, in linear time and constant memory, efficiently utilizing as many computers as you have available. It's open source and works today, but, uhh, could use some documentation...

Digression: the bioinformatics motivation

(You can skip this if you're not interested in the biological motivation ;)

What we really want to do with this kind of data is assemble it, using a De Bruijn graph approach a la Velvet, ABySS, or SOAPdenovo. However, De Bruijn graphs all rely on building a graph of overlapping k-mers in memory, which means that their memory usage scales as a function of the number of unique k-mers. This is good in general, but Bad in certain circumstances -- in particular, whenever the data set you're trying to assemble has a lot of genomic novelty. (See this fantastic review and my assembly lecture for some background here.)

One particular circumstance in which De Bruijn graph-based assemblers flail is in metagenomics, the isolation and sequencing of genetic material from "the wild", e.g. soil or the human gut. This is largely because the diversity of bacteria present in soil is so huge (although the relatively high error rate of next-gen platforms doesn't help).

Prefiltering can help reduce this memory usage by removing erroneous sequences as well as not-so-useful sequences. First, any sequence arising as the result of a sequencing error is going to be extremely rare, and abundance filtering will remove those. Second, genuinely "rare" (shallowly-sequenced) sequences will generally not contribute much to the assembly, and so removing them is a good heuristic for reducing memory usage.

We are currently playing with dozens (probably hundreds, soon) of gigabytes of metagenomic data, and we really need to do this prefiltering in order to have a chance at getting a useful assembly out of it.

It's worth noting that this is in no way an original thought: in particular, the Beijing Genome Institute (BGI) did this kind of prefiltering in their landmark Human Microbiome paper (ref). We want to do it, too, and the BGI wasn't obliging enough to give us source code (booooooo hisssssssssssssss).

Existing approaches

Existing approaches are inadequate to our needs, as far as we can tell from a literature survey and private conversations. Everyone seems to rely on big-RAM machines, which are nice if you have them, but shouldn't be necessary.

There are two basic approaches.

First, you can build a large table in memory, and then map k-mers into it. This involves writing a simple hash function that converts DNA words into numbers. However, this approach scales poorly with k: for example, there are 4**k unique DNA sequences of length k (or roughly (4**k) / 2 + (4**(k/2))/2, considering reverse complements). So the table for k=17 needs 4**17 entries -- that's 17 gb at 1 byte per counter, which stretches the memory of cheaply available commodity hardware. And we want to use bigger k than 17 -- most assemblers will be more effective for longer k, because it's more specific. (We've been using k values between 30 and 70 for our assemblies, and we'd like to filter on the same k.)

Second, you can observe that k-mer space (for sufficiently large k) is likely to be quite sparsely occupied -- after all, there's only so many actual 30-mers present in a 100gb data set! So, you can do something clever like use a tree representation of k-mers and then attach counters to the end nodes of the tree (see, for example, tallymer. The problem here that you need to use pointers to connect nodes in the tree, which means that while the tree size is going to scale with the amount of novel k-mers -- ok! -- it's going to have a big constant in front of it -- bad!. In our experience this has been prohibitive for real metagenomic data filtering.

These seem to be the two dominant approaches, although if you don't need to actually count the k-mers but only assess presence or absence, you can use something like a Bloom filter -- for example, see Classification of DNA sequences using a Bloom filter, which uses Bloom filters to look for novel sequences (the exact opposite of what we want to do here!). References to other approaches welcome...

Note that you really, really, really want to avoid disk access by keeping everything in memory. These are ginormous data sets and we would like to be able to quickly explore different parameters and methods of sequence selection. So we want to come up with a good counting scheme for k-mers that involves relatively small amounts of memory and as little disk access as possible.

I think this is a really fun kind of problem, actually. The more clever you are, the more likely you are to come up with a completely inappropriate data structure, given the amount of data and the basic algorithmic requirements. It demands KISS! Read on...

An approximate approach to counting

So, we've come up with a solution that scales with the amount of genomic novelty, and efficiently uses your available memory. It can also make effective use of multiple computers (although not multiple processors). What is this magic approach?

Hash tables. Yep. Map k-mers into a fixed-size table (presumably one about as big as your available memory), and have the table entries be counters for k-mer abundance.

This is an obvious solution, except for one problem: collisions. The big problem with hash tables is that you're going to have collisions, wherein multiple k-mers are mapped into a single counting bin. Oh noes! The traditional way to deal with this is to keep track of each k-mer individually -- e.g. when there's a collision, use some sort of data structure (like a linked list) to track the actual k-mers that mapped to a particular bin. But now you're stuck with using gobs of memory to keep these structures around, because each collision requires a new pointer of some sort. It may be possible to get around this efficiently, but I'm not smart enough to figure out how.

Instead of becoming smarter, I reconfigured my brain to look at the problem differently. (Think Different?)

The big realization here is that collisions may not matter all that much. Consider the situation where you're filtering on a maximum abundance of 5 -- that is, you want at least one k-mer in each sequence to be present five or more times across the data set. Well, if you look at the hash bin for a specific k-mer and see the number 4, you immediately know that whether or not there are any collisions, that particular k-mer isn't present five or more times, and can be discarded! That is, the count for a particular hash bin is the sum of the (non-negative) individual counts for k-mers mapping to that bin, and hence that sum is an upper bound on any individual k-mer's abundance.


The tradeoff is false positives: as k-mer space fills up, the hash table is going to be hit by more and more collisions. In turn, more of the k-mers are going to be falsely reported as being over the minimum abundance, and more of the sequences will be kept. You can deal with this partly by doing iterative filtering with different prime hash table sizes, but that will be of limited utility if you have a really saturated hash table.

Note that the counting and filtering is still O(N), while the false positives grow as a function of k-mer space occupancy -- which is to say that the more diversity you have in your data, the more trouble you're in. That's going to be a problem no matter the approach, however.

You can see a simple example of approximate and iterative filtering here, for k=15 (a k-mer space of approximately 1 billion) and hash table sizes ranging from 50m to 100m. The curves all approach the correct final number of reads (which we can calculate exactly, for this data set) but some take longer than others. (The code for this is here.)


Making use of multiple computers

While I was working out the details of the (really simple) approximate counting approach, I was bugged by my inability to make effective use of all the juicy computers to which I have access. I don't have many big computers, but I do have lots of medium sized computers with memory in the ~10-20 gb range. How can I use them?

This is not a pleasantly parallel problem, for at least two reasons. First, it's I/O bound -- reading the DNA sequences in takes more time than breaking them down into k-mers and counting them. And since it's really memory bound -- you want to use the biggest hash table possible to minimize collisions -- it doesn't seem like using multiple processors on a single machine will help. Second, the hash table is going to be too big to effectively share between computers: 10-20 gb of data per computer is too much to send over the network. So what do we do?

I was busy explaining to a colleague that this was impossible -- always a useful way for me to solve problems ;) -- when it hit me that you could use multiple chassis (RAM + CPU + disk) to decrease the false positive rate with only a small amount of communication overhead. Basically, my approach is to partition k-mer space into Z subsets (one for each chassis) and have each computer count only the k-mers that fall into their partition. Then, after the counting stage, each chassis records a max k-mer abundance per partition per sequence, and communicates that to a central node. This central node in turn calculates the max k-mer abundance across all partitions.

The partitioning trick is a more general form of the specific 'prefix' approach -- that is, separately count and get max abundances/sequence for all k-mers starting with 'A', then 'C', then 'G', and then 'T'. For each sequence you will then have four values (the max abundance/sequence for k-mers start with 'A', 'C', 'G', and 'T'), which can be cheaply stored on disk or in memory. Now you can do a single-pass integration and figure out what sequences to keep.

This approach effectively multiplies your working memory by a factor of Z, decreasing your false positive rate equivalently - no mean feat!


The communication load is significant but not prohibitive: replicate a read-only sequence data set across Z computers, and then communicate max values (1 byte for each sequence) back -- 50-500 mb of data per filtering round. The result of each filtering round can be returned to each node as a readmask against the already-replicated initial sequence set, with one bit per sequence for ignore or process. You can even do it on a single computer, with a local hard drive, in multiple iterations.

You can see a simple in-memory implementation of this approach here, and some tests here. I've implemented this using readmask tables and min/max tables (serializable data structures) more generally, too; see "the actual code", below.

Similar approaches

By allowing for false positives, I've effectively turned the hash table into a probabilistic data structure. The closest analog I've seen is the Bloom filter which records presence/absence information using multiple hash functions for arbitrary k. The hash approach outlined above devolves into a maximally efficient Bloom filter for fixed k when only presence/absence information is recorded.

The actual code

Theory and practice are the same, in theory. In practice, of course, they differ. A whole host of minor interface and implementation design decisions needed to be made. The result can be seen at the 'khmer' project, here: http://github.com/ctb/khmer. It's slim on documentation, but there are some example scripts and a reasonable amount of tests. It requires nothing but Python 2.6 and a compiler; nose if you want to run the tests.

The implementation is in C++ with a Python wrapper, much like the paircomp and motility software packages.

It will filter 1m 70-mer sequences in about 45 seconds, and 80 million sequences in less than an hour, on a 3 GHz Xeon with 16 gbs of RAM.

Right now it's limited to k <= 32, because we encode each DNA k-mer in a 64-bit 'long long'.

You can see an exact filtering script here: http://github.com/ctb/khmer/blob/master/scripts/filter-exact.py . By varying the hash table size (second arg to new_hashtable) you can turn it into an inexact filtering script quite easily.

Drop me a note if you want help using the code, or a specific example. We're planning to write documentation, doctests, examples, robust command line scripts, etc. prior to publication, but for now we're primarily trying to use it.

Other uses

It has not escaped our notice that you can use this approach for a bunch of other k-mer based problems, such as repeat discovery and calculating abundance distributions... but right now we're focused on actually using it for filtering metagenomic data sets prior to assembly.


I talked a fair bit with Prof. Rich Enbody about this approach, and he did a wonderful job of double-checking my intuition. Jason Pell and Qingpeng Zhang are co-authors on this project; in particular, Jason helped write the C++ code, and Qingpeng has been working with k-mers in general and tallymer in specific on some other projects, and provided a lot of background help.


We've taken a previously hard problem and made it practically solvable: we can filter ~88m sequences in a few hours on a single-processor computer with only 16gb of RAM. This seems useful.

Our main challenge now is to come up with a better hashing function. Our current hash function is not uniform, even for a uniform distribution of k-mers, because of the way it handles reverse complements.

The approach scales reasonably nicely. Doubling the amount of data doubles the compute time. However, if you have double the novelty, you'll need to do double the partitions to keep the same false positive rate, in which case you quadruple the compute time. So it's O(N^2) for the worst case (unending novelty) but should be something better for real-world cases. That's what we'll be looking at over the next few months.

I haven't done enough background reading to figure out if our approach is particularly novel, although in the space of bioinformatics it seems to be reasonably so. That's less important than actually solving our problem, but it would be nice to punch the "publication" ticket if possible. We're thinking of writing it up and sending it to BMC Bioinformatics, although suggestions are welcome.

It would be particularly ironic if the first publication from my lab was this computer science-y, given that I have no degrees in CS and am in the CS department by kind of a fluke of the hiring process ;).

Syndicated 2010-07-07 15:09:20 from Titus Brown

Teaching scientists how to use computers - hub & spokes

After my recent next-gen sequencing course, which was supposed to tie into the whole software carpentry (SWC) effort but didn't really succeed in doing so the first time through, I started thinking about the Right Way to tie in the SWC material. In particular, how do you both motivate scientists to look at the SWC material, and (re)direct people to the appropriate places?

It's not clear that a Plan is in place. Greg Wilson seems to assume that scientists will find at least some of the material immediately obviously usable, but I think he's targetted at a more sophisticated population of users -- physicists and the like. My experience with bioinformaticians, however, is that they either come from straight biology backgrounds (with little or no computational background and rather limited on-the-job training), straight computation backgrounds (with very little biology), or physics (gonzo programming skills, but no biology). The latter fit neatly into the SWC fold, but they (we ;) are rare in biology. I think computer scientists and biologists are going to need guidance to dive into SWC at an early enough time for it to be the most rewarding.

So, what's a good model for SWC to guide scientists from multiple disciplines into the appropriate material? It's obviously not going to be possible to have Greg et al. tailor the SWC material to individual subgroups -- he doesn't know much (any ;) biology, for example. I don't have the time, patience, or skillset to integrate my next-gen notes into his SWC material, either. So, instead, I propose the hub & spokes model!


Here, the "hub" is the SWC material, and the spokes are all of the individual disciplines.

Basically, the idea is that individual sites (like my own ANGUS site on next-gen sequencing, http://ged.msu.edu/angus/) will develop their own field-specific content, and then link from that content into the SWC notes. This way the experts with feet in both fields can link appropriately, and Greg only has to worry about making the central content general -- which he's already doing quite well, I think. Yes, It's more work than asking Greg to do it, but frankly I'm going to be happy with a kick-ass central SWC site to which I can link -- right now it's dismayingly challenging to teach students why this stuff matters and how to learn it.

From the psychosocial perspective, it's a great fit. Students can get hands on tutorials on how to do X, Y, and Z in their own field -- and then connect into the SWC material to learn the background, or additional computational techniques in support of it. Motivation first!

What do we need SWC to do to support this? Not much -- basically, the central SWC notes need to be stable enough (with permalinks) that I can link into them from my own site(s) and not have to worry about the links becoming broken or (worse) silently migrating in topic. There are other solutions (wholesale incorporation of SWC into my own notes, for example) but I think the permalink idea is the most straightforward. Oh, and we should have a Greg-gets-hit-by-a-bus plan, too; at some point he's going to move on from SWC (perhaps when his lovely wife decide she's had enough and he needs to stop obsessing over it, or perhaps under more dire circumstances ;( and it would be good to know who holds the domain and site keys.

Thoughts? Comments?


Syndicated 2010-07-06 03:12:24 from Titus Brown

Which functional programming language(s) should we teach?

Laurie Dillon just posted the SIGPLAN eduction board article on Why Undergraduates Should Learn the Principles of Programming Languages to our faculty mailing list at the MSU Computer Science department. One question that came up in the ensuing conversation was: what functional programming language(s) would/should we teach?

I mentioned OCaml, Haskell, and Erlang as reasonably pure but still pragmatic FP languages. Anything else that's both "truly" functional and used somewhat broadly in the real world?



Syndicated 2010-06-24 18:31:48 from Titus Brown

Teaching next-gen sequencing data analysis to biologists

Our sequencing analysis course ended last Friday, with an overwhelmingly positive response from the students. The few negative comments that I got were largely about organizational issues, and could be reshaped as suggestions for next time rather than as condemnations of this year's course.


The 23 students -- most with no prior command-line experience -- spent two weeks experiencing at first hand the challenges of dealing with dozens of gigabytes of sequencing data. Each of the students went through genome-scale mapping, genome assembly, mRNAseq analysis on an "emerging model organism" (a.k.a "one with a crappy genome", lamprey), resequencing analysis on E. coli, and ChIP-seq analysis on Myxococcus xanthus. By the beginning of the second week, many students were working with their own data -- a real victory. Python programming competency may take a bit longer, but many of them seem motivated.

If you had told me three weeks ago that we could pull this off, I would have told you that you were crazy. This does beg the question of what I was thinking when I proposed the course -- but don't dwell on that, please...

The locale was great, as you can see:


One of the most important lessons of the course for me is that cloud computing works well to backstop this kind of course. I was very worried about the suitabiliy and reliability and ease of use, but AWS did a great job, providing an easy-to-use Web interface and a good range of machine images. I have little doubt that this course would have been nearly impossible (and either completely ineffective or much more expensive) without it.

In the end, we spent more on beer than on computational power. That says something important to me :)

The course notes are available under a CC license although they need to be reworked to use publicly available data sets before they become truly useful. At that point I expect them to become awesomely useful, though.

From the scientific perspective, the students derived a number of significant benefits from the course. One that I had not really expected was that some students had no idea what went in to computational "sausage", and were kind of shocked to see what kinds of assumptions us comp bio people made on their behalf. This was especially true in the case of students from companies, who have pipelines that are run on their data. One student lamented that "we used to look at the raw traces... now all we get are spreadsheet summaries!" Another student came to me in a panic because they didn't realize that there was no one true answer -- that that was in fact part of the "fun" of all biology, not just experimental biology. These reactions alone made teaching the course worthwhile.

Of course, the main point is that many of the students seem to be capable of at least starting their own analyses now. I was surprised at the practical power of our cut-and-paste approach -- for example, if you look at the Short-read assembly with ABySS tutorial, it turns out to be relatively straightforward to adapt this to doing assemblies of your own genomic or transcriptomic data. I based our approach on Greg Wilson's post on the failure of inquiry-based teaching and so far I like it.

I am particularly amused that we have now documented, in replicable detail, the Kroos Lab MrpC ChIP analysis. We also have the best documentation for Jeff Barrick's breseq software, I think; this is what is used to analyze the Long Term Evolution Experiment lines -- and I can't wait for the anti-evolutionists to pounce on that... "Titus Brown -- making evolution experiments accessible to creationists." Yay?

There were a number of problems and mistakes that we had to steamroller through. In particular, more background and more advanced tutorials would have be great, but we just didn't have time to write them. Some 454, Helicos, and SOLiD data sets (and next year, PacBio?) would be a good addition. We had a general lack of multiplexing data, which is becoming a Big Thing now that sequencing is so ridiculously deep. I would also like to introduce additional real data analyses next year, reprising things like the Cufflinks analysis and whole-vertebrate-genome ChIP-seq/mRNAseq a la the Wold Lab. I'm weighing adding metagenomics data analysis in for a day, although it's a pretty separate field of inquiry (and frankly much harder in terms of "unknown unknowns"). We also desperately need some plant genomics expertise, because frankly I know nothing about plant genomes; my last-minute plant genomics TA fell through due to lack of planning on my part. (Conveniently, plant genomics is something MSU is particularly good at, so I'm sure I can find someone next year.)

Oops, did I say next year? Well, yes. If I can find funding for my princely salary, then I will almost certainly run the course again next year. I can cover TAs and my own room/board and speakers with workshop fees, but if I'm going to keep room+board+fees under $1000/student -- a practical necessity for most -- there's no way I can pay myself, too. And while this year I relied on my lovely, patient, and frankly long-suffering wife to hold down the home fort while I was away for two weeks, I simply can't put her through that again, so I will need to pay for a nanny next year. So doing it for free is not an option.

In other words, if you are a sequencing company, or an NIH/NSF/USDA program director, interested in keeping this going, please get in touch. I plan to apply for this Initiative to Maximize Research Education in Genomics in September, but I am not confident of getting that on the first try, and in any case I will need letters of support from interested folks. So drop me a note at ctb@msu.edu.

Course development this year was sponsored by the MSU Gene Expression in Disease and Development, to whom I am truly grateful. The course would simply not have been possible without their support.

My overall conclusion is that it is possible to teach bench biologists with no prior computational experience to achieve at least minimal competency in real-world data analysis of next-generation sequencing data. I can't conclusively demonstrate this without doing a better job of course evaluation, and of course only time will tell if it sticks for any of the students, but right now I'm feeling pretty good about the course overall. Not to mention massively relieved.


p.s. Update from one student -- "It's not even 12 o'clock Monday morning and I'm already getting people asking me how to run assemblies and analyze data." Heh.

Syndicated 2010-06-14 15:38:31 from Titus Brown

Running a next-gen sequence analysis course using Amazon Web Services

So, I've been teaching a course on next-generation sequence analysis for the last week, and one of the issues I had to deal with before I proposed the course was how to deal with the volume of data and the required computation.

You see, next-generation sequence analysis involves analyzing not just entire genomes (which are, after all, only 3gb or so in size) but data sets that are 100x or 1000x as big! We want to not just map these data sets (which is CPU-intensive), but also perform memory-intensive steps like assembly. If you have a class with 20+ students in it, you need to worry about a lot of things:

  • computational power: how do you provide 24 "good" workstations
  • memory
  • disk space
  • bandwidth
  • "take home" ability

One strategy would be to simply provide some Linux or Mac workstations, with cut-down data sets. But then you wouldn't be teaching reality -- you'd be teaching a cut-down version of reality. This would make the course particularly irrelevant given that one of the extra-fun things about next-gen sequence analysis is how hard it is to deal with the volume of data. You also have to worry that the course would be made even more irrelevant because the students would leave the course and be unable to use the information without finding infrastructure and installing a bunch of software and then administering the machine.

While I enjoy setting up computers and installing software and managing users, I'm clearly masochistic. It's also entirely besides the point for bioinformaticians and biologists - they just want to analyze data!

The solution I came up with was to use Amazon Web Services and rent some EC2 machines. There's a large variety of hardware configurations available (see instance types) and they're not that expensive per hour (see pricing).

This has worked out really, really well.

It's hard to enumerate the benefits, because there have been so many of them ;). A few of the obvious ones --

We've been able to write tutorials (temporary home here: http://ged.msu.edu/angus/) that make use of specific images and should be as future-proof as they can be. We've given students cut and paste command lines that Just Work, and that they can tweak and modify as they want. If it borks, they always just throw it away and start from a clean install.

It's dirt cheap. We spent less than $50 the first week, for ~30 people using an average of 8 hours of CPU time. The second week will increase to an average of 8 hours of CPU time a day, and for larger instances -- so probably about $300 total, or maybe even $500 -- but that's ridiculously cheap, frankly, when you consider that there are no hardware issues or OS re-install problems to deal with!

Students can choose whatever machine specs they need in order to do their analysis. More memory? Easy. Faster CPU needed? No problem.

All of the data analysis takes place off-site. As long as we can provide the data sets somewhere else (I've been using S3, of course) the students don't need to transfer multi-gigabyte files around.

The students can go home, rent EC2 machines, and do their own analyses -- without their labs buying any required infrastructure.

Home institution computer admins can use the EC2 tutorials as documentation to figure out what needs to be installed (and potentially, maintained) in order for their researchers to do next-gen sequence analysis.

The documentation should even serve as a general set of tutorials, once I go through and remove the dependence on private data sets! There won't be any need for students to do difficult or tricky configurations on their home machines in order to make use of the tutorial info.

So, truly awesome. I'm going to be using it for all my courses from now on, I think.

There have been only two minor hitches.

First, I'm using Consolidated Billing to pay for all of the students' computer use during the class, and Amazon has some rules in place to prevent abuse of this. They're limiting me to 20 consolidated billing accounts per AWS account, which means that I've needed to get a second AWS account in order to add all 30 students, TAs, and visiting instructors. I wouldn't even mention it as a serious issue but for the fact that they don't document it anywhere, so I ran into this on the first day of class and then had to wait for them to get back to me to explain what was going on and how to work around it. Grr.

Second, we had some trouble starting up enough Large instances simultaneously on the day we were doing assembly. Not sure what that was about.

Anyway, so I give a strong +1 on Amazon EC2 for large-ish style data analysis. Good stuff.

cheers, --titus

Syndicated 2010-06-08 14:52:29 from Titus Brown

Help! Help! Class notes site?

So, I'm running this summer course and I am trying to figure out how to organize the notes for students. I'd like to mix curriculum-specific notes ("here's what we're doing today, and here are some problems to work on") with tutorials (material independent of a single course, like "here's how to transfer files between computers" or "here's how to parse CSV files"), and allow students to search the documents, annotate them in their Web browser, search the annotations, and perhaps even do public or private bookmarking and tagging. The ability to edit the primary content in something other than a Web GUI would be really, really nice, too -- that way I can write in something like ReST and then upload into the system.

(This is a system I could write myself, but that's kind of silly, dontcha think?)

It should also be lightweight, reasonably mature, easy to set up, and (preferably) written in Python, although I'm willing to compromise on the last simply because I'm desperate.

Pointers, comments, suggestions welcome!


Syndicated 2010-05-21 15:22:10 from Titus Brown

How to teach newbies about Python

Inspired by the brilliant mind(s) behind python-commandments.org, here's a list of ideas you can use to help newbies learn Python!

  1. Write Web pages with overly broad self-granted authority. That way they'll sound authoritative!
  2. Make sure they're anonymous pages, too. It's important that people not recognize that "Commandment" means "My Opinion", especially in the presence of actual Python community-level commandments. You know, like "we're switching to Python 3".
  3. Keep it short. Nobody wants to overwhelm people with, umm, information.
  4. Have a favorite toolkit? Push it!
  5. Oh, wait -- does your favorite toolkit only really address one particular problem? That's OK! Naive technical discussions are just fine -- opinions only, information just gets in the way!
  6. Remember, keep that self-authority coming!

Seriously, I gather that these are the folk helping out on the #python channel on IRC. Gawd help us.


p.s. I stumbled across it when someone pointed it out to me as a reason to not prioritize Python 3 for a GSoC project.

Syndicated 2010-04-26 20:32:52 from Titus Brown

Algorithms, and finding biological variation

I've been doing some more focused bioinformatics programming recently, and as I'm thinking about how to teach biologists about data analysis, I realize more and more how much backstory goes into even relatively simple programming.

The problem: given a reference genome, and a very large set of short, error-prone, random sequences from an individual (with a slightly different genome) aligned to that genome, find all differences and provide enough summary information that probably-real differences can be distinguished from differences due only to error. More explicitly, if I have a sequence to which a bunch of shorter sequences have been aligned, like so:


   GGCTAC          -HWI-EAS216_0019:6:44:8736:13789#0/1[0:6]
   GGCTAAA         HWI-EAS216_0019:6:11:12331:10539#0/1[28:35]
   GGCTAAAAA       HWI-EAS216_0019:6:54:3745:18270#0/1[26:35]
   GGCTAAAAAA      HWI-EAS216_0019:6:75:4175:16939#0/1[25:35]
   GGATACAACAG     -HWI-EAS216_0019:6:96:16613:3276#0/1[0:11]
   GGCTAAAAAAA     HWI-EAS216_0019:6:11:8510:9363#0/1[22:33]
   GGCTAAAAAAA     HWI-EAS216_0019:6:72:14343:19416#0/1[21:32]
   GGCTACAACAG     -HWI-EAS216_0019:6:116:10526:5940#0/1[4:15]
   GGCTAAAACAA     HWI-EAS216_0019:6:103:9198:1208#0/1[18:29]
   GGCTACAACAG     -HWI-EAS216_0019:6:50:14928:13905#0/1[7:18]
   GGCTACAACAG     -HWI-EAS216_0019:6:109:18796:4787#0/1[14:25]
   GGCTACAACAG     -HWI-EAS216_0019:6:102:14050:8429#0/1[18:29]
   GGCTACAACAG     -HWI-EAS216_0019:6:22:14814:10742#0/1[23:34]
    GCTCCACCAG     -HWI-EAS216_0019:6:106:8241:6484#0/1[25:35]
      TACAACAG     HWI-EAS216_0019:6:4:8095:15474#0/1[0:8]
        CAACAG     HWI-EAS216_0019:6:115:15766:4844#0/1[0:6]
     * **  * *
     1 23  4 5

how do you gather all of the locations with differences, and then figure out which differences are significant? (Variations in 1 and 2 are probably errors, 4 and 5 may be real, and 3 is pretty clearly real, I would say.)

The numbers here aren't trivial: you're looking at ~1 million-3 billion bases for the reference genome, and 20-120 million or more shorter resequencing reads.

The first thing to point out is that it's not enough to simply look at all of the aligned sequences individually; you need to know how many total sequences are aligned to each spot, and what each nucleotide is. This is because the new sequences are error-prone: observing a single variation isn't enough, you need to know that it's systematic variation. So it's a total-information situation, and hence not just pleasantly parallel: you need to integrate the information afterwards.

I came up with two basic approaches: you can first figure out where there's any variation, and then iterate again over the short sequences, figure out which ones overlap detected variation, and record those counts; or you can figure out where there's variation, index the short sequences by position, and then query them by position. Both are easy to code.

For the first, approach A:

for seq, location in aligned_short_sequences:

for seq, location in aligned_short_sequences:
   if has_variation(location):
      record_nucleotide(genome[location], seq)

For the second, approach B:

for seq, location in aligned_short_sequences:

for location in recorded_variation:
   aligned_subset = get_aligned_short_sequences(location)
   for seq, _ in aligned_subset:
      record_nucleotide(genome[location], seq)

It's readily apparent that barring stupidity, approach A is generally going to be faster: there's one less loop, for one thing. Also, if you think about implementation details, approach B is potentially going to involve retrieving aligned short sequences multiple times, while approach A looks at each aligned short sequence exactly once.

However, there are tradeoffs, especially if you're building tools to support either approach. The most obvious one is, what if you need to look at variation on less than a whole-genome level? For example, for visualization, or gene overlap calculations, you're probably not going to care about an immense region of the genome; you may want to zoom in on a few hundred bases. In that case, you'd want to use some of the tools from approach B: given a position, what aligns there?

if variation_at(genome[location]):
    aligned_subset = get_aligned_short_sequences(location)
    for seq, _ in aligned_subset:
       record_nucleotide(genome[location], seq)
variation in the genome

Another tradeoff is simply whether or not to store everything in memory (fast) or on disk (slower, but scales to bigger data sets) -- I first wrote this code using an in-memory set, to look at low-variation chicken genome resequencing. Then, when I tried to apply it to a high variation Drosophila data set and a deeply sequenced Campylobacter data set, it consumed a lot of RAM. So you have to think about whether or not you want a single tool that works across all these situations, and if so, how you want it to behave for large genomes, with lots of real variation, and extremely deep sequencing -- plan for the worst, in other words.

I also tend to worry about correctness. Is the code giving you the right results, and how easy is it to understand? Each approach has different drawbacks, and relies on different bodies of code that make certain operations easy and other operations hard.

The upshot is that I wrote several programs implementing different approaches, and tried them out on different data sets, and chose the fastest one for one purpose (overall statistics) and a slower one for visualization. And most programmers I know would have done the same: tried a few different approaches, benchmarked them, and chosen the right tool for the current job.

Where does this leave people who don't intuitively understand the difference between the algorithms, and don't really understand what the tools are doing underneath, and why one tool might return different results (based on the parameters given) than another? How do you explain that sort of thing in a short amount of time?

I don't know, but it's going to be interesting to find out...

BTW, the proper way to do this is with pygr will someday soon be:

for seq, location in aligned_short_sequences:

mapping = index_short_sequences()

results = intersect_mappings(variations, mapping)

where 'intersect_mappings' uses internal knowlege of the data structures to do everything quickly and correctly. That day is still a few months off...


Syndicated 2010-04-04 00:58:05 from Titus Brown

Indexing and retrieving lots of (biological) sequences, quickly

These days, molecular biologists are dealing with lots and lots of sequences, largely due to next-gen sequencing technologies. For example, the Illumina GA2 is producing 100-200 million DNA sequences, each of 75-125 bases, per run; that works out to 20 gb of sequence data per run, not counting metadata such as names and quality scores.

Storing, referencing, and retrieving these sequences is kind of annoying. It's not so much that the files themselves are huge, but that they're in a flat file format called 'FASTA' (or FASTQ), with no way to quickly retrieve a particular sequence short of scanning through the file. This is not something you want to do on-demand or keep in-memory, no matter what BioPython says ;).

So everyone working with these sequences has evolved one way or another to deal with it. A number of different approaches have been used, but they mostly boil down to using some sort of database to either index or store the sequences and associated metadata. It won't surprise you to learn that performance varies widely; that benchmarks are few and far between; and that many people just use MySQL, even though you can predict that using something like MySQL (or really any full-blown r/w database) is going to be suboptimal. This is because FASTA and FASTQ files don't change, and the overhead associated with being able to rearrange your indices to accomodate new data is unnecessary.

A while ago -- nearly 2 years ago!? -- we started looking at various approaches for dealing with the problem of "too much sequence", because the approach we used in pygr was optimized for retrieving subsequences from a relatively small number of relatively large sequences (think human chromosomes). Alex Nolley, a CSE undergrad working in the lab, looked at a number of approaches, and settled on a simple three-level write-once-read-many scheme involving a hash table (mapping names to record IDs), another table mapping fixed-size record IDs to offsets within a database file, and a third randomly-indexed database file. He implemented this in C++ and wrote a Pyrex wrapper and a Python API, and we called it screed.

Imagine our surprise when we found out that even absent the hashing, sqlite was faster than our screed implementation! sqlite is an embedded SQL database that is quite popular and (among other things) included with Python as of Python 2.5. Yet because it's a general purpose database, it seemed unlikely to us that it would be faster than a straightforward 2-level indexed flat file scheme. And yet... it was! A cautionary tale about reimplementing something that really smart/good programmers have spent a lot of time thinking about, eh? (Thanks to James Casbon, in particular, for whacking us over the head on this issue; he wins, I lose.)

Anyway, after a frustrating few weeks trying to think of why screed would be slower than sqlite and trying some naive I/O optimizations, we abandoned our implementation and Alex reimplemented everything on top of sqlite. That is now available as the new screed.

screed can import both common forms of sequence flat file (FASTA and FASTQ), and provides a unified interface for retrieving records by the sequence name. It stores all records as single rows in a table, along with another table to describe what the rows are and what special features they have. Because all of the data is stored in a single database, screed databases are portable and you can regenerate the original file from the index.

screed is designed to be easy to install and use. Conveniently, screed is pure Python: sqlite comes with Python 2.5, and because screed stays away from sqlite extensions, no C/C++ compiler is needed. As a result installation is really easy, and since screed behaves just like a read-only dictionary, anyone -- not just pygr users -- can use it trivially. For example,

>>> import screed
>>> db = screed.read_fasta_sequences('tests/test.fa')


>>> db = screed.read_fastq_sequences('tests/test.fq')

and then

>>> for name in db:
...   print name, db[name].sequence

will index the given database and print out all the associated sequences.

Now, because we are actually considering publishing screed (along with another new tool, which I'll post about next), I asked Alex to benchmark screed against MySQL and PostgreSQL. While sqlite is far more convenient for our purposes -- embeddable, included with Python so no installation needed, and not client-server based -- "convenience" isn't as defensible as "way faster" ;). Alex dutifully did so, and the results for accessing sequences, once they are indexed, are below:

benchmark figure

That is, sqlite/screed seems to achieve a roughly 1.8x speedup over postgresql, and 2.2x speedup over mysql, with a roughly constant retrieval time for up to a million sequences. (We've seen constant retrieval times out to 10m or so, but past that it becomes annoying to repeat benchmarks, so we'd like to get our benchmark code correct before running them.)

The benchmark code is here: screed, mysql, and postgresql. We're still working on making it easy to run for other people but you should get the basic sense of it from the code I linked to.

screed's performance is likely to improve as we streamline some of the internal screed code, although the improvements should be marginal because I/O and hashing should dominate. So, these benchmarks can be read as "underperforming Python interface wrapped around sqlite vs mysql and postgresql".

There is one disappointment with sqlite. sqlite doesn't come with any sort of compression system built-in, so even though DNA sequence is eminently compressible, we're stuck with a storage and indexing system that requires about 120% of the space of the original file.


Syndicated 2010-03-29 02:41:31 from Titus Brown

Course announcement: Analyzing Next-Generation Sequencing Data

Analyzing Next-Generation Sequencing Data

May 31 - June 11th, 2010

Kellogg Biological Station, Michigan State University

CSE 891 s431 / MMG 890 s433, 2 cr

Applications are due by midnight EST, April 9th, 2010.

Course sponsor: Gene Expression in Disease and Development Focus Group at Michigan State University.

Instructors: Dr. C. Titus Brown and Dr. Gregory V. Wilson.

Course Description

This intensive two week summer course will introduce students with a strong biology background to the practice of analyzing short-read sequencing data from the Illumina GA2 and other next-gen platforms. The first week will introduce students to computational thinking and large-scale data analysis on UNIX platforms. The second week will focus on mapping, assembly, and analysis of short-read data for resequencing, ChIP-seq, and RNAseq.

No prior programming experience is required, although familiarity with some programming concepts is suggested, and bravery in the face of the unknown is necessary. 2 years or more of graduate school in a biological science is strongly suggested.

See the course Web site for more information.

Syndicated 2010-03-24 02:22:27 from Titus Brown

449 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!