21 May 2010 »

Help! Help! Class notes site?

So, I'm running this summer course and I am trying to figure out how to organize the notes for students. I'd like to mix curriculum-specific notes ("here's what we're doing today, and here are some problems to work on") with tutorials (material independent of a single course, like "here's how to transfer files between computers" or "here's how to parse CSV files"), and allow students to search the documents, annotate them in their Web browser, search the annotations, and perhaps even do public or private bookmarking and tagging. The ability to edit the primary content in something other than a Web GUI would be really, really nice, too -- that way I can write in something like ReST and then upload into the system.

(This is a system I could write myself, but that's kind of silly, dontcha think?)

It should also be lightweight, reasonably mature, easy to set up, and (preferably) written in Python, although I'm willing to compromise on the last simply because I'm desperate.

Pointers, comments, suggestions welcome!

--titus

Syndicated 2010-05-21 15:22:10 from Titus Brown

26 Apr 2010 »

How to teach newbies about Python

Inspired by the brilliant mind(s) behind python-commandments.org, here's a list of ideas you can use to help newbies learn Python!

Write Web pages with overly broad self-granted authority. That way they'll sound authoritative!
Make sure they're anonymous pages, too. It's important that people not recognize that "Commandment" means "My Opinion", especially in the presence of actual Python community-level commandments. You know, like "we're switching to Python 3".
Keep it short. Nobody wants to overwhelm people with, umm, information.
Have a favorite toolkit? Push it!
Oh, wait -- does your favorite toolkit only really address one particular problem? That's OK! Naive technical discussions are just fine -- opinions only, information just gets in the way!
Remember, keep that self-authority coming!

Seriously, I gather that these are the folk helping out on the #python channel on IRC. Gawd help us.

--titus

p.s. I stumbled across it when someone pointed it out to me as a reason to not prioritize Python 3 for a GSoC project.

Syndicated 2010-04-26 20:32:52 from Titus Brown

4 Apr 2010 »

Algorithms, and finding biological variation

I've been doing some more focused bioinformatics programming recently, and as I'm thinking about how to teach biologists about data analysis, I realize more and more how much backstory goes into even relatively simple programming.

The problem: given a reference genome, and a very large set of short, error-prone, random sequences from an individual (with a slightly different genome) aligned to that genome, find all differences and provide enough summary information that probably-real differences can be distinguished from differences due only to error. More explicitly, if I have a sequence to which a bunch of shorter sequences have been aligned, like so:

GAGGGCTACAAAAGGAGGCATAGGACCGATAGGACAT (reference genome)

   GGCTAC          -HWI-EAS216_0019:6:44:8736:13789#0/1[0:6]
   GGCTAAA         HWI-EAS216_0019:6:11:12331:10539#0/1[28:35]
   GGCTAAAAA       HWI-EAS216_0019:6:54:3745:18270#0/1[26:35]
   GGCTAAAAAA      HWI-EAS216_0019:6:75:4175:16939#0/1[25:35]
   GGATACAACAG     -HWI-EAS216_0019:6:96:16613:3276#0/1[0:11]
   GGCTAAAAAAA     HWI-EAS216_0019:6:11:8510:9363#0/1[22:33]
   GGCTAAAAAAA     HWI-EAS216_0019:6:72:14343:19416#0/1[21:32]
   GGCTACAACAG     -HWI-EAS216_0019:6:116:10526:5940#0/1[4:15]
   GGCTAAAACAA     HWI-EAS216_0019:6:103:9198:1208#0/1[18:29]
   GGCTACAACAG     -HWI-EAS216_0019:6:50:14928:13905#0/1[7:18]
   GGCTACAACAG     -HWI-EAS216_0019:6:109:18796:4787#0/1[14:25]
   GGCTACAACAG     -HWI-EAS216_0019:6:102:14050:8429#0/1[18:29]
   GGCTACAACAG     -HWI-EAS216_0019:6:22:14814:10742#0/1[23:34]
    GCTCCACCAG     -HWI-EAS216_0019:6:106:8241:6484#0/1[25:35]
      TACAACAG     HWI-EAS216_0019:6:4:8095:15474#0/1[0:8]
        CAACAG     HWI-EAS216_0019:6:115:15766:4844#0/1[0:6]
     * **  * *
     1 23  4 5

how do you gather all of the locations with differences, and then figure out which differences are significant? (Variations in 1 and 2 are probably errors, 4 and 5 may be real, and 3 is pretty clearly real, I would say.)

The numbers here aren't trivial: you're looking at ~1 million-3 billion bases for the reference genome, and 20-120 million or more shorter resequencing reads.

The first thing to point out is that it's not enough to simply look at all of the aligned sequences individually; you need to know how many total sequences are aligned to each spot, and what each nucleotide is. This is because the new sequences are error-prone: observing a single variation isn't enough, you need to know that it's systematic variation. So it's a total-information situation, and hence not just pleasantly parallel: you need to integrate the information afterwards.

I came up with two basic approaches: you can first figure out where there's any variation, and then iterate again over the short sequences, figure out which ones overlap detected variation, and record those counts; or you can figure out where there's variation, index the short sequences by position, and then query them by position. Both are easy to code.

For the first, approach A:

for seq, location in aligned_short_sequences:
   record_variation(genome[location])

for seq, location in aligned_short_sequences:
   if has_variation(location):
      record_nucleotide(genome[location], seq)

For the second, approach B:

for seq, location in aligned_short_sequences:
   record_variation(genome[location])

for location in recorded_variation:
   aligned_subset = get_aligned_short_sequences(location)
   for seq, _ in aligned_subset:
      record_nucleotide(genome[location], seq)

It's readily apparent that barring stupidity, approach A is generally going to be faster: there's one less loop, for one thing. Also, if you think about implementation details, approach B is potentially going to involve retrieving aligned short sequences multiple times, while approach A looks at each aligned short sequence exactly once.

However, there are tradeoffs, especially if you're building tools to support either approach. The most obvious one is, what if you need to look at variation on less than a whole-genome level? For example, for visualization, or gene overlap calculations, you're probably not going to care about an immense region of the genome; you may want to zoom in on a few hundred bases. In that case, you'd want to use some of the tools from approach B: given a position, what aligns there?

if variation_at(genome[location]):
    aligned_subset = get_aligned_short_sequences(location)
    for seq, _ in aligned_subset:
       record_nucleotide(genome[location], seq)

Another tradeoff is simply whether or not to store everything in memory (fast) or on disk (slower, but scales to bigger data sets) -- I first wrote this code using an in-memory set, to look at low-variation chicken genome resequencing. Then, when I tried to apply it to a high variation Drosophila data set and a deeply sequenced Campylobacter data set, it consumed a lot of RAM. So you have to think about whether or not you want a single tool that works across all these situations, and if so, how you want it to behave for large genomes, with lots of real variation, and extremely deep sequencing -- plan for the worst, in other words.

I also tend to worry about correctness. Is the code giving you the right results, and how easy is it to understand? Each approach has different drawbacks, and relies on different bodies of code that make certain operations easy and other operations hard.

The upshot is that I wrote several programs implementing different approaches, and tried them out on different data sets, and chose the fastest one for one purpose (overall statistics) and a slower one for visualization. And most programmers I know would have done the same: tried a few different approaches, benchmarked them, and chosen the right tool for the current job.

Where does this leave people who don't intuitively understand the difference between the algorithms, and don't really understand what the tools are doing underneath, and why one tool might return different results (based on the parameters given) than another? How do you explain that sort of thing in a short amount of time?

I don't know, but it's going to be interesting to find out...

BTW, the proper way to do this is with pygr will someday soon be:

for seq, location in aligned_short_sequences:
   record_variation(genome[location])

mapping = index_short_sequences()

results = intersect_mappings(variations, mapping)

where 'intersect_mappings' uses internal knowlege of the data structures to do everything quickly and correctly. That day is still a few months off...

--titus

Syndicated 2010-04-04 00:58:05 from Titus Brown

29 Mar 2010 »

Indexing and retrieving lots of (biological) sequences, quickly

These days, molecular biologists are dealing with lots and lots of sequences, largely due to next-gen sequencing technologies. For example, the Illumina GA2 is producing 100-200 million DNA sequences, each of 75-125 bases, per run; that works out to 20 gb of sequence data per run, not counting metadata such as names and quality scores.

Storing, referencing, and retrieving these sequences is kind of annoying. It's not so much that the files themselves are huge, but that they're in a flat file format called 'FASTA' (or FASTQ), with no way to quickly retrieve a particular sequence short of scanning through the file. This is not something you want to do on-demand or keep in-memory, no matter what BioPython says ;).

So everyone working with these sequences has evolved one way or another to deal with it. A number of different approaches have been used, but they mostly boil down to using some sort of database to either index or store the sequences and associated metadata. It won't surprise you to learn that performance varies widely; that benchmarks are few and far between; and that many people just use MySQL, even though you can predict that using something like MySQL (or really any full-blown r/w database) is going to be suboptimal. This is because FASTA and FASTQ files don't change, and the overhead associated with being able to rearrange your indices to accomodate new data is unnecessary.

A while ago -- nearly 2 years ago!? -- we started looking at various approaches for dealing with the problem of "too much sequence", because the approach we used in pygr was optimized for retrieving subsequences from a relatively small number of relatively large sequences (think human chromosomes). Alex Nolley, a CSE undergrad working in the lab, looked at a number of approaches, and settled on a simple three-level write-once-read-many scheme involving a hash table (mapping names to record IDs), another table mapping fixed-size record IDs to offsets within a database file, and a third randomly-indexed database file. He implemented this in C++ and wrote a Pyrex wrapper and a Python API, and we called it screed.

Imagine our surprise when we found out that even absent the hashing, sqlite was faster than our screed implementation! sqlite is an embedded SQL database that is quite popular and (among other things) included with Python as of Python 2.5. Yet because it's a general purpose database, it seemed unlikely to us that it would be faster than a straightforward 2-level indexed flat file scheme. And yet... it was! A cautionary tale about reimplementing something that really smart/good programmers have spent a lot of time thinking about, eh? (Thanks to James Casbon, in particular, for whacking us over the head on this issue; he wins, I lose.)

Anyway, after a frustrating few weeks trying to think of why screed would be slower than sqlite and trying some naive I/O optimizations, we abandoned our implementation and Alex reimplemented everything on top of sqlite. That is now available as the new screed.

screed can import both common forms of sequence flat file (FASTA and FASTQ), and provides a unified interface for retrieving records by the sequence name. It stores all records as single rows in a table, along with another table to describe what the rows are and what special features they have. Because all of the data is stored in a single database, screed databases are portable and you can regenerate the original file from the index.

screed is designed to be easy to install and use. Conveniently, screed is pure Python: sqlite comes with Python 2.5, and because screed stays away from sqlite extensions, no C/C++ compiler is needed. As a result installation is really easy, and since screed behaves just like a read-only dictionary, anyone -- not just pygr users -- can use it trivially. For example,

>>> import screed
>>> db = screed.read_fasta_sequences('tests/test.fa')

or

>>> db = screed.read_fastq_sequences('tests/test.fq')

and then

>>> for name in db:
...   print name, db[name].sequence

will index the given database and print out all the associated sequences.

Now, because we are actually considering publishing screed (along with another new tool, which I'll post about next), I asked Alex to benchmark screed against MySQL and PostgreSQL. While sqlite is far more convenient for our purposes -- embeddable, included with Python so no installation needed, and not client-server based -- "convenience" isn't as defensible as "way faster" ;). Alex dutifully did so, and the results for accessing sequences, once they are indexed, are below:

That is, sqlite/screed seems to achieve a roughly 1.8x speedup over postgresql, and 2.2x speedup over mysql, with a roughly constant retrieval time for up to a million sequences. (We've seen constant retrieval times out to 10m or so, but past that it becomes annoying to repeat benchmarks, so we'd like to get our benchmark code correct before running them.)

The benchmark code is here: screed, mysql, and postgresql. We're still working on making it easy to run for other people but you should get the basic sense of it from the code I linked to.

screed's performance is likely to improve as we streamline some of the internal screed code, although the improvements should be marginal because I/O and hashing should dominate. So, these benchmarks can be read as "underperforming Python interface wrapped around sqlite vs mysql and postgresql".

There is one disappointment with sqlite. sqlite doesn't come with any sort of compression system built-in, so even though DNA sequence is eminently compressible, we're stuck with a storage and indexing system that requires about 120% of the space of the original file.

--titus

Syndicated 2010-03-29 02:41:31 from Titus Brown

24 Mar 2010 »

Course announcement: Analyzing Next-Generation Sequencing Data

Analyzing Next-Generation Sequencing Data

May 31 - June 11th, 2010

Kellogg Biological Station, Michigan State University

CSE 891 s431 / MMG 890 s433, 2 cr

Applications are due by midnight EST, April 9th, 2010.

Course sponsor: Gene Expression in Disease and Development Focus Group at Michigan State University.

Instructors: Dr. C. Titus Brown and Dr. Gregory V. Wilson.

Course Description

This intensive two week summer course will introduce students with a strong biology background to the practice of analyzing short-read sequencing data from the Illumina GA2 and other next-gen platforms. The first week will introduce students to computational thinking and large-scale data analysis on UNIX platforms. The second week will focus on mapping, assembly, and analysis of short-read data for resequencing, ChIP-seq, and RNAseq.

No prior programming experience is required, although familiarity with some programming concepts is suggested, and bravery in the face of the unknown is necessary. 2 years or more of graduate school in a biological science is strongly suggested.

See the course Web site for more information.

Syndicated 2010-03-24 02:22:27 from Titus Brown

21 Feb 2010 »

What's with the goat?

A new meme was born at PyCon 2010: The Testing Goat.

Or, "Be Stubborn. Obey the Goat."

The goat actually emerged from the Testing In Python Birds of a Feather session at PyCon, where Terry Peppers used slides full of goat in his introduction. This was apparently an overreaction to lolcat, but the testing goat is now being held up in opposition to the Django pony.

sigh.

--titus

Syndicated 2010-02-21 20:35:31 from Titus Brown

2 Jan 2010 »

Managing student expectations for open-source projects

On the heels of my aggressive competence post, about (among other things) my failure to outline my expectations for students, I've started putting together a page to help manage student expectations for the pony-build project, which is participating in the Undergraduate Capstone Open-Source Projects (UCOSP) course this term.

(Please comment over at the Wordpress blog for UCOSP:

http://ucosp.wordpress.com/2010/01/02/managing-student-expectations/