<?xml version="1.0"?>
<rss version="2.0">
  <channel>
    <title>Advogato blog for titus</title>
    <link>http://www.advogato.org/person/titus/</link>
    <description>Advogato blog for titus</description>
    <language>en-us</language>
    <generator>mod_virgule</generator>
    <pubDate>Fri, 10 Feb 2012 15:40:49 GMT</pubDate>
    <item>
      <pubDate>Sun, 11 Dec 2011 15:36:11 GMT</pubDate>
      <title>Data Intensive Science, and Workflows</title>
      <link>http://www.advogato.org/person/titus/diary.html?start=462</link>
      <guid>http://ivory.idyll.org/blog/dec-11/data-intensive-science-and-workflows</guid>
      <description>&lt;div&gt;
&lt;p&gt;I'm writing this on my way back from Stockholm, where I attended a
workshop on the &lt;a href="http://research.microsoft.com/en-us/collaboration/fourthparadigm/" &gt;4th Paradigm&lt;/a&gt;.  This is the idea (so named by Jim
Gray, I gather?) that data-intensive science is a distinct paradigm
from the first three paradigms of scientific investigation -- theory,
experiment, and simulation.  I was invited to attend as a token
biologist -- someone in biology who works with large scale data, and
thinks about large scale data, but isn't necessarily &lt;em&gt;only&lt;/em&gt; devoted to
dealing with large scale data.&lt;/p&gt;
&lt;p&gt;The workshop was pretty interesting.  It was run by Microsoft, who
invited me &amp;amp; paid my way out there.  The idea was to identify areas of
opportunity in which Microsoft could make investments to help out
scientists and develop new perspectives on the future of eScience.  To
do that, we played a game that I'll call the "anchor game", where we
divvied up into groups to discuss the blocks to our work that stemmed
from algorithms and tools, data, process and workflows,
social/organizational aspects. In each group we put together sticky
notes with our "complaints" and then ranked them by how big of an
anchor they were on us -- "deep" sticky notes held us back more than
shallow sticky notes.  We then reorganized by discipline, and put
together an end-to-end workflow that we hoped in 5 years would be
possible, and then finally we looked for short- and medium-term
projects that would help get us there.&lt;/p&gt;
&lt;p&gt;The big surprise for me in all of this was that it turns out I'm most
interested in workflows and process!  All of my sticky notes had the
same theme: it wasn't tools, or data management, or social aspects
that were causing me problems, but rather the development of
appropriate workflows for scientific investigation.  Very weird, and
not what I would have predicted from the outset.&lt;/p&gt;
&lt;p&gt;Two questions came up for me during the workshop:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Why don't people use workflows in bioinformatics?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The first question that comes to mind is, why doesn't anyone I know
use a formal workflow engine in bioinformatics?  I know they exist
(Taverna, for one, has a bunch of bioinformatics workflows); I'm
reasonably sure they would be useful; but there seems to be some
block against using them!&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What would the ideal workflow situation be?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;During the anchor game, our group (which consisted of two biologists,
myself and Hugh Shanahan; a physicist, Geoffrey Fox; and a few
computer scientists, including Alex Wade) came up with an idea for a
specific tool.  The tool would be a bridge between Datascope for
biologists and a workflow engine.  The essential idea is to combine
data exploration with audit trail recording, which could then be
hardened into a workflow template and re-used.&lt;/p&gt;
&lt;p&gt;---&lt;/p&gt;
&lt;p&gt;Thinking about the process I usually use when working on a new
problem, it tends to consist of all these activites mixed together:&lt;/p&gt;
&lt;ol&gt;&lt;li&gt;evaluation of data quality, and application of "computational" controls&lt;/li&gt;
&lt;li&gt;exploration of various data manipulation steps, looking for statistical signal&lt;/li&gt;
&lt;li&gt;solidifying of various subcomponents of the data manipulation steps into scripted actions&lt;/li&gt;
&lt;li&gt;deployment of the entire thing on multiple data sets&lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;Now, I'm never quite done -- new data types and questions always come
up, and there are always components to tweak.  But portions of the
workflow become pretty solid by the time I'm halfway done with the
initial project, and the evaluation of data quality accretes more
steps but rarely loses one.  So it could be quite useful to be able to
take a step back and say, "yeah, these steps?  wrap 'em up and put 'em
into a workflow, I'm done."  Except that I also want to be able to
edit and change them in the future.  And I'd like to be able to post
the results along with the precise steps taken to generate them.  And
(as long as I'm greedy) I would like to work at the command line,
while I know that others would like to be able to work in a notebook
or graphical format.  And I'd like to be able to "scale out" the
computation by bringing the compute to the data.&lt;/p&gt;
&lt;p&gt;For all of this I need three things: I need workflow &lt;em&gt;agility&lt;/em&gt;, I need
workflow &lt;em&gt;versioning&lt;/em&gt;, and I need workflow &lt;em&gt;tracking&lt;/em&gt;.  And this all
needs to sit on top of a workflow component model that lets me run the
components of the workflow wherever the &lt;em&gt;data&lt;/em&gt; is.&lt;/p&gt;
&lt;p&gt;I'm guessing no tool out there does this, although I know other people
are thinking this way, so maybe I'm wrong.  The Microsoft folk didn't
know of any, though, and they seemed pretty well informed in this area
:).&lt;/p&gt;
&lt;p&gt;The devil's choice I personally make in all of this is to go for
workflow agility, and ignore versioning and tracking and the component
model, by scripting the hell out of things.  But this is getting old,
and as I get older and have to teach my wicked ways to grad students
and postdocs, the lack of versioning and tracking and easy scaling out
gets more and more obvious.  And now that I'm trying to actually teach
computational science to biologists, it's simply appallingly difficult
to convey this stuff in a sensible way.  So I'm looking for something
better.&lt;/p&gt;
&lt;p&gt;One particularly intriguing type of software I've been looking at
recently is the "interactive Web notebook" --
&lt;a href="http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html" &gt;ipython notebook&lt;/a&gt; and
&lt;a href="http://sagenb.org/" &gt;sagenb&lt;/a&gt;. These are essentially Mathematica or matlab-style notebook
packages that work with IPython or Sage, scientific computing and
mathematical computing systems in Python.  They let you run Python
code interactively, and colocate it with its output (including
graphics); the notebooks can then be saved and loaded and re-run.  I'm
thinking about working one or the other into my class, since it would
let me move away from command-line dependence a bit.  (Command lines
are great, but throwing them at biologists, along with Python programming,
data analysis, and program execution all together, seems a bit cruel.
And not that productive.)&lt;/p&gt;
&lt;p&gt;It would also be great to have cloud-enabled workflow components.  As
I embark on more and more sequence analysis, there are only about a
dozen "things" we actually do, but mixed and matched.  These things
could be hardened, parameterized into components, and placed behind an
RPC interface that would give us a standard way to execute them.
Combined with a suitable data abstraction layer, I could run the
components from location A on data in location B in a semi-transparent
way, and also record and track their use in a variety of ways.  Given
a suitably rich set of components and use cases, I could imagine that
these components and their interactions could be executed from
multiple workflow engines, and with usefully interesting GUIs.  I know
Domain Specific Languages are already passe, but a DSL might be a good
way to find the right division between subcomponents.&lt;/p&gt;
&lt;p&gt;I'd be interested in hearing about such things that may already exist.
I'm aware of Galaxy, but I think the components in Galaxy are not
quite written at the right level of abstraction for me; Galaxy is also
more focused on the GUI than I want.  I don't know anything about
Taverna, so I'm going to look into that a bit more.  And, inevitably,
we'll be writing some of our own software in this area, too.&lt;/p&gt;
&lt;p&gt;Overall, I'm really interested in workflow approaches that let me transition
seemlessly between &lt;a href="http://ivory.idyll.org/blog/dec-11/is-discovery-science-really-bogus.html" &gt;discovery science&lt;/a&gt; and "firing for effect" for hypothesis-driven science.&lt;/p&gt;
&lt;p&gt;A few more specific thoughts:&lt;/p&gt;
&lt;p&gt;In the area of metagenomics (one of my research focuses at the
moment), it would be great to see &lt;a href="http://img.jgi.doe.gov/cgi-bin/m/main.cgi" &gt;img/m&lt;/a&gt;, &lt;a href="http://camera.calit2.net/" &gt;camera&lt;/a&gt;, and &lt;a href="http://metagenomics.anl.gov/" &gt;MG-RAST&lt;/a&gt; move towards a "broken out" workflow
that lets semi-sophisticated computational users (hi mom!) run their
stuff on the Amazon Cloud and on private HPCs or clouds.  While I
appreciate hosted services, there are many drawbacks to them, and I'd love
to get my hands on the guts of those services.  (I'm
sure the MG-RAST folk would like me to note that they are moving
towards making their pipeline more usable outside of Argonne:
&lt;a href="https://github.com/MG-RAST/MG-RAST-pipeline" &gt;so noted&lt;/a&gt;.)&lt;/p&gt;
&lt;p&gt;In the comments to my post on &lt;a href="http://ivory.idyll.org/blog/dec-11/four-reasons-i-wont-use-your-data-analysis-pipeline.html" &gt;Four reasons I won't use your data
analysis pipeline&lt;/a&gt;,
Andrew Davison reminds me of VisTrails, which some people at the MS
workshop recommended.&lt;/p&gt;
&lt;p&gt;I met David De Roure from &lt;a href="http://www.myexperiment.org/" &gt;myExperiment&lt;/a&gt; at the MS workshop.  To quote,
"myExperiment makes it easy to find, use, and share scientific
workflows and other Research Objects, and to build communities."&lt;/p&gt;
&lt;p&gt;David put me in touch with Carole Goble who is involved with
&lt;a href="http://www.taverna.org.uk/" &gt;Taverna&lt;/a&gt;.  Something to look at.&lt;/p&gt;
&lt;p&gt;In the &lt;a href="http://pag.confex.com/pag/xx/webprogrampreliminary/Session1139.html" &gt;cloud computing workshop&lt;/a&gt;
I organized at the Planet and Animal Genome conference this January, I
will get a chance to buttonhole one of the Galaxy Cloud developers.  I
hope to make the most of this opportunity ;).&lt;/p&gt;
&lt;p&gt;It'd be interesting to do some social science research on what
difficulties users encounter when they attempt to use workflow
engines.  A lot of money goes into developing them, apparently, but at
least in bioinformatics they are not widely used.  Why?  This is sort
of in line with Greg Wilson's Software Carpentry and the wonderfully
named blog &lt;a href="http://www.neverworkintheory.org/" &gt;It will never work in theory&lt;/a&gt;: rather than guessing randomly
at what technical directions need to be pursued, why not study it
empirically?  It is increasingly obvious to me that improving
computational science productivity has more to do with lowering
learning barriers and changing other societal or cultural issues than
with a simple lack of technology, and figuring out how (and if)
appropriate technology could be integrated with the right incentives
and teaching strategy is pretty important.&lt;/p&gt;
&lt;p&gt;--titus&lt;/p&gt;
&lt;p&gt;p.s. Special thanks to Kenji Takeda and Tony Hey for inviting me to the
workshop, and especially for paying my way.  'twas really interesting!&lt;/p&gt;
&lt;/div&gt;</description>
    </item>
    <item>
      <pubDate>Mon, 14 Mar 2011 02:11:42 GMT</pubDate>
      <title>Trying out 'cram'</title>
      <link>http://www.advogato.org/person/titus/diary.html?start=461</link>
      <guid>http://ivory.idyll.org/blog/mar-11/trying-out-cram</guid>
      <description>&lt;div class="document"&gt;
&lt;p&gt;I desperately need something to run and test things at the command
line, both for course documentation (think &amp;quot;doctest&amp;quot; but with shell
prompts) and for script testing (as part of scientific pipelines).  At
the 2011 testing-in-python BoF, Augie showed us &lt;a class="reference" href="http://bitheap.org/cram/" &gt;cram&lt;/a&gt;, which is the mercurial project's
internal test code ripped out for the hoi polloi to use.&lt;/p&gt;
&lt;p&gt;Step zero: wonder-twin-powers activate a new virtualenv!&lt;/p&gt;
&lt;pre class="literal-block"&gt;
% virtualenv e
% . e/bin/activate
&lt;/pre&gt;
&lt;p&gt;Step one: install!&lt;/p&gt;
&lt;pre class="literal-block"&gt;
% pip install cram
&lt;/pre&gt;
&lt;p&gt;... that just works -- always a good sign!&lt;/p&gt;
&lt;p&gt;OK, let's test the bejeezus out of 'ls'.&lt;/p&gt;
&lt;pre class="literal-block"&gt;
% mkdir cramtest
% cd cramtest
&lt;/pre&gt;
&lt;p&gt;Next, I put&lt;/p&gt;
&lt;pre class="literal-block"&gt;
$ ls
&lt;/pre&gt;
&lt;p&gt;into a file.  Be careful -- you apparently need &lt;em&gt;exactly&lt;/em&gt; two spaces before
the $ or it doesn't recognize it like a test.&lt;/p&gt;
&lt;p&gt;Now, I run:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
% cram ls.t
&lt;/pre&gt;
&lt;p&gt;and I get&lt;/p&gt;
&lt;pre class="literal-block"&gt;
.
# Ran 1 tests, 0 skipped, 0 failed.
&lt;/pre&gt;
&lt;p&gt;Awesome!  A dot!&lt;/p&gt;
&lt;p&gt;The only problem with this is that when I run 'ls' myself, I see:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
ls.t    ls.t~
&lt;/pre&gt;
&lt;p&gt;Hmm.&lt;/p&gt;
&lt;p&gt;As a test of the cram test software, let's modify the file 'ls.t' to contain a
clearly broken test, rather than an empty one:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
$ ls
there is nothing here to see
&lt;/pre&gt;
&lt;p&gt;and I get&lt;/p&gt;
&lt;pre class="literal-block"&gt;
!
--- /Users/t/dev/cramtest/ls.t
+++ /Users/t/dev/cramtest/ls.t.err
&amp;#64;&amp;#64; -1,2 +1,1 &amp;#64;&amp;#64;
   $ ls
-  there is nothing here to see

# Ran 1 tests, 0 skipped, 1 failed.
&lt;/pre&gt;
&lt;p&gt;OK, so I can make it break -- excellent!  Cram comes advertised with
the ability to fix its own tests by replacing broken output with
actual output; let's see what happens, shall we?&lt;/p&gt;
&lt;pre class="literal-block"&gt;
% cram -i ls.t

!
--- /Users/t/dev/cramtest/ls.t
+++ /Users/t/dev/cramtest/ls.t.err
&amp;#64;&amp;#64; -1,2 +1,1 &amp;#64;&amp;#64;
   $ ls
-  there is nothing here to see
Accept this change? [yN] y
patching file /Users/t/dev/cramtest/ls.t
Reversed (or previously applied) patch detected!  Assume -R? [n] y
Hunk #1 succeeded at 1 with fuzz 1.

# Ran 1 tests, 0 skipped, 1 failed.
% more ls.t
$ ls
there is nothing here to see
there is nothing here to see
&lt;/pre&gt;
&lt;p&gt;OK, so, first, wtf is the whole reversed patch detected nonsense?  Sigh.
And second, where's the output from 'ls' going!?&lt;/p&gt;
&lt;p&gt;Hmm, maybe cram is setting up a temp directory?  That would explain a lot,
and would also be a very sensible approach.  It's not mentioned explicitly
on the front page, but if you read into it a bit, it looks likely.  OK.&lt;/p&gt;
&lt;p&gt;Let's modify 'ls.t' to create a file:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
$ touch testme
$ ls
&lt;/pre&gt;
&lt;p&gt;and run it...&lt;/p&gt;
&lt;pre class="literal-block"&gt;
% cram ls.t
!
--- /Users/t/dev/cramtest/ls.t
+++ /Users/t/dev/cramtest/ls.t.err
&amp;#64;&amp;#64; -1,3 +1,4 &amp;#64;&amp;#64;
   $ touch testme
   $ ls
+  testme


# Ran 1 tests, 0 skipped, 1 failed.
&lt;/pre&gt;
&lt;p&gt;Ah-hah!  Now we're getting somewhere!  Fix the test by making 'ls.t' read
like so:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
$ touch testme
$ ls
testme
&lt;/pre&gt;
&lt;p&gt;and run:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
% cram ls.t
.
# Ran 1 tests, 0 skipped, 0 failed.
&lt;/pre&gt;
&lt;p&gt;Awesome!  Dot-victory ho!&lt;/p&gt;
&lt;p&gt;Now let's do something a bit more interesting: check out and run my
PyCon 2011 talk code for ngram graphs.  Starting with this in 'khmer-ngram.t',&lt;/p&gt;
&lt;pre class="literal-block"&gt;
$ git clone git://github.com/ctb/khmer-ngram.git
$ cd khmer-ngram
$ ls
$ python run-doctests.py basic.txt
&lt;/pre&gt;
&lt;p&gt;I run 'cram khmer-ngram.t' and get&lt;/p&gt;
&lt;pre class="literal-block"&gt;
!
--- /Users/t/dev/cramtest/khmer-ngram.t
+++ /Users/t/dev/cramtest/khmer-ngram.t.err
&amp;#64;&amp;#64; -1,4 +1,15 &amp;#64;&amp;#64;
   $ git clone git://github.com/ctb/khmer-ngram.git
+  Initialized empty Git repository in /private/(yada, yada)
   $ cd khmer-ngram
   $ ls
+  basic.html
+  basic.txt
+  data
+  graphsize-book.py
+  hash.py
+  load-book.py
+  run-doctests.py
+  shred-book.py
   $ python run-doctests.py basic.txt
+  ... running doctests on basic.txt
+  *** SUCCESS ***

# Ran 1 tests, 0 skipped, 1 failed.
&lt;/pre&gt;
&lt;p&gt;After getting cram to fix the file (using -i), and re-running cram, it now
chokes at exactly one place; betcha you can guess where...:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
!
--- /Users/t/dev/cramtest/khmer-ngram.t
+++ /Users/t/dev/cramtest/khmer-ngram.t.err
&amp;#64;&amp;#64; -1,5 +1,5 &amp;#64;&amp;#64;
   $ git clone git://github.com/ctb/khmer-ngram.git
-  Initialized empty Git repository in /private/(yada, yada)
+  Initialized empty Git repository in /private/(different yada)
   $ cd khmer-ngram
   $ ls
   basic.html

# Ran 1 tests, 0 skipped, 1 failed.
&lt;/pre&gt;
&lt;p&gt;Right.  How do you deal with output that does change unpredictably?
Easy!  Throw in a wildcard regexp like so&lt;/p&gt;
&lt;pre class="literal-block"&gt;
Initialized empty Git repository in .* (re)
&lt;/pre&gt;
&lt;p&gt;My whole khmer-ngram.t file now looks like this:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
$ git clone git://github.com/ctb/khmer-ngram.git
Initialized empty Git repository in .* (re)
$ cd khmer-ngram
$ ls
basic.html
basic.txt
data
graphsize-book.py
hash.py
load-book.py
run-doctests.py
shred-book.py
$ python run-doctests.py basic.txt
... running doctests on basic.txt
*** SUCCESS ***
&lt;/pre&gt;
&lt;p&gt;And I can run cram on it without a problem:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
.
# Ran 1 tests, 0 skipped, 0 failed.
&lt;/pre&gt;
&lt;p&gt;Great!&lt;/p&gt;
&lt;p&gt;I love the regexp fix, too; none of this BS that doctest forces upon you.&lt;/p&gt;
&lt;p&gt;So, the next question: how do multiple tests work?  If you look above,
you can see that it's running all the commands as one test.  Logically
you should be able to just separate out the block of text and make it
into multiple tests... let's try adding&lt;/p&gt;
&lt;pre class="literal-block"&gt;
I'll add in another test:

  $ ls
&lt;/pre&gt;
&lt;p&gt;to the khmer-ngram.t file; does that work?  It looks promising:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
!
--- /Users/t/dev/cramtest/khmer-ngram.t
+++ /Users/t/dev/cramtest/khmer-ngram.t.err
&amp;#64;&amp;#64; -17,3 +17,12 &amp;#64;&amp;#64;
 I'll add in another test:

   $ ls
+  basic.html
+  basic.txt
+  data
+  graphsize-book.py
+  hash.py
+  hash.pyc
+  load-book.py
+  run-doctests.py
+  shred-book.py

# Ran 1 tests, 0 skipped, 1 failed.
&lt;/pre&gt;
&lt;p&gt;and it sees two tests... but, after fixing the expected output using
'cram -i', I only get one test:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
.
# Ran 1 tests, 0 skipped, 0 failed.
&lt;/pre&gt;
&lt;p&gt;So it seems like a little internal inconsistency in cram here.  Two tests
when something's failing, one test when both are running.  No big deal
in the end.&lt;/p&gt;
&lt;p&gt;And... I have to admit, that's about all I need for testing/checking
course materials!  The cram test format is perfectly compatible with
ReStructuredText, so I can go in and write real documents in it, and
then test them.  Command line testing FTW?&lt;/p&gt;
&lt;p&gt;And (I just checked) I can even put in Python commands and run doctest
on the same file that cram runs on.  Awesome.&lt;/p&gt;
&lt;p&gt;Critique:&lt;/p&gt;
&lt;p&gt;The requirement for two spaces exactly before the $ was not obvious to
me, nor was the implicit (and silent, even in verbose mode) use of a
temp directory.  I wiped out my test file a few times by answering
&amp;quot;yes&amp;quot; to patching, too.  What was up with the 'reversed patch' foo??
And of course it'd be nice if the number of dots reflected something
more granular than the number of files run.  But heck, it mostly just
works!  I didn't even look at the source code at all!&lt;/p&gt;
&lt;p&gt;Verdict: a tentative 8/10 on the &amp;quot;Can titus use your testing tool?&amp;quot;
scale.&lt;/p&gt;
&lt;p&gt;I'll try using it in anger on a real project next time I need it, and
report back from there.&lt;/p&gt;
&lt;p&gt;--titus&lt;/p&gt;
&lt;p&gt;p.s. To try out my full cram test from above, grab the file from the
khmer-ngram repo at github; see:&lt;/p&gt;
&lt;p&gt;&lt;a class="reference" href="https://github.com/ctb/khmer-ngram/blob/master/cram-test.t" &gt;https://github.com/ctb/khmer-ngram/blob/master/cram-test.t&lt;/a&gt; .&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    <item>
      <pubDate>Fri, 11 Mar 2011 15:13:02 GMT</pubDate>
      <title>My new data analysis pipeline code</title>
      <link>http://www.advogato.org/person/titus/diary.html?start=460</link>
      <guid>http://ivory.idyll.org/blog/mar-11/pipeline</guid>
      <description>&lt;div class="document"&gt;
&lt;p&gt;First, I write a recipe file, 'metagenome.recipe', laying out my
job description for, say, sequence trimming and assembly with Velvet:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
fasta_file soil-data.fa

qc_filter min_length=50 remove_Ns=true

graph_filter min_length=400

velvet_assemble k=33 min_length=1000 scaffolding=True
&lt;/pre&gt;
&lt;p&gt;Then I specify machine parameters, e.g. 'bigmem.conf':&lt;/p&gt;
&lt;pre class="literal-block"&gt;
[defaults]
n_threads=8

[graph_filtering]
use_mem=32gb

[velvet]
needs_mem=64gb
&lt;/pre&gt;
&lt;p&gt;And finally, I run the pipeline:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
% ak-run metagenome.recipe -c bigmem.conf
&lt;/pre&gt;
&lt;p&gt;If I have cloud access (and who doesn't?) I can tell the pipeline to
spin up and down nodes as needed:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
% ak-aws-run metagenome.recipe -c bigmem.conf
&lt;/pre&gt;
&lt;p&gt;(Bear in mind most of these tasks are multi-hour, if not multi-day, operations,
so I'm not too worried about optimizing machine use and re-use.)&lt;/p&gt;
&lt;p&gt;Hadoop jobs could be spawned underneath, depending on how each recipe
component was actually implemented.&lt;/p&gt;
&lt;p&gt;As for testing reproducibility of pipeline results, which is the
short-term motivation here, I can store results for regression
testing with later versions:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
% ak-run metagenome.recipe -c bigmem.conf --save-endpoint=/some/path
&lt;/pre&gt;
&lt;p&gt;and then compare:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
% ak-run --check-endpoint=/some/path
&lt;/pre&gt;
&lt;p&gt;---&lt;/p&gt;
&lt;p&gt;Now, does anyone know of a package or packages that actually do this, so
I/we don't have to write it??&lt;/p&gt;
&lt;p&gt;See &lt;a class="reference" href="http://texttest.carmen.se/" &gt;texttest&lt;/a&gt; and &lt;a class="reference" href="http://www.ruffus.org.uk/tutorials/simple_tutorial/simple_tutorial.html" &gt;ruffus&lt;/a&gt;
for some of my inspiration/interest.&lt;/p&gt;
&lt;p&gt;--titus&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    <item>
      <pubDate>Thu, 14 Oct 2010 22:11:35 GMT</pubDate>
      <title>The sky is falling! The sky is falling!</title>
      <link>http://www.advogato.org/person/titus/diary.html?start=459</link>
      <guid>http://ivory.idyll.org/blog/oct-10/sky-is-falling</guid>
      <description>&lt;div class="document"&gt;
&lt;p&gt;I just parachuted in on (and heli'd out of?) the &lt;a class="reference" href="http://biomedcentral.cvent.com/EVENTS/Info/Custom.aspx?cid=20&amp;amp;e=89d8be73-d072-43f8-8e35-ec75c44b3a03" &gt;Beyond the Genome
conference&lt;/a&gt;
in Boston.  I gave a very brief workshop on using EC2 for sequence
analysis, which seemed well received.  (Mind you, virtually everything
possible went wrong, from lack of good network access to lack of
attendee computers to truncated workshop periods due to conference
overrun, but I'm used to the Demo Effect.)&lt;/p&gt;
&lt;p&gt;After attending the last bit of the conference, I think that &amp;quot;the
cloud&amp;quot; is actually a really good metaphor for what's happening in
biology these days.  We have an immense science-killing asteroid
heading for us (in the form of ridiculously vast amounts of sequence
data, a.k.a. &amp;quot;sequencing bonanza&amp;quot; -- our sequencing capacity is
doubling every 6-10 months), and we're mostly going about our daily
business because we can't see the asteroid -- it's hidden by the
clouds!  Hence &amp;quot;cloud computing&amp;quot;: computing in the absence of clear
vision.&lt;/p&gt;
&lt;p&gt;But no, seriously.  Our current software and algorithms simply won't
scale to the data requirements, even on the hardware of the future.  A
few years ago I thought that we really just needed better data
management and organization tools.  Then Chris Lee pointed out how
much a good algorithm could do -- cnestedlist, in &lt;a class="reference" href="http://pygr.org/" &gt;pygr&lt;/a&gt;, for doing fast interval queries on extremely
large databases.  That solved a lot of problems for me.  And then
next-gen sequencing data started hitting me and my lab, and kept on
coming, and coming, and coming... A few fun personal items since my
&lt;a class="reference" href="http://ivory.idyll.org/blog/may-10/grim-future-for-sequencing-centers.html" &gt;Death of Sequencing Centers post&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p class="first"&gt;we managed to assemble some 50 Gb of Illumina GA2 metagenomic data
using a novel research approach, and then...&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;...we received 1.6 Tb of HiSeq data from the &lt;a class="reference" href="http://www.jgi.doe.gov/" &gt;Joint Genome
Institute&lt;/a&gt; as part of the same Great
Plains soil sequencing project.  It's hard not to think that our
collaborators were saying &amp;quot;So, Mr. Smarty Pants -- you can develop
new approaches that work for 100 Gb, eh?  Try &lt;em&gt;this&lt;/em&gt; on for size!
BWAHAHAHAHAHAHA!&amp;quot;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;I've been working for the past two weeks to do a lossy (no, NOT
&amp;quot;lousy&amp;quot;) assembly of 1.5 billion reads (100 Gb) of mRNAseq Illumina data
from lamprey, using a derivative research approach to the one above,
and then...&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;...our collaborators at &lt;a class="reference" href="http://www.mountsinai.org/" &gt;Mt. Sinai Medical Center&lt;/a&gt; told us that that we could expect
200 Gb of lamprey mRNA HiSeq data from their next run.&lt;/p&gt;
&lt;p&gt;(See BWAHAHAHAHAHAHAHA, above.)&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;Honestly, I think most people &lt;em&gt;in&lt;/em&gt; genomics, much less biology, don't
appreciate the game-changing nature of the sequencing bonanza.  In
particular, I don't think they realize the rate at which it's scaling.
Lincoln Stein had a great slide in his talk at the BTG workshop about
the game-changing nature of next-gen sequencing:&lt;/p&gt;
&lt;img alt="http://ivory.idyll.org/permanent/lstein-ngs-capacity.png" src="http://ivory.idyll.org/permanent/lstein-ngs-capacity.png" style="width: 300px;" /&gt;
&lt;p&gt;The blue line is hardware capacity, the yellow line is &amp;quot;first-gen&amp;quot;
sequencing (capillary electrophoresis), and the red line is next-gen
sequencing capacity.&lt;/p&gt;
&lt;p&gt;It helps if you realize that the Y axis is log scale.&lt;/p&gt;
&lt;p&gt;Heh, yeah.&lt;/p&gt;
&lt;p&gt;Now, reflect upon &lt;em&gt;this&lt;/em&gt;: many sequence analysis algorithms (notably,
assembly, but also multiple sequence alignment and really anything
that doesn't rely on a fixed-size &amp;quot;reference&amp;quot;) are supra-linear in
their scaling.&lt;/p&gt;
&lt;p&gt;Heh, yeah.&lt;/p&gt;
&lt;p&gt;We call this &amp;quot;Big Data&amp;quot;, yep yep.&lt;/p&gt;
&lt;p&gt;At the cloud computing workshop, I heard someone -- I won't say who,
because even though I picked specifically on them, it's a common
misconception -- compare computational biology to physics.  Physicists
and astronomers had to learn how to deal with Big Data, right?  So we
can, too! Yeah, but colliders and telescopes are &lt;em&gt;big&lt;/em&gt;, and
&lt;em&gt;expensive&lt;/em&gt;.  Sequencers?  Cheap.  Almost every research institution I
know has at least one, and often two or three.  Every &lt;em&gt;lab&lt;/em&gt; I know
either has some Gb-sized data set or is planning to generate 1-40 Gb
within the next year.  Take that graph above, and extrapolate to 2013
and beyond.  Yeah, that's right -- all your base belong to us,
physical scientists!  Current approaches are not going to scale well
for big projects, no matter what custom infrastructure you build or
rent.&lt;/p&gt;
&lt;p&gt;The closest thing to this dilemma that I've read about is in climate
modeling (see: &lt;a class="reference" href="http://www.easterbrook.ca/steve/?p=1933" &gt;Steve Easterbrook's blog, Serendipity&lt;/a&gt;).  However, I think the
difference with biology is that we're generating new scientific data,
not running modeling programs that generate our data.  Having been in
both situations, I can tell you that it's very different when your
data is not something you can decline to measure, or something you can
summarize and digest more intelligently while generating it.&lt;/p&gt;
&lt;p&gt;I've also heard people claim that this isn't a big problem compared
to, say, the problems that we face with Big Data on the Internet.  I
think the claim at the time was that &amp;quot;in biology, your data is more
structured, and so you haff vays of dealing with it&amp;quot;.  Poppycock!  The
unknown unknowns &lt;em&gt;dominate&lt;/em&gt;, everyone: we often &lt;em&gt;don't know&lt;/em&gt; what
we're looking for in large-scale biological data.  When we do know,
it's a lot easier to devise data analysis strategies; but when we
don't really know, people tend to run a &lt;em&gt;lot&lt;/em&gt; of different analyses,
looking looking looking.  So in many ways we end up with an added
polynomial-time exploratory computation scaling (trawling through N
databases with M half-arsed algorithms) on top of all the other
&amp;quot;stuff&amp;quot; (Big Data, poorly scaling algorithms).&lt;/p&gt;
&lt;p&gt;OK, OK, so the sky is falling.  What do we do about it?&lt;/p&gt;
&lt;p&gt;I don't see much short-term hope in cross-training more people,
despite my efforts in that direction (see: &lt;a class="reference" href="http://ivory.idyll.org/blog/jun-10/ngs-course-postmortem.html" &gt;next-gen course&lt;/a&gt;,
and &lt;a class="reference" href="http://ged.msu.edu/courses/2010-fall-cse-891/" &gt;the BEACON course&lt;/a&gt;).  Training is a
medium-term effort: necessary but not all that helpful in the short
term.&lt;/p&gt;
&lt;p&gt;It's not clear that &lt;a class="reference" href="http://www.nature.com/news/2010/101013/full/467775a.html" &gt;better computational science&lt;/a&gt; is a
solution to the sequencing bonanza.  Yes, most bioinformatics software
has problems, and I'm sure most analyses are wrong in many ways --
including ours, before you ask.  It's a general problem in scientific
computation, and it's aggravated by a lack of training, and we're
working on that, too, with things like &lt;a class="reference" href="http://swc.scipy.org/" &gt;Software Carpentry&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I do see two lights at the end of the tunnel, both spurred by our own
research and that of Michael Schatz (and really the whole &lt;a class="reference" href="http://www.cbcb.umd.edu/~salzberg/" &gt;Salzberg&lt;/a&gt;/&lt;a class="reference" href="http://www.cbcb.umd.edu/~mpop/" &gt;Pop&lt;/a&gt; gang) as well as &lt;a class="reference" href="http://www.mcs.anl.gov/about/people_detail.php?id=280" &gt;Narayan Desai's&lt;/a&gt; talk at the
BTG workshop.&lt;/p&gt;
&lt;p&gt;First, we need to &lt;em&gt;change the way analysis scales&lt;/em&gt; -- see esp. Michael
Schatz's work on &lt;a class="reference" href="http://www.cshl.edu/Faculty/schatz-michael" &gt;assembly in the cloud&lt;/a&gt;, and (soon, hopefully)
our own work on scaling metagenomic and mRNAseq assembly.  Michael's
code isn't available (tsk tsk) and ours is available but isn't
documented, published, or easy to use yet, but we can now do &amp;quot;exact&amp;quot;
assemblies of 100 Gb of metagenomic, and we're moving towards
nearly-exact assemblies of arbitrarily large RNAseq and metagenomic
data sets.  (Yes, &amp;quot;arbitrary&amp;quot;.  Take THAT, JGI.)&lt;/p&gt;
&lt;p&gt;We will have to do this kind of algorithmic scaling on a case-by-case
basis, however.  I'm focused on certain kinds of sequence analysis,
personally, but there's a huge world of problems out there that will
need constant attention to scale them in the face of the new data.
And right now, I don't see too many CSE people focused on this, because
they don't see the need to scale to Tb.&lt;/p&gt;
&lt;p&gt;Second, Big Data and cloud computing are, combined, going to dynamite
the traditional HPC model and make it clear that our only
hope is to work &lt;em&gt;smarter&lt;/em&gt; and develop better algorithms, in
combination with scaling compute power.  How so?&lt;/p&gt;
&lt;p&gt;As Narayan has eloquently argued many times, it no longer makes sense
for most institutions to run their own HPC, if you take into account
the true costs of power, AC, and hardware.  The only reason it &lt;em&gt;looks&lt;/em&gt;
like HPCs work well is because of the way institutions play games with
funny money (a.k.a. &lt;a class="reference" href="http://en.wikipedia.org/wiki/Overhead_%28business%29" &gt;&amp;quot;overhead charges&amp;quot;&lt;/a&gt;), channeling
it to HPC behind the scenes - often with much politicking involved.
If, as a scientist, your compute is &amp;quot;free&amp;quot; or even heavily subsidized,
you tend not to think much about it.  But now that we have to scale
those clusters 10s or 100s or 1000s of X, to deal with data 100s or
1e6s of times as big, institutions will no longer be able to afford to
build their own clusters with funny money.  And they'll have to charge
scientists for the true computational cost of their work -- or
scientists will have to use the cloud.  Either way, people will be
exposed to how much it &lt;em&gt;really&lt;/em&gt; costs to run, say, BLAST against
100,000,000 short reads.  And soon thereafter they'll stop doing such
foolish things.&lt;/p&gt;
&lt;p&gt;In turn, this will lead to a significant effort to make better use of
hardware, either by building better algorithms or asking questions
more cleverly.  (Goodbye, BLAST!)  It hurts me to say that, because
I'm not algorithmically focused by nature; but if you want to know the
answer to a biological question, and you have the data, but existing
approaches can't handle it within your budget... what else are you
going to do but try to ask, or answer, the question more cleverly?
Narayan said something like &amp;quot;we'll have to start asking if $150/BLAST
is a &lt;em&gt;good deal&lt;/em&gt; or not&amp;quot; which, properly interpreted, makes the point
well: it's a great deal if you have $1000 and only one BLAST to do,
but what if you have 500 BLASTs?  And didn't budget for it?&lt;/p&gt;
&lt;p&gt;Fast, cheap, good.  Choose two.&lt;/p&gt;
&lt;p&gt;Better algorithms and more of a focus on their importance (due to the
exposure of true costs) are two necessary components to solving this
problem, and there are increasingly many people working on them.
So I think there are these two lights at the end of the tunnel for the
Big Data-in-biology challenges.  And probably there are some others
that I'm missing. Although, of course, these two lights at the end of
tunnel may be train headlights, but we can hope, right?&lt;/p&gt;
&lt;p&gt;--titus&lt;/p&gt;
&lt;p&gt;p.s. &lt;a class="reference" href="http://www.bioteam.net/company/leadership.html" &gt;Chris Dagdigian&lt;/a&gt; from BioTeam gave
an absolutely awesome talk on many of these issues, too.  Although he
seems more optimistic than I am, possibly because he's paid by the hour
:).&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    <item>
      <pubDate>Wed, 7 Jul 2010 16:08:14 GMT</pubDate>
      <title>A memory efficient way to remove low-abundance k-mers from large (metagenomic?) DNA data sets</title>
      <link>http://www.advogato.org/person/titus/diary.html?start=458</link>
      <guid>http://ivory.idyll.org/blog/jul-10/kmer-filtering</guid>
      <description>&lt;div class="document"&gt;
&lt;p&gt;I've spent the last few weeks working on a simple solution to a
challenging problem in DNA sequence assembly, and I think we've got a
nice simple theoretical solution with an actual implementation.  I'd
be interested in comments!&lt;/p&gt;
&lt;div class="section"&gt;
&lt;h1&gt;&lt;a id="introduction" name="introduction" &gt;Introduction&lt;/a&gt;&lt;/h1&gt;
&lt;p&gt;Briefly, the algorithmic challenge is this:&lt;/p&gt;
&lt;p&gt;We have a bunch of DNA sequences in (let's say) FASTA format,&lt;/p&gt;
&lt;pre class="literal-block"&gt;
&amp;gt;850:2:1:1145:4509/1
CCGAGTCGTTTCGGAGGGACCCCGCCATCATACTCGGGGAATTCATCTGAAAGCATGATCATAGAGTCACCGAGCA
&amp;gt;850:2:1:1145:4509/2
AGCCAAGAGCACCCCAGCTATTCCTCCCGGACTTCATAACGTAACGGCCTACCTCGCCATTAAGACTGCGGCGGAG
&amp;gt;850:2:1:1145:14575/1
GACGCAAAAGTAATCGTTTTTTGGGAACATGTTTTCATCCTGATCATGTTCCTGCCGATTCTGATCTCGCGACTGG
&amp;gt;850:2:1:1145:14575/2
TAACGGGCTGAATGTTCAGGACCACGTTTACTACCGTCGGGTTGCCATACTTCAACGCCAGCGTGAAAAAGAACGT
&amp;gt;850:2:1:1145:2219/1
GAAGACAGAGTGGTCGAAAGCTATCAGACGCGATGCCTAACGGCATTTTGTAGGGCGTCTGCGTCAGACGCCAACC
&amp;gt;850:2:1:1145:2219/2
GAAGGCGGGTAATGTCCGACAAACATTTGACGTCAAAGCCGGCTTTTTTGTAGTGGGTTTGACTCTTTCGCTTCCG
&amp;gt;850:2:1:1145:5660/1
GATGGCGTCGTCCGGGTGCCCTGGTGGAAGTTGCGGGGATGCGGATTCATCCGGGACGCGCAGACGCAGGCGGTGG
&lt;/pre&gt;
&lt;p&gt;and we want to pre-filter these sequences so that only sequences
containing high-abundance DNA words of length k (&amp;quot;k-mers&amp;quot;),
remain. For example, given a set of DNA sequences, we might want to
remove any sequence that does not contain a k-mer present at least 5
times in the entire data set.&lt;/p&gt;
&lt;p&gt;This is very straightforward to do for small numbers of sequences, or
for small k.  Unfortunately, we are increasingly confronted by data sets
containing 250 million sequences (or more), and we would like to be
able to do this for large k (k &amp;gt; 20, at least).&lt;/p&gt;
&lt;p&gt;You can break the problem down into two basic steps: first,
counting k-mers; and second, filtering sequences based on those k-mer
counts.  It's not immediately obvious how you would parallelize this
task: the counting should be very quick (basically, it's I/O
bound) while the filtering is going to rely on wide-reaching lookups
that aren't localized to any subset of k-mer space.&lt;/p&gt;
&lt;p&gt;tl; dr? we've developed a way to do this for arbitrary k, in linear
time and constant memory, efficiently utilizing as many computers as
you have available.  It's open source and works today, but, uhh, could
use some documentation...&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section"&gt;
&lt;h1&gt;&lt;a id="digression-the-bioinformatics-motivation" name="digression-the-bioinformatics-motivation" &gt;Digression: the bioinformatics motivation&lt;/a&gt;&lt;/h1&gt;
&lt;p&gt;(You can skip this if you're not interested in the biological motivation ;)&lt;/p&gt;
&lt;p&gt;What we &lt;em&gt;really&lt;/em&gt; want to do with this kind of data is assemble it,
using a &lt;a class="reference" href="http://en.wikipedia.org/wiki/De_Bruijn_graph" &gt;De Bruijn graph approach&lt;/a&gt; a la &lt;a class="reference" href="http://genome.cshlp.org/content/18/5/821.full" &gt;Velvet&lt;/a&gt;, &lt;a class="reference" href="http://www.bcgsc.ca/platform/bioinfo/software/abyss" &gt;ABySS&lt;/a&gt;, or
&lt;a class="reference" href="http://soap.genomics.org.cn/soapdenovo.html" &gt;SOAPdenovo&lt;/a&gt;.
However, De Bruijn graphs all rely on building a graph of overlapping
k-mers in memory, which means that their memory usage scales as a
function of the number of unique k-mers.  This is good in general, but
Bad in certain circumstances -- in particular, whenever the data set
you're trying to assemble has a lot of genomic novelty.  (See &lt;a class="reference" href="http://www.ncbi.nlm.nih.gov/pubmed/20211242" &gt;this
fantastic review&lt;/a&gt; and
my &lt;a class="reference" href="http://ged.msu.edu/angus/files/lecture5-assembly.pdf" &gt;assembly lecture&lt;/a&gt; for some
background here.)&lt;/p&gt;
&lt;p&gt;One particular circumstance in which De Bruijn graph-based assemblers
flail is in &lt;a class="reference" href="http://en.wikipedia.org/wiki/Metagenomics" &gt;metagenomics&lt;/a&gt;, the isolation and
sequencing of genetic material from &amp;quot;the wild&amp;quot;, e.g.  soil or the
human gut.  This is largely because the diversity of bacteria present
in soil is so huge (although the relatively high error rate of
next-gen platforms doesn't help).&lt;/p&gt;
&lt;p&gt;Prefiltering can help reduce this memory usage by removing erroneous
sequences as well as not-so-useful sequences.  First, any sequence
arising as the result of a sequencing error is going to be extremely
rare, and abundance filtering will remove those.  Second, genuinely
&amp;quot;rare&amp;quot; (shallowly-sequenced) sequences will generally not contribute
much to the assembly, and so removing them is a good heuristic for
reducing memory usage.&lt;/p&gt;
&lt;p&gt;We are currently playing with dozens (probably hundreds, soon) of gigabytes
of metagenomic data, and we really need to do this prefiltering in order
to have a chance at getting a useful assembly out of it.&lt;/p&gt;
&lt;p&gt;It's worth noting that this is in no way an original thought: in
particular, the Beijing Genome Institute (BGI) did this kind of
prefiltering in their landmark Human Microbiome paper (&lt;a class="reference" href="http://www.nature.com/nature/journal/v464/n7285/full/nature08821.html" &gt;ref&lt;/a&gt;).
We want to do it, too, and the BGI wasn't obliging enough to give
us source code (booooooo hisssssssssssssss).&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section"&gt;
&lt;h1&gt;&lt;a id="existing-approaches" name="existing-approaches" &gt;Existing approaches&lt;/a&gt;&lt;/h1&gt;
&lt;p&gt;Existing approaches are inadequate to our needs, as far as we can tell
from a literature survey and private conversations.  Everyone seems
to rely on big-RAM machines, which are nice if you have them, but shouldn't
be necessary.&lt;/p&gt;
&lt;p&gt;There are two basic approaches.&lt;/p&gt;
&lt;p&gt;First, you can build a large table in memory, and then map k-mers into
it.  This involves writing a simple hash function that converts DNA
words into numbers.  However, this approach scales poorly with k: for
example, there are 4**k unique DNA sequences of length k (or roughly
(4**k) / 2 + (4**(k/2))/2, considering reverse complements).  So the table
for k=17 needs 4**17 entries -- that's 17 gb at 1 byte per counter,
which stretches the memory of cheaply available commodity hardware.
And we want to use bigger k than 17 -- most assemblers will be more
effective for longer k, because it's more specific.  (We've been using
k values between 30 and 70 for our assemblies, and we'd like to filter
on the same k.)&lt;/p&gt;
&lt;p&gt;Second, you can observe that k-mer space (for sufficiently large k) is
likely to be quite sparsely occupied -- after all, there's only so
many actual 30-mers present in a 100gb data set! So, you can do
something clever like use a tree representation of k-mers and then
attach counters to the end nodes of the tree (see, for example,
&lt;a class="reference" href="http://www.ncbi.nlm.nih.gov/pubmed/18976482" &gt;tallymer&lt;/a&gt;.  The
problem here that you need to use pointers to connect nodes in the
tree, which means that while the tree size is going to scale with the
amount of novel k-mers -- ok! -- it's going to have a big constant in
front of it -- bad!.  In our experience this has been prohibitive for
real metagenomic data filtering.&lt;/p&gt;
&lt;p&gt;These seem to be the two dominant approaches, although if you don't
need to actually &lt;em&gt;count&lt;/em&gt; the k-mers but only assess presence or
absence, you can use something like a &lt;a class="reference" href="http://en.wikipedia.org/wiki/Bloom_filter" &gt;Bloom filter&lt;/a&gt; -- for example, see
&lt;a class="reference" href="http://bioinformatics.oxfordjournals.org/cgi/content/full/26/13/1595" &gt;Classification of DNA sequences using a Bloom filter&lt;/a&gt;,
which uses Bloom filters to look for novel sequences (the exact
opposite of what we want to do here!).  References to other approaches
welcome...&lt;/p&gt;
&lt;p&gt;Note that you really, really, really want to avoid disk access by
keeping everything in memory.  These are ginormous data sets and we
would like to be able to quickly explore different parameters and
methods of sequence selection.  So we want to come up with a good counting
scheme for k-mers that involves relatively small amounts of memory and
as little disk access as possible.&lt;/p&gt;
&lt;p&gt;I think this is a really fun kind of problem, actually.  The more
clever you are, the more likely you are to come up with a completely
inappropriate data structure, given the amount of data and the basic
algorithmic requirements.  It demands KISS!  Read on...&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section"&gt;
&lt;h1&gt;&lt;a id="an-approximate-approach-to-counting" name="an-approximate-approach-to-counting" &gt;An approximate approach to counting&lt;/a&gt;&lt;/h1&gt;
&lt;p&gt;So, we've come up with a solution that scales with the amount of
genomic novelty, and efficiently uses your available memory.  It can
also make effective use of multiple computers (although not multiple
processors).  What is this magic approach?&lt;/p&gt;
&lt;p&gt;&lt;a class="reference" href="http://en.wikipedia.org/wiki/Hash_table" &gt;Hash tables&lt;/a&gt;.  Yep.  Map
k-mers into a fixed-size table (presumably one about as big as your
available memory), and have the table entries be counters for k-mer
abundance.&lt;/p&gt;
&lt;p&gt;This is an obvious solution, except for one problem: collisions.  The
big problem with hash tables is that you're going to have collisions,
wherein multiple k-mers are mapped into a single counting bin.  Oh
noes!  The traditional way to deal with this is to keep track of each
k-mer individually -- e.g. when there's a collision, use some sort of
data structure (like a linked list) to track the actual k-mers that
mapped to a particular bin.  But now you're stuck with using gobs of
memory to keep these structures around, because each collision
requires a new pointer of some sort.  It may be possible to get around
this efficiently, but I'm not smart enough to figure out how.&lt;/p&gt;
&lt;p&gt;Instead of becoming smarter, I reconfigured my brain to look at the problem
differently.  (Think Different?)&lt;/p&gt;
&lt;p&gt;The big realization here is that collisions &lt;strong&gt;may not matter&lt;/strong&gt; all
that much.  Consider the situation where you're filtering on a maximum
abundance of 5 -- that is, you want at least one k-mer in each
sequence to be present five or more times across the data set.  Well,
if you look at the hash bin for a specific k-mer and see the number
&lt;strong&gt;4&lt;/strong&gt;, you immediately know that whether or not there are any
collisions, that particular k-mer isn't present five or more times,
and can be discarded!  That is, the count for a particular hash bin is
the sum of the (non-negative) individual counts for k-mers mapping to
that bin, and hence that sum is an upper bound on any individual
k-mer's abundance.&lt;/p&gt;
&lt;img alt="http://ivory.idyll.org/permanent/kmer-hashtable.png" src="http://ivory.idyll.org/permanent/kmer-hashtable.png" style="width: 20%;" /&gt;
&lt;p&gt;The tradeoff is false positives: as k-mer space fills up, the hash
table is going to be hit by more and more collisions.  In turn, more
of the k-mers are going to be falsely reported as being over the
minimum abundance, and more of the sequences will be kept.  You can
deal with this partly by doing iterative filtering with different
prime hash table sizes, but that will be of limited utility if you
have a really saturated hash table.&lt;/p&gt;
&lt;p&gt;Note that the counting and filtering is still O(N), while the false
positives grow as a function of k-mer space occupancy -- which is to
say that the more diversity you have in your data, the more trouble
you're in.  That's going to be a problem no matter the approach, however.&lt;/p&gt;
&lt;p&gt;You can see a simple example of approximate and iterative filtering
here, for k=15 (a k-mer space of approximately 1 billion) and hash
table sizes ranging from 50m to 100m.  The curves all approach the
correct final number of reads (which we can calculate exactly, for
this data set) but some take longer than others.  (The code for this
is &lt;a class="reference" href="http://github.com/ctb/khmer/blob/master/scripts/ctb-iterative-bench-2.py" &gt;here&lt;/a&gt;.)&lt;/p&gt;
&lt;img alt="http://ivory.idyll.org/permanent/kmer-filtering-iterative.png" src="http://ivory.idyll.org/permanent/kmer-filtering-iterative.png" style="width: 50%;" /&gt;
&lt;/div&gt;
&lt;div class="section"&gt;
&lt;h1&gt;&lt;a id="making-use-of-multiple-computers" name="making-use-of-multiple-computers" &gt;Making use of multiple computers&lt;/a&gt;&lt;/h1&gt;
&lt;p&gt;While I was working out the details of the (really simple) approximate
counting approach, I was bugged by my inability to make effective use
of all the juicy computers to which I have access.  I don't have many
&lt;em&gt;big&lt;/em&gt; computers, but I do have lots of medium sized computers with
memory in the ~10-20 gb range.  How can I use them?&lt;/p&gt;
&lt;p&gt;This is not a pleasantly parallel problem, for at least two reasons.
First, it's I/O bound -- reading the DNA sequences in takes more time
than breaking them down into k-mers and counting them.  And since it's
really memory bound -- you want to use the biggest hash table possible
to minimize collisions
-- it doesn't seem like using multiple processors on a single machine
will help.  Second, the hash table is going to be too big to
effectively share between computers: 10-20 gb of data per computer is
too much to send over the network.  So what do we do?&lt;/p&gt;
&lt;p&gt;I was busy explaining to &lt;a class="reference" href="http://en.wikipedia.org/wiki/Charles_Ofria" &gt;a colleague&lt;/a&gt; that this was
impossible -- always a useful way for me to solve problems ;) -- when
it hit me that you could use &lt;em&gt;multiple chassis&lt;/em&gt; (RAM + CPU + disk) to
decrease the false positive rate with only a small amount of
communication overhead.  Basically, my approach is to partition k-mer
space into Z subsets (one for each chassis) and have each computer count
only the k-mers that fall into their partition.  Then, after the
counting stage, each chassis records a max k-mer abundance per
partition per sequence, and communicates &lt;em&gt;that&lt;/em&gt; to a central
node.  This central node in turn calculates the max k-mer abundance
across all partitions.&lt;/p&gt;
&lt;p&gt;The partitioning trick is a more general form of the specific 'prefix'
approach -- that is, separately count and get max abundances/sequence
for all k-mers starting with 'A', then 'C', then 'G', and then 'T'.
For each sequence you will then have four values (the max
abundance/sequence for k-mers start with 'A', 'C', 'G', and 'T'),
which can be cheaply stored on disk or in memory.  Now you can do a
single-pass integration and figure out what sequences to keep.&lt;/p&gt;
&lt;p&gt;This approach effectively multiplies your working
memory by a factor of Z, decreasing your false positive rate
equivalently - no mean feat!&lt;/p&gt;
&lt;img alt="http://ivory.idyll.org/permanent/kmer-hashtable-par.png" src="http://ivory.idyll.org/permanent/kmer-hashtable-par.png" style="width: 20%;" /&gt;
&lt;img alt="http://ivory.idyll.org/permanent/kmer-filter-process-2.png" src="http://ivory.idyll.org/permanent/kmer-filter-process-2.png" style="width: 40%;" /&gt;
&lt;p&gt;The communication load is significant but not prohibitive: replicate a
read-only sequence data set across Z computers, and then communicate
max values (1 byte for each sequence) back -- 50-500 mb of data per
filtering round.  The result of each filtering round can be returned
to each node as a readmask against the already-replicated initial
sequence set, with one bit per sequence for ignore or process.  You can
even do it on a single computer, with a local hard drive, in multiple
iterations.&lt;/p&gt;
&lt;p&gt;You can see a simple in-memory implementation of this approach &lt;a class="reference" href="http://github.com/ctb/khmer/blob/master/python/khmer/split.py" &gt;here&lt;/a&gt;,
and some tests &lt;a class="reference" href="http://github.com/ctb/khmer/blob/master/python/test_split.py" &gt;here&lt;/a&gt;.
I've implemented this using readmask tables and min/max tables (serializable
data structures) more generally, too; see &amp;quot;the actual code&amp;quot;, below.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section"&gt;
&lt;h1&gt;&lt;a id="similar-approaches" name="similar-approaches" &gt;Similar approaches&lt;/a&gt;&lt;/h1&gt;
&lt;p&gt;By allowing for false positives, I've effectively turned the hash
table into a probabilistic data structure.  The closest analog I've
seen is the &lt;a class="reference" href="http://en.wikipedia.org/wiki/Bloom_filter" &gt;Bloom filter&lt;/a&gt; which records
presence/absence information using multiple hash functions for
arbitrary k.  The hash approach outlined above devolves into a
maximally efficient Bloom filter for fixed k when only
presence/absence information is recorded.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section"&gt;
&lt;h1&gt;&lt;a id="the-actual-code" name="the-actual-code" &gt;The actual code&lt;/a&gt;&lt;/h1&gt;
&lt;p&gt;Theory and practice are the same, in theory.  In practice, of course,
they differ.  A whole host of minor interface and implementation
design decisions needed to be made.  The result can be seen at the
'khmer' project, here: &lt;a class="reference" href="http://github.com/ctb/khmer" &gt;http://github.com/ctb/khmer&lt;/a&gt;.  It's slim on
documentation, but there are some example scripts and a reasonable
amount of tests.  It requires nothing but Python 2.6 and a compiler;
nose if you want to run the tests.&lt;/p&gt;
&lt;p&gt;The implementation is in C++ with a Python wrapper, much like the
paircomp and motility software packages.&lt;/p&gt;
&lt;p&gt;It will filter 1m 70-mer sequences in about 45 seconds, and 80 million sequences
in less than an hour, on a 3 GHz Xeon with 16 gbs of RAM.&lt;/p&gt;
&lt;p&gt;Right now it's limited to k &amp;lt;= 32, because we encode each DNA k-mer in
a 64-bit 'long long'.&lt;/p&gt;
&lt;p&gt;You can see an exact filtering script here:
&lt;a class="reference" href="http://github.com/ctb/khmer/blob/master/scripts/filter-exact.py" &gt;http://github.com/ctb/khmer/blob/master/scripts/filter-exact.py&lt;/a&gt; .  By
varying the hash table size (second arg to new_hashtable) you can turn
it into an &lt;em&gt;inexact&lt;/em&gt; filtering script quite easily.&lt;/p&gt;
&lt;p&gt;Drop me a note if you want help using the code, or a specific example.
We're planning to write documentation, doctests, examples, robust
command line scripts, etc. prior to publication, but for now we're
primarily trying to use it.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section"&gt;
&lt;h1&gt;&lt;a id="other-uses" name="other-uses" &gt;Other uses&lt;/a&gt;&lt;/h1&gt;
&lt;p&gt;It has not escaped our notice that you can use this approach for a bunch of
other k-mer based problems, such as repeat discovery and calculating abundance
distributions... but right now we're focused on actually using it for
filtering metagenomic data sets prior to assembly.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section"&gt;
&lt;h1&gt;&lt;a id="acks" name="acks" &gt;Acks&lt;/a&gt;&lt;/h1&gt;
&lt;p&gt;I talked a fair bit with Prof. Rich Enbody about this approach, and he
did a wonderful job of double-checking my intuition.  Jason Pell and
Qingpeng Zhang are co-authors on this project; in particular, Jason
helped write the C++ code, and Qingpeng has been working with k-mers
in general and tallymer in specific on some other projects, and
provided a lot of background help.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section"&gt;
&lt;h1&gt;&lt;a id="conclusions" name="conclusions" &gt;Conclusions&lt;/a&gt;&lt;/h1&gt;
&lt;p&gt;We've taken a previously hard problem and made it practically
solvable: we can filter ~88m sequences in a few hours on a
single-processor computer with only 16gb of RAM.  This seems useful.&lt;/p&gt;
&lt;p&gt;Our main challenge now is to come up with a better hashing function.
Our current hash function is not uniform, even for a
uniform distribution of k-mers, because of the way it handles reverse
complements.&lt;/p&gt;
&lt;p&gt;The approach scales reasonably nicely.  Doubling the amount of data
doubles the compute time.  However, if you have double the novelty,
you'll need to do double the partitions to keep the same false
positive rate, in which case you quadruple the compute time.  So it's
O(N^2) for the worst case (unending novelty) but should be something
better for real-world cases.  That's what we'll be looking at over
the next few months.&lt;/p&gt;
&lt;p&gt;I haven't done enough background reading to figure out if our approach
is particularly novel, although in the space of bioinformatics it seems
to be reasonably so.  That's less important than actually solving our
problem, but it would be nice to punch the &amp;quot;publication&amp;quot; ticket if possible.
We're thinking of writing it up and sending it to BMC Bioinformatics,
although suggestions are welcome.&lt;/p&gt;
&lt;p&gt;It would be particularly ironic if the first publication from my lab
was this computer science-y, given that I have no degrees in CS and
am in the CS department by kind of a fluke of the hiring process ;).&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    <item>
      <pubDate>Tue, 6 Jul 2010 04:10:33 GMT</pubDate>
      <title>Teaching scientists how to use computers - hub </title>
      <link>http://www.advogato.org/person/titus/diary.html?start=457</link>
      <guid>http://ivory.idyll.org/blog/jul-10/swc-hub-spokes</guid>
      <description>&lt;div class="document"&gt;
&lt;p&gt;After my recent &lt;a class="reference" href="http://ivory.idyll.org/blog/jun-10/ngs-course-postmortem" &gt;next-gen sequencing course&lt;/a&gt;, which
was supposed to tie into the whole &lt;a class="reference" href="http://software-carpentry.org/" &gt;software carpentry (SWC) effort&lt;/a&gt; but didn't really succeed in doing
so the first time through, I started thinking about the Right Way to
tie in the SWC material.  In particular, how do you both motivate
scientists to look at the SWC material, and (re)direct people to the
appropriate places?&lt;/p&gt;
&lt;p&gt;It's not clear that a Plan is in place.  Greg Wilson seems to assume
that scientists will find at least some of the material immediately
obviously usable, but I think he's targetted at a more sophisticated
population of users -- physicists and the like.  My experience with
bioinformaticians, however, is that they either come from straight
biology backgrounds (with little or no computational background and
rather limited on-the-job training), straight computation backgrounds
(with very little biology), or physics (gonzo programming skills, but
no biology).  The latter fit neatly into the SWC fold, but they (we ;)
are rare in biology.  I think computer scientists and biologists are
going to need guidance to dive into SWC at an early enough time for it
to be the most rewarding.&lt;/p&gt;
&lt;p&gt;So, what's a good model for SWC to guide scientists from multiple
disciplines into the appropriate material?  It's obviously not going
to be possible to have Greg et al. tailor the SWC material to individual
subgroups -- he doesn't know much (any ;) biology, for example.  I don't
have the time, patience, or skillset to integrate my next-gen notes
into his SWC material, either.  So, instead, I propose the hub &amp;amp; spokes
model!&lt;/p&gt;
&lt;img alt="http://ivory.idyll.org/permanent/hub-spokes.png" src="http://ivory.idyll.org/permanent/hub-spokes.png" /&gt;
&lt;p&gt;Here, the &amp;quot;hub&amp;quot; is the SWC material, and the spokes are all of the
individual disciplines.&lt;/p&gt;
&lt;p&gt;Basically, the idea is that individual sites (like my own ANGUS site
on next-gen sequencing, &lt;a class="reference" href="http://ged.msu.edu/angus/" &gt;http://ged.msu.edu/angus/&lt;/a&gt;) will develop their
own field-specific content, and then link from that content into the
SWC notes.  This way the experts with feet in both fields can link
appropriately, and Greg only has to worry about making the central
content general -- which he's already doing quite well, I think.  Yes,
It's more work than asking Greg to do it, but frankly I'm going to be
happy with a kick-ass central SWC site to which I can link -- right
now it's dismayingly challenging to teach students why this stuff
matters and how to learn it.&lt;/p&gt;
&lt;p&gt;From the psychosocial perspective, it's a great fit.  Students can get
hands on tutorials on how to do X, Y, and Z in their own field -- and
then connect into the SWC material to learn the background, or
additional computational techniques in support of it.  Motivation first!&lt;/p&gt;
&lt;p&gt;What do we need SWC to do to support this?  Not much -- basically, the
central SWC notes need to be stable enough (with permalinks) that I
can link into them from my own site(s) and not have to worry about the
links becoming broken or (worse) silently migrating in topic.  There
are other solutions (wholesale incorporation of SWC into my own notes,
for example) but I think the permalink idea is the most
straightforward.  Oh, and we should have a Greg-gets-hit-by-a-bus plan,
too; at some point he's going to move on from SWC (perhaps when his
lovely wife decide she's had enough and he needs to stop obsessing over
it, or perhaps under more dire circumstances ;( and it would be good to
know who holds the domain and site keys.&lt;/p&gt;
&lt;p&gt;Thoughts?  Comments?&lt;/p&gt;
&lt;p&gt;--titus&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    <item>
      <pubDate>Thu, 24 Jun 2010 19:13:46 GMT</pubDate>
      <title>Which functional programming language(s) should we teach?</title>
      <link>http://www.advogato.org/person/titus/diary.html?start=456</link>
      <guid>http://ivory.idyll.org/blog/jun-10/functional-programming-languages</guid>
      <description>&lt;div class="document"&gt;
&lt;p&gt;Laurie Dillon just posted the SIGPLAN eduction board article on &lt;a class="reference" href="http://mt4.acm.org/educationboard/2010/06/why-undergraduates-should-learn-the-principles-of-programming-languages.html" &gt;Why
Undergraduates Should Learn the Principles of Programming Languages&lt;/a&gt;
to our faculty mailing list at the &lt;a class="reference" href="http://www.cse.msu.edu" &gt;MSU Computer Science department&lt;/a&gt;.  One question that came up in the ensuing
conversation was: what functional programming language(s) would/should
we teach?&lt;/p&gt;
&lt;p&gt;I mentioned OCaml, Haskell, and Erlang as reasonably pure but still
pragmatic FP languages.  Anything else that's both &amp;quot;truly&amp;quot; functional
and used somewhat broadly in the real world?&lt;/p&gt;
&lt;p&gt;thanks!&lt;/p&gt;
&lt;p&gt;--titus&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    <item>
      <pubDate>Mon, 14 Jun 2010 16:14:52 GMT</pubDate>
      <title>Teaching next-gen sequencing data analysis to biologists</title>
      <link>http://www.advogato.org/person/titus/diary.html?start=455</link>
      <guid>http://ivory.idyll.org/blog/jun-10/ngs-course-postmortem</guid>
      <description>&lt;div class="document"&gt;
&lt;p&gt;Our &lt;a class="reference" href="http://bioinformatics.msu.edu/ngs-summer-course-2010" &gt;sequencing analysis course&lt;/a&gt; ended last
Friday, with an overwhelmingly positive response from the students.
The few negative comments that I got were largely about organizational
issues, and could be reshaped as suggestions for next time rather than
as condemnations of this year's course.&lt;/p&gt;
&lt;img alt="http://ivory.idyll.org/permanent/ngs-2010-group.png" src="http://ivory.idyll.org/permanent/ngs-2010-group.png" style="width: 65%;" /&gt;
&lt;p&gt;The 23 students -- most with no prior command-line experience -- spent
two weeks experiencing at first hand the challenges of dealing with
dozens of gigabytes of sequencing data.  Each of the students went
through genome-scale mapping, genome assembly, mRNAseq analysis on an
&amp;quot;emerging model organism&amp;quot; (a.k.a &amp;quot;one with a crappy genome&amp;quot;, lamprey),
resequencing analysis on E. coli, and ChIP-seq analysis on Myxococcus
xanthus.  By the beginning of the second week, many students were working
with their own data -- a real victory.  Python programming competency
may take a bit longer, but many of them seem motivated.&lt;/p&gt;
&lt;p&gt;If you had told me three weeks ago that we could pull this off, I
would have told you that you were crazy.  This does beg the question
of what I was thinking when I proposed the course -- but don't dwell
on that, please...&lt;/p&gt;
&lt;p&gt;The locale was great, as you can see:&lt;/p&gt;
&lt;img alt="http://ivory.idyll.org/permanent/ngs-2010-beach.png" src="http://ivory.idyll.org/permanent/ngs-2010-beach.png" style="width: 65%;" /&gt;
&lt;p&gt;One of the most important lessons of the course for me is that &lt;a class="reference" href="http://ivory.idyll.org/blog/jun-10/ngs-course-with-aws.html" &gt;cloud
computing works well to backstop this kind of course&lt;/a&gt;.  I
was very worried about the suitabiliy and reliability and ease of use,
but AWS did a great job, providing an easy-to-use Web interface and a
good range of machine images.  I have little doubt that this course
would have been nearly impossible (and either completely ineffective
or much more expensive) without it.&lt;/p&gt;
&lt;p&gt;In the end, we spent more on beer than on computational power.  That says
something important to me :)&lt;/p&gt;
&lt;p&gt;The &lt;a class="reference" href="http://ged.msu.edu/angus/" &gt;course notes&lt;/a&gt; are available under a
CC license although they need to be reworked to use publicly available
data sets before they become truly useful.  At that point I expect them
to become awesomely useful, though.&lt;/p&gt;
&lt;p&gt;From the scientific perspective, the students derived a number of
significant benefits from the course.  One that I had not really
expected was that some students had no idea what went in to
computational &amp;quot;sausage&amp;quot;, and were kind of shocked to see what kinds of
assumptions us comp bio people made on their behalf.  This was
especially true in the case of students from companies, who have
pipelines that are run on their data.  One student lamented that &amp;quot;we
used to look at the raw traces... now all we get are spreadsheet
summaries!&amp;quot;  Another student came to me in a panic because they didn't
realize that there &lt;em&gt;was&lt;/em&gt; no one true answer -- that that was in fact
part of the &amp;quot;fun&amp;quot; of &lt;em&gt;all&lt;/em&gt; biology, not just experimental biology.
These reactions alone made teaching the course worthwhile.&lt;/p&gt;
&lt;p&gt;Of course, the main point is that many of the students seem to be
capable of at least starting their own analyses now.  I was surprised
at the practical power of our cut-and-paste approach -- for example,
if you look at the &lt;a class="reference" href="http://ged.msu.edu/angus/tutorials/short-read-assembly.html" &gt;Short-read assembly with ABySS tutorial&lt;/a&gt;, it
turns out to be relatively straightforward to adapt this to doing
assemblies of your own genomic or transcriptomic data.  I based our
approach on Greg Wilson's post on &lt;a class="reference" href="http://pyre.third-bit.com/blog/archives/3761.html" &gt;the failure of inquiry-based
teaching&lt;/a&gt; and so
far I like it.&lt;/p&gt;
&lt;p&gt;I am particularly amused that we have now documented, in replicable
detail, the Kroos Lab MrpC ChIP analysis.  We also have the best
documentation for Jeff Barrick's breseq software, I think; this is
what is used to analyze the &lt;a class="reference" href="http://en.wikipedia.org/wiki/E._coli_long-term_evolution_experiment" &gt;Long Term Evolution Experiment&lt;/a&gt;
lines -- and I can't wait for the anti-evolutionists to pounce on
that...  &amp;quot;Titus Brown -- making evolution experiments accessible to
creationists.&amp;quot;  Yay?&lt;/p&gt;
&lt;p&gt;There were a number of problems and mistakes that we had to
steamroller through.  In particular, more background and more advanced
tutorials would have be great, but we just didn't have time to write
them.  Some 454, Helicos, and SOLiD data sets (and next year, PacBio?)
would be a good addition.  We had a general lack of multiplexing data,
which is becoming a Big Thing now that sequencing is so ridiculously
deep. I would also like to introduce additional real data analyses
next year, reprising things like the &lt;a class="reference" href="http://www.nature.com/nbt/journal/v28/n5/abs/nbt.1621.html" &gt;Cufflinks analysis&lt;/a&gt; and
whole-vertebrate-genome ChIP-seq/mRNAseq &lt;a class="reference" href="http://www.nature.com/nmeth/journal/v6/n11s/abs/nmeth.1371.html" &gt;a la the Wold Lab&lt;/a&gt;.
I'm weighing adding metagenomics data analysis in for a day, although
it's a pretty separate field of inquiry (and frankly much harder in
terms of &amp;quot;unknown unknowns&amp;quot;).  We also desperately need some plant
genomics expertise, because frankly I know nothing about plant
genomes; my last-minute plant genomics TA fell through due to lack of
planning on my part.  (Conveniently, plant genomics is something MSU
is particularly good at, so I'm sure I can find someone next year.)&lt;/p&gt;
&lt;p&gt;Oops, did I say next year?  Well, yes.  &lt;em&gt;If&lt;/em&gt; I can find funding for my
princely salary, &lt;em&gt;then&lt;/em&gt; I will almost certainly run the course again
next year.  I can cover TAs and my own room/board and speakers with
workshop fees, but if I'm going to keep room+board+fees under
$1000/student -- a practical necessity for most -- there's no way I
can pay myself, too.  And while this year I relied on my lovely,
patient, and frankly long-suffering wife to hold down the home fort
while I was away for two weeks, I simply can't put her through that
again, so I will need to pay for a nanny next year.  So doing it for
free is not an option.&lt;/p&gt;
&lt;p&gt;In other words, &lt;strong&gt;if you are a sequencing company, or an NIH/NSF/USDA
program director, interested in keeping this going, please get in
touch&lt;/strong&gt;.  I plan to apply for this &lt;a class="reference" href="http://grants.nih.gov/grants/guide/pa-files/PAR-09-245.html" &gt;Initiative to Maximize Research
Education in Genomics&lt;/a&gt; in
September, but I am not confident of getting that on the first try,
and in any case I will need letters of support from interested folks.
So &lt;a class="reference" href="mailto:ctb&amp;#64;msu.edu" &gt;drop me a note at ctb&amp;#64;msu.edu&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Course development this year was sponsored by the &lt;a class="reference" href="http://www.bch.msu.edu/GEDD/" &gt;MSU Gene Expression
in Disease and Development&lt;/a&gt;, to whom
I am truly grateful.  The course would simply not have been possible
without their support.&lt;/p&gt;
&lt;p&gt;My overall conclusion is that it is possible to teach bench biologists
with no prior computational experience to achieve at least minimal
competency in real-world data analysis of next-generation sequencing
data.  I can't conclusively &lt;em&gt;demonstrate&lt;/em&gt; this without doing a better
job of course evaluation, and of course only time will tell if it
sticks for any of the students, but right now I'm feeling pretty good
about the course overall.  Not to mention massively relieved.&lt;/p&gt;
&lt;p&gt;--titus&lt;/p&gt;
&lt;p&gt;p.s. Update from one student -- &amp;quot;It's not even 12 o'clock Monday
morning and I'm already getting people asking me how to run assemblies
and analyze data.&amp;quot;  Heh.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    <item>
      <pubDate>Tue, 8 Jun 2010 15:11:06 GMT</pubDate>
      <title>Running a next-gen sequence analysis course using Amazon Web Services</title>
      <link>http://www.advogato.org/person/titus/diary.html?start=454</link>
      <guid>http://ivory.idyll.org/blog/jun-10/ngs-course-with-aws</guid>
      <description>&lt;div class="document"&gt;
&lt;p&gt;So, I've been teaching a &lt;a class="reference" href="http://bioinformatics.msu.edu/ngs-summer-course-2010" &gt;course&lt;/a&gt; on
next-generation sequence analysis for the last week, and one of the
issues I had to deal with before I proposed the course was how to deal
with the &lt;strong&gt;volume of data&lt;/strong&gt; and the required computation.&lt;/p&gt;
&lt;p&gt;You see, next-generation sequence analysis involves analyzing not just
entire genomes (which are, after all, only 3gb or so in size) but data
sets that are 100x or 1000x as big!  We want to not just map these
data sets (which is CPU-intensive), but also perform memory-intensive
steps like assembly.  If you have a class with 20+ students in it, you
need to worry about a lot of things:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;computational power: how do you provide 24 &amp;quot;good&amp;quot; workstations&lt;/li&gt;
&lt;li&gt;memory&lt;/li&gt;
&lt;li&gt;disk space&lt;/li&gt;
&lt;li&gt;bandwidth&lt;/li&gt;
&lt;li&gt;&amp;quot;take home&amp;quot; ability&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;One strategy would be to simply provide some Linux or Mac
workstations, with cut-down data sets.  But then you wouldn't be
teaching reality -- you'd be teaching a cut-down version of reality.
This would make the course particularly irrelevant given that one of
the extra-fun things about next-gen sequence analysis is how hard it
is to deal with the volume of data.  You also have to worry that the
course would be made even &lt;em&gt;more&lt;/em&gt; irrelevant because the students would
leave the course and be unable to use the information without finding
infrastructure and installing a bunch of software and then administering
the machine.&lt;/p&gt;
&lt;p&gt;While I enjoy setting up computers and installing software and
managing users, I'm clearly masochistic.  It's also &lt;em&gt;entirely besides
the point&lt;/em&gt; for bioinformaticians and biologists - they just want to
analyze data!&lt;/p&gt;
&lt;p&gt;The solution I came up with was to use Amazon Web Services and rent
some EC2 machines.  There's a large variety of hardware configurations
available (see &lt;a class="reference" href="http://aws.amazon.com/ec2/#instance" &gt;instance types&lt;/a&gt;) and they're not that
expensive per hour (see &lt;a class="reference" href="http://aws.amazon.com/ec2/#pricing" &gt;pricing&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;This has worked out really, really well.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;It's hard to enumerate the benefits, because there have been so many
of them ;).  A few of the obvious ones --&lt;/p&gt;
&lt;p&gt;We've been able to write tutorials (temporary home here:
&lt;a class="reference" href="http://ged.msu.edu/angus/" &gt;http://ged.msu.edu/angus/&lt;/a&gt;) that make use of specific images and should
be as future-proof as they can be.  We've given students cut and paste
command lines that Just Work, and that they can tweak and modify as
they want.  If it borks, they always just throw it away and start from
a clean install.&lt;/p&gt;
&lt;p&gt;It's dirt cheap.  We spent less than $50 the first week, for ~30
people using an average of 8 hours of CPU time.  The second week will
increase to an average of 8 hours of CPU time a day, and for larger
instances -- so probably about $300 total, or maybe even $500 -- but
that's ridiculously cheap, frankly, when you consider that there are
no hardware issues or OS re-install problems to deal with!&lt;/p&gt;
&lt;p&gt;Students can choose whatever machine specs they need in order to do
their analysis.  More memory?  Easy.  Faster CPU needed?  No problem.&lt;/p&gt;
&lt;p&gt;All of the data analysis takes place off-site.  As long as we can provide
the data sets somewhere else (I've been using S3, of course) the students
don't need to transfer multi-gigabyte files around.&lt;/p&gt;
&lt;p&gt;The students can go home, rent EC2 machines, and do their own analyses
-- without their labs buying any required infrastructure.&lt;/p&gt;
&lt;p&gt;Home institution computer admins can use the EC2 tutorials as
documentation to figure out what needs to be installed (and
potentially, maintained) in order for their researchers to do next-gen
sequence analysis.&lt;/p&gt;
&lt;p&gt;The documentation should even serve as a general set of tutorials,
once I go through and remove the dependence on private data sets!
There won't be any need for students to do difficult or tricky configurations
on their home machines in order to make use of the tutorial info.&lt;/p&gt;
&lt;p&gt;So, truly awesome.  I'm going to be using it for all my courses from now
on, I think.&lt;/p&gt;
&lt;p&gt;There have been only two minor hitches.&lt;/p&gt;
&lt;p&gt;First, I'm using Consolidated Billing to pay for all of the students'
computer use during the class, and Amazon has some rules in place to
prevent abuse of this.  They're limiting me to 20 consolidated billing
accounts per AWS account, which means that I've needed to get a second
AWS account in order to add all 30 students, TAs, and visiting
instructors.  I wouldn't even mention it as a serious issue but for
the fact that &lt;em&gt;they don't document it anywhere&lt;/em&gt;, so I ran into this on
the first day of class and then had to wait for them to get back to me
to explain what was going on and how to work around it.  Grr.&lt;/p&gt;
&lt;p&gt;Second, we had some trouble starting up enough Large instances
simultaneously on the day we were doing assembly.  Not sure what that
was about.&lt;/p&gt;
&lt;p&gt;Anyway, so I give a strong +1 on Amazon EC2 for large-ish style data
analysis.  Good stuff.&lt;/p&gt;
&lt;p&gt;cheers,
--titus&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    <item>
      <pubDate>Fri, 21 May 2010 16:10:45 GMT</pubDate>
      <title>Help! Help! Class notes site?</title>
      <link>http://www.advogato.org/person/titus/diary.html?start=453</link>
      <guid>http://ivory.idyll.org/blog/may-10/help-with-class-notes</guid>
      <description>&lt;div class="document"&gt;
&lt;p&gt;So, I'm running this &lt;a class="reference" href="http://bioinformatics.msu.edu/ngs-summer-course-2010" &gt;summer course&lt;/a&gt; and I am
trying to figure out how to organize the notes for students.  I'd like
to mix curriculum-specific notes (&amp;quot;here's what we're doing today, and
here are some problems to work on&amp;quot;) with tutorials (material independent
of a single course, like &amp;quot;here's how to transfer files between computers&amp;quot;
or &amp;quot;here's how to parse CSV files&amp;quot;), and allow students to search the
documents, annotate them in their Web browser, search the annotations,
and perhaps even do public or private bookmarking and tagging.  The
ability to edit the primary content in something other than a Web GUI
would be really, really nice, too -- that way I can write in something
like ReST and then upload into the system.&lt;/p&gt;
&lt;p&gt;(This &lt;em&gt;is&lt;/em&gt; a system I could write myself, but that's kind of silly,
dontcha think?)&lt;/p&gt;
&lt;p&gt;It should also be lightweight, reasonably mature, easy to set up, and
(preferably) written in Python, although I'm willing to compromise
on the last simply because I'm desperate.&lt;/p&gt;
&lt;p&gt;Pointers, comments, suggestions welcome!&lt;/p&gt;
&lt;p&gt;--titus&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
  </channel>
</rss>

