<?xml version="1.0"?>
<rss version="2.0">
  <channel>
    <title>Advogato blog for titus</title>
    <link>http://www.advogato.org/person/titus/</link>
    <description>Advogato blog for titus</description>
    <language>en-us</language>
    <generator>mod_virgule</generator>
    <pubDate>Tue, 21 May 2013 09:24:58 GMT</pubDate>
    <item>
      <pubDate>Wed, 15 May 2013 15:19:21 GMT</pubDate>
      <title>Excerpts from Coders At Work: Peter Deutsch Interview</title>
      <link>http://www.advogato.org/person/titus/diary.html?start=480</link>
      <guid>http://ivory.idyll.org/blog/coders-at-work-peter-deutsch.html</guid>
      <description>&lt;p&gt;I've been reading Peter Seibel's excellent book, &lt;a href="http://www.codersatwork.com/" &gt;Coders at Work&lt;/a&gt;, which is a transcription of
interviews with a dozen or so very well known and impactful
programmers.  After the first two interviews, I found myself itching
to highlight certain sections, and then I thought, heck, why not post
some of the bits I found most interesting?  This is a book everyone
should be aware of, and it's surprisingly readable.  Highly
recommended.&lt;/p&gt;
&lt;p&gt;This is the second of my blog posts.  The first contained excerpts
from Seibel's &lt;a href="http://ivory.idyll.org/blog/coders-at-work-joe-armstrong.html" &gt;interview with Joe Armstrong&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The excerpts below come from Seibel's &lt;a href="http://www.codersatwork.com/l-peter-deutsch.html" &gt;interview with Peter Deutsch&lt;/a&gt;, who is (among
many other things) the creator and long-time maintainer of
Ghostscript.&lt;/p&gt;
&lt;p&gt;My comments are labeled 'CTB'.&lt;/p&gt;
&lt;hr class="docutils"/&gt;&lt;div&gt;
&lt;h2&gt;On programmers&lt;/h2&gt;
&lt;p&gt;Seibel: So is it OK for people who don't have a talent for
systems-level thinking to work on smaller parts of software?
Can you split the programmers and the architects? Or do you
really want everyone who's working on systems-style software, since it is
sort of fractal, to be able think in terms of systems?&lt;/p&gt;
&lt;p&gt;Deutsch: ... But in terms of who should do software, I don't have
a good flat answer that. I do know that the further down in the plumbing the
software is, the more important it is that it be built by really good people.
That's an elitist point of view, and I'm happy to hold it.&lt;/p&gt;
&lt;p&gt;...&lt;/p&gt;
&lt;p&gt;You know the old story about the telephone and the telephone operators?
The story is, sometime fairly early in the adoption of the telephone,
when it was clear that use of the telephone was just expanding at an incredible
rate, more and more people were having to be hired to work as operators
because we didn't have dial telephones. Someone extrapolated the
growth rate and said "My God. By 20 or 30 years from now, every single
person will have to be a telephone operator." Well, that's happened.
I think something like that may be happening in some big areas of programming
as well.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;CTB: This seemed like interesting commentary on the increasing ...
democratization? ... of computer use.&lt;/em&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;h2&gt;Fast, cheap, good -- pick any two.&lt;/h2&gt;
&lt;p&gt;Deutsch: ...The problem being the old saying in the business: "fast, cheap,
good -- pick any two." If you build things fast and you have some way of building them inexpensively, it's very unlikely that they're going to be good.  But this school of thought says you shouldn't expect software to last.&lt;/p&gt;
&lt;p&gt;I think behind this perhaps is a mindset of software as expense vs
software as capital asset. I'm very much in the software-as-capital-asset school. When I was working at ParcPlace and Adele Goldberg was out there evangelizing object-oriented design, part of the way we talked about objects and part of the way we advocated object-oriented languages and design to our customers and potential customers is to say, "Look, you should treat software as a capital asset."&lt;/p&gt;
&lt;p&gt;And there is no such thing as a capital asset that doesn't require ongoing
maintenance and investment. You should expect that there's going to be
some cost associated with maintaining a growing library of reusable software.
And that is going to complicate your accounting because it means you can't
charge the cost of building a piece of software only to the project
or the customer that's motivating the creation of that software at this
time. You have to think of it the way you would think of a capital asset.&lt;/p&gt;
&lt;p&gt;CTB: A really good perspective that's relevant to &lt;a href="https://metarabbit.wordpress.com/2013/05/06/people-are-right-not-to-share-scientific-code/" &gt;scientists' concerns about software and data&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;h2&gt;On how software practice has (not) improved over the last 30 years&lt;/h2&gt;
&lt;p&gt;Seibel: So you don't believe the original object-reuse pitch quite as strongly now. Was there something wrong with the theory, or has it just not worked out for historical reasons?&lt;/p&gt;
&lt;p&gt;Deutsch: Well, part of the reason that I don't call myself a computer scientists any more is that I've seen software practice over a period of just about 50 years and it basically hasn't improved tremendously in about the last 30 years.&lt;/p&gt;
&lt;p&gt;If you look at programming languages I would make a strong case that programming languages have not improved qualitatively in the last 40 years.  There is no programming language in use today that is qualitatively better than Simula-67. I know that sounds kind of funny, but I really mean it. Java is not that much better than Simula-67.&lt;/p&gt;
&lt;p&gt;Seibel: Smalltalk?&lt;/p&gt;
&lt;p&gt;Deutsch: Smalltalk is somewhat better than Simula-67. But Smalltalk as it exists
today essentially existed in 1976. I'm not saying that today's
languages aren't better than the languages that existed 30 years ago. The language that I do all of my programming in today, Python, is, I think, a lot better
than anything that was available 30 years ago. I like it better than Smalltalk.&lt;/p&gt;
&lt;p&gt;I use the word &lt;em&gt;qualitatively&lt;/em&gt; very deliberately. Every programming language
today that I can think of, that's in substantial use, has the concept of
pointer. I don't know of any way to make software built using that fundamental
concept qualitatively better.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;CTB: Well, that's just a weird opinion in some ways.  But interesting,
especially since he has been around and active for so long, and his
perspective is obviously not based in ignorance.&lt;/em&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;h2&gt;On temptation&lt;/h2&gt;
&lt;p&gt;Deutsch: Every now and then I feel a temptation to design a
programming language but then I just lie down until it goes away.  But
if I were to give in to that temptation, it would have a pretty
fundamental cleavage between a functional part that talked only about
values and had no concept of pointer, and a different sphere of some
kind that talked about patterns of sharing and reference and control.&lt;/p&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;h2&gt;More on Smalltalk and Python&lt;/h2&gt;
&lt;p&gt;Seibel: So, despite it not being qualitatively better than Smalltalk,
you still like Python better.&lt;/p&gt;
&lt;p&gt;Deutsch: I do. There are several reasons. With Python there's a very
clear story of what is a program and what it means to run a program
and what it means to be part of a program. There's a concept of
module, and modules declare basically what information they need from other
modules. So it's possible to develop a module or a group of modules and share
them with other people and those other people can come along and look at those modules and know pretty much exactly what they depend on and know what their boundaries are.&lt;/p&gt;
&lt;p&gt;...&lt;/p&gt;
&lt;p&gt;I've talked with the few of my buddies that are still at VisualWorks about
open-sourcing the object engine, the just-in-time code generator,
which, even though I wrote it, I still think is better than a lot of what's
out there. Gosh, here we have Smalltalk, which has this really great code-generation machinery, which is now very mature -- it's about 20 years old and it's extremely reliable. It's a relatively simple, relatively retargetable, quite efficient just-in-time code generator that's designed to work really well with non type-declared languages. On the other hand, here's Python, which is this wonderful language with these wonderful libraries and a slow-as-mud implementation. Wouldn't it be nice if we could bring the two together?&lt;/p&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;h2&gt;(I'm a bit fixated on Python. OK?)&lt;/h2&gt;
&lt;p&gt;Deutsch: ... But that brings me to the other half, the other reason I
like Python syntax better, which is that Lisp is lexically pretty
monotonous.&lt;/p&gt;
&lt;p&gt;Seibel: I think Larry Wall described it as a bowl of oatmeal with
fingernail clippings in it.&lt;/p&gt;
&lt;p&gt;Deutsch: Well, my description of Perl is something that looks like it
came out of the wrong end of a dog. I think Larry Wall has a lot of
nerve talking about language design -- Perl is an abomination as a
language.  But let's not go there.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;CTB: heh.&lt;/em&gt;&lt;/p&gt;
&lt;/div&gt;</description>
    </item>
    <item>
      <pubDate>Mon, 29 Apr 2013 16:11:16 GMT</pubDate>
      <title>Excerpts from Coders At Work: Joe Armstrong Interview</title>
      <link>http://www.advogato.org/person/titus/diary.html?start=479</link>
      <guid>http://ivory.idyll.org/blog/coders-at-work-joe-armstrong.html</guid>
      <description>&lt;p&gt;I've been reading Peter Seibel's excellent book, &lt;a href="http://www.codersatwork.com/" &gt;Coders at Work&lt;/a&gt;, which is a transcription of
interviews with a dozen or so very well known and impactful
programmers.  After the first two interviews, I found myself itching
to highlight certain sections, and then I thought, heck, why not post
some of the bits I found most interesting?  This is a book everyone
should be aware of, and it's surprisingly readable.  Highly
recommended.&lt;/p&gt;
&lt;p&gt;This is the first in what I expect to be a dozen or so blog posts, time
permitting.&lt;/p&gt;
&lt;p&gt;The excerpts below come from Seibel's &lt;a href="http://www.codersatwork.com/joe-armstrong.html" &gt;interview with Joe Armstrong,
the inventer of Erlang&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;My comments are labeled 'CTB'.&lt;/p&gt;
&lt;hr class="docutils"/&gt;&lt;div&gt;
&lt;h2&gt;On learning to program&lt;/h2&gt;
&lt;p&gt;Seibel: How did you learn to program? When did it all start?&lt;/p&gt;
&lt;p&gt;Armstrong: When I was at school. I was born in 1950 so there weren't
many computers around then. The final year of school, I suppose I must
have been 17, the local council had a mainframe computer -- probably
an IBM. We could write Fortran on it. It was the usual thing -- you
wrote your programs on coding sheets and you sent them off. A week
later the coding sheets and the punch cards came back and you had to
approve them. But the people who made the punch cards would make
mistakes. So it might go backwards and forwards one or two times. And
then it would finally go to the computer center.&lt;/p&gt;
&lt;p&gt;Then it went to the computer center and came back and the Fortran
compilter had stopped at the first syntactic error in the program. It
didn't even process the remainder of the program. It was something
like three months to run your first program. I learned then, instead
of sending one program you had to develop every single subroutine in
parallel and sned the lot. I think I wrote a little program to dispay
a chess board -- it would plot a chess board on the printer. But I had
to write all the subroutines as parallel tasks because the turnaround
time was so appallingly bad.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;CTB: I think it's fascinating to interpret this statement in light of
Erlang's pattern of small components, working in parallel
(http://en.wikipedia.org/wiki/Erlang_(programming_language).  Did
Armstrong shape his mental architecture in this pattern from the early
mainframe days, and then translate that over to programming design?
Also, this made me think about unit testing in a whole new way.&lt;/em&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;h2&gt;On modern gizmos like "hierarchical file systems", and productivity&lt;/h2&gt;
&lt;p&gt;Armstrong: The funny thing is, thinking back, I don't think all of
these modern gizmos actually make you any more
productive. Hierarchical file systems -- how do they make you more
productive? Most of software development goes on in your head
anyway. I think having worked with that simpler system imposes a kind
of disciplined way of thinking. If you haven't got a directory system
and you have to put all the files in one directory, you have to be
fairly disciplined. If you haven't got a revision control system, you
have to be fairly disciplined. Given that you apply that discipline to
what you're doing it doesn't seem to me to be any better to have
hierarchical file systems and revision control. They don't solve the
fundamental problem of solving your problem. They probably make it
easier for groups of people to work together. For individuals I don't
see any difference.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;CTB: If your tools require you to be as good as Joe Armstrong in order to
get things done, that's probably not a generalizable solution...&lt;/em&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;h2&gt;On calling out to other languages, and Domain Specific Lanaguages&lt;/h2&gt;
&lt;p&gt;Seibel: So if you were writing a big image processing work-flow
system, then would you write the actual image transformation in some
other language?&lt;/p&gt;
&lt;p&gt;Armstrong: I'd write them in C or assembler or something. Or I might
actually write them in a dialect of Erlang and then cross-compile the
Erlang to C. Make a dialect - this kind of domain-specific language
kind of idea. Or I might write Erlang programs which generate C
programs rather than writing the C programs by hand. But the target
language would be C or assembler or something. Whether I wrote them by
hand or generated them would be the interesting question. I'm tending
toward automatically generating C rather than writing it by hand
because it's just easier.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;CTB: heh. So, I'd just generate C automatically from a dialect of Erlang...&lt;/em&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;h2&gt;On debugging&lt;/h2&gt;
&lt;p&gt;Seibel: What are the techniques that you use there? Print statements?&lt;/p&gt;
&lt;p&gt;Armstrong. Print statements. The great gods of programming said,
"Thou shall put printf statements in your program at the point where
yout hink it's gone wrong, recompile, and run it.&lt;/p&gt;
&lt;p&gt;Then there's -- I don't know if I read it somewhere or if I invented
it myself -- Joe's Law of Debugging, which is that all errors will be
plus/minus three statements of the place you last changed the program.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;CTB: one surprising commonality amongst many of the interviews thus far
is the lack of use (or disdain for) debuggers.  Almost everyone trots
out print statements!&lt;/em&gt;&lt;/p&gt;
&lt;/div&gt;</description>
    </item>
    <item>
      <pubDate>Wed, 20 Mar 2013 19:09:38 GMT</pubDate>
      <title>PyCon 2013 and Codes of Conduct, more generally</title>
      <link>http://www.advogato.org/person/titus/diary.html?start=478</link>
      <guid>http://ivory.idyll.org/blog/pycon-2013-and-codes-of-conduct.html</guid>
      <description>&lt;p&gt;The tech community is messed up in da head, yo.&lt;/p&gt;
&lt;p&gt;Several times since Steve Holden's &lt;a href="holdenweb.blogspot.com/2012/12/im-sorry.html" &gt;I'm Sorry&lt;/a&gt; post I've written
long blog posts about my own views on codes of conduct and
professional behavior, including the views informed by some of my own
extraordinarily embarrassing transgressions.  I never felt that the
end result had much to add to the conversation so I never posted any
of 'em.  Plus, they were really embarrassing transgressions.&lt;/p&gt;
&lt;p&gt;If you want to know, until last week, I was fairly publicly on the
fence about the proposed Python Software Foundation code of conduct
(which is not yet public, but is based on the &lt;a href="http://www.ubuntu.com/project/about-ubuntu/conduct" &gt;Ubuntu CoC&lt;/a&gt;, I think)
because I was worried about CoCs being used to whack people
inappropriately, due to nonspecificity and other things.&lt;/p&gt;
&lt;p&gt;Three things happened at PyCon 2013 that made me decide to (a) change
my mind and (b) post this short note saying so.&lt;/p&gt;
&lt;p&gt;First, I came to PyCon with two women colleagues, one of whom was
harassed nearly constantly by men, albeit on a low level.  Both of
them are friendly people who are willing to engage at both a personal
and a technical level with others, and apparently that signals to some
that they can now feel free to comment on "hotness", proposition them,
and otherwise act like 14 year old guys.  As one friend said,
(paraphrased) "I'd be more flattered that they seem to want to sleep
with me, if they'd indicated any interest in me as a human being --
you know, asked me why I was at PyCon, what I did, what I worked on,
what I thought about things.  But they didn't."  (Honestly, if I were
to judge by that reported set of interactions, any genetic component
to such behavior would be weeded out in approximately one generation,
'cause such guys would only be able to reproduce through anonymous
donations to sperm banks.)&lt;/p&gt;
&lt;p&gt;Second, at an event for a subcommunity that I help not run, &lt;a href="http://pycon.blogspot.com/2013/03/pycons-response-to-inapropriate.html" &gt;bad shit
happened&lt;/a&gt;.
At the same event, derogatory and not-fun sexist remarks were made,
publicly and loudly, about a presenter.  This made the main organizer
for that event feel horrible, and put a damper on the whole event.&lt;/p&gt;
&lt;p&gt;Third, &lt;a href="http://butyoureagirl.com/14015/forking-and-dongle-jokes-dont-belong-at-tech-conferences/" &gt;this happened&lt;/a&gt;.
As with #2, I found &lt;a href="http://pycon.blogspot.com/2013/03/pycon-response-to-inappropriate.html" &gt;PyCon's response&lt;/a&gt;
perfectly appropriate, which makes me much happier about the way PyCon
specifically and the PSF in general are likely to enforce any code of
conduct in the future.  As for the person who posted Twitter pics
identifying the men she felt were being sexist, I am not very upset by
her actions, because she is not an official representative of PyCon or
the PSF, and she did not post anonymously, so she is taking
responsibility for her actions -- unlike the people &lt;a href="https://news.ycombinator.com/item?id=5408443" &gt;harassing Jesse
Noller&lt;/a&gt; for doing his
effin' job.  I do reject the notion that Adria speaks for me in the
particulars, and I would guess that her claim to speak for all future
women is similarly rejected by many women.  Again, &lt;strong&gt;PyCon is not
responsible for her tweet or her picture, and they should not be held
accountable in any way for it; that's her personal action&lt;/strong&gt;.  (I'm
more upset by &lt;a href="https://news.ycombinator.com/item?id=5398681" &gt;the company that took this to the extent of actually
firing someone over it&lt;/a&gt;, and I'm really glad
my employer (Michigan State University) has rules, procedures, and
formal hearings -- if they fire me, it will be after a certain amount
of due process and not due to what is presumably Internet hearsay.)&lt;/p&gt;
&lt;p&gt;In the end, the latter two incidents are completely overshadowed by
the first, though.  I'm not the parent or guardian of either of my
colleagues, and I suspect they would similarly reject the idea that I
speak for either of them - they're both perfectly capable of telling
people what they think, frankly (and I would love to be an audience
for some of &lt;em&gt;those&lt;/em&gt; conversations).  But, as both a visible member of
the Python community and as the father of two small girls, I am
appalled at the second-hand reports of male behavior.  I'm
particularly appalled at the systemic low-level harassment that seems
to be considered normal behavior by some.  It's not cool, it's not
fun, and it doesn't even have the dubious virtue of being effective.&lt;/p&gt;
&lt;p&gt;In summary,&lt;/p&gt;
&lt;ol&gt;&lt;li&gt;GOOD JOB, PYCON.  The way the incidents were officially handled was
really well done and speaks well of the people we have chosen to
run PyCon.&lt;/li&gt;
&lt;li&gt;We need codes of conduct because they provide some minimal
guidelines for people that (apparently) need 'em, because they
can't figure out how to tie their own shoelaces without such
guidelines.&lt;/li&gt;
&lt;li&gt;As a community, we need to change the way we treat women, because
my daughters will TASER YOU ALL INTO OBLIVION in 10-20 years if we
don't.&lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;--titus&lt;/p&gt;
&lt;p&gt;p.s. Sorry, no comments.  Go blog 'em and I'll link to the blog posts,
just send them to @ctitusbrown on Twitter.&lt;/p&gt;</description>
    </item>
    <item>
      <pubDate>Sat, 16 Mar 2013 19:09:59 GMT</pubDate>
      <title>My 2013 PyCon talk: Awesome Big Data Algorithms</title>
      <link>http://www.advogato.org/person/titus/diary.html?start=477</link>
      <guid>http://ivory.idyll.org/blog/2013-pycon-awesome-big-data-algorithms-talk.html</guid>
      <description>&lt;p&gt;
  &lt;a href="https://us.pycon.org/2013/schedule/presentation/53/" &gt;Schedule link&lt;/a&gt;
&lt;/p&gt;
&lt;p&gt;
  &lt;strong&gt;Description&lt;/strong&gt;
&lt;/p&gt;
&lt;p&gt;Random algorithms and probabilistic data structures are
algorithmically efficient and can provide shockingly good practical
results. I will give a practical introduction, with live demos and bad
jokes, to this fascinating algorithmic niche. I will conclude with
some discussions of how our group has applied this to large sequencing
data sets (although this will not be the focus of the talk).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Abstract&lt;/strong&gt;:&lt;/p&gt;
&lt;p&gt;I propose to start with Python implementations of most of the DS &amp;amp; A mentioned in this excellent blog post:&lt;/p&gt;
&lt;p&gt;
  &lt;a href="http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/" &gt;http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/&lt;/a&gt;
&lt;/p&gt;
&lt;p&gt;and also discuss skip lists and any other random algorithms that catch
my fancy. I'll put everything together in an IPython notebook and add
visualizations as appropriate.&lt;/p&gt;
&lt;p&gt;I'll finish with some discussion of how we've put these approaches to
work in my lab's research, which focuses on compressive approaches to
large data sets (and is regularly featured in my Python-ic blog,
&lt;a href="http://ivory.idyll.org/blog/" &gt;http://ivory.idyll.org/blog/&lt;/a&gt;).&lt;/p&gt;
&lt;div&gt;
&lt;h2&gt;Misc talk links&lt;/h2&gt;
&lt;p&gt;&lt;a href="http://www.slideshare.net/c.titus.brown/2013-py-con-awesome-big-data-algorithms" &gt;Slides&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/ctb/2013-pycon-awesome-big-data-algorithms" &gt;Github repo with IPython Notebooks in it&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;(I'll put the video link in here when it's available.)&lt;/p&gt;
&lt;div&gt;
&lt;h3&gt;Overviews and linkfoo&lt;/h3&gt;
&lt;p&gt;&lt;a href="http://en.wikipedia.org/wiki/Category:Probabilistic_data_structures" &gt;Wikipedia's category page for Probabilistic Data Structures&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/" &gt;The Highly Scalable Blog on Probabilistic Data Structures for Web
Analytics and Data Mining&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;h3&gt;Specific References&lt;/h3&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;h2&gt;SkipLists:&lt;/h2&gt;
&lt;p&gt;&lt;a href="http://nbviewer.ipython.org/urls/raw.github.com/ctb/2013-pycon-awesome-big-data-algorithms/master/01-skiplist.ipynb" &gt;skiplist IPython Notebook&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://en.wikipedia.org/wiki/Skip_list" &gt;Wikipedia page on SkipLists&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://infohost.nmt.edu/tcc/help/lang/python/examples/pyskip/pyskip.pdf" &gt;John Shipman's excellent writeup&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="ftp://ftp.cs.umd.edu/pub/skipLists/skiplists.pdf" &gt;William Pugh's original article&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The HackerNews (oops!) reference for my reddit-attributed quote about
putting a gun to someone's head and asking them to write a log-time
algorithm for storing stuff:
&lt;a href="https://news.ycombinator.com/item?id=2670632" &gt;https://news.ycombinator.com/item?id=2670632&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;h2&gt;HyperLogLog:&lt;/h2&gt;
&lt;p&gt;&lt;a href="http://nbviewer.ipython.org/urls/raw.github.com/ctb/2013-pycon-awesome-big-data-algorithms/master/02-coinflips.ipynb" &gt;coinflips IPython Notebook&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://nbviewer.ipython.org/urls/raw.github.com/ctb/2013-pycon-awesome-big-data-algorithms/master/03-hyper-log-log-counter.ipynb" &gt;HyperLogLog counter IPython Notebook&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blog.aggregateknowledge.com/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/" &gt;Aggregate Knowledge's EXCELLENT blog post on HyperLogLog&lt;/a&gt;.
The section on Big Pattern Observables is truly fantastic :)&lt;/p&gt;
&lt;p&gt;&lt;a href="http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf" &gt;Flajolet et
al.&lt;/a&gt; is
the original paper.  It gets a bit technical in the middle but the
discussions are great.&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blog.notdot.net/2012/09/Dam-Cool-Algorithms-Cardinality-Estimation" &gt;Nick Johnson's blog post on cardinality estimation&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://metamarkets.com/2012/fast-cheap-and-98-right-cardinality-estimation-for-big-data/" &gt;MetaMarkets' blog post on cardinality counting&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html" &gt;More High Scalability blog posts, this one by Matt Abrams&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://stackoverflow.com/questions/10164608/how-do-you-count-cardinality-of-very-large-datasets-efficiently-in-python" &gt;The obligatory Stack Overflow Q&amp;amp;A&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Vasily Evseenko's git repo &lt;a href="https://github.com/svpcom/hyperloglog" &gt;https://github.com/svpcom/hyperloglog&lt;/a&gt;,
forked from Nelson Goncalves's git repo,
&lt;a href="https://github.com/goncalvesnelson/Log-Log-Sketch" &gt;https://github.com/goncalvesnelson/Log-Log-Sketch&lt;/a&gt;, served as the
source for my IPython Notebook.&lt;/p&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;h2&gt;Bloom Filters:&lt;/h2&gt;
&lt;p&gt;&lt;a href="http://nbviewer.ipython.org/urls/raw.github.com/ctb/2013-pycon-awesome-big-data-algorithms/master/04-bloom-filters.ipynb" &gt;Bloom filter notebook&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The &lt;a href="http://en.wikipedia.org/wiki/Bloom_filter" &gt;Wikipedia page&lt;/a&gt; is pretty
good.&lt;/p&gt;
&lt;p&gt;Everything I know about Bloom filters comes from &lt;a href="http://pnas.org/content/early/2012/07/25/1121464109.abstract" &gt;my research&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I briefly mentioned the &lt;a href="http://en.wikipedia.org/wiki/Count-Min_sketch" &gt;CountMin Sketch&lt;/a&gt;, which is an
extension of the basic Bloom filter approach, for counting frequency
distributions of objects.&lt;/p&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;h2&gt;Other nifty things to look at&lt;/h2&gt;
&lt;p&gt;&lt;a href="http://en.wikipedia.org/wiki/Quotient_filter" &gt;Quotient filters&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://en.wikipedia.org/wiki/Rapidly_exploring_random_tree" &gt;Rapidly-exploring random trees&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://en.wikipedia.org/wiki/Random_binary_tree" &gt;Random binary trees&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://en.wikipedia.org/wiki/Treap" &gt;Treaps&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://stackoverflow.com/questions/13263220/is-there-any-probabilistic-data-structure-that-gives-false-negatives-but-not-fal" &gt;StackOverflow question on Bloom-filter like structures that go the other way&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://www.slideshare.net/StampedeCon/a-survey-of-probabilistic-data-structures-stampedecon-2012" &gt;A survey of probabilistic data structures&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blog.aggregateknowledge.com/2012/07/09/sketch-of-the-day-k-minimum-values/" &gt;K-Minimum Values over at Aggregate Knowledge again&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blog.aggregateknowledge.com/2012/09/12/set-operations-on-hlls-of-different-sizes/" &gt;Set operations on HyperLogLog counters&lt;/a&gt;, again over at Aggregate Knowledge.&lt;/p&gt;
&lt;p&gt;&lt;a href="http://code.activestate.com/recipes/576930-efficient-running-median-using-an-indexable-skipli/" &gt;Using SkipLists to calculate an efficient running median&lt;/a&gt;&lt;/p&gt;
&lt;div&gt;
&lt;h3&gt;My research&lt;/h3&gt;
&lt;p&gt;&lt;a href="http://ged.msu.edu/downloads/2012-bigdata-nsf.pdf" &gt;A fairly readable (?) grant on Big Data in sequencing data sets&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://ged.msu.edu/downloads/2012-career-nsf-final.pdf" &gt;A less readable ;) grant on "infinite assembly"&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;In addition to our &lt;a href="http://pnas.org/content/early/2012/07/25/1121464109.abstract" &gt;published paper on using Bloom filters to store
de Bruijn graphs&lt;/a&gt;, you might be interested in:&lt;/p&gt;
&lt;p&gt;&lt;a href="http://arxiv.org/abs/1203.4802" &gt;Our preprint on streaming lossy compression of sequencing data&lt;/a&gt; (aka Digital Normalization)&lt;/p&gt;
&lt;p&gt;&lt;a href="http://arxiv.org/abs/1212.2832" &gt;Our use of these techniques to assemble the heck out of large metagenomic data from soil&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://arxiv.org/abs/1303.2223" &gt;A chapter on optimizing our khmer software&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;--titus&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;</description>
    </item>
    <item>
      <pubDate>Wed, 13 Feb 2013 03:08:36 GMT</pubDate>
      <title>Communicating programming practice with screencasts</title>
      <link>http://www.advogato.org/person/titus/diary.html?start=476</link>
      <guid>http://ivory.idyll.org/blog/communicating-programming-practice.html</guid>
      <description>&lt;p&gt;One of the things that I have struggled with over the years is how to
teach people how to &lt;em&gt;actually&lt;/em&gt; program -- by this I mean the
minute-to-minute process and techniques of generating code, more so
than syntax and data structures and algorithms.  This is generally not
taught explicitly in college: most undergraduate students pick it up
in the process of doing homeworks, by working with other people,
observing TAs, and evolving their own practice.  Most science graduate
students never take a formal course in programming or software
development, as far as I can tell, so they pick it up haphazardly from
their colleagues.  Open source hackers may get their practice from
sprints, but usually by the time you get to a sprint you are already
wedged fairly far into your own set of habits.&lt;/p&gt;
&lt;p&gt;Despite this lack of explicit teaching, I think it's clear that
programming practice is really important.  I and the other Linux/UNIX
geeks I know all have a fairly small set of basic approaches --
command-line driven, mostly emacs or vi, with lots of automation at
the shell -- that we apply to all of our problems, and it is all
pretty optimized for the tools and tasks we have.  I would be hard
pressed to imagine a significantly more efficient and effective
set of practices (which just tells me that there is probably
something much better, but it's far away from my current practices :).&lt;/p&gt;
&lt;p&gt;Now that I'm a professional educator, I'd like to teach this, because
what I see students doing is so darned inefficient by comparison. I
regularly watch students struggle with the mouse to switch between
windows, copy and paste by selecting or dragging, and otherwise
completely fail to make use of keyboard shortcuts.  I see a lot of
code being built from scratch by guess-work, without lots of Google-fu
or copy/pasting and editing.  Version control isn't integrated into
their minute-by-minute process.  Testing?  Hah.  We don't even &lt;em&gt;teach&lt;/em&gt;
automated testing here at MSU. It's an understatement to say that
using all of these techniques together is a conceptual leap that many
students seem ill-prepared to make.&lt;/p&gt;
&lt;p&gt;Last term I co-taught an &lt;a href="http://ged.msu.edu/courses/2012-fall-cse-891/" &gt;intro graduate course in computation for
evolutionary biologists&lt;/a&gt; using IPython
Notebook running in the cloud, and I made extensive use of screencasts
as a way to show the students how I worked and how I thought.  It went
pretty well -- several students told me that they really appreciated
being able to see what I was doing and hear why I was doing it, and
being able to pause and rewind was very helpful when they ran into
trouble.&lt;/p&gt;
&lt;p&gt;So this term, for my database-backed Web development course, I decided
to post videos of the homework solutions for &lt;a href="http://msu-web-dev.readthedocs.org/en/latest/hw2.html" &gt;the second homework&lt;/a&gt;, which is
part of a whole-term class project to build a distributed peer-to-peer
liquor cabinet and party planning Web site.  (Hey, you gotta teach 'em
somehow, right?)&lt;/p&gt;
&lt;p&gt;I posted the example solutions as &lt;a href="https://github.com/ctb/cse491-drinkz/tree/hw2-solutions" &gt;a github branch&lt;/a&gt; as well
as videos showing me solving each of the sub problems in real time,
with discussion:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;HW 2.1 -- &lt;a href="http://www.youtube.com/watch?v=2img0wKdokA" &gt;http://www.youtube.com/watch?v=2img0wKdokA&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;HW 2.2 -- &lt;a href="http://www.youtube.com/watch?v=eQU4qImY9VM" &gt;http://www.youtube.com/watch?v=eQU4qImY9VM&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;HW 2.3 -- &lt;a href="http://www.youtube.com/watch?v=YqL18Ip2wws" &gt;http://www.youtube.com/watch?v=YqL18Ip2wws&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;HW 2.4 -- &lt;a href="http://www.youtube.com/watch?v=7iOITFHrqmA" &gt;http://www.youtube.com/watch?v=7iOITFHrqmA&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;HW 2.5 -- &lt;a href="http://www.youtube.com/watch?v=0Ea5yxRCKKw" &gt;http://www.youtube.com/watch?v=0Ea5yxRCKKw&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;HW 2.6 -- &lt;a href="http://www.youtube.com/watch?v=6k8pnl2SgVI" &gt;http://www.youtube.com/watch?v=6k8pnl2SgVI&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I think the videos are decent screencasts, and by breaking them down
this way I made it possible for students to look only at the section
they had questions about.  Each screencasts is 5-10 minutes
total, and now I can use them for other classes, too.&lt;/p&gt;
&lt;p&gt;So far so good, and I doubt many students have spent much time looking
at them, but maybe some will.  We'll have to see if my contentment
in having produced them matches their actual utility in the class :).&lt;/p&gt;
&lt;p&gt;But then something entertaining happened.  Greg Wilson is always
bugging us (where "us" means pretty much anyone whose e-mail inbox he
has access to) about developing hands-on examples that we can use in
&lt;a href="http://software-carpentry.org" &gt;Software Carpentry&lt;/a&gt;, so I sent
these videos to the SWC 'tutors' mailing list with a note that I'd
love help writing better homeworks.  And within an hour or so, I got
back two nice polite e-mails from other members of the list, offering
better &lt;a href="#id1" &gt;&lt;span&gt;*&lt;/span&gt;&lt;/a&gt;solutions.  One was about &lt;a href="http://ged.msu.edu/courses/2012-fall-cse-891/" &gt;HW 2.1&lt;/a&gt; --&lt;/p&gt;
&lt;div&gt;
&lt;p&gt;System Message: WARNING/2 (&lt;tt&gt;/Users/t/dev/blog-final/src/communicating-programming-practice.rst&lt;/tt&gt;, line 88); &lt;em&gt;&lt;a href="#id2" &gt;backlink&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
Inline emphasis start-string without end-string.&lt;/div&gt;
&lt;blockquote&gt;
It might be safer to .lstrip() the line before checking for comments
(to allow indented comments). Also, not line[0].strip() doesn't test
for lines with only white space. it tests for lines that have white
space as the first character.  'not line.strip()[0]' would be all
white space... That would also make 'not line' redundant.&lt;/blockquote&gt;
&lt;p&gt;I also got a more general offer from someone else to peer review my
homework solutions, and chastising me for using&lt;/p&gt;
&lt;pre&gt;
fp = open(filename)
try:
  ... do stuff ...
finally:
  fp.close()
&lt;/pre&gt;
&lt;p&gt;instead of&lt;/p&gt;
&lt;pre&gt;
with open(filename) as fp:
   ... do stuff
&lt;/pre&gt;
&lt;p&gt;heh.&lt;/p&gt;
&lt;p&gt;I find this little episode very entertaining. I love the notion that
other people (at least one is another professor) had the spare time to
watch the videos and then critique what I'd done and then send me the
critique; I also like the point that the quest for perfect code is
ongoing.  I am particularly entertained by the fact that they are both
right, and that my explanation of my code was in some cases facile,
shallow, and somewhat wrong (although not significantly enough to make
me redo the videos -- the perfect is the enemy of the good enough!)&lt;/p&gt;
&lt;p&gt;And, finally, although no feedback spoke directly to this, I am in
love with the notion that we can convey effective practice through
video.  I think this episode is a great indication that if we could
get students to record themselves working through problems, we could
learn how they are responding to our instruction and start to develop
a deeper understanding of the traps for the novice that lie within our
current programming processes.&lt;/p&gt;
&lt;p&gt;--titus&lt;/p&gt;</description>
    </item>
    <item>
      <pubDate>Sun, 4 Nov 2012 18:10:11 GMT</pubDate>
      <title>Adding disqus, Google Analytics, and github edit links to ReadTheDocs sites</title>
      <link>http://www.advogato.org/person/titus/diary.html?start=475</link>
      <guid>http://ivory.idyll.org/blog/rtd-comments-and-editing.html</guid>
      <description>&lt;p&gt;Inspired by the awesomeness of disqus on my other sites, I wanted to
make it possible to enable disqus on my sites on &lt;a href="http://readthedocs.org" &gt;ReadTheDocs&lt;/a&gt;.  A bit of googling led me to Mikko
Ohtamaa's excellent work on &lt;a href="http://opensourcehacker.com/2012/01/08/readthedocs-org-github-edit-backlink-and-short-history-of-plone-documentation/" &gt;the Plone documentation&lt;/a&gt;,
where a blinding flash of awesomeness hit me and I realized that
github had, over the past year, &lt;a href="https://github.com/blog/905-edit-like-an-ace" &gt;nicely integrated online editing of
source, together with pull requests&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This meant that I could now give potential contributors completely
command-line-free edit ability for my documentation sites, together
with single-click approval of edits, and automated insta-updating of
the ReadTheDocs site.  Plus disqus commenting.  And Google Analytics.&lt;/p&gt;
&lt;p&gt;I just had to have it.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://labibi.readthedocs.org/en/latest/" &gt;Voila&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Basically, I took Mikko's awesomeness, combined it with some disqus hackery,
refactored a few times, and, well, posted it.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/ctb/labibi" &gt;source is here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Two things --&lt;/p&gt;
&lt;p&gt;I could some JS help disabling the 'Edit this document!' stuff if the
'github_base_account' variable isn't set in page.html'.  Anyone?  See
&lt;a href="https://github.com/ctb/labibi/blob/master/_templates/page.html#L105" &gt;line 105 of page.html&lt;/a&gt;.  You can edit online by hitting 'e' :).&lt;/p&gt;
&lt;p&gt;It would be nice to be able to configure disqus, Google Analytics, and
github editing in conf.py, but I wasn't able to figure out how to pass
variables into Jinja2 from conf.py.  It's probably really easy.&lt;/p&gt;
&lt;p&gt;But otherwise it all works nicely.&lt;/p&gt;
&lt;p&gt;Enjoy!  And thanks to Mikko, as well as Eric Holscher and the RTD team,
and github, for making this all so frickin' easy.&lt;/p&gt;
&lt;p&gt;--titus&lt;/p&gt;</description>
    </item>
    <item>
      <pubDate>Fri, 24 Aug 2012 15:10:58 GMT</pubDate>
      <title>PyCon 2013 talks I &lt;em&gt;really&lt;/em&gt; don't want to see</title>
      <link>http://www.advogato.org/person/titus/diary.html?start=474</link>
      <guid>http://ivory.idyll.org/blog/pycon13-talks-i-dont-want.html</guid>
      <description>&lt;p&gt;There's been a lot of discussion about PyCon talks that we &lt;em&gt;do&lt;/em&gt; want
to see.  Here's a brief list of those I &lt;em&gt;don't&lt;/em&gt; want to see, for those
of you considering a submission -- in no particular order.&lt;/p&gt;
&lt;ol&gt;&lt;li&gt;1001 Mocking Frameworks - a comparison and overview&lt;/li&gt;
&lt;li&gt;Long Live Twill&lt;/li&gt;
&lt;li&gt;Zope vs. Django - why Zope was right all along&lt;/li&gt;
&lt;li&gt;Why We Need More Men in Python - a Diversity discussion&lt;/li&gt;
&lt;li&gt;Centralized Version Control - it's the future&lt;/li&gt;
&lt;li&gt;Guido van Rossum on Python 4k - it's the future&lt;/li&gt;
&lt;li&gt;Running Python under Windows on my Superboard II&lt;/li&gt;
&lt;li&gt;Lists - way more useful than you ever thought&lt;/li&gt;
&lt;li&gt;What the Python Community Can Learn from Java&lt;/li&gt;
&lt;li&gt;Solving Easy Problems - my very own customized approaches&lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;Any other ideas?  Add 'em or send me links :)&lt;/p&gt;
&lt;p&gt;--titus&lt;/p&gt;</description>
    </item>
    <item>
      <pubDate>Mon, 25 Jun 2012 06:10:26 GMT</pubDate>
      <title>Welcome to my new blog!</title>
      <link>http://www.advogato.org/person/titus/diary.html?start=473</link>
      <guid>//new-blog.html</guid>
      <description>&lt;p&gt;I've just moved my blog over to &lt;a href="http://pelican.notmyidea.org/en/2.8/index.html" &gt;Pelican&lt;/a&gt;, a static blog
generator that takes in reStructuredText and spits out, well, this!
I'm now using &lt;a href="disqus.com" &gt;Disqus&lt;/a&gt; for commenting, too.&lt;/p&gt;
&lt;p&gt;The main motivations for the move (apart from slightly better theming) were
to escape dynamic-blog-land in favor of static-blog-land, while enabling
a better commenting setup.  Pelican+disqus looked like a great solution;
we'll see how it goes!&lt;/p&gt;
&lt;p&gt;One note -- rather than hack and slash my way through disqus's commenting
system upload fail, I just attached all of the comments as "legacy
comments" on old blog entries.  Yeah, it sucks, sorry.&lt;/p&gt;
&lt;p&gt;--titus&lt;/p&gt;</description>
    </item>
    <item>
      <pubDate>Fri, 8 Jun 2012 19:09:06 GMT</pubDate>
      <title>Some early experience in teaching using ipython notebook</title>
      <link>http://www.advogato.org/person/titus/diary.html?start=472</link>
      <guid>http://ivory.idyll.org/blog/jun-12/teaching-with-ipynb</guid>
      <description>&lt;div&gt;
&lt;p&gt;As part of the &lt;a href="http://bioinformatics.msu.edu/ngs-summer-course-2012" &gt;2012 Analyzing Next-Generation Sequencing Data course&lt;/a&gt;, I've been
trying out ipython notebook for the tutorials.&lt;/p&gt;
&lt;p&gt;In previous years, our tutorials all looked like this: &lt;a href="http://ged.msu.edu/angus/tutorials-2011/short-read-assembly-velvet.html" &gt;Short read
assembly with Velvet&lt;/a&gt;
-- basically, reStructuredText files integrated with Sphinx.  This had a lot
of advantages, including Googleability and simplicity; but it also meant
that students spent a lot of time copying and pasting commands.&lt;/p&gt;
&lt;p&gt;This year, I tried mixing things up with some ipython notebook, using
pre-written notebooks -- see for example a static view of the &lt;a href="http://ged.msu.edu/angus/tutorials-2012/files/static-ngs-10-blast.html" &gt;BLAST
notebook&lt;/a&gt;.
The notebooks are sourced at
&lt;a href="https://github.com/ngs-docs/ngs-notebooks" &gt;https://github.com/ngs-docs/ngs-notebooks&lt;/a&gt;, and can be automatically
updated and placed on an EC2 instance for the students to run.  The
idea is that the students can simply shift-ENTER through the notebooks;
shell commands can easily be run with '!', and we can integrate in
python code that graphs and explores the outputs.&lt;/p&gt;
&lt;p&gt;Once we got past the basic teething pains of badly written notebooks,
broken delivery mechanisms, proper ipython parameters, etc., things seemed
to work really well.  It's been great to be able to add code, annotate
code, and graph stuff interactively!&lt;/p&gt;
&lt;p&gt;Along the way, though, a few points have emerged.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;First&lt;/strong&gt;, ipython notebook adds a little bit of confusion to the
process.  Even though it's pretty simple, when you're throwing it in
on top of UNIX, EC2, bioinformatics, and Python, people's minds tend
to boggle.&lt;/p&gt;
&lt;p&gt;For this reason, it's not yet clear how good an addition ipynb is to
the course.  We can't get away with replacing the shell with ipynb,
for a variety of reasons; so it represents an extra cognitive burden.
I think for an entire term course it will be an unambiguous win, but
for an intensive workshop it may be one thing too many.&lt;/p&gt;
&lt;p&gt;I should have a better feeling for this next week.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Second&lt;/strong&gt;, in practice, ipython notebooks need to be written so that
they can be executed multiple times on the same machine.  Workshop
attendees start out very confused about the order of commands vs the
order of execution, and even though ipynb makes this relatively
simple, if they get into trouble it is nice to be able to tell them to
just rerun the entire notebook.  So the notebook commands have to be
designed this way -- for one example, if you're copying a file, make
sure to use 'cp -f' so that it doesn't ask if the file needs to be
copied again.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Third&lt;/strong&gt;, in practice, ipython notebooks cannot contain long
commands.  If the entire notebook can't be re-run in about 1 minute,
then it's too long.  This became really clear with Oases and Trinity,
where Oases could easily be run on a small data set in about 1-2
minutes, while Trinity took an hour or more.  Neither people nor
browsers handle that well.  Moreover, if you accidentally run the
time-consuming task twice, you're stuck waiting for it to finish, and
it's annoying and confusing to put multi-execution guards on tasks.&lt;/p&gt;
&lt;p&gt;This point is a known challenge with ipython notebook, of course; I've
been talking with Fernando and Brian, among others, about how to deal
with long running tasks.  I'm converging to the idea that long-running
tasks should be run at the command line (maybe using 'make' or
something better?) and then ipython notebook can be used for data analysis
leading to summaries and/or visualization.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Fourth&lt;/strong&gt;, ipython notebooks are a bit difficult to share in static
form, which makes the site less useful.  Right now I've been printing
to HTML and then serving that HTML up statically, which is slow and
not all that satisfying.  There are probably easy solutions for this
but I haven't invested in them ;).&lt;/p&gt;
&lt;p&gt;---&lt;/p&gt;
&lt;p&gt;In spite of these teething pains, feedback surrounding ipynb has been
reasonably positive.  Remember, these are biologists who may never
have done any previous shell commands or programming, and we are
throwing a lot at them; but overall the basic concepts of ipynb are
simple, and they recognize that.  Moreover, ipython notebook has
enabled extra flexibility in what we present and make possible for
them to do, and they seem to see and appreciate that.&lt;/p&gt;
&lt;p&gt;The good news is that we figured all this out in the first week, and I still
have a whole week with the guinea pigs, ahem, course attendees, under my
thumb.  We'll see how it goes!&lt;/p&gt;
&lt;p&gt;--titus&lt;/p&gt;
&lt;p&gt;p.s. Totally jonesing for a portfolio system that lets me specify a
machine config, then with a single click spawns the machine,
configures it, sucks down a bunch of ipython notebooks, and points me
at the first one!&lt;/p&gt;
&lt;/div&gt;</description>
    </item>
    <item>
      <pubDate>Sun, 8 Apr 2012 01:09:13 GMT</pubDate>
      <title>Why I don't *really* practice open science</title>
      <link>http://www.advogato.org/person/titus/diary.html?start=471</link>
      <guid>http://ivory.idyll.org/blog/apr-12/blog-practicing-open-science</guid>
      <description>&lt;div&gt;
&lt;p&gt;I'm a pretty big advocate of anything open -- open source, open
access, and open science, in particular.  I always have been.  And now
that I'm a professor, I've been trying to figure out how to actually
&lt;em&gt;practice&lt;/em&gt; open science effectively&lt;/p&gt;
&lt;p&gt;What is open science?  Well, I think of it as talking regularly about
my unpublished research on the Internet, generally in my blog or on
some other persistent, explicitly public forum.  It should be done
regularly, and it should be done with a certain amount of depth or
self-reflection.  (See, for example, the wunnerful &lt;a href="http://www.nature.com/news/2011/110809/full/news.2011.469.html" &gt;Rosie Redfield&lt;/a&gt;
and &lt;a href="http://www.nature.com/news/2011/110809/full/news.2011.469.html" &gt;Nature's commentary&lt;/a&gt;
on her blogging of the arsenic debacle &amp;amp; tests thereof.)&lt;/p&gt;
&lt;p&gt;Most of my cool, sexy bloggable work is in bioinformatics; I do have a
wet lab, and we're starting to get some neat stuff out of that
(incl. both some ascidian evo-devo and some chick transcriptomics) but
that's not as mature as the computational stuff I'm doing.  And, as
you know if you've seen any of my recent posts on this, I'm pretty
bullish about the computational work we've been doing: the de novo
assembly sequence foo is, frankly, super awesome and seems to solve
most of the scaling problems we face in short-read assembly.  And it
provides a path to solving the problems that it doesn't outright
&lt;em&gt;solve&lt;/em&gt;.  (I'm talking about &lt;a href="http://ivory.idyll.org/blog/dec-11/kmer-percolation-posted.html" &gt;partitioning&lt;/a&gt;
and &lt;a href="http://ivory.idyll.org/blog/mar-12/diginorm-paper-posted.html" &gt;digital normalization&lt;/a&gt;.)&lt;/p&gt;
&lt;p&gt;While I think we're doing awesome work, I've been uncharacteristically
(for me) shy about proselytizing it prior to having papers ready.  I
occasionally reference it on mailing lists, or in blog posts, or on
twitter, but the most I've talked about the details has been in talks
-- and I've rarely posted those talks online.  When I have, I don't
point out the nifty awesomeness in the talks, either, which of course
means it goes mostly unnoticed.  This seems to be at odds with my
oft-loudly stated position that open-everything is the way to go.
What's going on??  That's what this blog post is about.  I think it
sheds some interesting light on how science is actually practiced, and
why completely open science might waste a lot of people's time.&lt;/p&gt;
&lt;p&gt;I'd like to dedicate this blog post to &lt;a href="http://third-bit.com/" &gt;Greg Wilson&lt;/a&gt;.  He and I chat irregularly about research,
and he's always seemed interested in what I'm doing but is stymied
because I don't talk about it much in public venues.  And he's been a
bit curious about why.  Which made me curious about why.  Which led to
this blog post, explaining why I think why.  (I've had it written for
a few months, but was waiting until I posted diginorm.)&lt;/p&gt;
&lt;hr class="docutils"/&gt;&lt;p&gt;For the past two years or so, I've been unusually focused on the
problem of putting together vast amounts of data -- the problem of de
novo assembly of short-read sequences.  This is because I work on
unusual critters -- soil microbes &amp;amp; non-model animals -- that nobody
has sequenced before, and so we can't make use of prior work.  We're
working in two fields primarily, metagenomics (sampling populations of
wild microbes) and mRNAseq (quantitative sequencing of transcriptomes,
mostly from non-model organisms).&lt;/p&gt;
&lt;p&gt;The problems in this area are manifold, but basically boil down to two
linked issues: vast underlying diversity, and dealing with the even
vaster amounts of sequence necessary to thoroughly sample this
diversity.  There's lots of biology motivating this, but the
computational issues are, to first order, dominant: we can generate
more sequence than we can assemble.  This is the problem that
we've basically solved.&lt;/p&gt;
&lt;p&gt;A rough timeline of our work is:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;&lt;li&gt;mid/late 2009: Likit, a graduate student in my lab, points out that
we're getting way better gene models from assembly of chick mRNAseq
than from reference-based approaches.  Motivates interest in assembly.&lt;/li&gt;
&lt;li&gt;mid/late 2009: our lamprey collaborators deliver vast amounts of lamprey
mRNAseq to us.  Reference genome sucks.  Motivates interest in assembly.&lt;/li&gt;
&lt;li&gt;mid/late 2009: the JGI starts delivering ridiculous amount of soil
sequencing data to us (specifically, Adina).  We do everything
possible to avoid assembly.&lt;/li&gt;
&lt;li&gt;early 2010: we realize that the least insane approach to analyzing
soil sequencing data relies on assembly.&lt;/li&gt;
&lt;li&gt;early 2010: Qingpeng, a graduate student, convinces me that
existing software for counting k-mers (tallymer, specifically)
doesn't scale to samples with 20 or 30 billion unique k-mers.  (He
does this by repeatedly crashing our lab servers.)&lt;/li&gt;
&lt;li&gt;mid-2010: a computational cabal within the lab (Jason, Adina, Rose)
figures out how to count k-mers really efficiently, using a
CountMin Sketch data structure (which we reinvent, BTW, but
eventually figure out isn't novel.  o well).  We implement this in
khmer.  (see &lt;a href="http://ivory.idyll.org/blog/jul-10/kmer-filtering" &gt;k-mer filtering&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;mid-2010: We use khmer to figure out just how much Illumina
sequence sucks.  (see &lt;a href="http://ivory.idyll.org/blog/jul-10/illumina-read-phenomenology" &gt;Illumina read phenomenology&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;mid-2010: Arend joins our computational cabal, bringing detailed
and random knowledge of graph theory with him.  We invent an
&lt;em&gt;actually&lt;/em&gt; novel use of Bloom filters for storing de Bruijn graphs.
(&lt;a href="http://ivory.idyll.org/blog/dec-11/kmer-percolation-posted.html" &gt;blog post&lt;/a&gt;)
The idea of partitioning large metagenomic data sets into
(disconnected) components is born.  (Not novel, as it turns out --
see &lt;a href="http://metavelvet.dna.bio.keio.ac.jp/" &gt;MetaVelvet&lt;/a&gt; and &lt;a href="http://bioinformatics.oxfordjournals.org/content/27/13/i94.abstract" &gt;MetaIDBA&lt;/a&gt;.)&lt;/li&gt;
&lt;li&gt;late 2010: Adina and Rose figure out that Illumina suckage prevents
us from actually getting this to work.&lt;/li&gt;
&lt;li&gt;first half of 2011: Spent figuring out capacity of de Bruijn graph
representation (Jason/Arend) and the right parameters to actually
de-suckify large Illumina data sets (Adina).  We slowly progress
towards actually being able to partition large metagenomic data
sets reliably.  A friend browbeats me into applying the same
technique to his ugly genomic data set, which magically seems to
solve his assembly problems.&lt;/li&gt;
&lt;li&gt;fall 2011: the idea of digital normalization is born: throwing away
redundant data FTW. Early results are very promising (we throw away
95% of data, get identical assembly) but it doesn't scale assembly
as well as I'd hoped.&lt;/li&gt;
&lt;li&gt;October 2011: JGI talk at the &lt;a href="http://www.youtube.com/watch?v=0Oon5viKMmA&amp;amp;list=PL29441D81BD645568&amp;amp;index=8&amp;amp;feature=plpp_video" &gt;metagenome informatics workshop - SLYT&lt;/a&gt;, where
we present our ideas of partitioning and digital normalization,
together, for the first time.  We point out that this potentially
solves all the scaling problems.&lt;/li&gt;
&lt;li&gt;November 2011: We figure out the right parameters for digital
normalization, turning up the awesomeness level dramatically.&lt;/li&gt;
&lt;li&gt;through present: focus on actually writing this stuff up.  See:
&lt;a href="http://ivory.idyll.org/blog/dec-11/kmer-percolation-posted.html" &gt;de Bruijn graph preprint&lt;/a&gt;; &lt;a href="http://ivory.idyll.org/blog/mar-12/diginorm-paper-posted.html" &gt;digital normalization preprint&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;&lt;/blockquote&gt;
&lt;hr class="docutils"/&gt;&lt;p&gt;If you read this timeline (yeah, I know it's long, just skim) and look
at the dates of "public disclosure", there's a 12-14 month gap between
talking about k-mer counting (July 2010) and partitioning/etc (Oct
2011, metagenome informatics talk).  And then there's another
several-month gap before I really talk about digital normalization as
a good solution (basically, mid/late January 2012).&lt;/p&gt;
&lt;p&gt;Why??&lt;/p&gt;
&lt;ol&gt;&lt;li&gt;I was really freakin' busy actually getting the stuff to work, not
to mention teaching, traveling, and every now and then actually
being at home.&lt;/li&gt;
&lt;li&gt;I was definitely worried about "theft" of ideas.  Looking back,
this seems a mite ridiculous, but: I'm junior faculty in a
fast-moving field.  Eeek!  I also have a duty to my grads and
postdocs to get &lt;em&gt;them&lt;/em&gt; published, which wouldn't be helped by being
"scooped".&lt;/li&gt;
&lt;li&gt;We kept on coming up with new solutions and approaches!  Digital
normalization didn't exist until August 2011, for example;
appropriate de-suckifying of Illumina data took until April or May
of 2011; and proving that it all worked was, frankly, quite tough
and took until October.  (More on this below.)&lt;/li&gt;
&lt;li&gt;The code wasn't ready to use, and we hadn't worked out all the
right parameters, and I wasn't ready to do the support necessary to
address lots of people using the software.&lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;All of these things meant I didn't talk about things openly on my blog.
Is this me falling short of "open science" ideals??&lt;/p&gt;
&lt;p&gt;In my defense, on the "open science" side:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;&lt;li&gt;I gave plenty of invited talks in this period, including a few (one
at JGI and one at UMD CBCB) attended by experts who certainly
understood everything I was saying, probably better than me.&lt;/li&gt;
&lt;li&gt;I posted some of these talks on &lt;a href="http://www.slideshare.net/c.titus.brown/" &gt;slideshare&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;all of our software development has been done on github, under
github.com/ctb/khmer/.  It's all open source, available, etc.&lt;/li&gt;
&lt;/ul&gt;&lt;/blockquote&gt;
&lt;p&gt;...but these are sad excuses for open science.  None of these
activities really disseminated my research openly.  Why?&lt;/p&gt;
&lt;p&gt;Well, invited talks by junior faculty like me are largely attended out
of curiosity and habit, rather than out of a burning desire to
understand what they're doing; odds are, the faculty in question
hasn't done anything particularly neat, because if they had, they'd be
well known/senior, right?  And who the heck goes
through other people's random presentations on slideshare?  So that's
not really dissemination, especially when the talks are given to an in
group already.&lt;/p&gt;
&lt;p&gt;What about the source code?  The "but all my source code is available"
dodge is particularly pernicious.  Nobody, but nobody, looks at other
people's source code in science, unless it's (a) been released, (b)
been documented, and (c) claims to solve YOUR EXACT ACTUAL PROBLEM
RIGHT NOW RIGHT NOW.  The idea that someone is going to come along and
swoop your awesome solution out of your repository seems to me to be
ridiculous; &lt;strong&gt;you'd be lucky to be that relevant, frankly.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;So I don't think any of that is a good way to disseminate what you've
done.  It's necessary for science, but it's not at all sufficient.&lt;/p&gt;
&lt;p&gt;--&lt;/p&gt;
&lt;p&gt;What do I think &lt;em&gt;is&lt;/em&gt; sufficient for dissemination?  In my case, how do
you build solutions and write software that &lt;em&gt;actually has an impact&lt;/em&gt;,
either on the way people think or (even better) on actual practice?
And is it compatible with open science?&lt;/p&gt;
&lt;ol&gt;&lt;li&gt;Write effective solutions to common problems.  The code doesn't
have to be pretty or even work all that well, but it needs to work
well enough to run and solve a common problem.&lt;/li&gt;
&lt;li&gt;Make your software available.  Duh.  It doesn't have to be open
source, as far as I can tell; I think it should be, but plenty
of people have restrictive licenses on their code and software,
and it gets used.&lt;/li&gt;
&lt;li&gt;Write about it in an open setting.  Blogs and mailing lists are ok;
SeqAnswers is probably a good place for my field; but honestly,
you've got to write it all down in a nice, coherent, well-thought
out body of text.  And if you're doing that?  You might as well
publish it.  Here is where Open Access really helps, because The
Google will make it possible for people to find it, read it, and
then go out and find your software.&lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;The interesting thing about this list is that in addition to all the
less-than-salutary reasons (given above) for not blogging more
regularly about our stuff, I had one &lt;em&gt;very&lt;/em&gt; good reason for not doing
so.&lt;/p&gt;
&lt;p&gt;It's a combination of #1 and #3.&lt;/p&gt;
&lt;p&gt;You see, &lt;strong&gt;until near to the metagenome informatics meeting, I didn't
know if partitioning or digital normalization really worked&lt;/strong&gt;.  We had
really good indications that partitioning worked, but it was never
solid enough for me to push it strongly as an &lt;em&gt;actual&lt;/em&gt; solution to big
data problems.  And digital normalization made so much sense that it
almost &lt;em&gt;had&lt;/em&gt; to work, but, um, proving it was a different problem.
Only in October did we do a bunch of cross-validation that basically
convinced me that partitioning worked &lt;em&gt;really&lt;/em&gt; well, and only in
November did we figure out how awesome digital normalization was.&lt;/p&gt;
&lt;p&gt;So we thought we had solutions, but we weren't sure they were
effective, and we sure didn't have it neatly wrapped in a bow for
other people to use.  So #1 wasn't satisfied.&lt;/p&gt;
&lt;p&gt;And, once we did have it working, we started to put a lot of energy
into demonstrating that it worked and writing it up for publication --
#3 -- which took a few months.&lt;/p&gt;
&lt;p&gt;In fact, I would actually argue that before October 2011, we could
have wasted people's time by pushing our solutions out for general use
when we basically didn't know if they worked well.  Again, we
&lt;em&gt;thought&lt;/em&gt; they did, but we didn't really know.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;This is a conundrum for open science: how do you know that someone
else's work is worth your attention?&lt;/strong&gt; Research is really hard, and it
may take months or years to nail down all the details; do you really
want to invest significant time or effort in someone else's research
before that's done?  And when they are done -- well, that's when they
submit it for publication, so you might as well just read that first!&lt;/p&gt;
&lt;p&gt;--&lt;/p&gt;
&lt;p&gt;This is basically the format for open science I'm evolving.  I'll blog
as I see fit, I'll post code and interact with people that I know who
need solutions, but I will wait until we have written a paper to
really open up about what we're doing.  A big part of that is trying
to only push solid science, such that I don't waste other people's
time, energy, and attention.&lt;/p&gt;
&lt;p&gt;So: I'm planning to continue to post all my senior-author papers to
arXiv just before their first submission.  The papers will come with
open source and the full set of data necessary to recapitulate our
results.  And I'll blog about the papers, and the code, and the work,
and try to convince people that it's nifty and awesome and solves some
useful problems, or addresses cool science.  But I don't see any much
point in broadly discussing my stuff before a preprint is available.&lt;/p&gt;
&lt;p&gt;Is this open science?  I don't really think so.  I'd really like to
talk more openly about our actual research, but for all the reasons
above, it doesn't seem like a good idea.  So I'll stick to trying to
give presentations on our stuff at conferences, and maybe posting the
presentations to slideshare when I think of it, and interacting with
people privately where I can understand what problems they're running
into.&lt;/p&gt;
&lt;p&gt;What I'm doing is more about &lt;em&gt;open access&lt;/em&gt; than open science: people
won't find out details of our work until I think it's ready for
publication, but they also won't have to wait for the review process
to finish.  While I'm not a huge fan of the way peer review is done, I
accept it as a necessary evil for getting my papers into a journal.
By the time I submit a paper, I'll be prepared to argue, confidently
and with actual evidence, that the approach is sound.  If the
reviewers disagree with me and find an actual mistake, I'll fix the
paper and apologize profusely &amp;amp; publicly; if reviewers just want more
experiments done to round out the story, I'll do 'em, but it's easy to
argue that additional experiments generally don't &lt;em&gt;detract&lt;/em&gt; from the
paper unless they discover flaws (see above, re "apologize").  The
main thing reviewers seem to care about is softening grandiose claims,
anyway; this can be dealt with by (a) not making them and (b) sending
to impact-oblivious journals like PLoS One.  I see no problem with
posting the paper, in any of these circumstances.&lt;/p&gt;
&lt;p&gt;Maybe I'm wrong; experience will tell if this is a good idea.  It'll
be interesting to see where I am once we get these papers out... which
may take a year or two, given all the stuff we are writing up.&lt;/p&gt;
&lt;p&gt;I've also come to realize that most people don't have the time or
(mental) energy to spare to really come to grips with other people's
research.  We were doing some pretty weird stuff (sketch graph
representations? streaming sketch algorithms for throwing away data?),
and I don't have a prior body of work in this area; most people
probably wouldn't be able to guess at whether I was a quack without
really reading through my code and presentations, and understanding it
in depth.  That takes a &lt;em&gt;lot&lt;/em&gt; of effort.  And most people
don't really understand the underlying issues anyway; those who do
probably care about them sufficiently to have their own research ideas
and are pursuing them instead, and don't have time to understand mine.
The rest just want a solution that runs and isn't obviously wrong.&lt;/p&gt;
&lt;p&gt;In the medium term, the best I can hope for is that preprints and blog
posts will spur people to either use our software and approaches, or
that -- even better -- they will come up with nifty &lt;em&gt;new&lt;/em&gt; approaches
that solve the problems in some new way that I'd never have thought
of.  And then I can read &lt;em&gt;their&lt;/em&gt; work and build on &lt;em&gt;their&lt;/em&gt; ideas.
&lt;strong&gt;This is what we should strive for in science: the shortest
round trip between solid scientific inspiration in different labs.&lt;/strong&gt;
This does not necessarily mean open notebooks.&lt;/p&gt;
&lt;p&gt;Overall, it's been an interesting personal journey from "blind
optimism" about openness to a more, ahem, "nuanced" set of thoughts
(i.e., I was wrong before :).  I'd be interested to hear what other
people have to say... drop me a note or make a comment.&lt;/p&gt;
&lt;p&gt;--titus&lt;/p&gt;
&lt;p&gt;p.s. I recognize that it's too early to really defend the claim that
our stuff provides a broad set of solutions.  That's not up to me to
say, for one thing.  For another, it'll take years to prove out.  So
I'm really talking about the hypothetical solution where it &lt;em&gt;is&lt;/em&gt;
widely useful in practice, and how that intersects with open science
goals &amp;amp; practice.&lt;/p&gt;
&lt;/div&gt;</description>
    </item>
  </channel>
</rss>
