Older blog entries for titus (starting at number 451)

Algorithms, and finding biological variation

I've been doing some more focused bioinformatics programming recently, and as I'm thinking about how to teach biologists about data analysis, I realize more and more how much backstory goes into even relatively simple programming.

The problem: given a reference genome, and a very large set of short, error-prone, random sequences from an individual (with a slightly different genome) aligned to that genome, find all differences and provide enough summary information that probably-real differences can be distinguished from differences due only to error. More explicitly, if I have a sequence to which a bunch of shorter sequences have been aligned, like so:

GAGGGCTACAAAAGGAGGCATAGGACCGATAGGACAT (reference genome)

   GGCTAC          -HWI-EAS216_0019:6:44:8736:13789#0/1[0:6]
   GGCTAAA         HWI-EAS216_0019:6:11:12331:10539#0/1[28:35]
   GGCTAAAAA       HWI-EAS216_0019:6:54:3745:18270#0/1[26:35]
   GGCTAAAAAA      HWI-EAS216_0019:6:75:4175:16939#0/1[25:35]
   GGATACAACAG     -HWI-EAS216_0019:6:96:16613:3276#0/1[0:11]
   GGCTAAAAAAA     HWI-EAS216_0019:6:11:8510:9363#0/1[22:33]
   GGCTAAAAAAA     HWI-EAS216_0019:6:72:14343:19416#0/1[21:32]
   GGCTACAACAG     -HWI-EAS216_0019:6:116:10526:5940#0/1[4:15]
   GGCTAAAACAA     HWI-EAS216_0019:6:103:9198:1208#0/1[18:29]
   GGCTACAACAG     -HWI-EAS216_0019:6:50:14928:13905#0/1[7:18]
   GGCTACAACAG     -HWI-EAS216_0019:6:109:18796:4787#0/1[14:25]
   GGCTACAACAG     -HWI-EAS216_0019:6:102:14050:8429#0/1[18:29]
   GGCTACAACAG     -HWI-EAS216_0019:6:22:14814:10742#0/1[23:34]
    GCTCCACCAG     -HWI-EAS216_0019:6:106:8241:6484#0/1[25:35]
      TACAACAG     HWI-EAS216_0019:6:4:8095:15474#0/1[0:8]
        CAACAG     HWI-EAS216_0019:6:115:15766:4844#0/1[0:6]
     * **  * *
     1 23  4 5

how do you gather all of the locations with differences, and then figure out which differences are significant? (Variations in 1 and 2 are probably errors, 4 and 5 may be real, and 3 is pretty clearly real, I would say.)

The numbers here aren't trivial: you're looking at ~1 million-3 billion bases for the reference genome, and 20-120 million or more shorter resequencing reads.

The first thing to point out is that it's not enough to simply look at all of the aligned sequences individually; you need to know how many total sequences are aligned to each spot, and what each nucleotide is. This is because the new sequences are error-prone: observing a single variation isn't enough, you need to know that it's systematic variation. So it's a total-information situation, and hence not just pleasantly parallel: you need to integrate the information afterwards.

I came up with two basic approaches: you can first figure out where there's any variation, and then iterate again over the short sequences, figure out which ones overlap detected variation, and record those counts; or you can figure out where there's variation, index the short sequences by position, and then query them by position. Both are easy to code.

For the first, approach A:

for seq, location in aligned_short_sequences:
   record_variation(genome[location])

for seq, location in aligned_short_sequences:
   if has_variation(location):
      record_nucleotide(genome[location], seq)

For the second, approach B:

for seq, location in aligned_short_sequences:
   record_variation(genome[location])

for location in recorded_variation:
   aligned_subset = get_aligned_short_sequences(location)
   for seq, _ in aligned_subset:
      record_nucleotide(genome[location], seq)

It's readily apparent that barring stupidity, approach A is generally going to be faster: there's one less loop, for one thing. Also, if you think about implementation details, approach B is potentially going to involve retrieving aligned short sequences multiple times, while approach A looks at each aligned short sequence exactly once.

However, there are tradeoffs, especially if you're building tools to support either approach. The most obvious one is, what if you need to look at variation on less than a whole-genome level? For example, for visualization, or gene overlap calculations, you're probably not going to care about an immense region of the genome; you may want to zoom in on a few hundred bases. In that case, you'd want to use some of the tools from approach B: given a position, what aligns there?

if variation_at(genome[location]):
    aligned_subset = get_aligned_short_sequences(location)
    for seq, _ in aligned_subset:
       record_nucleotide(genome[location], seq)
variation in the genome

Another tradeoff is simply whether or not to store everything in memory (fast) or on disk (slower, but scales to bigger data sets) -- I first wrote this code using an in-memory set, to look at low-variation chicken genome resequencing. Then, when I tried to apply it to a high variation Drosophila data set and a deeply sequenced Campylobacter data set, it consumed a lot of RAM. So you have to think about whether or not you want a single tool that works across all these situations, and if so, how you want it to behave for large genomes, with lots of real variation, and extremely deep sequencing -- plan for the worst, in other words.

I also tend to worry about correctness. Is the code giving you the right results, and how easy is it to understand? Each approach has different drawbacks, and relies on different bodies of code that make certain operations easy and other operations hard.

The upshot is that I wrote several programs implementing different approaches, and tried them out on different data sets, and chose the fastest one for one purpose (overall statistics) and a slower one for visualization. And most programmers I know would have done the same: tried a few different approaches, benchmarked them, and chosen the right tool for the current job.

Where does this leave people who don't intuitively understand the difference between the algorithms, and don't really understand what the tools are doing underneath, and why one tool might return different results (based on the parameters given) than another? How do you explain that sort of thing in a short amount of time?

I don't know, but it's going to be interesting to find out...

BTW, the proper way to do this is with pygr will someday soon be:

for seq, location in aligned_short_sequences:
   record_variation(genome[location])

mapping = index_short_sequences()

results = intersect_mappings(variations, mapping)

where 'intersect_mappings' uses internal knowlege of the data structures to do everything quickly and correctly. That day is still a few months off...

--titus

Syndicated 2010-04-04 00:58:05 from Titus Brown

Indexing and retrieving lots of (biological) sequences, quickly

These days, molecular biologists are dealing with lots and lots of sequences, largely due to next-gen sequencing technologies. For example, the Illumina GA2 is producing 100-200 million DNA sequences, each of 75-125 bases, per run; that works out to 20 gb of sequence data per run, not counting metadata such as names and quality scores.

Storing, referencing, and retrieving these sequences is kind of annoying. It's not so much that the files themselves are huge, but that they're in a flat file format called 'FASTA' (or FASTQ), with no way to quickly retrieve a particular sequence short of scanning through the file. This is not something you want to do on-demand or keep in-memory, no matter what BioPython says ;).

So everyone working with these sequences has evolved one way or another to deal with it. A number of different approaches have been used, but they mostly boil down to using some sort of database to either index or store the sequences and associated metadata. It won't surprise you to learn that performance varies widely; that benchmarks are few and far between; and that many people just use MySQL, even though you can predict that using something like MySQL (or really any full-blown r/w database) is going to be suboptimal. This is because FASTA and FASTQ files don't change, and the overhead associated with being able to rearrange your indices to accomodate new data is unnecessary.

A while ago -- nearly 2 years ago!? -- we started looking at various approaches for dealing with the problem of "too much sequence", because the approach we used in pygr was optimized for retrieving subsequences from a relatively small number of relatively large sequences (think human chromosomes). Alex Nolley, a CSE undergrad working in the lab, looked at a number of approaches, and settled on a simple three-level write-once-read-many scheme involving a hash table (mapping names to record IDs), another table mapping fixed-size record IDs to offsets within a database file, and a third randomly-indexed database file. He implemented this in C++ and wrote a Pyrex wrapper and a Python API, and we called it screed.

Imagine our surprise when we found out that even absent the hashing, sqlite was faster than our screed implementation! sqlite is an embedded SQL database that is quite popular and (among other things) included with Python as of Python 2.5. Yet because it's a general purpose database, it seemed unlikely to us that it would be faster than a straightforward 2-level indexed flat file scheme. And yet... it was! A cautionary tale about reimplementing something that really smart/good programmers have spent a lot of time thinking about, eh? (Thanks to James Casbon, in particular, for whacking us over the head on this issue; he wins, I lose.)

Anyway, after a frustrating few weeks trying to think of why screed would be slower than sqlite and trying some naive I/O optimizations, we abandoned our implementation and Alex reimplemented everything on top of sqlite. That is now available as the new screed.

screed can import both common forms of sequence flat file (FASTA and FASTQ), and provides a unified interface for retrieving records by the sequence name. It stores all records as single rows in a table, along with another table to describe what the rows are and what special features they have. Because all of the data is stored in a single database, screed databases are portable and you can regenerate the original file from the index.

screed is designed to be easy to install and use. Conveniently, screed is pure Python: sqlite comes with Python 2.5, and because screed stays away from sqlite extensions, no C/C++ compiler is needed. As a result installation is really easy, and since screed behaves just like a read-only dictionary, anyone -- not just pygr users -- can use it trivially. For example,

>>> import screed
>>> db = screed.read_fasta_sequences('tests/test.fa')

or

>>> db = screed.read_fastq_sequences('tests/test.fq')

and then

>>> for name in db:
...   print name, db[name].sequence

will index the given database and print out all the associated sequences.

Now, because we are actually considering publishing screed (along with another new tool, which I'll post about next), I asked Alex to benchmark screed against MySQL and PostgreSQL. While sqlite is far more convenient for our purposes -- embeddable, included with Python so no installation needed, and not client-server based -- "convenience" isn't as defensible as "way faster" ;). Alex dutifully did so, and the results for accessing sequences, once they are indexed, are below:

benchmark figure

That is, sqlite/screed seems to achieve a roughly 1.8x speedup over postgresql, and 2.2x speedup over mysql, with a roughly constant retrieval time for up to a million sequences. (We've seen constant retrieval times out to 10m or so, but past that it becomes annoying to repeat benchmarks, so we'd like to get our benchmark code correct before running them.)

The benchmark code is here: screed, mysql, and postgresql. We're still working on making it easy to run for other people but you should get the basic sense of it from the code I linked to.

screed's performance is likely to improve as we streamline some of the internal screed code, although the improvements should be marginal because I/O and hashing should dominate. So, these benchmarks can be read as "underperforming Python interface wrapped around sqlite vs mysql and postgresql".

There is one disappointment with sqlite. sqlite doesn't come with any sort of compression system built-in, so even though DNA sequence is eminently compressible, we're stuck with a storage and indexing system that requires about 120% of the space of the original file.

--titus

Syndicated 2010-03-29 02:41:31 from Titus Brown

Course announcement: Analyzing Next-Generation Sequencing Data

Analyzing Next-Generation Sequencing Data

May 31 - June 11th, 2010

Kellogg Biological Station, Michigan State University

CSE 891 s431 / MMG 890 s433, 2 cr

Applications are due by midnight EST, April 9th, 2010.

Course sponsor: Gene Expression in Disease and Development Focus Group at Michigan State University.

Instructors: Dr. C. Titus Brown and Dr. Gregory V. Wilson.

Course Description

This intensive two week summer course will introduce students with a strong biology background to the practice of analyzing short-read sequencing data from the Illumina GA2 and other next-gen platforms. The first week will introduce students to computational thinking and large-scale data analysis on UNIX platforms. The second week will focus on mapping, assembly, and analysis of short-read data for resequencing, ChIP-seq, and RNAseq.

No prior programming experience is required, although familiarity with some programming concepts is suggested, and bravery in the face of the unknown is necessary. 2 years or more of graduate school in a biological science is strongly suggested.

See the course Web site for more information.

Syndicated 2010-03-24 02:22:27 from Titus Brown

What's with the goat?

A new meme was born at PyCon 2010: The Testing Goat.

Or, "Be Stubborn. Obey the Goat."

The goat actually emerged from the Testing In Python Birds of a Feather session at PyCon, where Terry Peppers used slides full of goat in his introduction. This was apparently an overreaction to lolcat, but the testing goat is now being held up in opposition to the Django pony.

sigh.

--titus

Syndicated 2010-02-21 20:35:31 from Titus Brown

Managing student expectations for open-source projects

On the heels of my aggressive competence post, about (among other things) my failure to outline my expectations for students, I've started putting together a page to help manage student expectations for the pony-build project, which is participating in the Undergraduate Capstone Open-Source Projects (UCOSP) course this term.

(Please comment over at the Wordpress blog for UCOSP:

http://ucosp.wordpress.com/2010/01/02/managing-student-expectations/

so the students can see your words of wisdom!)

--titus

Syndicated 2010-01-02 21:06:53 from Titus Brown

Why use buildbots?

I've recently turned my basilisk eye from Web testing and code coverage analysis to continuous integration, as you can see from my PyCon '10 talk and my UCOSP proposal, not to mention everyone wants a pony.

There's some confusion about what "continuous integration" means (see Martin Fowler on CI) so for simplicities sake I'm just going to talk about "buildbots" that take your code, compile it (if necessary), run all the tests across multiple platforms, and provide some record of the results. (This choice of terms is also confusing because "buildbot" is a widely used Python software package for CI. Sigh.)

why use buildbots?

So, uhh, why use buildbots, anyway?

  1. They build your code and run your tests without your conscious involvement.

Obvious, yes -- that is, after all, ostensibly the point of buildbots. But it has more benefits than you might immediately.

For this to work, you must have a systematized and automated build process.

You must also have some automated tests.

And your your build process and tests are being run on a regular basis, whether or not any particular developer feels like it. And if the build or tests fail, then more likely than not, something changed to make them fail -- and now you'll know.

These are all good and necessary things.

  1. They can build your code and run your tests in multiple environments.

buildbots can build and run your project on whatever operating systems you or your colleagues can access, and report the results to you, with a minimum of setup.

This is the main reason I use buildbots myself: to run tests on other versions of Python, and other operating systems. I'm a UNIX guy, and I develop on Linux; therefore my software usually works on Linux. My pure Python code generally works on Mac OS X, too, although I sometimes run into trouble with compiled code. But I don't ever run my software on Windows systems, because I don't have Windows handy; so my code often doesn't work on Windows. This is where a Windows buildbot comes in really handy, by catching the errors that I otherwise wouldn't even notice.

There's a more subtle point here that many people miss, which is the ability of buildbots to test dependence on a specific full stack of hardware and software. Most developers work with at most one or two build environments, including compiler or interpreter versions, operating system patchlevels, etc. The more different versions you have being tested, the more you can detect sensitivities to specific operating system or compiler or language features; whether or not cross-compiler or cross-version compatibility important to you is a different question, of course, but it's nice to know.

The most entertaining aspect of this is when buildbots detect when developers -- especially inexperienced ones -- introduce unintended or unauthorized new dependencies. "Hey, Joe, since when does our software depend on FizBuzz!?"

These latter points feed particularly into #3 and #4:

  1. They provide a de facto set of docs on your build & test environment.

buildbots require explicit build instructions, so if you've got one running at least your project has some form of build documentation. Not a good one, maybe not an explicit one, but something.

This is not a concern for most big open source projects, because they usually have fairly straightforward and well-documented build environments (although not all -- OLPC/Sugar was horrific!) Where I think this really helps is for small private projects and especially for for academic projects, where the level of software engineering expertise can be, ahem, poor. Having explicit build instructions that graduate student B can use to build & run the code now that graduate student A has left the project is quite helpful.

4. They are evidence that it is possible to build your code and run your tests on at least some platform.

You might be surprised how much some projects really need this kind of evidence :). As with #3, small private projects and academic projects benefit the most from this.

  1. They can run all the tests, even the slow ones, regularly.

This is the third reason that software professionals like continuous integration and buildbots: many tests (in particular, integration and acceptance tests) may take a loooong time to run, and developers may end up simply not running all of them. With buildbots, you can run them on a daily basis and detect problems, without distracting or defocusing your developers.

Are buildbots overkill for your project?

buildbots require setup and maintenance effort, which (in our zero-sum world) takes that effort away from developing new features, exploratory testing, etc. When does the benefit outweigh the cost?

Almost always, I believe.

For small side projects that you may not be constantly focused on, having the tests alert you when something breaks is really helpful. But even if you're in a mature software engineering setting and you have a good build process, a good set of documentation on how to build your software, and a commitment to running the tests regularly, many of the advantages above still apply. In particular, #1 (building w/o conscious effort), #2 (building across multiple environments), and #5 (running all of the tests, especially the slow ones), are advantageous for all projects.

I think buildbots aren't that useful for projects that are mostly UI (which is hard to develop automated tests for) or that are at a very early stage (where you're accumulating technical debt on a daily basis) or that depend on lots of specialized hardware. What else?

What's next?

I personally think that the technology that's out there in the Python world isn't that simple and hackable, so that's what I'm working on. I'd also like to minimize configuration and maintenance. I have a simple implementation "thought project", pony-build, that I'm hoping will address these issues. The goal is to make buildbots "out of sight, out of mind."

A secondary goal (one of many - watch this space) is to enable simple integration into a pipeline where patches can be tested, and/or automatically accepted or rejected, based on whether or not they pass tests on multiple platforms.

--titus

Syndicated 2009-12-22 20:35:00 from Titus Brown

Exhibiting aggressive competence

This last term I facilitated the participation of five MSU students in the Undergraduate Capstone Open Source Projects (UCOSP) program, in which students do distributed open source software development and receive home institution credit. UCOSP was managed out of U Toronto by Greg Wilson, and I was (and am) enthusiastic to participate as it's clearly a good way bring open source into education.

However, I was less thrilled to see that the majority of the MSU students received, ahem, "less than passing" grades from their project leaders. I knew about the problems in one particular project from having met with the students on a regular basis, but the other results caught me by surprise. I would love to kick and scream and complain that I should have been made more aware of what was going on -- and where I had constructive things to suggest, I did -- but the more important failure may have been a mismatch between the MSU students' approach to these projects, and project expectations.

The students variously had a number of problems, ranging from team miscommunication & poor conduct to an inability to get the software to compile. This meant that for several students, no visible work got done -- for example, in one project, it regularly happened that person X was working on a patch, and person Y committed an overlapping patch first. Or on another project, person Z spent two months trying to get the basic project infrastructure compiled, and was reduced (at the very end) to submitting code fixes without testing them in the full project context. Or several times, person A spent a week working out how to refactor a test into something reliable, and resulted in what looked like (and maybe was) a trivial code change.

All of these situations may result (and did result) in low evaluations. This is understandable: no visible work got done, so how is an evaluator supposed to grade them!? Yet, all of the situations are legitimate issues that block progress. What is a student to do?

The answer won't be too hard to guess for anyone who has worked on real-world team projects: make your struggles visible.

Someone steps on your patch? Fine -- submit your patch too, and explain why it's better (or worse) than the first patch. Code review the other patch, while you're at it: who better to do the review than someone who really understands the issues? Then when you get poor marks for not having contributed code, point at your patch. (You are using version control, right?)

Can't compile the software? Fine -- write down what's going wrong, and post it publicly. Document your fix attempts. Ask for help. Bash your head against the wall repeatedly. Either fix the problem, or document the problem thoroughly. Either someone will help you, or you'll figure it out, or you'll leave an audit trail so that others won't have to do all that fail work. Then when you get poor marks for not having contributed any patches, point out that the project has technical issues and either no one could help you (project FAIL) or you spent all your time fixing them.

Trying to debug niggling details that turn out (in the end) not to involve big impressive code changes? Submitting too many unimpressive patches that no one seems to value? Write down why your contributions are valuable. At the end of the day the evaluation may (rightly or wrongly) be "not too smart, but sure did work hard" -- but that's better than "no evidence of any work having been done".

Note how a lot of this seems to involve communication? Right -- that. For team projects, being an effective communicator is more important than being a kick-ass programmer.

At the end of the day, there are things you can control, and things you can't control. You can't control what other people think of you, and you can't control how other people (including project leaders and professors) evaluate you. But you can visibly work hard, and defend yourself based upon that evidence.

I call the general approach of throwing energy at a project "aggressive competence", and I think it's a necessary component of effective team software development. Everyone has days, or weeks, or even months where they look incompetent or ineffective; often that's because outsiders don't understand or appreciate the work that you've done. Tough on you, but I don't think it's reasonable to expect your boss, or colleagues, to look hard at your work to find reasons to praise you. Fundamentally, it's your responsibility to "manage up" and communicate your progress to others effectively.

In open source projects (and elective college courses) the immediate ramifications of a poor evaluation may not be clear -- I'll leave you, dear reader, to figure out the longer term consequences. But I think the ramifications of a poor evaluation are immediately obvious in the context of a capstone course, or a paying job.

Incidentally, this illuminates one of the reasons why I'm such a big fan of UCOSP: it is reality. You're working on an existing project, with other developers, at a distance; and it's not anyone else's responsibility to frame the problem for you. It's your responsibility to make progress.

This is where I think there were mismatched expectations. The students expected that they were going to be managed, helped, and given clear expectations. They weren't. So they got bad evaluations.

What do I plan to do? Well, assuming that UCOSP + MSU goes forward next term, I will be communicating my expectations quite clearly to the students. And I will be asking for regular progress reports, sent to me and CCed to the project leaders. And I'll be sending them this blog post. And I'll be failing the ones that don't listen.

I'll end with a paraphrase of one of my favorite sci-fi authors: "every new developer has problems on a new project. The extent of our sympathy for those problems, however, will be dictated by the efforts made to overcome them."

--titus

p.s. It's also a good way to figure what projects you don't want to work on: I once got dinged for working too hard in a company; I was told that I was "rowing too fast and the boat was going around in circles." My response (that perhaps others might consider rowing faster) was not received well. That's the kind of job situation you can leave without guilt (as I did).

p.p.s. Code reviews can be an extraordinarily effective passive-aggressive way to correctively interact with jerks on a project, too.

Syndicated 2009-12-18 19:27:22 from Titus Brown

Exhibiting aggressive competence

This last term I facilitated the participation of five MSU students in the Undergraduate Capstone Open Source Projects (UCOSP) program, in which students do distributed open source software development and receive home institution credit. UCOSP was managed out of U Toronto by Greg Wilson, and I was (and am) enthusiastic to participate as it's clearly a good way bring open source into education.

However, I was less thrilled to see that the majority of the MSU students received, ahem, "less than passing" grades from their project leaders. I knew about the problems in one particular project from having met with the students on a regular basis, but the other results caught me by surprise. I would love to kick and scream and complain that I should have been made more aware of what was going on -- and where I had constructive things to suggest, I did -- but the more important failure may have been a mismatch between the MSU students' approach to these projects, and project expectations.

The students variously had a number of problems, ranging from team miscommunication & poor conduct to an inability to get the software to compile. This meant that for several students, no visible work got done -- for example, in one project, it regularly happened that person X was working on a patch, and person Y committed an overlapping patch first. Or on another project, person Z spent two months trying to get the basic project infrastructure compiled, and was reduced (at the very end) to submitting code fixes without testing them in the full project context. Or several times, person A spent a week working out how to refactor a test into something reliable, and resulted in what looked like (and maybe was) a trivial code change.

All of these situations may result (and did result) in low evaluations. This is understandable: no visible work got done, so how is an evaluator supposed to grade them!? Yet, all of the situations are legitimate issues that block progress. What is a student to do?

The answer won't be too hard to guess for anyone who has worked on real-world team projects: make your struggles visible.

Someone steps on your patch? Fine -- submit your patch too, and explain why it's better (or worse) than the first patch. Code review the other patch, while you're at it: who better to do the review than someone who really understands the issues? Then when you get poor marks for not having contributed code, point at your patch. (You are using version control, right?)

Can't compile the software? Fine -- write down what's going wrong, and post it publicly. Document your fix attempts. Ask for help. Bash your head against the wall repeatedly. Either fix the problem, or document the problem thoroughly. Either someone will help you, or you'll figure it out, or you'll leave an audit trail so that others won't have to do all that fail work. Then when you get poor marks for not having contributed any patches, point out that the project has technical issues and either no one could help you (project FAIL) or you spent all your time fixing them.

Trying to debug niggling details that turn out (in the end) not to involve big impressive code changes? Submitting too many unimpressive patches that no one seems to value? Write down why your contributions are valuable. At the end of the day the evaluation may (rightly or wrongly) be "not too smart, but sure did work hard" -- but that's better than "no evidence of any work having been done".

Note how a lot of this seems to involve communication? Right -- that. For team projects, being an effective communicator is more important than being a kick-ass programmer.

At the end of the day, there are things you can control, and things you can't control. You can't control what other people think of you, and you can't control how other people (including project leaders and professors) evaluate you. But you can visibly work hard, and defend yourself based upon that evidence.

I call the general approach of throwing energy at a project "aggressive competence", and I think it's a necessary component of effective team software development. Everyone has days, or weeks, or even months where they look incompetent or ineffective; often that's because outsiders don't understand or appreciate the work that you've done. Tough on you, but I don't think it's reasonable to expect your boss, or colleagues, to look hard at your work to find reasons to praise you. Fundamentally, it's your responsibility to "manage up" and communicate your progress to others effectively.

This is where I think there were mismatched expectations. The students expected that they were going to be managed, helped, and given clear expectations. They weren't. So they got bad evaluations.

In open source projects (and elective college courses) the immediate ramifications of a poor evaluation may not be clear -- I'll leave you, dear reader, to figure out the longer term consequences. But I think the ramifications of a poor evaluation are immediately obvious in the context of a capstone course, or a paying job.

Incidentally, this illuminates one of the reasons why I'm such a big fan of UCOSP: it is reality. You're working on an existing project, with other developers, at a distance; and it's not anyone else's responsibility to frame the problem for you. It's your responsibility to make progress.

What do I plan to do? Well, assuming that UCOSP + MSU goes forward next term, I will be communicating my expectations quite clearly to the students. And I will be asking for regular progress reports, sent to me and CCed to the project leaders. And I'll be sending them this blog post. And I'll be failing the ones that don't listen.

I'll end with a paraphrase of one of my favorite sci-fi authors: "every new developer has problems on a new project. The extent of our sympathy for those problems, however, will be dictated by the efforts made to overcome them."

--titus

p.s. It's also a good way to figure what projects you don't want to work on: I once got dinged for working too hard in a company; I was told that I was "rowing too fast and the boat was going around in circles." My response (that perhaps others might consider rowing faster) was not received well. That's the kind of job situation you can leave without guilt (as I did).

p.p.s. Code reviews can be an extraordinarily effective passive-aggressive way to correctively interact with jerks on a project, too.

Syndicated 2009-12-18 19:10:45 from Titus Brown

A Tale of a Bug

or, "those python-dev people are awesome."

My experience with the Python bug tracker has been pretty sparse and largely limited to some of the eternaissues like "make HTMLParser deal with even more broken HTML" that never really get resolved because they're not very important and don't have a champion. So when I filed this minor bug report on test_distutils I was not expecting it to be on anyone's top 10 list.

I filed the bug report at 17:21 and dropped a note to python-dev.

Within an hour or so, several people reported that they could not duplicate it, and one person had reported that they could, on the mailing list.

At 19:58 Ned Deily reported through the bug tracker that he could dupe it.

By 20:36, Tarek had a suggestion for something to try (that wasn't the problem, but never mind).

By 21:58, Ned had tracked it down to a UNIX-y flavah difference (BSD vs SysV) in the way group ownership was set on files.

And at 22:30, Tarek had fixed the problem (by dropping the assertions).

So, umm, wow, that was quick! Just over 5 hours, across at least two continents, on a Sunday...

---

I was impressed by how many people chipped in to get a broad spectrum view of things, how quickly some hypotheses were generated & in the end how quickly the issue was resolved -- for a relatively minor problem in one corner of the test suite that didn't show up on any of the buildbots.

I'd bet that part of this speedy response is because Tarek Ziade has taken on stewardship of the distutils code. If so, this highlights how important it is to have people who feel responsible for various bips and bobs of the stdlib & just do whatever needs to be done.

Oh, and it also highlights the value of continuous integration across many machines; I saw the problem because I was running tests inside pony-build in a certain way on a Mac OS X 10.5 machine, and the error wasn't tripped in the 10.6 buildbot.

--titus

Syndicated 2009-11-30 03:10:17 from Titus Brown

Comparing two Python source files for equivalence

We've been doing some wholesale PEP 8 reformatting over in the pygr project and we're trying to figure out how to review the changes to make sure nothing inadvertently got broken. Most of the changes are in spacing and line breaks, not in variable names, so I think what we really want is a whitespace-ignoring diff utility. That's not that difficult to find -- except that we're coding in Python, so relative whitespace indentation levels are significant!

Any suggestions for tools that could help?

I've tried several approaches based on tokenizing and using ASTs, but I'm not that clever. I'm going to take another look at processing two token streams, and if that doesn't work, I think 2to3 may hold the key...

--titus

Syndicated 2009-11-26 14:20:12 from Titus Brown

442 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!