Older blog entries for titus (starting at number 438)

GHOP to run again; HELP.

The contest formally known as GHOP is going to run again this fall, and we need your help.

Yes, you. YOU, over there in the corner. Stop avoiding this post!

GHOP, for those of you who don't remember or weren't around 2 years ago, was the very successful pilot sister program to the Google Summer of Code that involved 13+ yro students from countries around the world (excepting only the Axis of Evil) in open source work. Nearly 400 students (!) participated and there was much rejoicing. (Summary post here, and all of my blog posts on Python's GHOP here.)

The good news was that GHOP was a big success from the perspective of many people: unlike the GSoC, which requires a substantial time investment from the mentor, and is only intended for coding work, GHOP involved byte-sized chunks of work in all areas (docs, testing, etc.) and rewarded both students and mentors for even a little bit of participation. In a signal of GHOP's success, by the end of the contest coming up with new Python-based tasks was easy -- people were literally throwing them at me, because they saw the rate at which existing tasks were being completed! I know that GvR was happy with the doc patches that made it into Python, and Andre Roberge gives GHOPpers a fair bit of credit for their contributions to Crunchy; there are a number of other success stories, too, including when Kumar told me that a task was too big and open-ended and then a 13 year old took the task and aced it, proving that I am not always wrong to ignore Kumar.

The bad news was that running GHOP was an immense amount of work, largely because of a lousy infrastructure -- Google Code isn't intended for this kind of thing, but we had to use something Google-hosted because it was a contest.

So what did Google do? They created the Melange project to help provide infrastructure for the GSoC and the GHOP both. It was used for the GSoC this last summer, and despite its rough edges, it worked out quite well.

Now Google is running GHOP again, and they're aiming to start December 7th. Unfortunately, in order to make that happen, they need a LOT of help on Melange.

Where do YOU come in?

Well, presumably you're a Python coder. You may be an expert in testing. You might be a Django nutcase. You're probably a Web developer (and odds are you've written your own Web framework, too, but never mind).

And guess what Melange is written in?

That's right, the best language on Earth (or at least a reasonable facsimile of it) -- Python.

You already know the language.

You already know how to use it in anger, to make the computer do your bidding.

Why not put your skillz to use?

I will be hitting up specific people and specific lists once we know when the IRC meeting to discuss Melange development is. Why not save yourself the aggravation of feeling guilt when you get my e-mail in a few days, and just sign up the Melange dev list right now?

---

Seriously, GHOP was awesome last time and we got a lot done for quite a few different Python projects. This time, we're older, more experienced, and better prepared to take advantage of GHOP. Join us, and you will become more powerful than you can possibly imagine!

You can find a list of areas where Melange devs feel they need help right here. I look forward to seeing YOU working on them!

--titus

Syndicated 2009-09-12 03:42:08 from Titus Brown

How the Python stdlib changes (a public service message)

In the interests of social anthropology, I feel compelled to point Pythonistas at this fascinating discussion on the stdlib-sig on adding argparse to the Python stdlib. (Yeah, it's pretty much the only traffic that list got so far this month.)

Fascinating stuff. If there's a secret cabal out there masterminding Python development, they are clearly rather poorly organized ;)

--titus

Syndicated 2009-09-12 02:27:04 from Titus Brown

Buggy Python code?

I'm looking for examples of frustratingly simple-yet-wrong Python code, suitable for an undergrad class to debug. I'd prefer things that don't rely on tricky features of Python (like shared list references), but rather code where subtly bad logic or program flow leads to bad behavior.

Comment below, or e-mail me; I'll post the ones I pick later. thanks!

--titus

Syndicated 2009-09-09 02:13:57 from Titus Brown

Chickens are not a rate limiting factor

My wife and I were talking with my USDA collaborator about some possible chicken research, and I asked about access to animals. His response? "Chickens are not a rate limiting factor."

Did you know that 1 million chickens are slaughtered per hour, on average, in the US? Wow.

--titus

Syndicated 2009-09-06 23:59:31 from Titus Brown

Success, at last!

For only the second time (out of many tries) I managed to smoke some salmon and trout so that it was not overcooked and dry as a bone. Conclusion? I think my smoker thermometer is about 50 deg F off of the true "on grill" temperature, probably because it's about 3/4 of a foot above the grill level. So I just let the smoker sit at a lower measured temperature and voila, tasty fish!

I also got a full Windows dev environment working, from scratch, for Python. I took advantage of the Snakebite MSDN account to grab Windows XP and Visual Studio 2008, and then used Parallels to create a VM and install everything.

A few comments:

Parallels (a Mac OS X app) makes Windows much more bearable. On first blush, they've really got a good setup for people who only occasionally need to use Windows and generally hate every minute of it. It's kind of funny, really; even the Windows emulator is better on the Mac than Windows is itself!

It took about 24 hrs of futzing to get everything installed and updated. I ran into several situations where I had to turn off the Parallels disk sharing setup (which shares disks between the Mac host and the Windows VM) in order to install packages. This included the Python 2.6.2 MSI installer, the Service Pack 2 upgrade, and (I think) MySQL 5.1.

The purpose of this Windows futzing is to get a build client system going for Windows XP; for that, I needed git and svn. I'm happy to report that both git and svn have clients that are pretty much trivial to install on Windows, and seem to Just Work: I used msysgit and the Tigris.org download.

I ran into an infuriatingly opaque error compiling MySQLdb, and had to figure it out myself; I didn't run across this response until too late. Briefly, if you have MySQL 5.1 installed, _winreg returns an error, "the system cannot find the file specified"; you need to update MySQLdb's site.cfg to look for MySQL 5.1 instead of 5.0. This seems like something that should be in MySQLdb...

Speaking of MySQLdb, it'd be nice to have binary packages of some version or another for Python 2.6. Binary builds for Windows of packages with C extensions are really important for users. Hopefully I can help provide a better solution for this down the road.

Anyway, now I have a full blown dev environment: I can compile and test pygr, I can compile and test CPython itself, and I'm happy.

I can even sit back and eat some yummy smoked fish. How is that not a win?

--titus

p.s. Now I have to repeat all of this for another Mac by which time I will be an expert, I'm sure! Bleah.

Syndicated 2009-09-06 19:12:40 from Titus Brown

Why does my iPhone know how to spell Cthulhu?

Very odd. I mean, it's nice to have my prayers spell-checked and all, but really, Apple? Cthulhu?

Also, jinja2 rocks. I think I'll be teaching it as a templating language this term...

And finally, people interested in using sqlite3 for shelve-like storage in Python 2.x can take a look at issue 52 in pygr's issue tracker; I've taken the code from bugs.python.org/issue3783 and "backported" it to Python 2.x. Since bsddb is no longer going to be part of the Python stdlib, we're planning to switch to using sqlite3 for scalable data storage.

--titus

Syndicated 2009-08-30 17:59:02 from Titus Brown

Teaching girls to program in Ruby

Sarah Mei posts about teaching Ruby to high school girls. Good stuff.

While searching for some GHOP info from way back, I ran across this post asking "where are the girls among the GHOP winners?" (The statistics mentioned in the post may have been posted since, although I haven't seen them.) We asked the Python mentors to "rate" the students, and the hands-down winner was someone who had worked closely with several different mentors and performed very well. Perhaps next time we should highlight everyone who did well; there were several women in the group, too.

In general it's tough to raise the visibility of minority groups, though. Do we engage in affirmative action of some sort, and if so, how do we do so without being unfair to others? Or do we simply rank people in a presumably gender- and color-blind way and see what happens? I've talked several times in various venues about trying to run a female-oriented GSoC derivative, like GNOME's Women's Summer Outreach Program, which would at least call attention to one minority in OSS... and if GHOP ever happens again, we could work on getting younger women involved.

--titus

Syndicated 2009-08-28 03:02:20 from Titus Brown

XML sucks for big data, or hadn't you noticed?

Courtesy of Rich Enbody, this blog post, How XML Threatens Big Data -- Dataspora, elicited a big "duh" from me.

You don't solve any of the semantic problems with data by elaborating on a textual format. You may bring them into the light, but along with the visibility comes "bureaucracy" -- technology, acronyms, proponents and opponents, and the usual cruft.

I find the "embrace lazy data modeling" rule rather funny, personally, because it is the data-world's counterpart to agile methodologies in software development: solve problems you actually have, rather than all the potential ones you see.

I do like the "15 minute" rule: if I can't parse some useful information out of your data format in 15 minutes, you've done something very wrong.

--titus

Syndicated 2009-08-26 05:57:25 from Titus Brown

Who belongs in the PSF?

So, it's nomination season for the Python Software Foundation again... and I have this niggling feeling that I'm forgetting about several people that have demonstrated significant commitment to the Python community, are good 'uns, and are otherwise people I would trust with some part of the future of Python and the community.

PSF membership "worthiness" is hard to define, but (for me) depends on more than just working in your own small patch of Python -- contributing broadly to GSoC/GHOP, for example, seems like a good criterion. Remember, people need a substantial positive vote by the existing PSF, too, so they should be Known. For some value of Known.

Any thoughts? Drop me a line.

--titus

Syndicated 2009-08-25 16:15:27 from Titus Brown

Calculating necessary coverage for ChIP-seq

OK, so you have a genome -- let's say it's about 1gb in size -- and you want to do ChIP-seq on a transcription factor that you think binds ~1000 places in the genome. You've measured the specificity of the transcription factor and it seems to enrich about 10-fold over background (an OK but not fantastic number). How much sequencing do you need to do to see a statistically significant signal?

We need two other numbers. First, what fragment size are you going to use? And second, what level of signal over background do you want to see?

Let's choose a fragment size of 300 bp and look for a 10:1 signal over background, just for grins.

The math then works out like this: you need roughly 1 sample from each fragment in the ChIP mixture (background + specific fragments) to get an average 10x signal with a 10-fold enrichment. So, if the background is 1 gb / 300 bp, or 3 x 10^6 fragments, and the signal is 10x enrichment x 1000 locations, or 10^4, then you need N=(3x10^6 + 10^4) samples to hit each background fragment once & each "real" location 10 times. Note that (3x10^6 + 10^4) is approximately 3x10^6: that is, the background dominates the necessary sequencing for transcription factors that bind in so few places in the genome.

If you want at least one sample (on average) from each of 3x10^6 fragments, you want 3x10^6 samples. A single Solexa lane yields approx 2.5x10^6 mappable reads (as of the last data sets I have -- so it should have improved by now), suggesting that a single Solexa lane should yield nearly enough samples to see a <deep breath> 10x signal with a 10:1 enrichment over background by ChIP in a 1 gb genome with a 300 bp fragment size.

Now, are these realistic numbers? In some cases, yes; in others, I don't know, but I think so. Some factoids and guesstimates:

  • the chick genome is 1.2 gb in size, or approximately 1 gb.

  • 300 bp is a "typical" choice for fragment size, although of course you'd prefer smaller (for better resolution).

  • a 10x signal should stand out over background, using an off-the-cuff estimate of variance (sqrt(10) ~ 3, so 95% of your true peaks should have > (10 - 2var == 4) reads associated with them).

    Of course, you're going to have a lot of background, so there will be many peaks that are just coincidental. I'm not sure how to do that napkin-sized calculation -- should I just look three standard deviations out (1 +/- 3 = 4) and see how many peaks I'd expect in that interval? It should be less than 1 percent, but that's still an awful lot of peaks when you're considering a background of 3x10^6 fragments.

  • People (TM) tell me that 10:1 enrichment is not atypical of an OK antibody.

So, assuming these numbers are about right, where can you optimize most easily? There aren't any non-linear influences, so you can't look to tweak there; it comes down to what's easy and cheap.

  • Mo' sequencing = mo' bettah, obviously, but each doubling of signal also doubles your cost.

  • Increasing your fragment size is cheap and divides out linearly -- a doubling from a 300 to a 600 bp fragment size will give you twice as much signal. Unfortunately it will also decrease your resolution, which will in turn have a significant impact on any bioinformatics you might do for finding binding sites.

    But you could always go back and confirm your predictions with ChIP-QPCR, which most people do anyway.

  • You could start with a better antibody, too ;). Of course, usually that's not so easy to obtain.

    I seem to recall that the Johnson et al. paper (pubmed 17540862) used an NRSF antibody that was estimated to yield 100x enrichment. Obviously that would significantly help with your signal-to-noise ratio!

I guess I'd look to changing the fragment size first, hmm. Wonder what kind of effect that has on the bioinformatics? I'll have to think about that more.

I'd appreciate any comments or thoughts people might have... I'm not even sure this is the right approach to calculating the numbers, but it makes sense to me! Please comment, or drop me a note at ctb@msu.edu.

--titus

p.s. Yes, background is not uniform. But I have yet to see a good method for calculating it; most people simply do a negative control and sequence that, too.

Syndicated 2009-08-19 17:05:24 from Titus Brown

429 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!