Older blog entries for jtauber (starting at number 197)

Atom, Google Reader and Duplicates on Planets

For a while I've wondered why posts syndicated across multiple planets don't get picked up by Google Reader as duplicates (and automatically marked read when I've read it the first time around).

I wasn't sure whether the problem was:

  • the source feeds
  • the planet feeds
  • Google Reader itself

so I decided to investigate further with my own feed as the source and the three planets my site is syndicated to (that I know of).

Let's take my post Cleese and Disk Images.

My feed gives both an id and a link for both the feed itself and each individual entry. That makes it possible, at least, for planets and readers to do the Right Thing. So I don't think the problem is my feed.

On the Official Planet Python:

  • there is an RSS 1.0 feed
  • the rdf:about is not the same as the id in my feed but is the link URL (subtlety different but the planet is doing the wrong thing, IMHO)
  • there is no authorship or source information

On both the Unofficial Planet Python and on Sam Ruby's Planet Intertwingly:

  • there is an Atom feed
  • the entry Cleese and Disk Images has the same id as in my feed and a distinct link.
  • the source element gives appropriate information about my blog and the original feed
  • the author (which I only give on the source feed not each entry) is not inherited on to the entry in the planet feed, only included in the source

Note that the handling of the author by the latter two feeds is correct per the Atom RFC, although I have noticed that Safari's feed reader gets this wrong and, despite the author in the source element, uses the inherited author from the planet feed itself.

But, in short, the Atom-feed-based Planets do the Right Thing, although IMHO the RSS-1.0-based Official Planet Python does not. That may not be the Planet's fault. The RSS 1.0 Spec (or any RSS for that matter) may not make the distinction between id and link.

So given that my feed and two of the planet feeds do the right thing, I guess that places the blame with Google Reader.

Why does Google Reader not honour the entry id and automatically mark duplicates as already read when you've read it the first time. That's my pony request for Google Reader.

And by the way, the same thing applies to feeds themselves, not just entries. Feedburner, for example, does the right thing and passes through the id of a source Atom feed into its own Atom feed version. However, if you subscribe to both the source and Feedburner version of of a feed, Google Reader doesn't not identify them as the same feed. Of course, if either are RSS, I'd assume all bets are off.

So, in summary, Atom supports doing the Right Thing. The Atom-based Planets do the Right Thing. Google Reader doesn't take advantage of this.

Syndicated 2008-11-06 06:04:34 (Updated 2008-11-06 06:04:35) from James Tauber

Dear America

I'm glad you have elected someone you think brings hope and change. I hope he turns out to be one of the truly great presidents.

However, after the last eight years, you need a strong dose of fiscal conservatism. I hope your choice turns out to be the right one for that. I am not yet convinced.

Syndicated 2008-11-05 19:55:43 (Updated 2008-11-05 19:55:44) from James Tauber

Cleese and Disk Images

Previously I talked about setting up a toolchain to compile i386-elf binaries for hobby OS writing on Mac OS X.

The next step in getting Cleese development working on Mac OS X was working how to build disk images for VMware Fusion for a "hello world" kernel. I got about half way over the weekend, but Brian Rosner worked out the rest Monday night.

VMware can mount a normal disk image as a floppy but can't do the same for hard drives. Turns out, though, you can create floppy images larger than 1.44MB (although I don't know if there's an upper limit).

Here's the make target Brian came up with:

cleese.img: KERNEL.BIN
        hdiutil create -size 5M -fs "MS-DOS" -layout NONE cleese
        mv cleese.dmg cleese.img
        mkdir -p mnt
        mount_msdos -o nosync `hdid -nomount cleese.img` ./mnt
        cp -r boot KERNEL.BIN ./mnt
        umount -f ./mnt
        rm -r ./mnt

This creates a 5MB disk image, mounts it and copies the "boot" directory from GRUB and our kernel KERNEL.BIN on to the image.

This image isn't bootable by VMware yet. You need to boot off another floppy that has GRUB and is bootable but this is a one off operation. You can easily create a bootable GRUB disk with just

cat boot/grub/stage1 boot/grub/stage2 > grub.img

Once you've booted to the GRUB command line, you can switch to cleese.img as your floppy and type

setup (fd0)

and that will copy GRUB onto the boot sector. From that point on, cleese.img is all you need.

To avoid having to do that step every time KERNEL.BIN updates, I wrote an additional make target that just updates KERNEL.BIN on an existing image.

        mkdir -p mnt
        mount_msdos -o nosync `hdid -nomount cleese.img` ./mnt
        cp KERNEL.BIN ./mnt
        umount -f ./mnt
        rm -r ./mnt

As a quick guide to what that's doing:

  • the -p option to mkdir just stops it complaining if mnt already exists
  • hdid -nomount cleese.img binds the disk image to a /dev and returns the device path
  • that device path is then used as an argument to mount_msdos (hence the backticks) which mounts that device as ./mnt
  • the file(s) are copied on, the image unmounted and the mount point deleted

I'm not sure why the -o nosync is needed. Maybe it isn't.

In the original target, the -layout NONE option to hdiutil ensures no partition map is created for the drive.

Syndicated 2008-11-05 09:26:14 (Updated 2008-11-05 09:49:40) from James Tauber

Daylight Saving Time

Yesterday I was asked at work what the origins of daylight savings were. People who know me know I can never just say "I don't know" to a question like that—I had to go do some research.

The short answer is "war and golf" but here is a longer version, gleaned from various articles online and a little prior knowledge on the topic.

While Benjamin Franklin is sometimes credited with the idea of setting clocks differently in the summer, his idea was well before its time as there wasn't a notion of standard time in his day. The notion that clocks would be set according to the "real time" (i.e. based on the Sun) of some other location has its origin with the railroad system. In November 1840, the Great Western Railway in England adopted London Time for all their schedules. The US and Canada followed suit with their own Standard Time in November 1883.

While Standard Time was initially for the railroads, it began to be adopted across the board, eventually being enacted into law in the US by the Standard Time Act of 1918.

An Englishman, William Willet made the observation, a century after Ben Franklin had done the same, that people were wasting the early hours of the day in summer by sleeping in. He was also an avid golfer who was frustrated at dusk cutting short his game. So he started campaigning for clocks to be advanced during the summer months. The idea was ridiculed and he died in 1915 without seeing his idea adopted.

In April 1916, however, Germany started advancing the clock an hour to reduce electricity usage and hence fuel consumption during the war. Many European countries immediately followed suit and Britain started in May 1916. When the US joined the war, they too adopted this daylight saving measure.

US Congress repealed the law in 1919, but Woodrow Wilson (incidentally also an avid golfer) vetoed the repeal. Congress overrode the veto and so daylight saving stopped, although was adopted locally in some places.

In World War II, it was reintroduced, this time all year around. The US had daylight saving from February 1942 to September 1945. After the war, it went back to being a local issue.

It was a controversial issue through the early 1960s but the confusion caused by so many local differences resulted in the US passing the Universal Time Act in 1966 which reintroduced it across the country unless overridden by state law.

My own home state of Western Australia is currently in a three-year trial of daylight saving and will hold a vote next year as to whether to keep it.

Syndicated 2008-11-04 08:21:40 (Updated 2008-11-04 18:46:31) from James Tauber

Python's re.DEBUG Flag

Eric Holscher points out a Python gem I never knew about. If you pass in the number 128 (or, as I have a preference for flags in hex, 0x80) as the second arg to re.compile, it prints out the parse tree of the regex:

>>> import re
>>> pattern = re.compile("a+b*\s\w?", 0x80)
max_repeat 1 65535
  literal 97
max_repeat 0 65535
  literal 98
  category category_space
max_repeat 0 1
    category category_word

While re.compile is documented as having the signature

compile(pattern[, flags])

the particular flag 0x80 is not documented as far as I can tell.

I thought I'd dig in further.

Firstly, note that re appears to cache patterns as if you repeat the same re.compile, it returns the same object and doesn't spit out the parse tree. There is a re.purge function for purging this cache but while this is mentioned in help(re) it is not in the main documentation.

Secondly, note that the flag 0x80 is actually defined as DEBUG in the re module, so a more robust form would be:

re.compile(pattern, re.DEBUG)

A source code comment for DEBUG and another undocumented flag TEMPLATE (which supposedly disables backtracking) mentions:

# sre extensions (experimental, don't rely on these)

which explains why they aren't documented.

In the Python source code, there is also a Scanner class defined with the comment "experimental stuff (see python-dev discussions for details)"

A quick search of the python-dev mailing list found nothing. Perhaps a python core development could fill us in.

Syndicated 2008-11-03 11:47:43 (Updated 2008-11-03 11:49:27) from James Tauber

Cleese and a New Toolchain

Back in July 2003, I had an idea to "make the Python intepreter a micro-kernel and boot directly to the Python prompt". Thus started Cleese, which I worked on with Dave Long. We made a good deal of progress and I learned a tremendous amount.

In February 2007, I moved Cleese from SourceForge to Google Code Project Hosting in the hope of restarting work on it. In between 2003 and 2007 I'd become a switcher and so I needed to work out how to do on OS X what I'd been doing with a very strange hybrid of Windows command line and Cygwin before. Alas I never got around to that part.

Then about a week ago, inspired by Brian Rosner's interest in the project, I decided to give it another go. I also decided to use it as an opportunity to finally learn Git.

First goal: build a "hello world" kernel (no Python yet). Fortunately I had one from the initial stages of Cleese 2003, but it wouldn't build. In particular ld was barfing on the -T option used to specify a linking script (which OS X's ld doesn't support).

After asking some questions on the #osdev channel on freenode, I discovered I'd need a completely new gcc and binutils toolchain to support i386-elf. This didn't turn out to be difficult at all, though.

Here were my steps:

export PREFIX=/Users/jtauber/Projects/cleese/toolchain
export TARGET=i386-elf

cd ~/Projects/cleese curl -O http://ftp.gnu.org/gnu/binutils/binutils-2.19.tar.gz mkdir toolchain tar xvzf binutils-2.19.tar.gz cd binutils-2.19 ./configure --prefix=$PREFIX --target=$TARGET --disable-nls make make install cd .. curl -O http://ftp.gnu.org/gnu/gcc/gcc-4.2.4/gcc-core-4.2.4.tar.bz2 bunzip2 gcc-core-4.2.4.tar.bz2 tar xvf gcc-core-4.2.4.tar cd gcc-4.2.4 ./configure --prefix=$PREFIX --target=$TARGET --disable-nls --enable-languages=c --without-headers make all-gcc make install-gcc

Now my "hello world" kernel builds. Next goal...working out how to programmatically build disk images for VMware Fusion (or, failing that, Qemu)

Syndicated 2008-11-03 03:43:26 (Updated 2008-11-03 03:44:59) from James Tauber

2 Nov 2008 (updated 2 Nov 2008 at 12:06 UTC) »

Cell naming

My previous post introduced my adventures into C. elegans.

I've gone ahead and implemented my own little cell lineage browser using django-mptt. Once I've added more functionality, I'll put it online.

But for now, I'm intrigued by the naming of cells in the lineage. In particular, the majority of cells are named by appending either 'a' or 'p' to the parent cell. What do 'a' and 'p' stand for?

As an example:

P0 -> P1' -> P2' -> C

but then

  • C divides into Ca, Cp
  • Ca divides into Caa and Cap; Cp divides into Cpa and Cpp

Caa, Cpa then have a slightly different progression than Cap and Cpp:

  • Caa and Cpa respectively split into Caaa, Caap and Cpaa and Cpap
  • these then split into Caaaa, Caaap, Caapa, Caapp, Cpaaa, Cpaap, Cpapa and Cpapp
  • these then split into the 16 you'd expect except that Cpapp splits into what are called Cpappd and hyp11, Caapp splits into Caappd and PVR and Caapa splits into Caapap and DVC.

Cap and Cpp progress as follows:

  • they split into Capa, Capp, Cppa, Cppp as you'd expect
  • these split into Capaa, Capap, Cappa, Cappp, Cppaa, Cppap, Cpppa, Cpppp as you'd expect
  • these then split into Capaaa, Capaap, Capapa, Capapp, Cappaa, Cappap, Capppa, Capppp, Cppaaa, Cppaap, Cppapa, Cppapp, Cpppaa, Cpppap, Cppppa, Cppppp
  • and finally the 32 you would expect except Cppppp splits into what are called Cpppppd and Cpppppv

This is just the C lineage which is less than 10%. But I'd love to know what the 'a' and 'p' stand for; what the 'd' and 'v' stand for; and why hyp11, PVR and DVR get such a distinct names.

UPDATE: I added a "cell type" field to my browser and it revealed a couple of useful things: the "leaf nodes" (i.e. final cells) from Cap and Cpp are all marked as of cell type "muscle". The leaf nodes from Cpa (including hyp11) are all marked cell type "hypodermis". The leaf nodes from Caa are a little more interesting: The Caaa... leaf nodes are all "hypodermis". The leaf nodes from Caap are the most interesting, though. Caappd is "hypodermis", Caapap is marked as dying, and PVR and DVC are neurons.

UPDATE 2: Just as a point of comparison, there is another founder cell D whose descendants are a lot cleaner. D results in 20 cells, all of type "muscle". All are named with a/p. The only reason it's not a power of 2 is the two D{a|p}pp split into 4 whereas the others at that level split into only 2.

UPDATE 3: Based on http://en.wikipedia.org/wiki/Anatomical_terms_of_location I'm now convinced a, p, d, and v refer to anterior, posterior, dorsal and ventral respectively.

Syndicated 2008-11-02 05:39:49 (Updated 2008-11-02 06:51:50) from James Tauber

C. elegans

I don't normally talk about biology because I don't know much about it. Growing up, I was the physicist and my sisters were the biologists. But I'm interested in the computational modeling of just about anything so I've long been interested in biological simulations, artificial life, etc and have recently been getting in to computational neuroscience in a fairly big way.

I can't remember when I first read about Caenorhabditis elegans (henceforth abbreviated, as it is by biologists, to C. elegans) but it was probably about a year ago and it totally blew my mind.

C. elegans is a tiny roundworm, about one millimeter long but what is remarkable is just how much we know about it. How much? well, we know every single cell and how it develops from the single cell zygote. We know every single neuron and how the entire brain is wired. That's pretty incredible. Oh, and of course we've sequenced the entire genome.

C. elegans, along with fruit flies and zebrafish, is an example of a model organism. Model organisms are those that have been studied in great depth in the hope of understanding organisms in general (including humans). Numerous characteristics make a particular organism suitable as a model. In the case of C. elegans I think it's how quickly they generate and the fact they have a very defined development and fixed number of cells. They can also be revived after being frozen.

Now C. elegans are almost always hermaphrodite, although a tiny fraction are male. The hermaphrodites have 959 cells and, as I mentioned, we know how each of them developed from the initial zygote. So P0 splits in to AB and P1', P1' into EMS and P2', EMS in to E and EMS, E into Ea and Ep, and so on. This tree structure is called the cell lineage or pedigree and it's available online at http://www.wormbase.org/db/searches/pedigree. For each cell, there's also an information page and that information is also available in an XML format (e.g. http://www.wormbase.org/db/cell/cell.cgi?name=EMS;class=Cell. Because I wanted to dig around a little more, I ended up writing a data scraping script in Python to download all the XML files (parsing each one to find out what the daughter cells were then recursing).

The data I've downloaded also includes the neuronal wiring. At some point I'd like to do a little Django app for navigating around the data in a way that's a little friendlier for the layperson. Might also be a good excuse for me to try out django-mptt.

The data is all in a format that is shared across different model organism research projects and there is open source software for dealing with this data (especially the genomic data). For example, GBrowse is used for browsing and searching the genome of both C. elegans and the fruit fly. GBrowse is part of the GMOD project. Most of the stuff looks like it's Perl CGI scripts.

In my fascination with computer modeling but my complete ignorance of the state of biology, I wonder how far we are from cell-level simulations of organisms like C. elegans. Do we know enough to even begin to think about doing this for a 959-cell organism? I mean, isn't the Blue Brain project supposed to eventually simulate a 10,000-cell neocortical column? Or how far are we from simulating the cell develop of C. elegans? i.e. given P0 (including the genome), press play and get the 959 cells of the C. elegans adult hermaphrodite at the end. The fact that the most powerful computer in the world and a multi-year project are what it's going to take for 10,000 cells, I guess we're not going to be writing C. elegans simulators in Python on our desktops any time soon.

But hey, it would sure be cool.

Syndicated 2008-11-02 03:18:10 (Updated 2008-11-02 03:29:46) from James Tauber

How I View Blogging

(I'm trying to blog every day—however, if I want to say something in the afternoon and I've already blogged that day, I probably won't postpone posting just to stretch things out. That will likely mean more than 30 posts this month although will reduce the chance of me having something to say every day)

In his first post for the month, Brian Rosner talks about his preference for the "article" type of blog entry than the "random opinion and links" type of entry. It's not clear if that's a preference for the entries he wants to write or the entries he wants to read. He also asks his readers what their take is.

As I've thought (and written) about the topic before, I thought I'd post my random opinions here rather than just in comments on Brian's blog (or though afterwards, I may go link there).

Comments vs Trackbacks

Which segues nicely into the first point: I like giving more detailed responses to a blog post in another blog post rather than just a comment. In fact, the reason I didn't add comments when I first implemented this blog software was I wanted people to reply on their own blogs. Back in 2004 that seems more the "blog way". In a post from that time Blogs, Annotations, Comments and Trackbacks I talked about trackbacks (notifying resource A that resource B talks about it) as the fundamental idea—it's just Web annotation really, but trackbacks are primarily blog to blog and comments are really just a variant where there annotating resource is actually inlined with the annotated resource (and generally persisted on the same system, although not always).

Don Park had an idea called Conversation Categories where you could host your responses but still mark them as part of a particular conversation. I never really saw this done beyond broad tagging.

Paucity of Inbound Links

One thing that's always been unusual about my own blog is the paucity of inbound links relative to number of readers. When I've compared my stats with others who've published them, I have a high subscriber count but low number of incoming links.

I've never really worked out what that would be. I guess people find my posts interesting but not noteworthy.

Blogs as Conversations

Back in Belated Thoughts on Blogs and Wikis, I talked mostly about the nature of wikis but also made the comment that while wikis are about collaboration, blogs are about conversations.

I wonder if that's as true any more. Has the conversation moved to twitter? See more below.

Blog to Contribute to Your Tribe

I've long be inspired by Tom Peters' view that loyalty is no longer to companies but to professions and networks. Nowadays I think that's better rephrased as loyalty to 'tribes'. A few years ago I gave a talk to a business group where I basically said contributing to your tribe was the best way to "network" and, in particular contributing by sharing knowledge.

Back in the late 90s people were more likely to know me because of posts to mailing lists like xml-dev. Nowadays someone at a conference is far more likely to come up to me and say "oh hi James, I read your blog".

Blogging is a great way to contribute to your tribe(s).

Planetary Effects

I'm on both Planet Pythons. The fact I don't have category-specific feeds means all my non-Python stuff goes to the Python planets too. No one has ever complained to me about it (and, in fact, some people have thanked me for my topic diversity) but I still sometimes feel awkward about it.

One thing's for sure, nothing gets comments like a Python-related post with code included.

The Twitter Effect

There is no doubt that Twitter reduced the amount of blogging I do. Reflecting on this, it could be that blogging was partly fulfilling a desire to tell the world what I was up to and Twitter now does that. I think it's more than that, though. I think it's that Twitter has also taken much of the conversation.

I was always hesitant to post naked links to my blog but now Twitter has completely taken away the possibility of me doing that.

Also, if I have a question that can be expressed in 140 characters, I'll ask it on Twitter whereas I may have previously blogged a longer version of the question.

Twitter also has an impact on the reading side. I can now find out what a friend is up to via Twitter or their Facebook status rather than them having to do a blog post.

Why I Blog

In Blog Goals of Lack Thereof I talked about the fact that blogging is for scribbling or making announcements about projects, not, for me, a project in itself.

Back in Thank You Blog Readers I said:

I think I'll still just continue to blog about things that interest me and things that I'm working on. After all, pretty much every single topic I've written on has put me in contact with some interesting person that I've learnt and am continuing to learn new things from.

How I View Blogging As Reader

I read to be informed and, occasionally entertained. I want to learn stuff. I want to trigger new ideas. I want to be informed what's going on in particular communities. I want to to be informed what's going on with particular friends or their projects. I read too many feeds to deal with too many long articles.

How I View Blogging As a Writer

I want to inform. I want get a better understanding of things by being forced to articulate them myself. I want to be corrected when I've done something stupidly and want have my solutions improved upon. I want to find other people who are working (or wanting to work) on similar projects to me. I want to keep people up-to-date with what I'm working on.

In Conclusion

I think I'll continue to blog. They won't be long articles. They won't be naked links. There'll be some announcements, but it will mostly be snippets of thought as I learn and try to interact with other learners.

Syndicated 2008-11-01 17:57:18 (Updated 2008-11-01 18:21:46) from James Tauber

Two Fun(ctional) Questions

Consider the following series of functions:

def x(a):
    if callable(a):
        return lambda i: a(i)
        return a


def x(a):
    def xx(b):
        if callable(b):
            return lambda i: a(b(i))
            return a(b)
    return xx


def x(a):
    def xx(b):
        if callable(b):
            def xxx(c):
                if callable(c):
                    return lambda i: a(b(c(i)))
                    return a(b(c))
            return xxx
            return a(b)
    return xx

and so on...

Two questions:

  • how would you write a single recursive or iterative version of this that could handle cases of any depth?
  • what would you call what this function x is doing?

Syndicated 2008-11-01 01:03:34 (Updated 2008-11-01 01:18:43) from James Tauber

188 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!