17 Apr 2007 malcolm   » (Master)

Odd Python Fact

Using the long weekend to get some intensive hacking done, I'm converting Django's internals to be more transparently unicode aware. All this character encoding twiddling has me thinking about performance, so I've been writing lots of little test programs to time features.

One unusual result that popped up this afternoon concerned reading a UTF-8-encoded file. Contrary to my intuition, this version:

  data = open(filename).read()

was consistently a little bit faster than this version:

  data = codecs.open(filename, 'r', 'utf-8').read()

Admittedly the differences were generally (much) less than 5%, in favour of the first version, but I was a little surprised there was any real difference at all. I'm not worried by this result, but I would have guessed incorrectly.

In both cases, I'm reading in the data and converting it a unicode string. I was running it against some examples I had lying around from Markus Kuhn. The results were consistent if I changed the order of the tests or intermixed them. Aliasing codecs.open to a global variable sped up the second method very slightly, but not enough to catch up. I was careful to pre-fill the disk buffer cache and run each test enough times in a loop for any noise on a single run to be absorbed.

Turns out, the results are closest (essentially identical speed) for files that have mostly one byte per character (pure ASCII files being the fastest) and diverged the most for more complex characters. The runic poem, with lots of three byte characters, and Greek text, which is entirely two byte characters were the most divergent.

Syndicated 2007-04-07 19:26:47 from Malcolm Tredinnick

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!