Odd Python Fact
Using the long weekend to get some intensive hacking done, I'm converting Django's internals to be more transparently unicode aware. All this character encoding twiddling has me thinking about performance, so I've been writing lots of little test programs to time features.
One unusual result that popped up this afternoon concerned reading a UTF-8-encoded file. Contrary to my intuition, this version:
data = open(filename).read()
data.decode('utf-8')
was consistently a little bit faster than this version:
data = codecs.open(filename, 'r', 'utf-8').read()
Admittedly the differences were generally (much) less than 5%, in favour of the first version, but I was a little surprised there was any real difference at all. I'm not worried by this result, but I would have guessed incorrectly.
In both cases, I'm reading in the data and converting it a unicode string. I was running it against some examples I had lying around from Markus Kuhn. The results were consistent if I changed the order of the tests or intermixed them. Aliasing codecs.open to a global variable sped up the second method very slightly, but not enough to catch up. I was careful to pre-fill the disk buffer cache and run each test enough times in a loop for any noise on a single run to be absorbed.
Turns out, the results are closest (essentially identical speed) for files that have mostly one byte per character (pure ASCII files being the fastest) and diverged the most for more complex characters. The runic poem, with lots of three byte characters, and Greek text, which is entirely two byte characters were the most divergent.
