17 Aug 2004 jamesh   » (Master)

Python Unicode Weirdness

While discussing unicode on IRC with owen, we ran into a peculiarity in Python's unicode handling. It can be tested with the following code:

>>> s = u'\U00010001\U00010002'
>>> len(s)
>>> s[0]

Python can be compiled to use either 16-bit or 32-bit widths for characters in its unicode strings (16-bit being the default). When compiled in 32-bit mode, the results of the last two statements are 2 and u'\U00010001' respectively. When compiled in 16-bit mode, the results are 4 and u'\ud800'.

So rather than just being an implementation detail, the unicode string width chosen at compile time can alter the result of Python programs that manipulate characters outside of the basic multilingual plane. It would be nice if Python programs didn't have to care about this sort of detail ...

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!