28 Sep 2008 (updated 28 Sep 2008 at 19:58 UTC)
»
When people tell me that python3000 will "solve" the unicode
problem in python, I always shake my head and say that
unicode handling will be better but there's still plenty of
places where they'll consider it "broken".
That's because people are really asking whether
they'll stop
getting UnicodeErrors in their code and stop having to
manually convert between byte strings and unicode. This
portion will never stop simply because the data stored on
computers is stored as bytes and it has to go through a
translation before it can be recognized as unicode.
Take filenames on web servers as an example. I recently
created two files on my apache web server with filenames of
½ñ.html encoded in utf-8 and ½ñ.html
encoded in latin-1. I
tested that apache was serving both files by hitting them in
firefox using the hex for their encoded names
(%c2%bd%c3%b1.html for the utf-8 version; %bd%f1.html for
the latin-1).
In python3.0rc1, this resulted in total failure (I
wasn't
able to retrieve either one of these names as urllib would
only take unicode strings which encoded to the ASCii subset)
bug
But what is the "right" solution? A naive guess
would be
that the programmer should deal with the url as unicode
internally. This is because the programmer may have to
manipulate the path as a string. In a web app, for
instance, a user might submit a form at
"http://localhost/mysite/files/vaña.txt?edit" and the web
app needs to parse that url and redirect the user to
"http://localhost/mysite/files/vaña.txt?view" afterwards.
However, this approach suffers from a major problem.
If the
URL is converted into the unicode type before the web app
gets it, then the web app will not know what character set
to encode the ñ in in order to get proper data back.
Is the
ñ written in latin-1? Is it written in utf-8? The
difference will cause a 404 (or worse, access to the wrong
file) if guessed wrong. So the underlying libraries have to
pass a byte string up to the web application and the web
application has to translate that string into a unicode
string just long enough to operate on it with the proper
string tools before sending it back as an encoded byte string.
So then, what is the big deal with python3000? How
does it
make the unicode situation better at all? Consistency.
In python2.x, there's many times where a module will
only
work with byte strings or only work with unicode strings
despite the fact that either one would be valid for the work
being done. In one case I've seen, it's possible to get to
a point where the Windows filesystem requires a unicode type
to do the right thing but the subprocess module requires a
byte string so there's no way to operate on non-ASCii filenames.
python3000 will make unicode strings the norm when
dealing
with strings in code. If you get a byte type returned to
you you'll know that there's a reason you're getting it
rather than just assuming that this module didn't convert
things to unicode. This will hopefully make errors from
non-conversion occur closer to their origin and make
potential error conditions easier to find as seeing a byte
type where you expect a string will be a major clue that
something has gone wrong.