When people tell me that python3000 will "solve" the unicode problem in python, I always shake my head and say that unicode handling will be better but there's still plenty of places where they'll consider it "broken".
That's because people are really asking whether they'll stop getting UnicodeErrors in their code and stop having to manually convert between byte strings and unicode. This portion will never stop simply because the data stored on computers is stored as bytes and it has to go through a translation before it can be recognized as unicode.
Take filenames on web servers as an example. I recently created two files on my apache web server with filenames of ½ñ.html encoded in utf-8 and ½ñ.html encoded in latin-1. I tested that apache was serving both files by hitting them in firefox using the hex for their encoded names (%c2%bd%c3%b1.html for the utf-8 version; %bd%f1.html for the latin-1).
In python3.0rc1, this resulted in total failure (I wasn't able to retrieve either one of these names as urllib would only take unicode strings which encoded to the ASCii subset) bug
But what is the "right" solution? A naive guess would be that the programmer should deal with the url as unicode internally. This is because the programmer may have to manipulate the path as a string. In a web app, for instance, a user might submit a form at "http://localhost/mysite/files/vaña.txt?edit" and the web app needs to parse that url and redirect the user to "http://localhost/mysite/files/vaña.txt?view" afterwards.
However, this approach suffers from a major problem. If the URL is converted into the unicode type before the web app gets it, then the web app will not know what character set to encode the ñ in in order to get proper data back. Is the ñ written in latin-1? Is it written in utf-8? The difference will cause a 404 (or worse, access to the wrong file) if guessed wrong. So the underlying libraries have to pass a byte string up to the web application and the web application has to translate that string into a unicode string just long enough to operate on it with the proper string tools before sending it back as an encoded byte string.
So then, what is the big deal with python3000? How does it make the unicode situation better at all? Consistency.
In python2.x, there's many times where a module will only work with byte strings or only work with unicode strings despite the fact that either one would be valid for the work being done. In one case I've seen, it's possible to get to a point where the Windows filesystem requires a unicode type to do the right thing but the subprocess module requires a byte string so there's no way to operate on non-ASCii filenames.
python3000 will make unicode strings the norm when dealing with strings in code. If you get a byte type returned to you you'll know that there's a reason you're getting it rather than just assuming that this module didn't convert things to unicode. This will hopefully make errors from non-conversion occur closer to their origin and make potential error conditions easier to find as seeing a byte type where you expect a string will be a major clue that something has gone wrong.





