unicode ate my homework
I've just spent several days trying to adapt git-annex to changes in ghc
4.7's handling of unicode in filenames. And by spent, I mean, time
withdrawn from the bank, and frittered away.
In kindergarten, the top of the classrom wall was encircled by the aA bB
cC of the alphabet. I'll bet they still put that up on the walls.
And all the kids who grow up to become involved with computers learn
that was a lie. The alphabet doesn't stop at zZ. It wouldn't all fit
on a wall anymore.
So we're in a transition period, where we've all learnt deeply the
alphabet, but the reality is much more complicated. And the collision
between that intuitive sense of the world and the real world makes things
more complicated still. And so, until we get much farther along in this
transition period, you have to be very lucky indeed to not have wasted
time dealing with that complexity, or at least having encountered
Mojibake.
Most of the pain centers around programming languages, and libraries,
which are all at different stages of the transition from ascii
and other legacy encodings to unicode.
- If you're using C, you likely deal with all characters as raw bytes,
and rely on the backwards compatability built into UTF-8, or you
go to long lengths to manually deal with wide characters, so you can
intelligently manipulate strings. The transition has barely begin,
and will, apparently, never end.
- If you're using perl (at least like I do in ikiwiki), everything
is (probably) unicode internally, but every time you call a library
or do IO you have to manually deal with conversions, that are generally
not even documented. You constantly find new encoding bugs.
(If you're lucky, you don't find outright language bugs... I have.)
You're at a very uncomfortable midpoint of the transition.
- If you're using haskell, or probably lots of other languages like python
and ruby, everything is unicode all the time.. except for when it's not.
- If you're using javascript, the transition is basically complete.
My most recent pain is because the haskell GHC compiler is moving along
in the transition, getting closer to the end. Or at least finishing
the second 80% and moving into the third 80%. (This is not a quick
transition..)
The change involves filename encodings, a situation that, at least on unix
systems, is a vast mess of its own. Any filename, anywhere, can be in any
encoding, and there's no way to know what's the right one, if you dislike
guessing.
Haskell folk like strongly typed stuff, so this ambiguity about what type
of data is contained in a FilePath
type was surely anathama. So GHC is
changing to always use UTF-8 for operations on FilePath
.
(Or whatever the system encoding is set to, but let's just assume it's
UTF-8.)
Which is great and all, unless you need to write a Haskell program
that can deal with arbitrary files. Let's say you want to delete
a file. Just a simple rm
. Now there are two problems:
- The input filename is assumed to be in the system encoding aka unicode.
What if it cannot be validly interpreted in that encoding?
Probably your
rm
throws an exception.
- Once the
FilePath
is loaded, it's been decoded to unicode characters.
In order to call unlink
, these have to be re-encoded to get a
filename. Will that be the same bytes as the input filename and the
filename on disk? Possibly not, and then the rm
will delete the wrong
thing, or fail.
But haskell people are smart, so they thought of this problem, and provided
a separate type that can deal with it. RawFilePath
hearks back to
kindergarten; the filename is simply a series of bytes with no encoding.
Which means it cannot be converted to a FilePath
without encountering the
above problems. But does let you write a safe rm
in ghc 4.7.
So I set out to make something more complicated than a rm, that still needs
to deal with arbitrary filename encodings. And I soon saw it would be
problimatic. Because the things ghc can do with RawFilePaths
are limited.
It can't even split the directory from the filename. We often do need to
manipulate filenames in such ways, even if we don't know their encoding,
when we're doing something more complicated than rm
.
If you use a library that does anything useful with FilePath
, it's not
available for RawFilePath
. If you used standard haskell stuff like
readFile
and writeFile
, it's not available for RawFilePath
either.
Enjoy your low-level POSIX interface!
So, I went lowlevel, and wrote my own RawFilePath
versions of pretty much
all of System.FilePath
, and System.Directory
, and parts of MissingH
and other libraries. (And noticed that I can understand all this Haskell
code.. yay!) And I got it close enough to working that, I'm sure,
if I wanted to chase type errors for a week, I could get git-annex, with
ghc 4.7, to fully work on any encoding of filenames.
But, now I'm left wondering what to do, because all this work is
regressive; it's swimming against the tide of the transition. GHC's
change is certainly the right change to make for most programs, that are
not like rm
. And so most programs and libraries won't use RawFilePath
.
This risks leaving a program that does a fish out of water.
At this point, I'm inclined to make git-annex support only unicode (or the
system encoding). That's easy. And maybe have a branch that uses
RawFilePath
, in a hackish and type-unsafe way, with no guarantees
of correctness, for those who really need it.
Previously: unicode eye chart wanted on a bumper sticker abc
boxes unpacking boxes
Syndicated 2012-02-02 22:12:02 from see shy jo