more on ghc filename encodings
My last post missed an important thing about
GHC 7.4's handling of encodings for FileName. It can in fact be safe to use
FilePath to write a command like
rm. This is because GHC internally uses
a special encoding for FilePath data, that is documented to allow
"arbitrary undecodable bytes to be round-tripped through it". (It seems to
do this by encoding the undecodable bytes as very high unicode code
points.) So, when presented with a filename that cannot be decoded using
utf-8 (or whatever the system encoding is), it still handles it, and using
the resulting FilePath will in fact operate on the right file. Whew!
Moral of the story is that if you're going to be using GHC 7.4 to read or write filenames from a pipe, or a file, you need to arrange for the Handle you're reading or writing to use this special encoding too. I use this to set up my Handles:
import System.IO import GHC.IO.Encoding import GHC.IO.Handle fileEncoding :: Handle -> IO () fileEncoding h = hSetEncoding h =<< getFileSystemEncoding
Even if you're only going to write a FilePath to stdout, you need to do this. Otherwise, your program will crash on some filenames! This doesn't seem quite right to me, but I hesitate to file a bug report. (And this is not a new problem in GHC anyway.) If I did, it would have this testcase:
# touch "me¡" # LANG=C ghc Prelude> :m System.Directory Prelude System.Directory> mapM_ putStrLn =<< getDirectoryContents "." me*** Exception: <stdout>: hPutChar: invalid argument (invalid character)
Since git-annex reads lots of filenames from git commands and other places, I had to deal with this extensively. Unfortunatly I have not found a way to read Text from a Handle using the fileSystemEncoding. So I'm stuck with slow Strings. But, it does seem to work now.
PS: I found a bug in GHC 7.4 today where one of those famous Haskell immutable values seems to get well, mutated. Specifically a [FilePath] that is non-empty at the top of a function ends up empty at the bottom. Unless IO is done involving it at the top. Really. Hope to develop a test case soon. Happily, the code that triggered it did so while working around a bug in GHC that is fixed in 7.4. Language bugs.. gotta love em.