joey is currently certified at Master level.

Name: Joey Hess
Member since: 2000-03-06 23:42:41
Last Login: 2011-12-31 20:04:52

FOAF RDF Share This

Homepage: http://kitenet.net/~joey

Projects

Recent blog entries by joey

Syndication: RSS 2.0

more on ghc filename encodings

My last post missed an important thing about GHC 7.4's handling of encodings for FileName. It can in fact be safe to use FilePath to write a command like rm. This is because GHC internally uses a special encoding for FilePath data, that is documented to allow "arbitrary undecodable bytes to be round-tripped through it". (It seems to do this by encoding the undecodable bytes as very high unicode code points.) So, when presented with a filename that cannot be decoded using utf-8 (or whatever the system encoding is), it still handles it, and using the resulting FilePath will in fact operate on the right file. Whew!

Moral of the story is that if you're going to be using GHC 7.4 to read or write filenames from a pipe, or a file, you need to arrange for the Handle you're reading or writing to use this special encoding too. I use this to set up my Handles:

  import System.IO
import GHC.IO.Encoding
import GHC.IO.Handle

fileEncoding :: Handle -> IO ()
fileEncoding h = hSetEncoding h =<< getFileSystemEncoding

Even if you're only going to write a FilePath to stdout, you need to do this. Otherwise, your program will crash on some filenames! This doesn't seem quite right to me, but I hesitate to file a bug report. (And this is not a new problem in GHC anyway.) If I did, it would have this testcase:

  # touch "meĀ”"
# LANG=C ghc
Prelude> :m System.Directory
Prelude System.Directory> mapM_ putStrLn =<< getDirectoryContents "."
me*** Exception: <stdout>: hPutChar: invalid argument (invalid character)

Since git-annex reads lots of filenames from git commands and other places, I had to deal with this extensively. Unfortunatly I have not found a way to read Text from a Handle using the fileSystemEncoding. So I'm stuck with slow Strings. But, it does seem to work now.


PS: I found a bug in GHC 7.4 today where one of those famous Haskell immutable values seems to get well, mutated. Specifically a [FilePath] that is non-empty at the top of a function ends up empty at the bottom. Unless IO is done involving it at the top. Really. Hope to develop a test case soon. Happily, the code that triggered it did so while working around a bug in GHC that is fixed in 7.4. Language bugs.. gotta love em.

Syndicated 2012-02-03 20:11:32 from see shy jo

unicode ate my homework

I've just spent several days trying to adapt git-annex to changes in ghc 4.7's handling of unicode in filenames. And by spent, I mean, time withdrawn from the bank, and frittered away.

In kindergarten, the top of the classrom wall was encircled by the aA bB cC of the alphabet. I'll bet they still put that up on the walls. And all the kids who grow up to become involved with computers learn that was a lie. The alphabet doesn't stop at zZ. It wouldn't all fit on a wall anymore.

So we're in a transition period, where we've all learnt deeply the alphabet, but the reality is much more complicated. And the collision between that intuitive sense of the world and the real world makes things more complicated still. And so, until we get much farther along in this transition period, you have to be very lucky indeed to not have wasted time dealing with that complexity, or at least having encountered Mojibake.

Most of the pain centers around programming languages, and libraries, which are all at different stages of the transition from ascii and other legacy encodings to unicode.

  • If you're using C, you likely deal with all characters as raw bytes, and rely on the backwards compatability built into UTF-8, or you go to long lengths to manually deal with wide characters, so you can intelligently manipulate strings. The transition has barely begin, and will, apparently, never end.
  • If you're using perl (at least like I do in ikiwiki), everything is (probably) unicode internally, but every time you call a library or do IO you have to manually deal with conversions, that are generally not even documented. You constantly find new encoding bugs. (If you're lucky, you don't find outright language bugs... I have.) You're at a very uncomfortable midpoint of the transition.
  • If you're using haskell, or probably lots of other languages like python and ruby, everything is unicode all the time.. except for when it's not.
  • If you're using javascript, the transition is basically complete.

My most recent pain is because the haskell GHC compiler is moving along in the transition, getting closer to the end. Or at least finishing the second 80% and moving into the third 80%. (This is not a quick transition..)

The change involves filename encodings, a situation that, at least on unix systems, is a vast mess of its own. Any filename, anywhere, can be in any encoding, and there's no way to know what's the right one, if you dislike guessing.

Haskell folk like strongly typed stuff, so this ambiguity about what type of data is contained in a FilePath type was surely anathama. So GHC is changing to always use UTF-8 for operations on FilePath. (Or whatever the system encoding is set to, but let's just assume it's UTF-8.)

Which is great and all, unless you need to write a Haskell program that can deal with arbitrary files. Let's say you want to delete a file. Just a simple rm. Now there are two problems:

  1. The input filename is assumed to be in the system encoding aka unicode. What if it cannot be validly interpreted in that encoding? Probably your rm throws an exception.
  2. Once the FilePath is loaded, it's been decoded to unicode characters. In order to call unlink, these have to be re-encoded to get a filename. Will that be the same bytes as the input filename and the filename on disk? Possibly not, and then the rm will delete the wrong thing, or fail.

But haskell people are smart, so they thought of this problem, and provided a separate type that can deal with it. RawFilePath hearks back to kindergarten; the filename is simply a series of bytes with no encoding. Which means it cannot be converted to a FilePath without encountering the above problems. But does let you write a safe rm in ghc 4.7.

So I set out to make something more complicated than a rm, that still needs to deal with arbitrary filename encodings. And I soon saw it would be problimatic. Because the things ghc can do with RawFilePaths are limited. It can't even split the directory from the filename. We often do need to manipulate filenames in such ways, even if we don't know their encoding, when we're doing something more complicated than rm.

If you use a library that does anything useful with FilePath, it's not available for RawFilePath. If you used standard haskell stuff like readFile and writeFile, it's not available for RawFilePath either. Enjoy your low-level POSIX interface!

So, I went lowlevel, and wrote my own RawFilePath versions of pretty much all of System.FilePath, and System.Directory, and parts of MissingH and other libraries. (And noticed that I can understand all this Haskell code.. yay!) And I got it close enough to working that, I'm sure, if I wanted to chase type errors for a week, I could get git-annex, with ghc 4.7, to fully work on any encoding of filenames.

But, now I'm left wondering what to do, because all this work is regressive; it's swimming against the tide of the transition. GHC's change is certainly the right change to make for most programs, that are not like rm. And so most programs and libraries won't use RawFilePath. This risks leaving a program that does a fish out of water.

At this point, I'm inclined to make git-annex support only unicode (or the system encoding). That's easy. And maybe have a branch that uses RawFilePath, in a hackish and type-unsafe way, with no guarantees of correctness, for those who really need it.


Previously: unicode eye chart wanted on a bumper sticker abc boxes unpacking boxes

Syndicated 2012-02-02 22:12:02 from see shy jo

announcing github-backup

Partly as a followup to a Github survey, and partly because I had a free evening and the need to write more haskell code, any haskell code, I present to you, github-backup.

github-backup is a simple tool you run in a git repository you cloned from Github. It backs up everything Github knows about the repository, including other forks, issues, comments, milestones, pull requests, and watchers.

This is all stored in the repository, as regular files, on a "github" branch.

Available in Cabal now, in Debian maybe if someone packages haskell-github.

Syndicated 2012-01-26 04:44:10 from see shy jo

olduse.net 1982

Hard to believe I've consumed all of 1981's Usenet posts now on olduse.net, and it's been running for 7 months already.


Last night, there was a "very long" post, describing nearly every node on usenet in 1982. There had been a warning about this post the day before, since it would take many sites half an hour to download at 300 baud. It was handily formatted as a shell script, which created per-node files.

So, I ran this code nobody has run since 1982. It worked. I got files. I tossed them on the olduse.net wiki, and used some ikiwiki code TOVA contracted me to write just a few months ago, to make clickable links on my usenet map.

usenet map

The map data was contributed in another post a while back. By 1982, usenet is getting nearly impossible to map with 1982 technology of ascii art. I enjoyed throwing graphviz, git, wikis, and the web at it.

So, we have a collaboration across time, me and "Mark" and a lot of people who described their usenet nodes and piles of technology that make creating a mashup easy. Awesome!


I blog about stuff I find on the olduse.net blog. It's an open blog; Koldfront also blogs there, and we welcome other bloggers.

Some of the highlights for me have included:

As the space shuttle program is winding down, reading the excitement about the first shuttle flights, and the play-by-play coverage of a launch, posted to net.columbia by a high school student borrowing his dad's account. (A usegroup name that's hard to read without remembering its fate).

The announcements of the Motorola M68k, the IBM PC, and the CD-ROM.

world ipv6 launch Reading the TCP-IP digest, and Postel's plans for launching IPv4 soon, while the world IPv6 launch is being planned now. (The nay-sayers are especially fun to read. Including the guy who was concerned about the address space size, in 1981!)

Learning that nethack ascention tales have a history streching back 30 years, to rogue, and that the stories back then had much the same flavor as they do today.

Various celebrity sightings. Dennis Ritchie teaching C and Unix. Bill Joy talking vi. RMS talking .. nuclear politics?

The general development of usenet. B-news being rolled out, groups proliferating, many first inklings of what will be major problems and developments in 5 or 10 years. A shift in tone is already apparent, by now usenet is not only about announcements, there are already some flames.

oldusenet in a period terminal

Still 9 years to go!

Syndicated 2012-01-21 20:58:05 from see shy jo

version numbers

Today I released two entirely different pieces of software with the identical version number 3.20120115. Debian developers also will be soon noticing a piece of software I released with the version number 9.20120115.

I expect to move more of my software to this version number scheme over time, unless I find something badly wrong with it. It reflects how I think about versions for my software; there's a kind of continual "now" that development progresses through, in which individual releases have little discrete meaning and at the same time, there can also be significant discontinuities, that require the user to do something to deal with (such as a new debhelper compat version, or a new git-annex repository format).

Those two things are really all that I need a version number for my software to communicate. I can do without the rest of the things that version numbers are used for:

  • The marketing of version 1.0 and 2.0.
  • The comparative nuances such as whether 1.0 to 1.1 is a relatively big change, and 1.0 to 1.0.1 is a relatively small change
  • The implication that 0.99 is almost 1.0 ready, and 1.1a is some kind of alpha release.

There is so much software, with so many version numbers that any signal encoded in such version numbers is swamped in the noise. Even on projects that I develop a version number like 2.88 is meaningless to me. All I care about is, how long ago was that version? Has there been a major change breaking compatibility since that version? "2.88" doesn't answer these questions well; "3.20111111" does.

It is a little wordy to have the full year in there, and it can be annoying to remember to set the version to the right date on release day (TODO: automate). This is balanced with the version not being so wordy as to include the time of day, which means I might have to do a 3.20120115.1 if I goof up. These minor problems are worth it to instantly know how old a version is when a user pastes it into a bug report.

And that is probably all I will ever have to say about version numbers. :)

Syndicated 2012-01-16 02:21:07 from see shy jo

477 older entries...

 

joey certified others as follows:

  • joey certified joey as Journeyer
  • joey certified davidw as Journeyer
  • joey certified bombadil as Journeyer
  • joey certified dhd as Journeyer
  • joey certified ajt as Journeyer
  • joey certified chrisd as Journeyer
  • joey certified scandal as Journeyer
  • joey certified lewing as Journeyer
  • joey certified jwz as Master
  • joey certified graydon as Journeyer
  • joey certified cas as Journeyer
  • joey certified garrett as Journeyer
  • joey certified lupus as Journeyer
  • joey certified octobrx as Journeyer
  • joey certified pudge as Journeyer
  • joey certified marcel as Journeyer
  • joey certified ljlane as Journeyer
  • joey certified uzi as Journeyer
  • joey certified quinlan as Journeyer
  • joey certified bribass as Journeyer
  • joey certified jonas as Journeyer
  • joey certified dsifry as Journeyer
  • joey certified plundis as Journeyer
  • joey certified deirdre as Journeyer
  • joey certified crackmonkey as Journeyer
  • joey certified jim as Journeyer
  • joey certified vincent as Journeyer
  • joey certified apenwarr as Journeyer
  • joey certified schoen as Journeyer
  • joey certified CentralScrutinizer as Apprentice
  • joey certified wichert as Master
  • joey certified doogie as Journeyer
  • joey certified espy as Journeyer
  • joey certified omnic as Journeyer
  • joey certified hands as Journeyer
  • joey certified stig as Journeyer
  • joey certified nick as Journeyer
  • joey certified tausq as Journeyer
  • joey certified broonie as Journeyer
  • joey certified dunham as Journeyer
  • joey certified austin as Journeyer
  • joey certified lordsutch as Journeyer
  • joey certified Gimptek as Apprentice
  • joey certified jimd as Journeyer
  • joey certified chip as Master
  • joey certified jgg as Master
  • joey certified branden as Journeyer
  • joey certified z as Journeyer
  • joey certified srivasta as Journeyer
  • joey certified danpat as Journeyer
  • joey certified lilo as Journeyer
  • joey certified seeS as Journeyer
  • joey certified netgod as Journeyer
  • joey certified dres as Journeyer
  • joey certified cech as Journeyer
  • joey certified knghtbrd as Journeyer
  • joey certified calc as Journeyer
  • joey certified ruud as Journeyer
  • joey certified edlang as Journeyer
  • joey certified gorgo as Journeyer
  • joey certified jwalther as Journeyer
  • joey certified bma as Journeyer
  • joey certified claw as Apprentice
  • joey certified hp as Journeyer
  • joey certified esr as Master
  • joey certified tobi as Journeyer
  • joey certified ajk as Journeyer
  • joey certified Joy as Journeyer
  • joey certified ejb as Journeyer
  • joey certified corbet as Journeyer
  • joey certified rcw as Journeyer
  • joey certified woot as Journeyer
  • joey certified bcollins as Journeyer
  • joey certified neuro as Journeyer
  • joey certified biffhero as Journeyer
  • joey certified Trakker as Journeyer
  • joey certified bdale as Journeyer
  • joey certified foka as Journeyer
  • joey certified davem as Master
  • joey certified logic as Journeyer
  • joey certified mstone as Journeyer
  • joey certified drow as Journeyer
  • joey certified clameter as Journeyer
  • joey certified mdorman as Journeyer
  • joey certified bwoodard as Journeyer
  • joey certified JHM as Journeyer
  • joey certified lalo as Journeyer
  • joey certified edb as Journeyer
  • joey certified shaleh as Journeyer
  • joey certified x as Apprentice
  • joey certified stephenc as Journeyer
  • joey certified bodo as Journeyer
  • joey certified jpick as Journeyer
  • joey certified ncm as Journeyer
  • joey certified gord as Journeyer
  • joey certified mpav as Journeyer
  • joey certified lazarus as Apprentice
  • joey certified starshine as Journeyer
  • joey certified che as Journeyer
  • joey certified brother as Journeyer
  • joey certified joeysmith as Journeyer
  • joey certified bod as Journeyer
  • joey certified decklin as Journeyer
  • joey certified gibreel as Journeyer
  • joey certified torsten as Journeyer
  • joey certified alfie as Apprentice
  • joey certified aclark as Journeyer
  • joey certified kju as Journeyer
  • joey certified psg as Journeyer
  • joey certified zed as Journeyer
  • joey certified evo as Journeyer
  • joey certified mbaker as Journeyer
  • joey certified cmr as Journeyer
  • joey certified Tv as Journeyer
  • joey certified xtifr as Journeyer
  • joey certified sstrickl as Journeyer
  • joey certified etbe as Journeyer

Others have certified joey as follows:

  • joey certified joey as Journeyer
  • dhd certified joey as Journeyer
  • ajt certified joey as Master
  • davidw certified joey as Journeyer
  • alan certified joey as Journeyer
  • uzi certified joey as Journeyer
  • caolan certified joey as Journeyer
  • tron certified joey as Master
  • bombadil certified joey as Journeyer
  • cas certified joey as Journeyer
  • garrett certified joey as Master
  • lupus certified joey as Journeyer
  • graydon certified joey as Journeyer
  • marcel certified joey as Journeyer
  • mblevin certified joey as Journeyer
  • bribass certified joey as Master
  • plundis certified joey as Journeyer
  • matias certified joey as Journeyer
  • ajv certified joey as Journeyer
  • crackmonkey certified joey as Master
  • jim certified joey as Master
  • CentralScrutinizer certified joey as Master
  • schoen certified joey as Master
  • pedro certified joey as Master
  • omnic certified joey as Master
  • hands certified joey as Master
  • tausq certified joey as Journeyer
  • suzi certified joey as Master
  • broonie certified joey as Master
  • nick certified joey as Journeyer
  • lordsutch certified joey as Master
  • jimd certified joey as Master
  • chip certified joey as Master
  • jgg certified joey as Master
  • branden certified joey as Master
  • srivasta certified joey as Master
  • danpat certified joey as Master
  • darkewolf certified joey as Master
  • z certified joey as Journeyer
  • cech certified joey as Master
  • dres certified joey as Master
  • gorgo certified joey as Master
  • ruud certified joey as Master
  • kaig certified joey as Master
  • wichert certified joey as Master
  • ajk certified joey as Master
  • ljlane certified joey as Master
  • Joy certified joey as Journeyer
  • andrei certified joey as Master
  • rcw certified joey as Master
  • Trakker certified joey as Master
  • neuro certified joey as Master
  • starshine certified joey as Master
  • seeS certified joey as Master
  • foka certified joey as Master
  • pretzelgod certified joey as Master
  • mstone certified joey as Master
  • bcollins certified joey as Master
  • doviende certified joey as Master
  • dmarti certified joey as Master
  • splork certified joey as Master
  • bdale certified joey as Master
  • drow certified joey as Master
  • edward certified joey as Master
  • ljb certified joey as Journeyer
  • claw certified joey as Master
  • edb certified joey as Master
  • shaleh certified joey as Master
  • jpick certified joey as Master
  • zacs certified joey as Journeyer
  • jae certified joey as Master
  • benson certified joey as Journeyer
  • wardv certified joey as Master
  • jeroen certified joey as Master
  • lazarus certified joey as Journeyer
  • mpav certified joey as Master
  • walken certified joey as Master
  • ncm certified joey as Master
  • Barbwired certified joey as Master
  • kraai certified joey as Master
  • che certified joey as Master
  • lstep certified joey as Master
  • brother certified joey as Master
  • nas certified joey as Journeyer
  • acme certified joey as Master
  • moshez certified joey as Master
  • tca certified joey as Journeyer
  • cord certified joey as Master
  • sethcohn certified joey as Master
  • bod certified joey as Journeyer
  • tripix certified joey as Journeyer
  • jLoki certified joey as Master
  • sh certified joey as Master
  • lerdsuwa certified joey as Master
  • torsten certified joey as Master
  • alfie certified joey as Master
  • mhatta certified joey as Master
  • aclark certified joey as Master
  • kju certified joey as Master
  • psg certified joey as Master
  • zed certified joey as Master
  • karlheg certified joey as Master
  • evo certified joey as Master
  • ole certified joey as Master
  • jfs certified joey as Master
  • bma certified joey as Master
  • jtc certified joey as Master
  • gibreel certified joey as Master
  • Jordi certified joey as Master
  • jhasler certified joey as Master
  • cpbs certified joey as Journeyer
  • ths certified joey as Master
  • decklin certified joey as Master
  • Tv certified joey as Master
  • xtifr certified joey as Master
  • joeysmith certified joey as Master
  • mishan certified joey as Master
  • keverets certified joey as Master
  • pa certified joey as Master
  • Slimer certified joey as Master
  • weasel certified joey as Master
  • technik certified joey as Master
  • baretta certified joey as Master
  • robster certified joey as Master
  • juhtolv certified joey as Master
  • rcyeske certified joey as Master
  • kmself certified joey as Master
  • andersee certified joey as Master
  • asuffield certified joey as Master
  • charon certified joey as Master
  • claviola certified joey as Master
  • chrisd certified joey as Master
  • mdz certified joey as Master
  • buckley certified joey as Master
  • moray certified joey as Master
  • jtjm certified joey as Master
  • mwk certified joey as Master
  • proski certified joey as Master
  • cmiller certified joey as Master
  • pau certified joey as Master
  • rkrishnan certified joey as Master
  • dieman certified joey as Master
  • eckes certified joey as Master
  • fxn certified joey as Master
  • etbe certified joey as Master
  • Sam certified joey as Master
  • fallenlord certified joey as Master
  • hanna certified joey as Master
  • maxx certified joey as Master
  • dopey certified joey as Master
  • tfheen certified joey as Master
  • ttroxell certified joey as Master
  • Netsnipe certified joey as Master
  • quarl certified joey as Journeyer
  • amck certified joey as Master
  • riverwind certified joey as Master
  • pere certified joey as Journeyer
  • NoWhereMan certified joey as Master
  • jochen certified joey as Master
  • faw certified joey as Master
  • mako certified joey as Master
  • Pizza certified joey as Master
  • sysdebug certified joey as Master
  • vern certified joey as Master
  • ctrlsoft certified joey as Master
  • lkcl certified joey as Master
  • hasienda certified joey as Master
  • gesslein certified joey as Master
  • ean certified joey as Master

[ Certification disabled because you're not logged in. ]

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

X
Share this page