Older blog entries for joey (starting at number 489)

popcon graphs for tasks

Last year I was able to switch tasksel to using metapackages, instead of the weird non-package task things that had been used before Debian supported Recommends fields well.

An unanticipated result of the new task packages is that I have this nice popcon data available for them, so can get graphs like these.

For new installs of testing, KDE and Xfce are neck and neck. With Gnome being the default, it's hard to say which desktop users really prefer. My feeling is that it's probably nearly evenly split now.

(I installed Xfce on my sister's laptop last week, and anticipate moving all my family to it, rather than Gnome 3.)

The above graph also shows a surprisingly large number of ssh server task installs. In fact, it's the most often manually installed task. Probably many of those are server machines, and so I'm considering having tasksel automatically select ssh on systems where it doesn't automatically select a desktop.

Language data is also available. Taskel uses language tasks internally, without exposing an interface, so this will be almost entirely users who did an install of testing localised to their language.

Interesting data can be teased out of this too. For example there seem more installs in Catalan than Chinese ... and at least 10 Esperanto users. (As with any popcon number, this is a lower bound, to be multiplied by the scaling guesstimate of your choice.)

By the way, I've got a new vanity domain for my blog and wiki: http://joeyh.name/

The old http://kitenet.net/~joey/ will continue to work, like it has since 1997. But the new is easier to type. And it let me move my site to Branchable, at last.

Syndicated 2012-05-12 22:20:08 from see shy jo

moving my email archives and packages to git-annex

I've recently been moving some important data into git-annex, and finding it simplifies things while also increasing my flexability.

email archives

I've kept my email archives in git for years. This works ok, just choose the right file format (compressed mbox) and number of files (one archive per mailbox per month or so) and git can handle this well enough, as email is not really large.

But, email is not really small either. Keeping my email repository checked out on my netbook consumes 2 gigabytes of its 30 gigabyte SSD, half of which is duplication in .git. Also, I have only kept it at 2 gigabytes through careful selection of what classes of mail I archive. That made sense when archival disk was more expensive, but what makes sense these days is to archive everything.

For a while I've wanted to have a "raw" archive, that includes all email I receive. (Even spam.) This protects against various disasters in mail filtering or reading. Setting that up was my impetus for switching my mail archives to git-annex today.

The new system I've settled on is to first copy all incoming mail into a "raw" maildir folder. Then mailfilter sorts it into the folders I sync (with offlineimap) and read. Each day, the "raw" folder is moved into a mbox archive, and that's added to the git annex. Each month, the mail I've read is moved into a monthly archives, and added to the git annex. A simple script does the work.

I counted the number of copies that existed of my mail when it was stored in git, and found 7 copies spread amoung just 3 drives. I decided to slim that back, and configured git-annex to require only 5 copies. But those 5 copies will spread amoung more drives, including several offline archival drives, so it will be more robust overall.

My netbook will have an incomplete checkout of my mail, omitting the "raw" archive. If I need to peek inside a spam folder for a lost mail, I can quickly pull it down; if I need to free up space I can quickly drop older archives. This is the flexability that git-annex fans love. :)

By the way, this also makes it easier to permanantly delete mail, when you really need to (ie, for contractual reasons). Before, I'd have to do a painful git-filter-branch if I needed to get rid of eg, mail for old jobs. Now I can git annex drop --force.

Pro Tip: If you're doing this kind of migration to git-annex, you can save bandwidth by not re-transferring files to machines that already have a copy. I ran this command on my netbook to inject the archives it had in the old repository into the new repository, verifying checksums as it goes:

  cd ~/mail/archive; find -type l -exec git annex reinject ~/mail.old/archive/{} {} \;

debian packages

I'd evolved a complex and fragile chain of personal apt repositories to hold Debian packages I've released. I recently got rid of the mess, which looked like this: dput → local mini-dinstall repo → dput→mini-dinstallrepo on my server →dput` → Debian

The point of all that was that I could "upload" a package locally while offline and batch transfer it later. And I had a local and a public apt repository of just the packages I've uploaded. But these days, packages uploaded to Debian are available nearly immediately, so there's not much reason to do that.

My old system also had a problem: It only kept the most recent single copy of each package. Again, disk is cheap, so I'd rather have archives of everything I have uploaded. Again I switched to git-annex.

My new system is simplicity itself. I release a package by checking it into a "toupload" directory in my git annex repository on my netbook. Items in that directory are dput to Debian and moved to "released". I have various other clones of that repository, which I git annex move packages to periodically to free up SSD space. In the rare cases when I build a package on a server, I check it into the clone on the server, and again rely on git-annex to copy it around.

Now, does anyone know a good way to download a copy of every package you're ever released from archive.debian.org? (Ideally as a list of urls I can feed to git annex addurl.)


My email and Debian packages were the last large files I was not storing in git-annex. Even backups of my backups end up checked into git-annex and archived away.

Now that I'm using git-annex in every place I can, my goal with it is to make it as easy as possible for as many of you to use it as possible, too. I have some inotify tricks up my sleeve that seem promising. Kickstarter may be involved. Watch this space!

Syndicated 2012-04-22 20:11:38 from see shy jo

Stand by the grey stone when the thrush knocks

Today, map in hand, I explored the "long valley, narrower than the great dale in the South where the Gates of the river stood, and walled with lower spurs of the Mountain".

"The dangerous search on the western slopes for the secret door"
"It seemed as if darkness flowed out like a vapour from the hole in the mountain-side" "They spoke low and never called or sang, for danger brooded in every rock."
"It is almost dark so that its vastness can only be dimly guessed, but rising from the near side of the rocky floor there is a great glow. The glow of Smaug!"

Syndicated 2012-04-09 19:38:20 from see shy jo

ls: the missing options

I'm honored and pleased to be the person who gets to complete ls. This project, begun around when I was born, was slow to turn into anything more than a simple for loop over a dirent. It really took off in the mid and late 80's, when Richard Stallman added numerous features, and the growth has been steady ever since. But, a glance at the man page shows that ls has never quite been complete. It fell to me to finish the job, and I have produced several handy patches to this end:

The only obvious lack now is a -z option, which should make output filenames be NULL terminated for consuption by other programs. I think this would be easy to write, but I've been extermely busy IRL (moving lots of furniture) and didn't get to it. Any takers to write it?

Due to the nature of these patches, they conflict with each other. Here's a combined patch suitable to be applied and tested.

diff -ur orig/coreutils-8.13/src/ls.c coreutils-8.13/src/ls.c
--- orig/coreutils-8.13/src/ls.c    2011-07-28 06:38:27.000000000 -0400
+++ coreutils-8.13/src/ls.c 2012-04-01 12:41:56.835106346 -0400
@@ -270,6 +270,7 @@
 static int format_group_width (gid_t g);
 static void print_long_format (const struct fileinfo *f);
 static void print_many_per_line (void);
+static void print_jam (void);
 static size_t print_name_with_quoting (const struct fileinfo *f,
                                        bool symlink_target,
                                        struct obstack *stack,
@@ -382,6 +383,7 @@
    many_per_line for just names, many per line, sorted vertically.
    horizontal for just names, many per line, sorted horizontally.
    with_commas for just names, many per line, separated by commas.
+   jam to fit in the most information possible.
    -l (and other options that imply -l), -1, -C, -x and -m control
    this parameter.  */
@@ -392,7 +394,8 @@
     one_per_line,      /* -1 */
     many_per_line,     /* -C */
     horizontal,            /* -x */
-    with_commas            /* -m */
+    with_commas,       /* -m */
+    jam            /* -j */
 static enum format format;
@@ -630,6 +633,11 @@
 static bool immediate_dirs;
+/* True means when multiple directories are being displayed, combine
+ * their contents as if all in one directory. -e */
+static bool entangle_dirs;
 /* True means that directories are grouped before files. */
 static bool directories_first;
@@ -705,6 +713,10 @@
 static bool format_needs_type;
+/* Answer "yes" to all prompts. */
+static bool yes;
 /* An arbitrary limit on the number of bytes in a printed time stamp.
    This is set to a relatively small value to avoid the need to worry
    about denial-of-service attacks on servers that run "ls" on behalf
@@ -804,6 +816,7 @@
   {"escape", no_argument, NULL, 'b'},
   {"directory", no_argument, NULL, 'd'},
   {"dired", no_argument, NULL, 'D'},
+  {"entangle", no_argument, NULL, 'e'},
   {"full-time", no_argument, NULL, FULL_TIME_OPTION},
   {"group-directories-first", no_argument, NULL,
@@ -849,12 +862,12 @@
 static char const *const format_args[] =
   "verbose", "long", "commas", "horizontal", "across",
-  "vertical", "single-column", NULL
+  "vertical", "single-column", "jam", NULL
 static enum format const format_types[] =
   long_format, long_format, with_commas, horizontal, horizontal,
-  many_per_line, one_per_line
+  many_per_line, one_per_line, jam
 ARGMATCH_VERIFY (format_args, format_types);
@@ -1448,6 +1461,9 @@
       print_dir_name = true;
+  if (entangle_dirs)
+      print_current_files ();
   if (print_with_color)
       int j;
@@ -1559,6 +1575,7 @@
   print_block_size = false;
   indicator_style = none;
   print_inode = false;
+  yes = false;
   dereference = DEREF_UNDEFINED;
   recursive = false;
   immediate_dirs = false;
@@ -1644,7 +1661,7 @@
       int oi = -1;
       int c = getopt_long (argc, argv,
-                           "abcdfghiklmnopqrstuvw:xABCDFGHI:LNQRST:UXZ1",
+                           "abcdefghijklmnopqrstuvw:xyABCDFGHI:LNQRST:UXZ1",
                            long_options, &oi);
       if (c == -1)
@@ -1667,6 +1684,10 @@
           immediate_dirs = true;
+   case 'e':
+          entangle_dirs = true;
+     break;
         case 'f':
           /* Same as enabling -a -U and disabling -l -s.  */
           ignore_mode = IGNORE_MINIMAL;
@@ -1697,6 +1718,10 @@
           print_inode = true;
+   case 'j':
+     format = jam;
+     break;
         case 'k':
           human_output_opts = 0;
           file_output_block_size = output_block_size = 1024;
@@ -1765,6 +1790,10 @@
           format = horizontal;
+   case 'y':
+     yes = true;
+     break;
         case 'A':
           if (ignore_mode == IGNORE_DEFAULT)
             ignore_mode = IGNORE_DOT_AND_DOTDOT;
@@ -2510,7 +2539,7 @@
       DEV_INO_PUSH (dir_stat.st_dev, dir_stat.st_ino);
-  if (recursive || print_dir_name)
+  if ((recursive || print_dir_name) && ! entangle_dirs)
       if (!first)
         DIRED_PUTCHAR ('\n');
@@ -2526,7 +2555,8 @@
   /* Read the directory entries, and insert the subfiles into the `cwd_file'
      table.  */
-  clear_files ();
+  if (! entangle_dirs)
+     clear_files ();
   while (1)
@@ -2615,7 +2645,7 @@
       DIRED_PUTCHAR ('\n');
-  if (cwd_n_used)
+  if (cwd_n_used && ! entangle_dirs)
     print_current_files ();
@@ -3464,6 +3494,10 @@
       print_with_commas ();
+    case jam:
+      print_jam ();
+      break;
     case long_format:
       for (i = 0; i < cwd_n_used; i++)
@@ -4418,6 +4452,24 @@
   putchar ('\n');
+static void
+print_jam (void)
+  size_t filesno;
+  size_t pos = 0;
+  for (filesno = 0; filesno < cwd_n_used; filesno++)
+    {
+      struct fileinfo const *f = sorted_file[filesno];
+      size_t len = length_of_file_name_and_frills (f);
+      print_file_name_and_frills (f, pos);
+      pos += len;
+    }
+  putchar ('\n');
 /* Assuming cursor is at position FROM, indent up to position TO.
    Use a TAB character instead of two or more spaces whenever possible.  */
@@ -4627,11 +4679,13 @@
   -D, --dired                generate output designed for Emacs' dired mode\n\
 "), stdout);
       fputs (_("\
+  -e, --entangle             display multiple directory contents as one\n\
   -f                         do not sort, enable -aU, disable -ls --color\n\
   -F, --classify             append indicator (one of */=>@|) to entries\n\
       --file-type            likewise, except do not append `*'\n\
       --format=WORD          across -x, commas -m, horizontal -x, long -l,\n\
                                single-column -1, verbose -l, vertical -C\n\
+                               jam -j\n\
       --full-time            like -l --time-style=full-iso\n\
 "), stdout);
       fputs (_("\
@@ -4667,6 +4721,8 @@
   -i, --inode                print the index number of each file\n\
   -I, --ignore=PATTERN       do not list implied entries matching shell PATTERN\
+  -j                         jam output together, makes the most of limited\n\
+                             space on modern systems (cell phones, twitter)\n\
   -k                         like --block-size=1K\n\
 "), stdout);
       fputs (_("\
@@ -4733,6 +4789,7 @@
   -w, --width=COLS           assume screen width instead of current value\n\
   -x                         list entries by lines instead of by columns\n\
   -X                         sort alphabetically by entry extension\n\
+  -y                         answer all questions with \"yes\"\n\
   -Z, --context              print any SELinux security context of each file\n\
   -1                         list one file per line\n\
 "), stdout);

It remains to be seen if multi-option enabled coreutils will be accepted into Debian in time for the next release. Due to some disagreements with the coreutils maintainer, the matter has been referred to the Technical Committee (Flattr me)

Traditionally new ls contributors stop once enough options have been added that they can spell their name, in the best traditions of yellow snow. Once ls -richard -stallman worked, I'm sure RMS moved on other other more pressing concerns. The current maintainer, David MacKenzie, was clearly not done yet, since only ls -david -mack worked. But he was being slow to add these last few features, and ls was very deficient in the realm of spelling my name (ls -o -hss .. srsly?), so I took matter into my own hands in the best tradition of free software.

Syndicated 2012-04-01 16:51:11 from see shy jo

podcasts that don't suck

My public radio station is engaged in a most obnoxious spring pledge drive. Good time to listen to podcasts. Here are the ones I'm currently liking.

  • Free As In Freedom: The best informed podcast on software licensing issues, and highly idealistic. What keeps me coming back, though is that Karen and Bradley never quite agree on things, and always end up in some lawyerly minutia culdesac that is somehow interesting to listen to. They once did a whole show about a particular IRS tax form, and I listened to it all. (Granted, I often listen to this while cleaning house, but as Bradley would say, at least I'm not listening to it while driving.)

  • This Developer's Life: At least the early episodes before it got popular are a unashamed imitation of This American Life, and I have quite enjoyed them. Although I often roll my eyes at the proprietary developer mindsets on display in the show. For example, often they'll have a bug and not root cause it, because well, they don't have the source code for the Windows layers. Still, beneath that it's mostly about the parts of software development that are common to all our lives. A particular episode I can recommend is #10 "Disconnecting" -- the first 20 minutes is a perfect story.

  • Off the Hook: This is actually a live radio show, quite well done, with call-ins and everything. So much more polished than your typical podcast. It's hosted by Emmanuel Goldstein! And it's been going on for over 20 years, so why did I never hear about it before? Probably I'm not quite in the right hacker circles. Since it's out of NYC and very anti-authoritarian, I've mostly been enjoying it as a view into the Occupy protests.

  • StarShipSofa: The best science fiction podcast around. Probably not news to anyone who ever looked for such a podcast. Long, and tends to be frontloaded with a lot of administrivia, which I fast-forward to get to the stories.

  • Spider on the Web: The best music and science fiction podcast around. Mostly on hiatus since Jeanne died, but I hope Spider picks it back up. A good examplar is "Bianca's Hands"

  • Long Now Seminars: Consistently interesting. I visited their space last time I was in SF only to learn they'd had a talk the night before, which would have been a bummer, except they ran the bits of the Clock for us.

  • Linux Outlaws: After 18 years using Linux, I find the level of discourse in most Linux podcasts typically rather annoying. Including this one, but when Fab gets on a rant, it's all worth it. Sometimes some interesting guests.

  • This Week In Debian: Sadly no new episodes lately, and I've been too lame to respond to repeated interview requests. Probably it needs to move away from being an interview show if it is to continue; there are only so many DD's who can give excellent interviews like liw did.

Syndicated 2012-03-30 21:28:22 from see shy jo

case study: adding box.com support to git-annex

git-annex has special remotes that allow large files checked into git to be stored in arbitrary places, that are not proper git remotes. One key use of the special remotes is to store files in The Cloud.

Until now the flagship special remote used Amazon S3, although a few other things like Archive.org, rsync.net, and Tahoe-Laffs can be made to work too. One of my goals is to add as many cloud storage options to git-annex as possible.

Box.com came to my attention because they currently have a promotion that provides 50 gigabytes of free "lifetime" service. Which is a nice amount of cloud storage to have for free. I decided that I didn't want to spend more than 4 hours of my time to make git-annex use it though. (I probably have spent a week on the S3 support by contrast.)

So, this is a case study in quickly adding support for one cloud storage provider to git-annex.

  • First, I had to sign up to box.com. Their promotion requires an android phone be used to get the 50 gigabytres. This wasted about an hour getting my unused phone dusted off etc. This also includes time spent researching ways to access box.com's storage, including reading their API documentation. I found it has a WebDAV interface.
  • Sadly, there is not yet a native WebDAV library for haskell. This is a shame, because it would make the implementation better. But, I'm confident someone will eventually write one. My experience with haskell libraries for other web APIs (S3, GitHub) is that it's an excellent language to write them in, the code tends to be very simple, concise and clear. But I can't do it in 4 hours. So for now, the workaround is to use a WebDAV mounting tool. I picked davfs2 as it was the first one I got to work with box.com's slightly broken WebDAV. 2 hours spent now.
  • With box.com mounted, I was neary done; git-annex's directory special remote can use the mount point. But there was a catch: box.com only allows up to 100 mb large files. I spent 1 hour or so adding support to the directory special remote for chunking files into a user-specified size.
    This was a fairly complex problem -- the existing code had a ByteString that when accessed lazily read the whole large file (from disk or from gpg, depending), and just called writeFile on it.
    I needed to still consume it lazily to avoid reading the whole file into memory, but write out chunks. This gets a bit into haskell's ByteString internals, but they're very well suited to this kind of thing, and so after 15 minutes familiarizing myself with the data structures, it was actually fairly easy to write the code. patch
  • I spent my last hour testing and tuning the box.com special remote. Using davfs2 as a quick fix caused some technical debt that I had to make up for. In particular, the chunked filename retrieval code had to make sure not to open every chunk at once, because that makes davfs2 try to cache them all, instead of streaming one at a time. patch
  • Not counted toward my 4 hour limit is the ... er ... 4 hours I spent last night adding a progress bar to the directory special remote. A progress display while transferring the files makes using box.com as a special remote much nicer, but also makes using my phone's SD card as a special remote much nicer! This is why I'm a poor consultant -- when faced with something generic and generally useful like this, I have difficulty billing for it.

The end result is that there are detailed instructions for using box.com as a special remote.

And it seems to work quite well now. I just set up my production box.com special remote. All content written to it is gpg encrypted, and various of my computers have access to it, each using their own gpg key to decrypt the files uploaded by the others. (git-annex's encryption feature makes this work really well!)

There is a DropBox API for haskell. But as I'm not a customer, the 2 gb free account hardly makes it worth my while to make git-annex use it. Would someone like to fund my time to add a dropbox special remote to git-annex?

Syndicated 2012-03-04 17:28:59 from see shy jo

leap day

This leap day saw me driving along the river on a rainy, with 4 chickens in the car's trunk, and 3 terabytes of disk (and a half a bale of straw) in the back seat. I may have not been blogging much lately about life, because these situations can be hard to explain. (Or because "joined the Debian haskell team and spent two days working on rebuilds for the ghc 7.4 transition" is not thrilling reading.)

hens in a car

The Light Sussex chickens are my sister's spare flock, which are "too tame". They're now cozily installed into a coop we built last weekend. In return I gave her a 6 foot long APC power strip, which had been mounted on the wall of my office. I'm preparing my house in town to be rented, and have little need for two dozen power outlets here in solar power land.

Google <3s Your Work

Indeed, today is a gift economy day all around -- when I arrived at the cabin, there on the porch was an unexpected package from Google. Particularly surprising since I never get deliveries here, since the driveway is a mile long and often seems like it could dead-end into the woods at any moment.

The combination of technological wackiness (I also debugged a laptop whose USB hub hangs when a particular trackball is plugged in) and in your face country texture (including coal trains, being stuck behind a tractor, and miles of amazing tree-height mist) made this a memorable day.

Syndicated 2012-02-29 21:30:30 from see shy jo

addicted to $

One of the weird historical accidents of programming languages is that so many of them use $ for important things. The reason is just that out of the available punctuation, nearly all of it has a mathmatical or other predefined use that makes sense to retain in a programming language context, while $ (and also @ and #) do not. Still, $ annoys me, it's so asymetric that we use it all over our code, and never a £ or ฿ to be seen.

The one language that manages to use $ nicely, IMHO, is Haskell. Recently I noticed that it has an actual visual mnemonic in its use of $. And it's used for something I've not seen in other languages.

The visual mnemonic of $ is that it looks like an opening parenthesis, with the related closing parenthesis on a line below it.

  (something (that
    (lisp folks
        (are (very (familiar with)))

And this is also the problem that $ solves:

  something $ that $
    haskell folks $
        are $ very $ familiar with

This is a trivial feature.. but oh so useful. The implementation in Haskell of $ is simply:

  f $ x = f x
infixr 0 $

Just function application, but at a different precedence than usual.

I am now very addicted to my $. Out of 15 thousand lines of code, only 87 contain )), while 10% use $.

Syndicated 2012-02-17 16:49:54 from see shy jo

more on ghc filename encodings

My last post missed an important thing about GHC 7.4's handling of encodings for FileName. It can in fact be safe to use FilePath to write a command like rm. This is because GHC internally uses a special encoding for FilePath data, that is documented to allow "arbitrary undecodable bytes to be round-tripped through it". (It seems to do this by encoding the undecodable bytes as very high unicode code points.) So, when presented with a filename that cannot be decoded using utf-8 (or whatever the system encoding is), it still handles it, and using the resulting FilePath will in fact operate on the right file. Whew!

Moral of the story is that if you're going to be using GHC 7.4 to read or write filenames from a pipe, or a file, you need to arrange for the Handle you're reading or writing to use this special encoding too. I use this to set up my Handles:

  import System.IO
import GHC.IO.Encoding
import GHC.IO.Handle

fileEncoding :: Handle -> IO ()
fileEncoding h = hSetEncoding h =<< getFileSystemEncoding

Even if you're only going to write a FilePath to stdout, you need to do this. Otherwise, your program will crash on some filenames! This doesn't seem quite right to me, but I hesitate to file a bug report. (And this is not a new problem in GHC anyway.) If I did, it would have this testcase:

  # touch "me¡"
# LANG=C ghc
Prelude> :m System.Directory
Prelude System.Directory> mapM_ putStrLn =<< getDirectoryContents "."
me*** Exception: <stdout>: hPutChar: invalid argument (invalid character)

Since git-annex reads lots of filenames from git commands and other places, I had to deal with this extensively. Unfortunatly I have not found a way to read Text from a Handle using the fileSystemEncoding. So I'm stuck with slow Strings. But, it does seem to work now.

PS: I found a bug in GHC 7.4 today where one of those famous Haskell immutable values seems to get well, mutated. Specifically a [FilePath] that is non-empty at the top of a function ends up empty at the bottom. Unless IO is done involving it at the top. Really. Hope to develop a test case soon. Happily, the code that triggered it did so while working around a bug in GHC that is fixed in 7.4. Language bugs.. gotta love em.

Syndicated 2012-02-03 20:11:32 from see shy jo

unicode ate my homework

I've just spent several days trying to adapt git-annex to changes in ghc 4.7's handling of unicode in filenames. And by spent, I mean, time withdrawn from the bank, and frittered away.

In kindergarten, the top of the classrom wall was encircled by the aA bB cC of the alphabet. I'll bet they still put that up on the walls. And all the kids who grow up to become involved with computers learn that was a lie. The alphabet doesn't stop at zZ. It wouldn't all fit on a wall anymore.

So we're in a transition period, where we've all learnt deeply the alphabet, but the reality is much more complicated. And the collision between that intuitive sense of the world and the real world makes things more complicated still. And so, until we get much farther along in this transition period, you have to be very lucky indeed to not have wasted time dealing with that complexity, or at least having encountered Mojibake.

Most of the pain centers around programming languages, and libraries, which are all at different stages of the transition from ascii and other legacy encodings to unicode.

  • If you're using C, you likely deal with all characters as raw bytes, and rely on the backwards compatability built into UTF-8, or you go to long lengths to manually deal with wide characters, so you can intelligently manipulate strings. The transition has barely begin, and will, apparently, never end.
  • If you're using perl (at least like I do in ikiwiki), everything is (probably) unicode internally, but every time you call a library or do IO you have to manually deal with conversions, that are generally not even documented. You constantly find new encoding bugs. (If you're lucky, you don't find outright language bugs... I have.) You're at a very uncomfortable midpoint of the transition.
  • If you're using haskell, or probably lots of other languages like python and ruby, everything is unicode all the time.. except for when it's not.
  • If you're using javascript, the transition is basically complete.

My most recent pain is because the haskell GHC compiler is moving along in the transition, getting closer to the end. Or at least finishing the second 80% and moving into the third 80%. (This is not a quick transition..)

The change involves filename encodings, a situation that, at least on unix systems, is a vast mess of its own. Any filename, anywhere, can be in any encoding, and there's no way to know what's the right one, if you dislike guessing.

Haskell folk like strongly typed stuff, so this ambiguity about what type of data is contained in a FilePath type was surely anathama. So GHC is changing to always use UTF-8 for operations on FilePath. (Or whatever the system encoding is set to, but let's just assume it's UTF-8.)

Which is great and all, unless you need to write a Haskell program that can deal with arbitrary files. Let's say you want to delete a file. Just a simple rm. Now there are two problems:

  1. The input filename is assumed to be in the system encoding aka unicode. What if it cannot be validly interpreted in that encoding? Probably your rm throws an exception.
  2. Once the FilePath is loaded, it's been decoded to unicode characters. In order to call unlink, these have to be re-encoded to get a filename. Will that be the same bytes as the input filename and the filename on disk? Possibly not, and then the rm will delete the wrong thing, or fail.

But haskell people are smart, so they thought of this problem, and provided a separate type that can deal with it. RawFilePath hearks back to kindergarten; the filename is simply a series of bytes with no encoding. Which means it cannot be converted to a FilePath without encountering the above problems. But does let you write a safe rm in ghc 4.7.

So I set out to make something more complicated than a rm, that still needs to deal with arbitrary filename encodings. And I soon saw it would be problimatic. Because the things ghc can do with RawFilePaths are limited. It can't even split the directory from the filename. We often do need to manipulate filenames in such ways, even if we don't know their encoding, when we're doing something more complicated than rm.

If you use a library that does anything useful with FilePath, it's not available for RawFilePath. If you used standard haskell stuff like readFile and writeFile, it's not available for RawFilePath either. Enjoy your low-level POSIX interface!

So, I went lowlevel, and wrote my own RawFilePath versions of pretty much all of System.FilePath, and System.Directory, and parts of MissingH and other libraries. (And noticed that I can understand all this Haskell code.. yay!) And I got it close enough to working that, I'm sure, if I wanted to chase type errors for a week, I could get git-annex, with ghc 4.7, to fully work on any encoding of filenames.

But, now I'm left wondering what to do, because all this work is regressive; it's swimming against the tide of the transition. GHC's change is certainly the right change to make for most programs, that are not like rm. And so most programs and libraries won't use RawFilePath. This risks leaving a program that does a fish out of water.

At this point, I'm inclined to make git-annex support only unicode (or the system encoding). That's easy. And maybe have a branch that uses RawFilePath, in a hackish and type-unsafe way, with no guarantees of correctness, for those who really need it.

Previously: unicode eye chart wanted on a bumper sticker abc boxes unpacking boxes

Syndicated 2012-02-02 22:12:02 from see shy jo

480 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!