Older blog entries for joey (starting at number 480)

unicode ate my homework

I've just spent several days trying to adapt git-annex to changes in ghc 4.7's handling of unicode in filenames. And by spent, I mean, time withdrawn from the bank, and frittered away.

In kindergarten, the top of the classrom wall was encircled by the aA bB cC of the alphabet. I'll bet they still put that up on the walls. And all the kids who grow up to become involved with computers learn that was a lie. The alphabet doesn't stop at zZ. It wouldn't all fit on a wall anymore.

So we're in a transition period, where we've all learnt deeply the alphabet, but the reality is much more complicated. And the collision between that intuitive sense of the world and the real world makes things more complicated still. And so, until we get much farther along in this transition period, you have to be very lucky indeed to not have wasted time dealing with that complexity, or at least having encountered Mojibake.

Most of the pain centers around programming languages, and libraries, which are all at different stages of the transition from ascii and other legacy encodings to unicode.

  • If you're using C, you likely deal with all characters as raw bytes, and rely on the backwards compatability built into UTF-8, or you go to long lengths to manually deal with wide characters, so you can intelligently manipulate strings. The transition has barely begin, and will, apparently, never end.
  • If you're using perl (at least like I do in ikiwiki), everything is (probably) unicode internally, but every time you call a library or do IO you have to manually deal with conversions, that are generally not even documented. You constantly find new encoding bugs. (If you're lucky, you don't find outright language bugs... I have.) You're at a very uncomfortable midpoint of the transition.
  • If you're using haskell, or probably lots of other languages like python and ruby, everything is unicode all the time.. except for when it's not.
  • If you're using javascript, the transition is basically complete.

My most recent pain is because the haskell GHC compiler is moving along in the transition, getting closer to the end. Or at least finishing the second 80% and moving into the third 80%. (This is not a quick transition..)

The change involves filename encodings, a situation that, at least on unix systems, is a vast mess of its own. Any filename, anywhere, can be in any encoding, and there's no way to know what's the right one, if you dislike guessing.

Haskell folk like strongly typed stuff, so this ambiguity about what type of data is contained in a FilePath type was surely anathama. So GHC is changing to always use UTF-8 for operations on FilePath. (Or whatever the system encoding is set to, but let's just assume it's UTF-8.)

Which is great and all, unless you need to write a Haskell program that can deal with arbitrary files. Let's say you want to delete a file. Just a simple rm. Now there are two problems:

  1. The input filename is assumed to be in the system encoding aka unicode. What if it cannot be validly interpreted in that encoding? Probably your rm throws an exception.
  2. Once the FilePath is loaded, it's been decoded to unicode characters. In order to call unlink, these have to be re-encoded to get a filename. Will that be the same bytes as the input filename and the filename on disk? Possibly not, and then the rm will delete the wrong thing, or fail.

But haskell people are smart, so they thought of this problem, and provided a separate type that can deal with it. RawFilePath hearks back to kindergarten; the filename is simply a series of bytes with no encoding. Which means it cannot be converted to a FilePath without encountering the above problems. But does let you write a safe rm in ghc 4.7.

So I set out to make something more complicated than a rm, that still needs to deal with arbitrary filename encodings. And I soon saw it would be problimatic. Because the things ghc can do with RawFilePaths are limited. It can't even split the directory from the filename. We often do need to manipulate filenames in such ways, even if we don't know their encoding, when we're doing something more complicated than rm.

If you use a library that does anything useful with FilePath, it's not available for RawFilePath. If you used standard haskell stuff like readFile and writeFile, it's not available for RawFilePath either. Enjoy your low-level POSIX interface!

So, I went lowlevel, and wrote my own RawFilePath versions of pretty much all of System.FilePath, and System.Directory, and parts of MissingH and other libraries. (And noticed that I can understand all this Haskell code.. yay!) And I got it close enough to working that, I'm sure, if I wanted to chase type errors for a week, I could get git-annex, with ghc 4.7, to fully work on any encoding of filenames.

But, now I'm left wondering what to do, because all this work is regressive; it's swimming against the tide of the transition. GHC's change is certainly the right change to make for most programs, that are not like rm. And so most programs and libraries won't use RawFilePath. This risks leaving a program that does a fish out of water.

At this point, I'm inclined to make git-annex support only unicode (or the system encoding). That's easy. And maybe have a branch that uses RawFilePath, in a hackish and type-unsafe way, with no guarantees of correctness, for those who really need it.


Previously: unicode eye chart wanted on a bumper sticker abc boxes unpacking boxes

Syndicated 2012-02-02 22:12:02 from see shy jo

announcing github-backup

Partly as a followup to a Github survey, and partly because I had a free evening and the need to write more haskell code, any haskell code, I present to you, github-backup.

github-backup is a simple tool you run in a git repository you cloned from Github. It backs up everything Github knows about the repository, including other forks, issues, comments, milestones, pull requests, and watchers.

This is all stored in the repository, as regular files, on a "github" branch.

Available in Cabal now, in Debian maybe if someone packages haskell-github.

Syndicated 2012-01-26 04:44:10 from see shy jo

olduse.net 1982

Hard to believe I've consumed all of 1981's Usenet posts now on olduse.net, and it's been running for 7 months already.


Last night, there was a "very long" post, describing nearly every node on usenet in 1982. There had been a warning about this post the day before, since it would take many sites half an hour to download at 300 baud. It was handily formatted as a shell script, which created per-node files.

So, I ran this code nobody has run since 1982. It worked. I got files. I tossed them on the olduse.net wiki, and used some ikiwiki code TOVA contracted me to write just a few months ago, to make clickable links on my usenet map.

usenet map

The map data was contributed in another post a while back. By 1982, usenet is getting nearly impossible to map with 1982 technology of ascii art. I enjoyed throwing graphviz, git, wikis, and the web at it.

So, we have a collaboration across time, me and "Mark" and a lot of people who described their usenet nodes and piles of technology that make creating a mashup easy. Awesome!


I blog about stuff I find on the olduse.net blog. It's an open blog; Koldfront also blogs there, and we welcome other bloggers.

Some of the highlights for me have included:

As the space shuttle program is winding down, reading the excitement about the first shuttle flights, and the play-by-play coverage of a launch, posted to net.columbia by a high school student borrowing his dad's account. (A usegroup name that's hard to read without remembering its fate).

The announcements of the Motorola M68k, the IBM PC, and the CD-ROM.

world ipv6 launch Reading the TCP-IP digest, and Postel's plans for launching IPv4 soon, while the world IPv6 launch is being planned now. (The nay-sayers are especially fun to read. Including the guy who was concerned about the address space size, in 1981!)

Learning that nethack ascention tales have a history streching back 30 years, to rogue, and that the stories back then had much the same flavor as they do today.

Various celebrity sightings. Dennis Ritchie teaching C and Unix. Bill Joy talking vi. RMS talking .. nuclear politics?

The general development of usenet. B-news being rolled out, groups proliferating, many first inklings of what will be major problems and developments in 5 or 10 years. A shift in tone is already apparent, by now usenet is not only about announcements, there are already some flames.

oldusenet in a period terminal

Still 9 years to go!

Syndicated 2012-01-21 20:58:05 from see shy jo

version numbers

Today I released two entirely different pieces of software with the identical version number 3.20120115. Debian developers also will be soon noticing a piece of software I released with the version number 9.20120115.

I expect to move more of my software to this version number scheme over time, unless I find something badly wrong with it. It reflects how I think about versions for my software; there's a kind of continual "now" that development progresses through, in which individual releases have little discrete meaning and at the same time, there can also be significant discontinuities, that require the user to do something to deal with (such as a new debhelper compat version, or a new git-annex repository format).

Those two things are really all that I need a version number for my software to communicate. I can do without the rest of the things that version numbers are used for:

  • The marketing of version 1.0 and 2.0.
  • The comparative nuances such as whether 1.0 to 1.1 is a relatively big change, and 1.0 to 1.0.1 is a relatively small change
  • The implication that 0.99 is almost 1.0 ready, and 1.1a is some kind of alpha release.

There is so much software, with so many version numbers that any signal encoded in such version numbers is swamped in the noise. Even on projects that I develop a version number like 2.88 is meaningless to me. All I care about is, how long ago was that version? Has there been a major change breaking compatibility since that version? "2.88" doesn't answer these questions well; "3.20111111" does.

It is a little wordy to have the full year in there, and it can be annoying to remember to set the version to the right date on release day (TODO: automate). This is balanced with the version not being so wordy as to include the time of day, which means I might have to do a 3.20120115.1 if I goof up. These minor problems are worth it to instantly know how old a version is when a user pastes it into a bug report.

And that is probably all I will ever have to say about version numbers. :)

Syndicated 2012-01-16 02:21:07 from see shy jo

a resolution that stuck

Last year, my new year's resolution was to write in my journal every day. That actually stuck, I wrote 262 journal entries in 2011. While I've been keeping a journal intermittently since 1998, last year I doubled the number of entries in it. And wrote a novel's worth of entries -- 53 thousand words!

Most of it is of course banal and mundane stuff. Not good compared with Lars, who does something with his journal where he goes into some detail about code he's working on, and other work. The excerpts I've seen are quite nice. But after I've written code, written a commit message, documentation, perhaps bug reports etc, I often can't find much to say about it in my journal, beyond the bare bones that I worked on $foo today or faced a particularly hard bug. I also worry that the journal, and my reluctance to repeat myself, often tips the balance away from me blogging, if I write down something in the journal first.


Here's my journal for today:

Compare what jokes are funny now with those in 1982. The 1982 ones from net.jokes on olduse.net seem juvenile. Now compare what Unix joke man pages are funny now with those I'm reading from 1982. They seem basically the same. What would Biella make of this?

Liw noticed ikiwiki OOM on pell. Tracked down to a perl markdown bug with long lines. Had quite enough of perl markdown; ikiwiki will be moving to a different engine. Added discount support to it today, still needs Debian package tho.

[censored]

Really gorgeous sunset, with a high wind, moon, puffy low, fast moving clouds. Enjoyed it ecstaticly. It's going to get cold soon. Very rainy early, but then got intermittently sunny; power is holding out ok.

Was going to roast a chicken today, but got distracted and had a large lunch besides. Need to find some quick food for supper.

I need to start a new book, should it be the River Cottage book about meat that I stole from Anna, or some SF?

Blogged about journaling, and put this journal entry in it, so also journaled about blogging. Wrote it somewhat self-conciously.


The benefits for me have ranged from being able to go back and work out dates of events, to forwarding the odd excerpts to others. The best thing though is certianly having a regular time of introspection, to look back over my the day.

If you've not got a new year's resolution yet, I recommend this one. (Learning Haskell would be another good one, if you haven't yet.)

Just write something, anything, down in your journal every day.

Syndicated 2012-01-01 22:58:57 from see shy jo

solar year

I've been at the cabin, on solar power, for a year now. I have a year of data!

Everything went pretty well until last month. There was an April rainy spell where power felt slightly tight. Then over the summer, plenty of power, no need to conserve. The last month though had what seemed like weeks of continual grey clouds, where I never saw the sun.

high noon today

Of course, even on a sunny day in winter, it does not get far above the hills, and the peak production window is only a few hours. This bad combination had my battery power dipping below the 10 volts that I consider low, down to 9, and even to 8 volts.

I use kerosine lamps in the winter. (I prefer the light anway.) I've also started unplugging my Thecus server at night to conserve power, meaning no internet late or early. For four or so nights, I had no power to run even my laptop after sunset. On one notable day, there was no power even in the daytime.

Even when it turned sunny again, I found that the batteries would seem to charge to 12 volts during the day, but then precipitously drop to 10 and 9 volts at night. I think the problem was not damaged batteries, but that these Nicads charge most efficiently above 12 volts (14 volts is best), and there was never enough power saved up to get them full enough that they could charge really efficiently.

So, I reluctantly spent three days away this week, to let the batteries soak up sun and recover. It seems to have worked; they've been holding a 12 volt charge overnight again.

Syndicated 2011-12-31 18:15:55 from see shy jo

a Github survey

The great thing about git and other distributed version control systems is that once you clone (or fork) a repository, you have all the data. You don't have to trust that Github will preserve it; everyone who develops the project is a backup.

Github carries this principle quite far amoung the features they provide. But not all the way. Today I have surveyed their features, and where the data for each is stored.

  • source code -- in git, of course!
  • user and project pages and wiki -- in git
  • gists -- in git
  • issues -- in a database accessible by an API
  • notes on commits -- in a database accessible by an API
  • relationships between repos (who forked what, pull requests) -- in a database accessible by an API
  • your account details and activity -- in a database, accessible by you via an API
  • list of all projects and users -- in a closed database (AFAIK)

The two that really stand out are the issues and notes not being stored in git. This means that, if a project uses github, it gets locked into github to a degree. The records of bugs and features, all the planning, and communication, is locked away in a database where it cannot be cloned, where every developer is not a backup.

Github's intent here is not to control this data to lock you in (to the extent they want to lock you in, they do that by providing a proprietary UI that people rave about); it was probably only expedient to use some sort of database, rather than git, when implementing these features.

They should automatically produce git repository branches containing a project's issues, and notes, based on the contents of their database. (For notes, git notes is the obviously right storage location.) Along with ensuring every developer checkout is a backup, this would allow accessing that data while offline, which is one of the reasons we use distributed version control.

The lack of a global list of projects is problimatic in a more global sense. It means that we can't make a backup of all the (public) repositories in Github (assuming that we had the bandwidth and storage to do it). I recently backed up all the repositories on Berlios.de, when it looked to be shutting down; this was only possible because they allowed enumerating them all.

People at The Internet Archive say that their archival coverage of free software is actually quite bad. We trust our version control systems to save our free software data, but while this works individually, it will result in data loss globally over time. I'd encourage Github to help the Internet Archive improve their collections by donating periodic snapshots of their public git repositories to the Archive. You're located in the same city, 5 miles apart; they have lots of hard drives (though less right now during the shortage than usual); this should be pretty easy to do.


Full disclosure: Github has bought me dinner and seemed like stand-up guys to me.

Syndicated 2011-12-27 17:38:45 from see shy jo

roundtrip latency from a cabin with dialup in 2011

alt="imagine an xkcd-style infographic here"

0 seconds

  • peace and quiet
  • full history of all my projects (git repos)
  • my blog
  • email

0.5 seconds

  • chatting on IRC
  • searching through all email received since 1994
  • music
  • cached web pages

5 seconds

  • ssh to a server
  • search the web
  • lwn, hacker news, reddit, metafilter, and other web aggregators

10 seconds

  • resuming laptop from sleep and waiting for network-manager
  • view an unnecessarily pastebinned scrap of text
  • access local Debian mirror
  • looking up a typical bug report

20 seconds

  • click on a typical link from a web aggregator
  • an hour of video pulled from a USB drive with git-annex

2 minutes

  • downloading new email
  • an increasing number of websites that force https (average of 3 reloads needed due to timeouts)

5 minutes

  • viewing a single file, bug report, or merge request on github
  • cloning the full content of a typical not too large git repo
  • retriving data from archival drives via git-annex
  • going offline and making a phone call
  • apt-get update (thanks aj, for the pdiffs)
  • viewing a single a twitter page (megabytes of crud and #! redirections)

10 minutes

  • entering a state of flow while programming
  • boingboing.net (with all the pretty pictures)
  • my mailbox (after a nice walk down a long driveway)

22 minutes

  • milk and eggs
  • a swim in the river

30 minutes

  • broadband internet access
  • someone else who knows what linux is

32 minutes

  • an hour of video pulled from my server with git-annex (includes travel time to broadband access point)

70 minutes

  • a halfway decent but slightly overpriced grocery store
  • a produce stand
  • a coffee shop

180 minutes

  • family
  • a bakery with real bread

300 minutes

  • downloading a typical podcast

Syndicated 2011-11-23 21:44:04 from see shy jo

the Engelbart demo

Just watched the whole Douglas Engelbart demo from 1968. Somehow I'd only heard of this as the first demo of the computer mouse, and only seen a brief clip on youtube. All three 30-minute reels of the film are available online, and well worth a watch in full.

The mouse is the least of it, the demo includes an outlining text editor, model-view-controller, hypertext, wiki, domain specific programming languages, a precurser to email, bug tracking, version control(?), a chorded keyboard. (Ok, that last one didn't really take off.) Probably a dozen other things I've forgotten. All in a single interface, and all before I was born.

Just like any tech demo, there are fumbles and mistakes, which is reassuring to anyone who has tried to give a tech demo.

There's also the awesome crazy hack shown here. They could only afford these tiny, round CRTs, so they pointed a television camera at it, and the camera image was piped to their television console. (So add KVM switch to the list of firsts!) The demo was done in San Fransisco, with the computer system remote in Palo Alto, so in this image you see the text on the CRT overlaid with the video from the camera.

Engelbart points out that the delay this added to the system acts as a short-term memory that filtered out flicker in the original display (and made the mouse have a mouse trail). To me it gives the whole demo a unique quality, as if it were underwater.

Despite the piping around of audio and video signals, and the multiuser system, the glaring thing missing from the demo that we have these days is networking. Although there is this amusing bit at the end where they compile a regular expression and then apply it, in order to search for documents containing certain terms, and end up with a hyperlinked list of 10 results, ordered by relevance. Yes, think Google.

Syndicated 2011-11-03 00:14:19 from see shy jo

two random thoughts about bugs

First thought is this: A bug's likelyhood of ever being fixed decays with time, starting when I first read it. If I have to read it a second time, the bug has already become more complex, since something prevented me from just fixing it the first time. If more information has to be added to the bug, that makes it yet more complex. If there is an argument in the bug about whether it is a bug, or how to fix it, just revisiting the bug at a later date can become more expensive than it's worth. Much of what is involved in filing good and effective bug reports are obvious corollaries of this. It also follows that it's best to either fix, or at least plan how to fix a bug immediatly upon reading it.

Second thought is about "wontfix". A bug submitter and the developer responsible for the bug see this state in very different ways, but the name hides what it really means, which is that there is a meta-bug affecting either the bug submitter, the developer, or both. Once you realize this, wontfix bugs, from either side, become a bit personally insulting. They also quickly decay to uselessness (see first thought), and then just lurk there wasting the developer's time in various ways. Bug tracking systems should not provide a "wontfix" state; if they want to track meta-bugs they should provide a way to reassign such a bug to some other party who can actually resolve such a meta-bug.

Syndicated 2011-10-29 18:08:33 from see shy jo

471 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!