Older blog entries for joey (starting at number 463)

size of the git sha1 collision attack surface

The kernel.org compromise has people talking about the security of git's use of sha1. Talking about this is a good thing, I think, but there's a lot of smug "we're cryptographically secure" in the air that does not seem warranted coming from non-cryptographers like me.

Two years ago I had a discussion on my blog about git and sha1, that reached similar conclusions to what I'm seeing here: It seems that current known sha1 attacks require somehow getting an ugly colliding binary file accepted into the repository in the first place. Hard to manage for peer reviewed source code. We all hate firmware in the kernel, so perhaps this is another reason it's bad. ;-) Etc.

Well, not so fast. Git's exposure to sha1 collisions is broader than just files. Git also stores data for commits, and directory trees.

Git's tree objects are interesting because they're a bag of bytes that is rarely if ever manually examined. If there was a way to exploit git such that it ignored some trailing garbage at the end of a tree object, then here's an attack injection vector that would be unlikely to be caught by peer review.

If you can change the content of a tree without changing its sha1, you can simply make it link to an older version of a file that had an exploitable problem. Or you can assemble a combination of files that results in an new exploitable problem. (For example, suppose a buffer size was hardcoded in two files in the kernel, and then the size was changed in both -- make a tree that contains one change and not the other.)

Now, git's tree-walk code, until 2008, mishandled malformed data by accessing memory outside the tree buffer. Was this an executable bug in git? I don't know. It is interesting that the fix, in 64cc1c0909949fa2866ad71ad2d1ab7ccaa673d9 relied on the parser stopping at a NULL -- great if you want to put some garbage after the tree's filename. With that said, the particular exploit I describe above probably won't work -- I tried! Here's all the code that stands between us and this exploit:

        if (size < 24 || buf[size - 21])
                die("corrupt tree file");

        path = get_mode(buf, &mode);
        if (!path || !*path)
                die("corrupt tree file");

Any good C programmer would recognise that this magic-constant-laden code needs to be careful about the size of the buffer. It's not as clear though, that it needs to be careful about consuming the entire contents of the buffer. And C programmers involved with git have gotten this code wrong before.

tldr: If git is a castle, it was built just after cannons were invented, and we've had our fingers in our ears for several years as their power improved. Now the outer wall of sha1 is looking increasingly like one of straw, and we're down to a rather thin inner wall of C code.

Syndicated 2011-09-02 05:47:39 from see shy jo

summer trips wrapup

Finally back from a solid month away.

  • Drive from England to Bosnia, and back. Plus two days of air travel, for a week of travel all told.

    The trip with the UK convoy back from Bosnia was enjoyable, Steve found a great route thru the Alps, and I much enjoyed finally seeing them. Then we stopped at a golf course in Luxemburg, where my hotel room was a suite ... swanky. We bogged down in Belgium, missed our ferry, which provided a chance to play some Eurogames in Europe. Then I visited family in London.

    I hope to eventually have some pictures from that trip. If those who had cameras make them available.

  • In between the European tour, there was DebConf. As always, it was excellent. I did not come out with the large todo list of exciting things like happened last year. I did continue nibbling through that list. Had some good conversations about haskell, met Intrigeri, who wrote the ikiwiki po plugin. Had some meetings on things I feel I've sorta moved on from to some extent but still have to be available for. Didn't manage significant technical work, but this was not unexpected. The day trip was fun, enjoyed seeing the waterfalls and little mills, and swimming the cold, cold river. The last few days I was out of energy. I did not give any presentations, and only realized during the lightning talks that I should have given one about git-annex.

  • I had expected to have most of a week at home after getting back from Europe, and technically did. But it was too annoying and unusual to count. Wildlife ate two trees of pears while I was away. My cat was stressed. I was stressed. It was insanely humid, and the house had been closed up for two weeks, and I had to fight mold and damp.

  • So the added trip to the beach that put this month over the top to beyond insane amounts of travel, turned out to be sorta a good thing. Camping in the dunes, kids, good books, sea turtle eggs fenced off a hundred feet away on the beach waiting to hatch. Lots of kite flying, and somehow no sunburn. And no rain until a dawn rainbow followed by lots of wet just as we were breaking camp.

    Full details in this Ocracode. Nobody but me understands or cares, but that's just fine. :)

    OBX1.1 P6 L6 SC5d+++b--c- U0 T3 f++-b2 R1dw Bn-b++m++ F+u+ SC++s++g0 H+++f2i4Vs---m0 E+++r+++ T6f++-b0 R1w Bn-b++m++ F++u++ SC++s++g1 H+++f2i5 V+++ E+++r++

For the past three days I've been coding, which feels good after all that time away.

Syndicated 2011-08-17 19:44:06 from see shy jo

arrival at DebConf

The trip down from Gratz to Banja Luka was much easier than the day before. After a while you just get used to being sat in a car for ages. Plenty of nice scenery to enjoy through Slovenia. After a while our car's GPS's began to fail, showing us driving through fields, and we were stuck for 1.5 hours in a traffic jam when the 4 lane highway seemed to end. Got around that with some guesswork, and on into Croatia by back roads.

The Bosnian border was an interesting experience, all the guards could say in English was "green card! green card!" -- which from an American POV is an unsettling thing to be asked for at a border, especially if they've already taken your passport away -- but at least we were not detained overnight.

While drivers were away getting the car insurance settled it descended toward farce as we had to hand roll the cars forward to let trucks get into the country. (Or we thought we did.. one was rolled with the keys in it as it turned out.)

Arrived at the Hotel in Banja Luka in the middle of a wedding, which was amazingly loud (I could still hear it from the 5th floor at 2 am). There's also a casino at the hotel, so first impression was garish and loud! ... But now that it's a rainy Sunday, seems much nicer here.

Syndicated 2011-07-24 14:15:49 from see shy jo

from the convoy

As I type this, it's just passed midnight. I'm in the back of a BMW somewhere in east Germany , and the Debian UK convoy is doing 110 mph on the autobahn, twenty hours into a twenty-five hour first leg of our trip to Banja Lunka.

This all started out so sanely with a 3 am departure to catch the 6 am ferry at Dover. Followed by a couple of hours leisurely breakfast onboard. First hint that yes, this is a road trip in which things will go wrong was a minor bumper denting of one of the convoy's cars by a stray landrover during the ferry trip, but it didn't really phase us. On to Cologne, for a very nice lunch and to pick up another person.

But we didn't anticipate how brutal the next leg to Gratz would be. Nor did we count on apparently half of Germany and the Netherlands getting out their campers and heading east this Friday. Spent multiple hours stop-and-go, and many more in constant traffic. Finally it opened out, so we can follow the night speed limit. And while we started out horsing around on the radio, we've developed some real comms discipline by now to keep the convoy together.

Also people seem to be amazingly keeping rested while not driving. To add to the sleep debt to me, I flew in the day before, but I actually feel caught up now. Still I've not been driving at all due to mislaid license and general inability to safely drive a right side drive stick shift at 100 mph at night. Our 7 drivers are doing an amazing job.

Update: Arrived safely in Gratz at 4 am. Austria tantilized with 30+ miles of tunnels thru the alps, but I've not seen an alp yet.


PS, you'll never appreciate a stinky, free bathroom until you're in a country where all the antiseptic bathrooms cost money and hoards of vacationers are doing the logical thing next to service stations.

Syndicated 2011-07-23 09:13:09 from see shy jo

thoughts on the last shuttle launch

I watched the final shuttle launch this morning; for several years I've tried to catch shuttle events, knowing it would soon be over, but before that the shuttle had faded into the background for me as it did for so many of us. It was an impractical rocket to nowhere that we mostly only paid attention to when it blew up.

Fourteen years ago today, Debian was flying in space aboard Columbia. According to the press release, it ran on an SSD on an embedded 486 in the lab module, and controlled plant watering, telemetry, and video. At the time, I had just become a Debian developer, and was very impressed to be part of a project that was involved in that. It was early days for Debian, and near the midpoint of the shuttle's thirty years.

Now it seems likely that Debian, or its derivatives (or at least Free Software) will easily outlast that thirty year run, but I do wonder to what extent our work will fade into the background (and what interesting ways it will find to explode) over that time span and beyond. We'd say we have better methods than the centralized, committee-driven, top-down, PR-conscious NASA... impressive though it can be at its best. We have dreams just as noble as the ones behind the space program, but also goals that are more adaptable, equally at home flying in space, or emebdded in some pocket-lint laden artifact of a perhaps more contemporary inward turn.

Syndicated 2011-07-08 22:29:26 from see shy jo

databranches: using git as a database

I've just released git-annex version 3, which stops cluttering the filesystem with .git-annex directories. Instead it stores its data in a git-annex branch, which it manages entirely transparently to the user. It is essentially now using git as a distributed NOSQL database. Let's call it a databranch.

This is not an unheard of thing to do with git. The git notes built into recent git does something similar, using a dynamically balanced tree in a hidden branch to store notes. My own pristine-tar injects data into a git branch. (Thanks to Alexander Wirt for showing me how to do that when I was a git newbie.) Some distributed bug trackers store their data in git in various ways.

What I think takes git-annex beyond these is that it not only injects data into git, but it does it in a way that's efficient for large quantities of changing data, and it automates merging remote changes into its databranch. This is novel enough to write up how I did it, especially the latter which tends to be a weak spot in things that use git this way.

Indeed, it's important to approach your design for using git as a database from the perspective of automated merging. Get the merging right and the rest will follow. I've chosen to use the simplest possible merge, the union merge: When merging parent trees A and B, the result will have all files that are in either A or B, and files present in both will have their lines merged (and possibly reordered or uniqed).

The main thing git-annex stores in its databranch is a bunch of presence logs. Each log file corresponds to one item, and has lines with this form:

  timestamp [0|1] id

This records whether the item was present at the specified id at a given time. It can be easily union merged, since only the newest timestamp for an id is relevant. Older lines can be compacted away whenever the log is updated. Generalizing this technique for other kinds of data is probably an interesting problem. :)

While git can union merge changes into the currently checked out branch, when using git as a database, you want to merge into your internal-use databranch instead, and maintaining a checkout of that branch is inefficient. So git-annex includes a general purpose git-union-merge command that can union merge changes into a git branch, efficiently, without needing the branch to be checked out. Another problem is how to trigger the merge when git pulls changes from remotes. There is no suitible git hook (post-merge won't do because the checked out branch may not change at all). git-annex works around this problem by automatically merging */git-annex into git-annex each time it is run. I hope that git might eventually get such capabilities built into it to better support this type of thing.

So that's the data. Now, how to efficiently inject it into your databranch? And how to efficiently retrieve it?

The second question is easier to answer, although it took me a while to find the right way ... Which is two orders of magnitude faster than the wrong way, and fairly close in speed to reading data files directly from the filesystem. The right choice is to use git-cat-file --batch; starting it up the first time data is requested, and leaving it running for further queries. This would be straightforward, except git-cat-file --batch is a little difficult when a file is requested that does not exist. To detect that, you'll have to examine its stderr for error messages too. Perhaps git-cat-file --batch could be improved to print something machine parseable to stdout when it cannot find a file.

Efficiently injecting changes into the databranch was another place where my first attempt was an order of magnitude slower than my final code. The key trick is to maintain a separate index file for the branch. (Set GIT_INDEX_FILE to make git use it.) Then changes can be fed into git by using git hash-object, and those hashes recorded into the branch's index file with git update-index --index-info. Finally, just commit the separate index file and update the branch's ref.

That works ok, but the sad truth is that git's index files don't scale well as the number of files in the tree grows. Once you have a hundred thousand or so files, updating an index file becomes slow, since for every update, git has to rewrite the entire file. I hope that git will be improved to scale better, perhaps by some git wizard who understands index files (does anyone except Junio and Linus?) arranging for them to be modified in-place.

In the meantime, I use a workaround: Each change that will be committed to the databranch is first recorded into a journal file, and when git-annex shuts down, it runs git hash-object just once, passing it all the journal files, and feeds the resulting hashes into a single call to git update-index. Of course, my database code has to make sure to check the journal when retrieving data. And of course, it has to deal with possibly being interrupted in the middle of updating the journal, or before it can commit it, and so forth. If gory details interest you, the complete code for using a git branch as a database, with journaling, is here.

After all that, git-annex turned out to be nearly as fast as before when it was simply reading files from the filesystem, and actually faster in some cases. And without the clutter of the .git-annex/ directory, git use is overall faster, commits are uncluttered, and there's no difficulty with branching. Using a git branch as a database is not always the right choice, and git's plumbing could be improved to better support it, but it is an interesting technique.

Syndicated 2011-07-02 20:38:57 from see shy jo

I'm going to DebConf 11!

My DebConf trip will involve 7 days of travel spanning ten countries, plus the conference. I'll be part of a three car convoy, all across Europe to B&H, with a bunch of speed-loving brits. I'm sure there will be many references to Top Gear, and probably some multicar wifi. Luckily I'm prepared, since Overfiend (aka Debian's own Stig) introduced me to the show two years ago. This will be a great way to see a lot of Europe, fast, and for pretty cheap all told, even with the hotel in Luxembourg.

I'm sure DebConf 11 will spark a lot more ideas of things to do. For that matter, I still have todo items leftover from Debconf 10. I'm not planning to give any talks, but who knows?

Syndicated 2011-06-24 04:11:36 from see shy jo

date formats of a decade of usenet

I've finished importing the usenet archive for oldusenet. The fun part was parsing the dates to put the posts in order.

No date format was really required on usenet, and so a wide variery of formats were used. Some posts didn't have a Date, but a guess could be made from their Message-ID. Some posts had absurd dates (ie, 1969, 1995), others had dates that were correct in every way.. except the year was left out (oops). One early post had a date of "_".

Still, this excerpt of my code managed to parse the rest and so gives a fairly complete picture of how messy dates can possibly be. Read and weep.

    p anyzone "%d %b %y %T"       "15 Jun 88 02:27:41 GMT"
, p anyzone "%a, %d %b %y %T"       "Thu, 22 Jun 89 20:02:03 GMT"
, p anyzone "%a, %d-%b-%y %T"       "Thu, 15-Jun-89 18:01:56 EDT"
, p anyzone "%d %b %y %T"       "8 Jan 90 14:07:27 -0400"
, p anyzone "%d %b %y %H:%M"        "4 Oct 89 19:56 GMT"
, p anyzone "%a, %d %b %y %H:%M"    "Thu, 23 May 91 02:13 PDT"
, p anyzone "%a, %d %b %Y %T"       "Thu, 23 May 1991 07:07:00 -0400"
, p anyzone "%a, %d %b %Y %H:%M"    "Sat, 18 May 1991 17:28 CDT"
, p anyzone "%d %b %Y %T"       "11 Apr 1991 12:02:01 GMT"
, p anyzone "%d-%b-%y %H:%M"        "24-Mar-90 14:22 CST"
, p anyzone "%d %b %y, %T"      "22 May 91, 16:31:37 EST"
, p anyzone "%d %b %Y %H:%M"        "30 June 1991 17:15 -0400"
, p anyzone "%a, %d %b T  %T"       "Fri, 8 Feb T  09:49:39 EST"

-- special cases
, p (tzconst est) "%a %b %d %T EST %Y"  "Tue Jan 11 12:44:36 EST 1983"
, p (tzconst est) "%a %b %d %T EST %y"  "Tue Jan 11 12:44:36 EST 83"
, p (tzconst edt) "%a %b %d %T EDT %Y"  "Tue Jan 11 12:44:36 EDT 1983"
, p (tzconst edt) "%a %b %d %T EDT %y"  "Tue Jan 11 12:44:36 EDT 83"
, p (tzconst utc) "%a %b %d %T GMT %Y"  "Thu Nov  1 23:14:37 GMT 1990"
, p (tzconst pdt) "%d %b %y %T -7"  "11 Jun 91 15:41:21 -7"

-- dates with no timezone specified are guessed
, p nozone "%d %b %y %T"        "9 Jan 90 09:33:59"
, p nozone "%d %b %Y %T"        "10 APR 1990 05:25:28"
, p nozone "%a %b %d %T %Y"     "Fri Feb  6 00:19:47 1981"
, p nozone "%a %b %d %T %y"     "Fri Feb  6 00:19:47 81"
, p nozone "%Y-%m-%d %T"        "1981-11-12 18:31:01"
, p nozone "%y-%m-%d %T"        "81-11-12 18:31:01"
, p nozone "%a, %d %b %y %T"        "Sat, 13 Apr 91 08:37:57"
, p nozone "%a, %d %b %Y %T"        "Sun, 16 Jun 1991 13:23:02"
, p nozone "%d %b, %Y %T"       "1 May, 1991 00:00:00"
, p nozone "%d %b %y %H:%M"     "8 Jan 88 18:03"
, p nozone "%a, %d %b %y %H:%M"     "Wed, 29 May 91 17:14"
, p nozone "1 %b %d %T %Y"      "1 Jan 08 20:59:08 1991"

-- this has to come near the end, as it matches greedily
, g nozone "%a %b %d %T %Y ("       "Wed Oct 27 17:02:46 1982 (Tuesday)"
, g nozone "%a, %d %b %y %T +"      "Tue, 21 May 91 16:46:01 +22323328"

-- extract date from message-id headers
-- (used for messages with no Date field)
, g nozone "<%Y%b%d.%H%M%S."        "<1989Jul6.214048.28313@jarvis.csri.toronto.edu>"

(Parsing the often ambiguous, malformed, etc timezones was fun all its own too, of course.)

Syndicated 2011-06-08 21:28:38 from see shy jo

hmm

I was going to blog about being on hacker news, and slashdot etc, but it doesn't seem interesting enough for my blog.

Although last showing 50 thousand logins by "oldusenet" is noteworthy.

Happy IPv6 day BTW!

Syndicated 2011-06-08 00:49:45 from see shy jo

announcing olduse.net

As I write this, it's the morning of June 5th, 1981. A few people scattered across the US are waking up, going in to work, sitting down at their terminal with a coffee, and reading Usenet. Usenet is only getting a trickle of posts each day -- it's still in that period where it's easy to read every message posted to it.

Many things lie in Usenet's future. It's still running A-News, which doesn't even have a real From header yet. Later this year it will switch over to B-News, and volume will begin to increase. In 1987 there will be The great renaming. And of course in 1994, the first spam will be posted to Usenet.

But that's all a long way off, here in 1981. Right now, they're talking about 500 mb disk drives that only cost $38000. And rms is inciting flames about nuclear proliferation. And Postel is publishing an RFC for the new Mail Transfer Protocol.

Good morning, Usenet. Who knows what will come next in this fledgeling electronic communications medium!

a ten year real-time historical exhibit

This morning, I'm announcing a new site: Olduse.net

It's Usenet, updated in real time as it was thirty years ago. Planned to be available for the next ten years, unless I run out of inodes (again).

If you missed it the first time around, this is your chance to follow Usenet's flowering.

made possible by

141 magtape usenet archive

  • Henry Spencer at the University of Toronto, Department of Zoology, who archived Usenet. Back when it was really uncool and really expensive. Our view onto Usenet is thus slightly centric to Canada and Zoology, but that's ok.
  • David Wiseman, who hauled 141 magtapes in a pickup truck.
  • Many who worked to rescue data off the tapes. Including from the deleted stuff at the ends.
  • Rich Skrenta, who somehow got a copy of the archive out from under the Google borg. Although one of the tar files is truncated. Just saying.
  • The creator of Telehack, who pointed me in the right direction, ending my multi-year quest to find the archive. And if you think this is neat, Telehack will blow you away.
  • The developers of Haskell, which enabled me to whip up a B-News to C-News converter, a custom uucp, date parsers for every crazy date format ever used on Usenet, and suitible queue data structures in a rock solid, maintainable way, in 500 lines of code written over 12 hours. When I realized I also needed an A-News to B-News converter, I knew it was worth it to have done things right, because that took only 43 more lines, and worked 100% on the first run! My code repository for olduse.net is here.

PS: You can post to olduse.net, but it won't show up for at least 30 years. :)

Syndicated 2011-06-05 11:49:54 from see shy jo

454 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!