Older blog entries for joey (starting at number 523)

guesting on GitMinutes

Last Friday, I spent and hour and a half clamping a landline phone to the side of my head, while also wearing a headset. I was recording an interview on the GitMinutes podcast about git-annex.

I've been listening to GitMinutes for a while, ever since I heard git-annex was mentioned on it. Actually, I think it's come up in 4 or 5 interviews on the podcast. Most notably with core git dev Peff King, who had some interesting things to say, for sure. I responded to that in my interview, and we covered quite a wide amount of stuff in reasonable depth in just over an hour. Thomas is quite a good host and great at drawing stuff out, and it's nice to not need to worry about going into too much technical depth. (Although I didn't get a chance to explain the automatic union merging used to maintain the git-annex branch.)

This is the first podcast I've been in, and I've always worried about audio quality if I was in one. That is, I'd want it to be really good, and probably end up annoying the host. ;) For this one, we settled on using my land line call, which went through some Skype thing to get the Europe, and mixing in a local recording I made with a not too great headset. I think the result is pretty good, considering.

You can listen to the whole thing here, if you dare! (1 hour 8 minutes) http://episodes.gitminutes.com/2013/07/gitminutes-16-joey-hess-on-git-annex.html

(Special bonus guest: The songbird that lives on my porch.)


git-annex fundraising campaign update: Initial goal reached in a mere seven hours. I will be developing git-annex fulltime for at least the next three months! Gone to stretch goal.

Also it made the top of Hacker News: thread

Syndicated 2013-07-15 07:26:08 from see shy jo

New git-annex crowdfunding campaign

Having reached the end of my Kickstarter funded year working on the git-annex assistant, I've decided to try one more crowdfunding campaign, to see if I can get funded to work on it a little while longer. I went back and forth on this for a while. The Kickstarter funded development was extremely successful (one of the most productive years of my life). I certianly want to work on git-annex more, and have lots more stuff to do, particularly around security and encryption. On the other hand, it's hard to frame ongoing development as a normal Kickstarter campaign to start something new.

Anyway, I've decided to go ahead and try it, and not do it through Kickstarter this time. So I have my own website set up and accepting donations, and hopefully I'll make enough to spend a few more months working on git-annex.

campaign.joeyh.name

... And if not, I'll probably spend a few months working on git-annex part time, while looking for paying work with the rest of my time.

By the way, I'm taking payments in both US Dollars (via Paypal) and Bitcoin (via Coinbase). Can't wait to see how this works out!

Syndicated 2013-07-14 23:50:43 from see shy jo

git annex and my mom

[I'm encouraging git-annex users to post their success stories, and this one is my own.]

I set up git-annex on my mom's and sisters' computers a couple of months ago. I noticed this was the first software I've written that I didn't have to really explain how to use. All I told them was, put files in this folder, and the rest of us will be able to see them. Don't put anything too big in there, or anything you don't want others to see.

I paired the computers using XMPP, and set up an encrypted transfer repository using a free account rsync.net gave me for beta testing. I also added a repository on my server, which made things more robust. (XMPP has since improved, but it's still a good idea to have a git repository to suppliment XMPP.) I also have two removable drives that are used to back up our files.

This was all set up using the webapp. And adding a computer takes just a couple of minutes that way. I set it up at my sister's in a spare moment during a visit, and it all just worked.

Our shared git annex contains a couple of hundred files, and is a couple of gigabytes in size. And growing pretty fast as we find things we want to share. Mostly photos and videos so far but I won't be surprised to find poems and books pop up in there from the family's poets and authors. And it'll grow further as I add people who've so far been left out.

Coming home from a week at the beach with my grand nephew and niece, was the first time I really used git-annex without thinking about it. Collapsed on a hotel bed, I plugged in my camera and loaded in the trip's photos. Only to see the hotel wifi cost extra. Urk, no! Later, in the lobby, I found an open wifi network, and watched it automatically sync up.

screenshot

By the time I was home, the video of cute kids playing weathermen and reporting on our near miss by a tropical storm had been enjoyed by the folks who didn't make that family gathering.

Syndicated 2013-06-25 21:33:29 from see shy jo

little disasters

Interesting times.. While the big disasters are ongoing, little ones have been spicing up my life lately.

A pleasant week by the beach ended with a tropical storm passing over the beach house. I've never experienced this before, and though Andrea was diminished by passing over land, it was still more wind than I've ever seen. I love wind, and this was thrilling, right on the edge of danger but not quite there. At least, if you have sense to stay out of the water. Leaving the beach, I heard of someone who tried to go surfing that day, and drowned.

The night before last, I was startled to find nearly an inch of water seeping up from underneath the tile floor of the kitchen. Probably it has something to do with the pressure tank pumping system, which was repaired while I was away, and means I actually have indoor running water here. (Overrated.) This saw me scrambling to close every water valve, and out with a flashlight at 2 am closing the cutoff at the 1000 gallon water reservoir before it all drained into the house. While sopping up dozens of gallons of water from the floor at 3 am probably doesn't sound like fun, I found myself going through the motions elatedly.. Because this means I finally am coming to understand the source of the damp that infests the most earth-sheltered corner of this house. It's not condensation. It's bad plumbing!

Then yesterday, I went out to try a dip in the river, stopped by the neighborhood eatery and bait shop, and ended up sitting out on the back deck eating ribs and listening to a band with "possum playboys" in their name (which makes the full name fairly irrelevant), while looking out over the river and the old-timey green metal bridge. Which was unexpected fun, and the kind of thing you have to take in when it happens, but getting stuck in a newly installed hole in my driveway was not. My car was spinning, and I gave up and called it a night.

Here's the thing. I could feel my brain working on this stupid "underpowered car is stuck in a small rut" issue all night long. Same mental pathways activating that chew over bugs and design issues. Got up this morning with a set of plans and contingency plans all ready to go. The first one, of jacking it up and putting something under the tire was stymied; it seems I am missing a jack. But the second, of digging out all around the tire, and then filling in with gravel and cat litter (a tip from some offroading website I blearily surfed last night), and then riding the gas while releasing the bake, worked great.

All of which is to say, bring em on! But I still prefer my disasters in the form of software bugs.

Syndicated 2013-06-16 16:25:56 from see shy jo

faster dh

With wheezy released, the floodgates are opened on a lot of debhelper changes that have been piling up. Most of these should be pretty minor, but I released one yesterday that will affect all users of dh. Hopefully in a good way.

I made dh smarter about selecting which debhelper commands it runs. It can tell when a package does not use the stuff done by a particular command, and skips running the command entirely.

So the debian/rules binary of a package using dh will now often look like this:

dh binary
   dh_testroot
   dh_prep
   dh_auto_install
   dh_installdocs
   dh_installchangelogs
   dh_perl
   dh_link
   dh_compress
   dh_fixperms
   dh_installdeb
   dh_gencontrol
   dh_md5sums
   dh_builddeb

Which is pretty close to the optimal hand-crafted debian/rules file (and just about as fast, too). But with the benefit that if you later add, say, cron job files, dh_installcron will automatically start being run too.

Hopefully this will not result in any behavior changes, other than packages building faster and with less noise. If there is a bug it'll probably be something missing in the specification of when a command needs to be run.

Beyond speed, I hope that this will help to lower the bar to adding new commands to debhelper, and to the default dh sequences. Before, every such new command slowed things down and was annoying. Now more special-purpose commands won't get in the way of packages that don't need them.


The way this works is that debhelper commands can include a "PROMISE" directive. An example from dh_installexamples

  # PROMISE: DH NOOP WITHOUT examples

Mostly this specifies the files in debian/ that are used by the command, and whose presence triggers the command to run. There is also a syntax to specify items that can be present in the package build directory to trigger the command to run.

(Unfortunatly, dh_perl can't use this. There's no good way to specify when dh_perl needs to run, short of doing nearly as much work as dh_perl would do when run. Oh well.)

Note that third-party dh_ commands can include these directives too, if that makes sense.


I'm happy how this turned out, but I could be happier about the implementation. The PROMISE directives need to be maintained along with the code of the command. If another config file is added, they obviously must be updated. Other changes to a command can invalidate the PROMISE directive, and cause unexpected bugs.

What would be ideal is to not repeat the inputs of the command in these directives, but instead write the command such that its inputs can be automatically extracted. I played around with some code like this:

$behavior = main_behavior("docs tmp(usr/share/doc/)", sub {
       my $package=shift;
       my $docs=shift;
       my $docdir=shift;

       install($docs, $docdir);
});
$behavior->($package);

But refactoring all debhelper commands to be written in this style would be a big job. And I was not happy enough with the flexability and expressiveness of this to continue with it.

I can however, dream about what this would look like if debhelper were written in Haskell. Then I would have a Debhelper a monad, within which each command executes.

main = runDebhelperIO installDocs

installDocs :: Monad a => Debhelper a
installDocs = do
    docs <- configFile "docs"
    docdir <- tmpDir "usr/share/doc"
    lift $ install docs docdir

To run the command, runDebhelperIO would loop over all the packages and run the action, in the Debhelper IO monad.

But, this also allows making an examineDebhelper that takes an action like installDocs, and runs it in a Debhelper Writer monad. That would accumulate a list of all the inputs used by the action, and return it, without performing any side effecting IO actions.

It's been 15 years since I last changed the language debhelper was written in. I did that for less gains than this, really. (The issue back then was that shell getopt sucked.) IIRC it was not very hard, and only took a few days. Still, I don't really anticipate reimplementing debhelper in Haskell any time soon.

For one thing, individual Haskell binaries are quite large, statically linking all Haskell libraries they use, and so the installed size of debhelper would go up quite a bit. I hope that forthcoming changes will move things toward dynamically linked haskell libraries, and make it more appealing for projects that involve a lot of small commands.

So, just a thought experiment for now..

Syndicated 2013-05-08 19:18:06 from see shy jo

the #newinwheezy game: STM

Debian wheezy includes a bunch of excellent new Haskell libraries. I'm going to highlight one that should be interesting to non-Haskell developers, who may have struggled with writing non-buggy threaded programs in other languages: libghc-stm-dev

I had given up on most threaded programs before learning about Software Transactional Memory. Writing a correct threaded program, when multiple threads needed to modify the same state, needed careful uses of locking. In my experience, locking is almost never gotten right the first time.

A real life example I encountered is an app that displays a queue of files to be downloaded, and a list of files currently downloading. Starting a new download would go something like this:

startDownload = do
    file <- getQueuedFile
    push file currentDownLoads
    startDownloadThread file

But there's a point in time in which another thread, that refreshes the display, could then see an inconsistent state, where the file is in neither place. To fix this, you'd need to add lock checking around all accesses to the download queue and current downloads list, and lock them both here. (And be sure to always take the locks in the same order!)

But, it's worse than that, because how is getQueuedFile implemented? If the queue is empty, it needs to wait on a file being added. But how can a file be added the queue if we've locked it in order to perform this larger startDownload operation? What should be really simple code has become really complex juggling of locks.

STM deals with this in a much nicer way:

startDownload = atomically $ do
    file <- getQueuedFile
    push file currentDownLoads
    startDownloadThread file

Now the two operations are performed as one atomic transaction. It's not possible for any other thread to see an inconsistent state. No explicit locking is needed.

And, getQueuedFile can do whatever waiting it needs to, also using STM. This becomes part of the same larger transaction, in a way that cannot deadlock. It might be implemented like this:

getQueuedFile = atomically $
    if empty downloadQueue
        then retry
        else pop downloadQueue

When the queue is empty and this calls "retry", STM automatically waits for the queue to change before restarting the transaction. So this blocks until a file becomes available. It does it without any locking, and without you needing to tell explicitly tell STM what you're waiting on.

I find this beautiful, and am happier with it the more I use it in my code. Functions like getQueuedFile that run entirely in STM are building blocks that can be snapped together without worries to build more and more complex things.

For non-Haskell developers, STM is also available in Clojure, and work is underway to add it to gcc. There is also Hardware Transactional Memory coming, to speed it up. Although in my experience it's quite acceptably fast already.

However, as far as I know, all these other implementations of STM leave developers with a problem nearly as thorny as the original problem with locking. STM inherently works by detecting when a change is made that conflicts with another transaction, throwing away the change, and retrying. This means that code inside a STM transaction may run more than once.

Wait a second.. Doesn't that mean this code has a problem?

startDownload = atomically $ do
    file <- getQueuedFile
    push file currentDownLoads
    startDownloadThread file

Yes, this code is buggy! If the download thread is started, but then STM restarts the transaction, the same file will be downloaded repeatedly.

The C, Clojure, etc, STM implementations all let you write this buggy code.

Haskell, however, does not. The buggy code I showed won't even compile. The way it prevents this involves, well, monads. But essentially, it is able to use type checking to automatically determine that startDownloadThread is not safe to put in the middle of a STM transaction. You're left with no choice but to change things so the thread is only spawned once the transaction succeeds:

startDownload = do
    file <- atomically $ do
        f <- getQueuedFile
        push file currentDownLoads
        return f
    startDownloadThread file

If you appreciate that, you may want to check out some other #newinwheezy stuff like libghc-yesod-dev, a web framework that uses type checking to avoid broken urls, and also makes heavy use of threading, so is a great fit for using with STM. And libghc-quickcheck2-dev, which leverages the type system to automatically test properties about your program.

Syndicated 2013-05-02 16:54:04 from see shy jo

Template Haskell on impossible architectures

Imagine you had an excellent successful Kickstarter campaign, and during it a lot of people asked for an Android port to be made of the software. Which is written in Haskell. No problem, you'd think -- the user interface can be written as a local webapp, which will be nicely platform agnostic and so make it easy to port. Also, it's easy to promise a lot of stuff during a Kickstarter campaign. Keeps the graph going up. What could go wrong?

So, rather later you realize there is no Haskell compiler for Android. At all. But surely there will be eventually. And so you go off and build the webapp. Since Yesod seems to be the pinnacle of type-safe Haskell web frameworks, you use it. Hmm, there's this Template Haskell stuff that it uses a lot, but it only makes compiles a little slow, and the result is cool, so why not.

Then, about half-way through the project, it seems time to get around to this Android port. And, amazingly, a Haskell compiler for Android has appeared in the meantime. Like the Haskell community has your back. (Which they generally seem to.) It's early days and rough, lots of libraries need to be hacked to work, but it only takes around 2 weeks to get a port of your program that basically works.

But, no webapp. Cause nobody seems to know how to make a cross-compiling Haskell compiler do Template Haskell. (Even building a fully native compiler on Android doesn't do the trick. Perhaps you missed something though.)

At this point you can give up and write a separate Android UI (perhaps using these new Android JNI bindings for Haskell that have also appeared in the meantime). Or you can procrastinate for a while, and mull it over; consider rewriting the webapp to not use Yesod but some other framework that doesn't need Template Haskell.

Eventually you might think this: If I run ghc -ddump-splices when I'm building my Yesod code, I can see all the thousands of lines of delicious machine generated Haskell code. I just have to paste that in, in place of the Template Haskell that generated it, and I'll get a program I can build on Android! What could go wrong?

And you even try it, and yeah, it seems to work. For small amounts of code that you paste in and carefully modify and get working. Not a whole big, constantly improving webapp where every single line of html gets converted to types and lambdas that are somehow screamingly fast.

So then, let's automate this pasting. And so the EvilSplicer is born!

That's a fairly general-purpose Template Haskell splicer. First do a native build with -ddump-splices output redirected to a log file. Run the EvilSplicer to fix up the code. Then run an Android cross-compile.

But oh, the caveats. There are so many ways this can go wrong..

  • The first and most annoying problem you'll encounter is that often Template Haskell splices refer to hidden symbols that are not exported from the modules that define the splices. This lets the splices use those symbols, but prevents them being used in your code.

    This does not seem like a good part of the Template Haskell design, to be honest. It would be better if it required all symbols used in splices to be exported.

    But it can be worked around. Just use trial and error to find every Haskell library that does this, and then modify them to export all the symbols they use. And after each one, rebuild all libraries that depend on it.

    You're very unlikely to end up with more than 9 thousand lines of patches. Because that's all it took me..

  • The next problem (and the next one, and the next ...) is that while GHC's code output by -dump-splices (and indeed, by GHC error messages, etc) looks like valid Haskell code to the casual viewer, it's often not.

    To start with, it often has symbols qualified with the package and module name. ghc-prim:GHC.Types.: does not work well where code originally contained :.

    And then there's fun with multi-line strings, which sometimes cannot be parsed back in by GHC in the form it outputs them.

    And then there's the strange way GHC outputs case expressions, which is not valid Haskell at all. (It's missing some semicolons.)

    Oh, and there's the lambda expressions that GHC outputs with insufficient parentheses, leading to type errors at compile time.

    And so much more fun. Enough fun to give one the idea that this GHC output has never really been treated as code that could be run again. Because that would be a dumb thing to need to do.

  • Just to keep things interesting, the Haskell libraries used by your native GHC and your Android GHC need to be pretty much identical versions. Maybe a little wiggle room, but any version skew could cause unwanted results. Probably, most of the time, unwanted results in the form of a 3 screen long type error message.

    (My longest GHC error message seen on this odyessy was actually a full 500+ kilobytes in size. It included the complete text of Jquery and Bootstrap. At times like these you notice that GHC outputs its error messages o.n.e . c.h.a.r.a.c.t.e.r . a.t . a . t.i.m.e.)

Anyway, if you struggle with it, or pay me vast quantities of money, your program will, eventually, link. And that's all I can promise for now.


PS, I hope nobody will ever find this blog post useful in their work.

PPS, Also, if you let this put you off Haskell in any way .. well, don't. You just might want to wait a year or so before doing Haskell on Android.

Syndicated 2013-04-17 06:41:33 from see shy jo

upstream git repositories

Daniel Pocock posted The multiple repository conundrum in Linux packaging. While a generally good and useful post, which upstream developers will find helpful to understand how Debian packages their software, it contains this statement:

If it is the first download, the maintainer creates a new git repository. If it has been packaged before, he clones the repository. The important point here is that this is not the upstream repository, it is an independent repository for Debian packaging.

The only thing important about that point is that it highlights an unnecessary disconnect between the Debian developer and upstream development. One which upstream will surely find annoying and should certainly not be bothered with.

There is absolutely no technical reason to not use the upstream git repository as the basis for the git repository used in Debian packaging. I would never package software maintained in a git repository upstream and not do so.

The details are as follows:

  • For historical reasons that are continuingly vanishing in importance, Debian fetishises the tarballs produced by upstream. While upstreams increasingly consider them an unimportant distraction, Debian insists in hoarding and rolling around in on its nest of gleaming pristine tarballs.

    I wrote pristine-tar to facilitate this behavior, while also pointing fun at it, and perhaps introducing a weak spot with which to eventually slay this particular dragon. It is widely used within Debian.

    .. Anyway, the point is that it's no problem to import upstream's tarball into a clone if their git repository. It's fine if that tarball includes files not present in their git repository. Indeed, upstream can do this at release time if they like. Or Debian developers can do it and push a small quantity of data back to upstream in a branch.

  • Sometimes tagged releases in upstream git repositories differ from the files in their released tarballs. This is actually, in my experience, less due to autotools generated files, and more due to manual and imperfect release processes, human error, etc. (Arguably, autotools are a form of human error.)

    When this happens, and the Debian developer is tracking upstream git, they can quite easily modify their branch to reflect the contents of the tarball as closely as they desire. Or modify the source package uploaded to Debian to include anything left out of the tarball.

    My favorite example of this is an upstream who forgot to include their README in their released tarball. Not a made up example; as mentioned tarballs are increasingly an irrelevant side-show to upstreams. If I had been treating the tarball as canonical I would have released a package with no documentation.

  • Whenever Debian developers interact with upstream, whether it's by filing bug reports or sending patches, they're going to be referring to refs in the upstream git repository. They need to have that repository available. The closer and better the relationship with upstream, the more the DD will use that repository. Anything that pulls them away from using that repository is going to add friction to dealing with upstream.

    There have, historically, been quite a lot of sources of friction. From upstreams who choose one VCS while the DD preferred using another, to DDs low on disk space who decided to only version control the debian directory, and not the upstream source code. With disk space increasingly absurdly cheap, and the preponderance of development converging on git, there's no reason for this friction to be allowed to continue.

So using the upstream git repository is valuable. And there is absolutely no technical value, and plenty of potential friction in maintaining a history-disconnected git repository for Debian packaging.

Syndicated 2013-04-03 16:53:20 from see shy jo

Goodreads vs LibraryThing vs Free software

Four years ago I started using Goodreads to maintain the list of books I've read (which had lived in a flat text file for a decade+ before that).

Now it's been aquired by Amazon. I doubt it will survive in its current form for more than 2 years. Anyway, while Goodreads has been a quite good way to find what my friends are reading, I've been increasingly annoyed by the quality of its recommendations, and its paucity of other features I need. It really doesn't seem to help me keep up with new and interesting fiction at all, unless my friends happen to read it.

So I looked at LibraryThing. Actually, I seem to have looked at it several times before, since it had accounts named "joey", "joeyh", and "joeyhess" that were all mine. Which is what happens to me on sites that lack Openid or Browserid.

Digging a little deeper this time, I am finding its recommendations much better than Goodreads' -- although it seems to sometimes recommend books I've already read. And it has some nice features like tracking series, so you can easily tell when you've read all the books in a series or not. The analytics overall seem quite impressive. The UI is cluttered and it seems to take 5 clicks to add and rate a single book. It supports half stars.

Overall I get the feeling this was designed for a set of needs that doesn't quite match mine. For example, it seems it doesn't have a single database entry per book; instead each time I add a book, it seems to pull in data from primary sources (library of congress, Amazon cough) and treat this as a separate (but related) entry somehow. Weird. Perhaps this makes sense to say, librarians. I'm willing to adjust how I think about things if there's an underlying reason that can be grasped.

There's a quite interesting thread on LibraryThing where the founder says:

Don't say we should open-source the code. That would be a nightmare! And I have limited confidence in APIs. LibraryThing has the book geeks, but not so much the computers geeks.

I assume that the nightmare is that there would be dozens of clones of the site, all balkanized, with no data transfer, no federation between them.

Except, that's the current situation, as every Goodreads user who is now trying to use LibraryThing is discovering.

Before I ever started using Goodreads, I made sure it met my minimum criteria for putting my data into a proprietary silo: That I could get the data back out. I can, and have. LibraryThing can import it. But the import process loses data! And it's majorly clunky. If I want to continue using Goodreads due to its better UI, and get the data into LibraryThing, for its better analytics, I have to do periodic dumps and loads of CSV files with manual fixups.

This is why we have standards. This is why we're building federated social networks like status.net and the upcoming pump.io that can pass structured data between nodes transparently. It doesn't have to be a nightmare. It doesn't have to rely on proprietary APIs. We have the computer geeks.

Thing is, sites like GoodReads and LibraryThing need domain-specific knowledge, and communities to curate data, and stuff like that. Things that work well in a smallish company. (LibraryThing even has a business model that makes sense, yearly payments to store more books in it.)

With free software, it's much more appealing to sink the time we have into the most general-purpose solution we can. Why build a LibraryThing when we could build something that tracks not only books but movies and music? Why build that when we could build a generic federated network for structured social data? And that's great, as infrastructure, but if that infrastructure is only used to build a succession of proprietary data silos, what was the point?

So, could some computer & book geeks please build a free software alternative to these things, focused on books, that federates using any of the fine APIs we have available? Bear in mind that there is already a nice start at a comprehensive collection of book data in the Open Library. I'd happily contribute to a crowd funded project doing this.

Syndicated 2013-03-30 16:20:18 from see shy jo

difficulties in backing up live git repositories

But you can’t just tar.gz up the bare repositories on the server and hope for the best. Maybe a given repository will be in a valid state; maybe it won’t.

-- Jeff Mitchell in a followup to the recent KDE near git disaster

This was a surprising statement to me. I seem to remember that one of (many) selling points for git talked about back in the day was that it avoided the problem that making a simple cp (or backup) of a repository could lead to an inconsistent result. A problem that subversion repositories had, and required annoying commands to work around. (svnadmin $something -- iirc the backend FSFS fixed or avoided most of this issue.)

This prompted me to check how I handle it in ikiwiki-hosting. I must have anticipated a problem at some point, since ikisite backup takes care to lock the git repository in a way that prevents eg, incoming pushes while a backup is running. Probably, like the KDE developers, I was simply exercising reasonable caution.

The following analysis has probably been written up before (train; limited network availability; can't check), but here are some scenarios to consider:

  • A non-bare repository has two parts that can clearly get out of sync during a backup: The work tree and the .git directory.

    • The .git directory will likely be backed up first, since getdirent will typically return it first, since it gets created first . If a change is made to the work tree during that backup, and committed while the work tree is being backed up, the backup won't include that commit -- which is no particular problem and would not be surprising upon restore. Make commit again and get on with life.

    • However, if (part of) the work tree is backed up before .git, then any changes that are committed to git during the backup would not be reflected in the restored work tree, and git diff would show a reversion of those changes. After restore, care would need to be taken to reset the work tree (without losing any legitimate uncommitted changes).

  • A non-bare repository can also become broken in other ways if just the wrong state is snapshotted. For example, if a commit is in progress during a backup, .git/index.lock may exist, and prevent future commits from happening, until it's deleted. These problems can also occur if the machine dies at just the right time during a commit. Git tells you how to recover. (git could go further to avoid these problems than it does; for example it could check if .git/index.lock is actually locked using fcntl. Something I do in git-annex to make the .git/annex/index.lock file crash safe.)

  • A bare repository could be receiving a push (or a non-bare repository a pull) while the backup occurs. These are fairly similar cases, with the main difference being that a non-bare repository has the reflog, which can be used to recover from some inconsist states that could be backed up. Let's concentrate on pushes to bare repositories.

    • A pack could be in the process of being uploaded during a backup. The KDE developers apparently worried that this could result in a corrupt or inconsistent repository, but TTBOMK it cannot; git transfers the pack to a temp file and atomically renames it into place once the transfer is complete. A backup may include an excess temp file, but this can also happen if the system goes down while a push is in progress. Git cleans these things up.

    • A push first transfers the .git/objects, and then updates .git/refs. A backup might first back up the refs, and then the objects. In this case, it would lose the record that refs were pushed. After being restored, any push from another repository would update the refs, even using the objects that did get backed up. So git recovers from this, and it's not really a concern.

    • Perhaps a backup chooses to first back up the objects, and then the refs. In this case, it could back up a newly changed ref, without having backed up the referenced objects (because they arrived after the backup had finished with the objects). When this happens, your bare repository is inconsistent; you have to somehow hunt down the correct ref for the objects you do have.

      This is a bad failure mode. git could improve this, perhaps, by maintaining a reflog for bare repositories, which, in my limited testing, it does not do.

  • A "backup" of a git repository can consist of other clones of it. Which do not include .git/hooks/ scripts, .git/config settings, and potentially other valuable information, that strangely, we do not check into revision control despite having this nice revision control system available. This is the most likely failure mode with "git backups". :P

I think that it's important git support naive backups of git repositories as well as possible, because that's probably how most backups of git repositories are made. We don't all have time to carefully tune our backup systems to do something special around our git repositories to ensure we get them in a consistent state like the KDE project did, and as their experience shows, even if we do it, we can easily introduce other, unanticipated problems.

Can anyone else think of any other failure modes like these, or find holes in my slightly rushed analysis?


PS: git-annex is itself entirely crash-safe, to the best of my abilities, and also safe for naive backups. But inherits any problems with naive backups of git repositories.

Syndicated 2013-03-25 01:39:49 from see shy jo

514 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!