4 Mar 2012 joey   » (Master)

case study: adding box.com support to git-annex

git-annex has special remotes that allow large files checked into git to be stored in arbitrary places, that are not proper git remotes. One key use of the special remotes is to store files in The Cloud.

Until now the flagship special remote used Amazon S3, although a few other things like Archive.org, rsync.net, and Tahoe-Laffs can be made to work too. One of my goals is to add as many cloud storage options to git-annex as possible.

Box.com came to my attention because they currently have a promotion that provides 50 gigabytes of free "lifetime" service. Which is a nice amount of cloud storage to have for free. I decided that I didn't want to spend more than 4 hours of my time to make git-annex use it though. (I probably have spent a week on the S3 support by contrast.)

So, this is a case study in quickly adding support for one cloud storage provider to git-annex.

  • First, I had to sign up to box.com. Their promotion requires an android phone be used to get the 50 gigabytres. This wasted about an hour getting my unused phone dusted off etc. This also includes time spent researching ways to access box.com's storage, including reading their API documentation. I found it has a WebDAV interface.
  • Sadly, there is not yet a native WebDAV library for haskell. This is a shame, because it would make the implementation better. But, I'm confident someone will eventually write one. My experience with haskell libraries for other web APIs (S3, GitHub) is that it's an excellent language to write them in, the code tends to be very simple, concise and clear. But I can't do it in 4 hours. So for now, the workaround is to use a WebDAV mounting tool. I picked davfs2 as it was the first one I got to work with box.com's slightly broken WebDAV. 2 hours spent now.
  • With box.com mounted, I was neary done; git-annex's directory special remote can use the mount point. But there was a catch: box.com only allows up to 100 mb large files. I spent 1 hour or so adding support to the directory special remote for chunking files into a user-specified size.
    This was a fairly complex problem -- the existing code had a ByteString that when accessed lazily read the whole large file (from disk or from gpg, depending), and just called writeFile on it.
    I needed to still consume it lazily to avoid reading the whole file into memory, but write out chunks. This gets a bit into haskell's ByteString internals, but they're very well suited to this kind of thing, and so after 15 minutes familiarizing myself with the data structures, it was actually fairly easy to write the code. patch
  • I spent my last hour testing and tuning the box.com special remote. Using davfs2 as a quick fix caused some technical debt that I had to make up for. In particular, the chunked filename retrieval code had to make sure not to open every chunk at once, because that makes davfs2 try to cache them all, instead of streaming one at a time. patch
  • Not counted toward my 4 hour limit is the ... er ... 4 hours I spent last night adding a progress bar to the directory special remote. A progress display while transferring the files makes using box.com as a special remote much nicer, but also makes using my phone's SD card as a special remote much nicer! This is why I'm a poor consultant -- when faced with something generic and generally useful like this, I have difficulty billing for it.

The end result is that there are detailed instructions for using box.com as a special remote.

And it seems to work quite well now. I just set up my production box.com special remote. All content written to it is gpg encrypted, and various of my computers have access to it, each using their own gpg key to decrypt the files uploaded by the others. (git-annex's encryption feature makes this work really well!)

So..
There is a DropBox API for haskell. But as I'm not a customer, the 2 gb free account hardly makes it worth my while to make git-annex use it. Would someone like to fund my time to add a dropbox special remote to git-annex?

Syndicated 2012-03-04 17:28:59 from see shy jo

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!