22 Apr 2012 joey   » (Master)

moving my email archives and packages to git-annex

I've recently been moving some important data into git-annex, and finding it simplifies things while also increasing my flexability.

email archives

I've kept my email archives in git for years. This works ok, just choose the right file format (compressed mbox) and number of files (one archive per mailbox per month or so) and git can handle this well enough, as email is not really large.

But, email is not really small either. Keeping my email repository checked out on my netbook consumes 2 gigabytes of its 30 gigabyte SSD, half of which is duplication in .git. Also, I have only kept it at 2 gigabytes through careful selection of what classes of mail I archive. That made sense when archival disk was more expensive, but what makes sense these days is to archive everything.

For a while I've wanted to have a "raw" archive, that includes all email I receive. (Even spam.) This protects against various disasters in mail filtering or reading. Setting that up was my impetus for switching my mail archives to git-annex today.

The new system I've settled on is to first copy all incoming mail into a "raw" maildir folder. Then mailfilter sorts it into the folders I sync (with offlineimap) and read. Each day, the "raw" folder is moved into a mbox archive, and that's added to the git annex. Each month, the mail I've read is moved into a monthly archives, and added to the git annex. A simple script does the work.

I counted the number of copies that existed of my mail when it was stored in git, and found 7 copies spread amoung just 3 drives. I decided to slim that back, and configured git-annex to require only 5 copies. But those 5 copies will spread amoung more drives, including several offline archival drives, so it will be more robust overall.

My netbook will have an incomplete checkout of my mail, omitting the "raw" archive. If I need to peek inside a spam folder for a lost mail, I can quickly pull it down; if I need to free up space I can quickly drop older archives. This is the flexability that git-annex fans love. :)

By the way, this also makes it easier to permanantly delete mail, when you really need to (ie, for contractual reasons). Before, I'd have to do a painful git-filter-branch if I needed to get rid of eg, mail for old jobs. Now I can git annex drop --force.

Pro Tip: If you're doing this kind of migration to git-annex, you can save bandwidth by not re-transferring files to machines that already have a copy. I ran this command on my netbook to inject the archives it had in the old repository into the new repository, verifying checksums as it goes:

  cd ~/mail/archive; find -type l -exec git annex reinject ~/mail.old/archive/{} {} \;

debian packages

I'd evolved a complex and fragile chain of personal apt repositories to hold Debian packages I've released. I recently got rid of the mess, which looked like this: dput → local mini-dinstall repo → dput→mini-dinstallrepo on my server →dput` → Debian

The point of all that was that I could "upload" a package locally while offline and batch transfer it later. And I had a local and a public apt repository of just the packages I've uploaded. But these days, packages uploaded to Debian are available nearly immediately, so there's not much reason to do that.

My old system also had a problem: It only kept the most recent single copy of each package. Again, disk is cheap, so I'd rather have archives of everything I have uploaded. Again I switched to git-annex.

My new system is simplicity itself. I release a package by checking it into a "toupload" directory in my git annex repository on my netbook. Items in that directory are dput to Debian and moved to "released". I have various other clones of that repository, which I git annex move packages to periodically to free up SSD space. In the rare cases when I build a package on a server, I check it into the clone on the server, and again rely on git-annex to copy it around.

Now, does anyone know a good way to download a copy of every package you're ever released from archive.debian.org? (Ideally as a list of urls I can feed to git annex addurl.)


My email and Debian packages were the last large files I was not storing in git-annex. Even backups of my backups end up checked into git-annex and archived away.

Now that I'm using git-annex in every place I can, my goal with it is to make it as easy as possible for as many of you to use it as possible, too. I have some inotify tricks up my sleeve that seem promising. Kickstarter may be involved. Watch this space!

Syndicated 2012-04-22 20:11:38 from see shy jo

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!