28 Jul 2013 joey   » (Master)

git-annex as a podcatcher

As a Sunday diversion, I wrote 150 lines of code and turned git-annex into a podcatcher!

I've been using hpodder, a podcatcher written in Haskell. But John Goerzen hasn't had time to maintain it, and it fell out of Debian a while ago. John suggested I maintain it, but I have not found the time, and it'd be another mass of code for me to learn and worry about.

Also, hpodder has some misfeatures common to the "podcatcher" genre:

  • It has some kind of database of feeds and what files have been downloaded from them. And this requires an interface around adding feeds, removing feeds, changing urls, etc.
  • Due to it using a database, there's no particularly good way to run it on the same feeds on multiple computers and sync the results in some way.
  • It doesn't use git annex addurl to register the url where a file came from, so when I check files in with git-annex after the fact they're missing that useful metadata and I can't just git annex get them to re-download them from the podcast.

So, here's a rethink of the podcatcher genre:

  cd annex; git annex importfeed http://url/to/podcast http://another/podcast

There is no database of feeds at all. Although of course you can check a list of them right into the same git repository, next to the files it adds. git-annex already keeps track of urls associated with content, so it reuses that to know which urls it's already downloaded. So when you're done with a podcast file and delete it, it won't download it again.

This is a podcatcher that doesn't need to actually download podcast files! With --fast, it only records the existence of files in git, so git annex get will download them from the web (or perhaps from a nearer location that git-annex knows about).

Took just 3 hours to write, and that's including full control over the filenames it uses (--template='${feedtitle)/${itemtitle}${extension}'), and automatic resuming of interrupted downloads. Most of what I needed was already available in git-annex's utility libraries or Hackage.

Technically, the only part of this that was hard at all was efficiently querying the git repository for a list of all known urls. I found a pretty fast way to do it, but might add a local cache file later on.

Syndicated 2013-07-28 21:03:02 from see shy jo

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!