7 Dec 2006 robogato   » (Master)

Advogato Status Report

A new rev of mod_virgule code went live today. See the changelog for the details.

I've added support for a couple of additional RSS variants with ever more unusual date stamp formats. In theory the RSS pubDate tag is suppose to use the date format described in RFC822. The first problem is that RFC822 allows a lot of variation. The second problem is that RFC822 specifies a two digit year. For obvious reasons most RSS feeds use a four digit year. Mod_virgule's first line of defense is to call the Apache APR routine apr_date_parse_rfc(), which will parse all date strings that actually comply with RFC822, plus nine variants that are not strictly RFC822 compliant but are commonly seen in the wild. So far, at least one common blogging app, Blosxom, produces a pubDate field that is not RFC822 compliant and can't be parsed by apr_date_parse_rfc(). I've added a custom strptime() call that handles these. A patch for the Apache APR folks is in the works.

Some RSS feeds don't have a pubDate tag at all. Instead they have a date tag which, instead of RFC822, contains an RFC3339 formatted date string. This is actually much nicer, since it's a slightly more sane format and is the same one used in Atom feeds, so we already have code for handling it.

Speaking of Atom, the mod_virgule aggregator now supports the old, deprecated Atom v0.3 feeds in addition to the current Atom v1.0 standard.

So here's what we support right now:

  • Atom 0.3
  • Atom 1.0
  • RSS 0.91 *(only if optional pubDate or date tags are included)
  • RSS 0.92 *(only if optional pubDate or date tags are included)
  • RSS 2.0
  • RDF Site Summary 0.9 *(untested)
  • RDF Site Summary 1 *(all variants seen so far work)
  • RDF Site Summary 1.1 *(untested)

I wish I could support the RSS 0.91/0.92 feeds that don't have any sort of time or date stamps at all but it would require some reworking of the code in the aggregator that sorts out which posts are new and which have been seen before. In most cases RSS 0.91/0.92 allows the use of both date and pubDate, so if you make sure those tags are included, things should work fine. Otherwise, your best bet is to use something a little more recent like RSS 2.0 or Atom 1.0.

The other update this week was a performance improvement. Each hour the trust metric and blog interest eigen vector ratings are recalculated. The eigen vector recalculation takes several minutes to complete. In the past the process held a read lock on the XML database, preventing any other process from taking a write lock. This caused some operations on Advogato to block (such as clicking on the "Read more..." link of articles, which writes an update to the user's "last read" pointers). This problem is now fixed. The site should seem signficantly less sluggish at the top of the hour when the update runs. The eigen vector processing now releases the read lock and gives up its time slice, then re-acquires the lock on each iteration. The total processing time is slightly longer (from 3 minutes to 3.25 minutes) but during that time the site can be used normally without feeling slow.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!