15 Sep 2008 nbm   » (Journeyer)

Further adventures in Sitemaps

Sitemap by Brian Talbot, CC BY NC
Sitemap by Brian Talbot CC BY NC

While the two Sitemap formats are straightforward, deciding on the data to put into the templates not always altogether obvious.

There are three main types of metadata about sitemaps and URLs:

  • Last modification time
  • Change Frequency
  • Priority

Last modified time

squared circles - Clocks by Leo Reynolds, CC BY NC SA
squared circles - Clocks by Leo Reynolds CC BY NC SA

Last modified time of sitemaps

Setting the last modified time on a sitemap allows consumers of the sitemap index to not download the referenced sitemap again if they've already got an up-to-date sitemap.  Getting this wrong (say, by always giving the same last modified time) may mean consumers of your sitemap index will try the referenced sitemaps less often than they should.

The last modified time for a sitemap for a web log will probably be the most recent last modified time of the posts.  Depending on whether the comments constitute valuable content, the last modified time of comments on the posts may be useful too.

Last modified time of URLs

As with sitemaps in sitemap indices, last modification time for URLs listen in a sitemap is pretty easy — the last time that particular URL's content changed.  For a CMS page or web log post, it would usually be the time of the last edit.  For a post, the time of the last comment is relevant.

Complications with last modified

Things get a bit murky if you change your web site's style though — the HTML output has changed, but the most relevant content hasn't.  If your style change majorly affects the navigation potential or relevance of content, it may be worthwhile updating the last modification time.

Things are also complicated on pages that aggregate content from elsewhere.  For example, page two of the archives for March 2008 on a web log.  The "correct" answer to that is probably the last updated time of any posts originally posted in March 2008.  But if you change from having full-content to summary content per post, or remove any content per post, or add tags to your content, or otherwise change navigation or content relevance, then you might want to update the last modified time for all archives pages to when you made the style change.

Change frequency

Toronto subway frequency by Elijah van der Giessen, CC BY NC
Toronto subway frequency by Elijah van der Giessen CC BY NC

Change frequency is (currently) unique to URLs in a sitemap.  It's an opportunity to tell consumers of your sitemap how often you think the content at that URL changes.  Valid values are:

  • always
  • hourly
  • daily
  • weekly
  • monthly
  • yearly
  • never
It isn't yet obvious how seriously search engines (for example) take these values.  I imagine that if you say that all your URLs change hourly, then you probably won't get any change in their behaviour.  However, it can help reduce the amount of spider traffic that older pages get, and if consumers trust you, may get some of your pages checked for changes more often.

Determining change frequency of URLs

The change frequency of a front page will probably be hourly.  Similarly, an archives page for the current day, month, year, or all time would be hourly.  The change frequency for an archives page for previous days, months, or years could potentially be considered "never" or "yearly", but you can always set it to "monthly" if you're worried about such long periods of time.  (The sitemap consumer will watch the last modified time of the entry in your sitemap anyway, and probably try visit that content more often than that just in case anyway.)

The change frequency for a post on a web log or a news article depends on a few things.  For example, if you use "related posts" or "related stories", you may not want to use values such as "never" or "yearly" even for posts from years back.  If you allow comments, you may similarly want not to use those values.

The most important indicator of likely change frequency in standard cases is probably how long it has been since a particular page has changed.  In GibeSitemap, I use a relatively naive algorithm:

  • If the content has changed in the last three days, the change frequency is hourly.
  • If changed in the last 15 days, daily.
  • If changed in the last 45 days, weekly.
  • older, monthly.

Priority

Changed priorities ahead by Peter Reed, CC BY NC SA
Changed priorities ahead by Peter Reed CC BY NC

The priority of a page signals how valuable and relevant the content on that URL is likely to be to the consumer, relative to other pages on your web site.  Priority can run from 0.0 (low) to 1.0 (high).  Your front page is likely to have a very high priority (say, 1.0).  A web log "About" page is probably one of the highest priority pages (say, 0.9).

Determining priority of URLs

For a CMS with a hierarchical path structure, you can use a simple algorithm to determine priority — the fewer folders between the site root and the page, the more important it likely is.  For the Gibe Pages plugin, pages at the top level are given 0.9, losing 0.1 for each folder until a lowest value of 0.6.  So:

  • /about : 0.9
  • /about/team : 0.8
  • /about/team/neil : 0.7
  • /about/team/neil/interests : 0.6

Web log or news archives pages should not have remotely high priority, since the content on them is more relevant in the individual posts.  A value of 0.1 is appropriate.

For web log posts or news articles, priority depends on a number of factors.  For example, you may want to set existing popular posts or articles with a high priority, so that people are more likely to find that post or article when searching for them.  You may want to set posts with a particular tag or articles in a particular section to have higher or lower priority.

For the basic case, though, you can probably just use the publishing date or last modification time to help determine the priority.  More recent posts and news are probably more relevant (on your site) than older ones.  You might want to use a simple algorithm like the one I used on Gibe:

  • If the publish date is within the last 15 days, priority of 0.9
  • last month, 0.8
  • last three months, 0.7
  • last half-year, 0.6
  • last year, 0.5
  • last two years, 0.4
  • older, 0.3

Syndicated 2008-09-15 08:47:02 from Neil Blakey-Milner

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!