26 Aug 2008 nbm   » (Journeyer)

Early adventures with Sitemaps

Perhaps entirely randomly, I decided that TechGeneral would need Sitemaps before I put it live.

A Sitemap (sometimes called a Google Sitemap, although you won't see Google calling it that, and it is a standard that Yahoo!, Ask, and Live all support) is an XML file (or bunch of XML files) that describe the various resources on your web site which allows search engines and other programs to discover them more easily.

There are a few advantages to putting together a Sitemap.  Generally, search engines give up after they travel a few links into a web site to avoid infinite automatically generated links (not because of malicious intent necessarily, but because of weird programming).  With a Sitemap, each listed resource can potentially be treated as a first visit.  Also, if a site has navigation that search engines can't traverse to get to certain pages, Sitemaps can assist search engines to find those resources.

They also optionally assign a priority to each resource as a way to influence the importance assigned to the resource relative to other resources on your web site.  Similarly, an optional update frequency per resource can influence how often a search engine or other program should check back for new versions of that resource.  Last modified dates also optionally help to determine whether to try revisit a resource earlier or later than would normally happen.

Example Sitemap File

<urlset
    xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
        http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
 
    <url>
        <loc>http://techgeneral.org/diary</loc>
        <lastmod>2008-08-16T22:52:41+00:00</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.9</priority>
    </url>
 
    <url>
        <loc>http://techgeneral.org/speaking</loc>
        <lastmod>2008-08-16T22:52:13+00:00</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.9</priority>
    </url>
 
    <url>
        <loc>http://techgeneral.org/contact</loc>
        <lastmod>2008-08-10T16:59:32+00:00</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.9</priority>
    </url>
 
    <url>
        <loc>http://techgeneral.org/about</loc>
        <lastmod>2008-08-10T12:42:06+00:00</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.9</priority>
    </url>
</urlset>

There are two types of Sitemaps - individual Sitemap files and Sitemap Index files.  Why would you want a Sitemap Index?  One, less relevant to many, reason is that individual Sitemap files can only contain 50 000 URLs (which, admittedly, the average blog isn't going to hit) and be less than 10MB uncompressed.  Another reason is that you might be using multiple systems that each generate Sitemap files (or you've hacked them to do so) but you don't want to merge them yourself.

Example Sitemap Index

<sitemapindex
    xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
        http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
 
    <sitemap>
        <loc>http://techgeneral.org/sitemap_posts.xml</loc>
    </sitemap>
 
    <sitemap>
        <loc>http://techgeneral.org/sitemap_archives.xml</loc>
    </sitemap>
 
    <sitemap>
        <loc>http://techgeneral.org/sitemap_pages.xml</loc>
    </sitemap>
</sitemapindex>

One useful side-effect of using a Sitemap with Google's webmaster tools is that you can see errors that occur on resources listed in the Sitemap.  So, if a request for a resource starts returning 404 or 500 errors, you can separate that more specific set of errors from those caused by broken links on your site or on other sites.

However, Google's webmaster tools doesn't seem to like having a whole bunch of separate Sitemap files with a central Sitemap Index.  I mean, it seems to work, but it complains (warnings, not errors) that many of the Sitemaps (all on this site, most on my personal web site) have only entries with the same priority.  I'm setting the priority of all the archives low (they have noindex, follow set anyway, so won't show up in search results), the frontpage high, and the posts are priorities based on age.

I get the feeling that the priorities only apply within the same file, and not within the same site.  This somewhat makes sense, since one can delegate a sitemap for a particular folder on your web site, and you wouldn't want an overeager person assigning "1.0" to all content within the folder, overriding your beautifully crafted values for the base site.  However, in this case, they're all at the same level, and I really want the archives lower than the posts, and the frontpage higher than most of the posts.

Oh well, I'll push on and see whether it's just a matter of warnings that aren't affecting things (my favourite kind) or an indication of things being as I suspect.

Syndicated 2008-08-26 16:50:44 from Neil Blakey-Milner

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!