Older blog entries for nbm (starting at number 121)

In San Francisco in October

Visa-willing, I'll be in San Francisco for about three weeks from early October.  The SynthaSite Cape Town office is heading over to the San Francisco office for a mix of team training, team building, end-of-year partying, and planning sessions.

My last trip to San Francisco in May/June included Google I/O and a Pylons/TG2/WSGI sprint, and I really enjoyed being in the company of geeks.  This time around, it doesn't seem like there are any good conferences to squeeze in or stay around for and so far my only plans are to attend the Bay Area Python Interest Group with Jonathan.

Are there any interesting tech events happening in October in or around San Francisco I should try to attend?

Syndicated 2008-09-22 11:58:24 from Neil Blakey-Milner

Further adventures in Sitemaps

Sitemap by Brian Talbot, CC BY NC
Sitemap by Brian Talbot CC BY NC

While the two Sitemap formats are straightforward, deciding on the data to put into the templates not always altogether obvious.

There are three main types of metadata about sitemaps and URLs:

  • Last modification time
  • Change Frequency
  • Priority

Last modified time

squared circles - Clocks by Leo Reynolds, CC BY NC SA
squared circles - Clocks by Leo Reynolds CC BY NC SA

Last modified time of sitemaps

Setting the last modified time on a sitemap allows consumers of the sitemap index to not download the referenced sitemap again if they've already got an up-to-date sitemap.  Getting this wrong (say, by always giving the same last modified time) may mean consumers of your sitemap index will try the referenced sitemaps less often than they should.

The last modified time for a sitemap for a web log will probably be the most recent last modified time of the posts.  Depending on whether the comments constitute valuable content, the last modified time of comments on the posts may be useful too.

Last modified time of URLs

As with sitemaps in sitemap indices, last modification time for URLs listen in a sitemap is pretty easy — the last time that particular URL's content changed.  For a CMS page or web log post, it would usually be the time of the last edit.  For a post, the time of the last comment is relevant.

Complications with last modified

Things get a bit murky if you change your web site's style though — the HTML output has changed, but the most relevant content hasn't.  If your style change majorly affects the navigation potential or relevance of content, it may be worthwhile updating the last modification time.

Things are also complicated on pages that aggregate content from elsewhere.  For example, page two of the archives for March 2008 on a web log.  The "correct" answer to that is probably the last updated time of any posts originally posted in March 2008.  But if you change from having full-content to summary content per post, or remove any content per post, or add tags to your content, or otherwise change navigation or content relevance, then you might want to update the last modified time for all archives pages to when you made the style change.

Change frequency

Toronto subway frequency by Elijah van der Giessen, CC BY NC
Toronto subway frequency by Elijah van der Giessen CC BY NC

Change frequency is (currently) unique to URLs in a sitemap.  It's an opportunity to tell consumers of your sitemap how often you think the content at that URL changes.  Valid values are:

  • always
  • hourly
  • daily
  • weekly
  • monthly
  • yearly
  • never
It isn't yet obvious how seriously search engines (for example) take these values.  I imagine that if you say that all your URLs change hourly, then you probably won't get any change in their behaviour.  However, it can help reduce the amount of spider traffic that older pages get, and if consumers trust you, may get some of your pages checked for changes more often.

Determining change frequency of URLs

The change frequency of a front page will probably be hourly.  Similarly, an archives page for the current day, month, year, or all time would be hourly.  The change frequency for an archives page for previous days, months, or years could potentially be considered "never" or "yearly", but you can always set it to "monthly" if you're worried about such long periods of time.  (The sitemap consumer will watch the last modified time of the entry in your sitemap anyway, and probably try visit that content more often than that just in case anyway.)

The change frequency for a post on a web log or a news article depends on a few things.  For example, if you use "related posts" or "related stories", you may not want to use values such as "never" or "yearly" even for posts from years back.  If you allow comments, you may similarly want not to use those values.

The most important indicator of likely change frequency in standard cases is probably how long it has been since a particular page has changed.  In GibeSitemap, I use a relatively naive algorithm:

  • If the content has changed in the last three days, the change frequency is hourly.
  • If changed in the last 15 days, daily.
  • If changed in the last 45 days, weekly.
  • older, monthly.


Changed priorities ahead by Peter Reed, CC BY NC SA
Changed priorities ahead by Peter Reed CC BY NC

The priority of a page signals how valuable and relevant the content on that URL is likely to be to the consumer, relative to other pages on your web site.  Priority can run from 0.0 (low) to 1.0 (high).  Your front page is likely to have a very high priority (say, 1.0).  A web log "About" page is probably one of the highest priority pages (say, 0.9).

Determining priority of URLs

For a CMS with a hierarchical path structure, you can use a simple algorithm to determine priority — the fewer folders between the site root and the page, the more important it likely is.  For the Gibe Pages plugin, pages at the top level are given 0.9, losing 0.1 for each folder until a lowest value of 0.6.  So:

  • /about : 0.9
  • /about/team : 0.8
  • /about/team/neil : 0.7
  • /about/team/neil/interests : 0.6

Web log or news archives pages should not have remotely high priority, since the content on them is more relevant in the individual posts.  A value of 0.1 is appropriate.

For web log posts or news articles, priority depends on a number of factors.  For example, you may want to set existing popular posts or articles with a high priority, so that people are more likely to find that post or article when searching for them.  You may want to set posts with a particular tag or articles in a particular section to have higher or lower priority.

For the basic case, though, you can probably just use the publishing date or last modification time to help determine the priority.  More recent posts and news are probably more relevant (on your site) than older ones.  You might want to use a simple algorithm like the one I used on Gibe:

  • If the publish date is within the last 15 days, priority of 0.9
  • last month, 0.8
  • last three months, 0.7
  • last half-year, 0.6
  • last year, 0.5
  • last two years, 0.4
  • older, 0.3

Syndicated 2008-09-15 08:47:02 from Neil Blakey-Milner

Early adventures with Sitemaps

Perhaps entirely randomly, I decided that TechGeneral would need Sitemaps before I put it live.

A Sitemap (sometimes called a Google Sitemap, although you won't see Google calling it that, and it is a standard that Yahoo!, Ask, and Live all support) is an XML file (or bunch of XML files) that describe the various resources on your web site which allows search engines and other programs to discover them more easily.

There are a few advantages to putting together a Sitemap.  Generally, search engines give up after they travel a few links into a web site to avoid infinite automatically generated links (not because of malicious intent necessarily, but because of weird programming).  With a Sitemap, each listed resource can potentially be treated as a first visit.  Also, if a site has navigation that search engines can't traverse to get to certain pages, Sitemaps can assist search engines to find those resources.

They also optionally assign a priority to each resource as a way to influence the importance assigned to the resource relative to other resources on your web site.  Similarly, an optional update frequency per resource can influence how often a search engine or other program should check back for new versions of that resource.  Last modified dates also optionally help to determine whether to try revisit a resource earlier or later than would normally happen.

Example Sitemap File


There are two types of Sitemaps - individual Sitemap files and Sitemap Index files.  Why would you want a Sitemap Index?  One, less relevant to many, reason is that individual Sitemap files can only contain 50 000 URLs (which, admittedly, the average blog isn't going to hit) and be less than 10MB uncompressed.  Another reason is that you might be using multiple systems that each generate Sitemap files (or you've hacked them to do so) but you don't want to merge them yourself.

Example Sitemap Index


One useful side-effect of using a Sitemap with Google's webmaster tools is that you can see errors that occur on resources listed in the Sitemap.  So, if a request for a resource starts returning 404 or 500 errors, you can separate that more specific set of errors from those caused by broken links on your site or on other sites.

However, Google's webmaster tools doesn't seem to like having a whole bunch of separate Sitemap files with a central Sitemap Index.  I mean, it seems to work, but it complains (warnings, not errors) that many of the Sitemaps (all on this site, most on my personal web site) have only entries with the same priority.  I'm setting the priority of all the archives low (they have noindex, follow set anyway, so won't show up in search results), the frontpage high, and the posts are priorities based on age.

I get the feeling that the priorities only apply within the same file, and not within the same site.  This somewhat makes sense, since one can delegate a sitemap for a particular folder on your web site, and you wouldn't want an overeager person assigning "1.0" to all content within the folder, overriding your beautifully crafted values for the base site.  However, in this case, they're all at the same level, and I really want the archives lower than the posts, and the frontpage higher than most of the posts.

Oh well, I'll push on and see whether it's just a matter of warnings that aren't affecting things (my favourite kind) or an indication of things being as I suspect.

Syndicated 2008-08-26 16:50:44 from Neil Blakey-Milner

Wordpress.com scalability at WordCamp SA 2008

At WordCamp South Africa 2008, held in Cape Town yesterday, we were given a brief overview of how Wordpress.com is set up to scale.

Matt Mullenweg set the scene with some idea of just how huge Wordpress.com is.  I may mess up a few numbers mentioned, but there've been something like 6.5 billion page views on Wordpress.com since the beginning of the year, there are 3.8 million Wordpress.com hosted blogs (only Blogger is bigger), and there are 1.4 billion words in posts created on Wordpress.com.

Warwick Poole then gave us some more in-depth numbers, although pointing out that Wordpress.com was bigger than AdultFriendFinder was a pretty good and well-understood indication from the audience's reaction.  In May 2008, Wordpress.com was served 693 million page views, but this rose to 812 million page views in July.  Over 1TB of media was uploaded in May, 1.3TB in July.  In May, 417TB of traffic left the Wordpress.com data centres.  These numbers are available in the "July wrap-up" post on the Wordpress.com web log.

Apparently, across the approximately 710 servers, 10 000 web requests and 10 000 databases requests are handled per second (I wasn't intelligent to write down whether this was the average).  110 requests per second are done to Amazon's S3 storage service, while 3TB of media is cached on their own media caches.  They output 1.5TB/s (I wrote TB, so it probably is TB and not Tb.  I'm guessing this is peak). They experience approximately 5 server failures a week.

How is it put together?  They use Round Robin DNS which determines the data centre (from testing, it seems there round robin six IPs - two IPs for each of three data centres).  There it hits a load balancer using some combination of nginx, wackamole, and spread.  They use Varnish for serving at least media, and currently use Litespeed web servers.  They also use MySQL and memcached.

They use (and developed) the batcache Wordpress plugin to serve content from memcached - according to the documentation, batcache only potentially servers stale content to first-time visitors - visitors who have interacted with the web log receive up to date content.

When new media is uploaded, its existence and initial location is stored in a table.  As necessary, the other data centres will create their own local copies of that media, and update that table.  The backup media stores in the data centres are write-only - apparently nothing is ever deleted from them.

That's about all I wrote down, but there's quite a bit of information about how Wordpress.com is set up and the sort of load/traffic it has on the Wordpress.com blog and on the blogs of various employees (such as this post on nginx replacing Pound, this one on Pound, and another on varnish) giving some useful information which will probably inform some technology choices we might make at SynthaSite.

Syndicated 2008-08-24 17:13:34 from Neil Blakey-Milner

Subversion (SVN) shortcuts to revert previous commits

Good version control system usage prevents many disasters, but that doesn't necessarily mean you won't make your own mistakes.  Today, I mistakenly included a file in a commit that I didn't want to commit yet.  I learned two new tricks while spending a few minutes puzzling the best way to get back to where I was before with that file.

First, make a mistake:

$ svn commit -m "..."
Sending dev.cfg
Sending gibe/plugin.py
Transmitting file data ..
Committed revision 114.

svn merge is the tool to use for this:

merge: Apply the differences between two sources to a working copy path.
usage: 1. merge sourceURL1[@N] sourceURL2[@M] [WCPATH]
2. merge sourceWCPATH1@N sourceWCPATH2@M [WCPATH]
3. merge [-c M | -r N:M] SOURCE[@REV] [WCPATH]

Trick #1: use svn merge's 3rd usage pattern with the -c option with the negative of the revision you've committed, and (here comes the trick) use . (the current directory) as the source of the merge:

$ svn merge -c -114 .
U gibe/plugin.py
U dev.cfg

With that your working copy is now where the repository was before your commit.  Commit that to the repository, and the repository is back where it was before your commit.

Now your working copy is where it was before you made any changes - but you probably want those changes back.  Easy enough:

$ svn merge -c 114 .
U gibe/plugin.py
U dev.cfg

Now your working copy is back where it was before you did the mistaken commit.

Trick #2: Of course, if your mistake is like mine and you only messed up one file and everything else is as it should be, you can just do this on one file, by using svn merge's 2nd usage pattern:

$ svn merge dev.cfg@114 dev.cfg@113
U dev.cfg

Commit that, and your repository is back to normal.  Then run:

$ svn merge dev.cfg@113 dev.cfg@114
U dev.cfg

Now the file is back where it was before your botch.

Syndicated 2008-08-22 15:01:03 from Neil Blakey-Milner

Simple Routes-based authentication with Pylons

Some of the services in the SynthaSite service layer use Python, WSGI, and the Pylons web application framework (with some TurboGears 2...). Particular functions require authentication, while others do not. We had a few simple working parameters:

  • We only currently need a single user name and password for these particular services, since we are authenticating the connecting application, not a particular user.
  • The authentication details must live in deploy configuration, not code.
  • We would like to easily be able to see a list of all entry points into the application, and see whether they require authentication.
  • If we do not specify otherwise, assume that our functions require authentication.
  • To simplify testing and development, be able to easily turn off the authentication requirement in deploy configuration.

We already have a list of all entry points into these applications, since they use Routes. In the Pylons layout, these live in config/routing.py, and look like this:

from pylons import config
from routes import Mapper

def make_map():
"""Create, configure and return the routes Mapper"""
map = Mapper(directory=config['pylons.paths']['controllers'],

# The ErrorController route (handles 404/500 error pages); it should
# likely stay at the top, ensuring it can always be resolved
map.connect('error/:action/:id', controller='error')


map.connect('users', '/users', controller='users', action='index')
map.connect('user', '/users/:user_id', controller='users', action='user')
# ...

While one can use Routes with controllers and actions that depend on the URL, I prefer being explicit.  This creates a single list of all the accessible URLs and their accompanying controllers and actions.

If you attach additional keywords to the map.connect method, then they are added to the defaults attribute of the Route object created for each of those routes.  The RoutesMiddleware WSGI middleware places the route that matches the incoming request into the routes.route key in the environ dictionary that drives WSGI.  So, we can just add _auth = True to routes that require auth and _auth = False to those that don't, and create our own simple authentication middleware.  It would look something like this:

from paste.auth.basic import AuthBasicHandler

class LocalAuthenticationMiddleware(object):
def __init__(self, app, config):
realm = config.get('localauthentication.realm', None)
username = config.get('localauthentication.username', None)
password = config.get('localauthentication.password', None)
if realm:
def authfunc(environ, username, password,
_wanted_username = username, _wanted_password = password):
if username == _wanted_username:
if password == _wanted_password:
return True
self.protected_app = AuthBasicHandler(app, realm, authfunc)
self.protected_app = app
self.config = config
self.app = app

def __call__(self, environ, start_response):
route = environ.get('routes.route', None)
if not route:
return self.app(environ, start_response)

if route.defaults.get('_auth', 'False') == 'False':
return self.app(environ, start_response)

return self.protected_app(environ, start_response)

We use Paste's AuthBasicHandler WSGI middleware to optionally wrap our application.  We keep a reference to our application around, in case we don't want to apply authentication.  When our middleware is called, we check whether we want the AuthBasicHandler-wrapped application, or the plain application, and call the one we want as per standard WSGI middleware.

Specifying _auth = True and _auth = False for every route is going to be painful.  Instead, we created a simple wrapper function around map.connect that we use instead, and it does the defaulting to requiring authentication for us (amongst other things):

    def connect(route_name, route_url, *args, **kw):
if 'method' in kw:
method = kw.pop('method')
if 'conditions' in kw:
kw['conditions']['method'] = method
kw['conditions'] = dict(method=method)

# Unless otherwise specified, require authentication
kw['_auth'] = kw.get('_auth', True)

return map.connect(route_name, route_url, *args, **kw)

Syndicated 2008-08-22 14:58:10 from Neil Blakey-Milner

Updating the TechGeneral deployment environment

Over the past few days, I've been putting the final touches on TechGeneral before letting anyone know about it.  The process from development to deployment has been surprisingly simple.

TechGeneral runs gibe, the web log server application I wrote for my personal web log.  Gibe is written in Python, using the TurboGears 1 mega-framework.

When deploying Python applications, using virtualenv (or something equivalent) is the best way to go.  Each virtual Python environment contains the particular versions of libraries necessary to run the applications that run in that environment.  TurboGears 1 is getting a bit old (although that's entirely relative), and needs some older versions of libraries.  No problem accomodating that with virtualenv.

Gibe itself, its plugins, and themes written for it (which are just plugins) are all Python packages, and are most easily installed using easy_install within the virtual Python environment.

This was my first real-life use of mod_wsgi, which manages a WSGI application's lifecycle.  I created a simple .wsgi file using the TurboGears example, and set up a line or two in my Apache config, and I had a fully managed Python process running as a specified user and using the virtual Python environment I'd set up.

At this point, moving from development to deployment was just a matter of creating Python packages, uploading them, installing them with easy_install, and using the Unix command "touch" on the .wsgi file to tell mod_wsgi to redeploy the application.

If a mistake happens, I just remove the new version of the offending package and touch the .wsgi file.

I created a simple development-side alias to create a new Python source distribution of the current Gibe plugin (or Gibe itself), and upload it to my server:

alias tgu="python setup.py sdist && \
scp `ls -trc1 dist/* | tail -1` \

On the server side, I have a simple function (paths removed for simplicity):

tginstall() { easy_install "$@" && \
touch mod_wsgi/techgeneral.wsgi ; }

(I supposed if I got really bored, I could create one command on the development machine to push it up, install it, and reload the server.)

Syndicated 2008-08-20 12:28:43 (Updated 2008-08-20 20:52:18) from Neil Blakey-Milner

Welcome to TechGeneral

I'm Neil Blakey-Milner, a technology generalist based in Cape Town, South Africa.  Welcome to TechGeneral, my new (at least at time of writing) technology web log, where I talk about my wide-ranging interests in technology.

Likely common themes are:

Since April 2003, I've maintained a mixed-bag web log, Cosmic Seriosity Balance.  From today, that will be where I'll talk about thing other than technology, such as:

I've realised that the people who read my technology posts (predominantly outside of South Africa) probably don't care much about the other stuff.  The people who read my other stuff (predominantly inside South Africa) probably don't care all that much about the technology stuff either.

I hope this split helps improve the subjective signal to noise ratio for those who follow what I have to say.

Syndicated 2008-08-17 14:07:26 from Neil Blakey-Milner

Be sure to wear a flower in your hair

(This is a repost of my entry "Be sure to wear a flower in your hair" to the South African Tech Leader technology group blog.  My next post, What is a geek?, has just been posted there, if you want to read it before a week or two from now when I'll repost it here.)

It’s really hard to summarise the experience of a first visit to San Francisco, assuming you’re at least somewhat a technology geek. San Francisco (and by that, one generally means the San Francisco Bay Area) is modern technology’s birthplace and still its hometown.

Xerox PARC (as in Palo Alto Research Centre) either created or popularised implementations of modern computing aspects such as the mouse, laser printers, Ethernet, GUI/WIMP interfaces, Object-Oriented Programming with the Smalltalk programming language, and the Integrated Development Environment. The Bay Area is home to the headquarters of technology giants such as Apple, Cisco, eBay, Google, Oracle, Sun Microsystems, and Yahoo!, as well as upstarts like Facebook, Mint.com, and SugarCRM. (And SynthaSite, of course.)

At times during my visit the technology industry seemed entirely pervasive — whether it was randomly walking past three people in the street arguing the merits of various memory allocation techniques (I kid you not) or hearing that one of your colleagues just moved into the apartment the CEO of a popular social media startup just moved out of. It is hard not to let your imagination loose with the idea of what can be achieved here, especially after seeing over 3000 developers, a large portion of them probably local to the area and most certainly at least as geeky as I am, at Google’s I/O conference. (I posted quite extensively about my Google I/O trip on my personal blog, if you want to check it out.)

If I sound a bit in love, it’s because I am. I challenge anyone in our industry to somehow not be a little in love with the vibe and pace and sense of belonging you will find in San Francisco. But this isn’t really about technology in San Francisco — it’s about it in South Africa.

Romance novels suggest that sometimes you need to discover (or be reminded of) what is out there to realise quite what you have, that while you find that there’s a lot of prettiness out there, you will also discover that there have been and always will be many and unassailable reasons for you being with the one you’re with.

I needed that a bit with South Africa. I’ve always wanted to be here for the long run, but it has been hard not to get worn down little by little over the past few years by the scarcity of interesting highly-skilled work and the similar scarcity of ambition in South African technology companies. Now, I have an updated and more accurate idea of what is out there, and while South Africa does fair poorly in some comparisons, there are other, more important, aspects to take into consideration. And those mean that leaving it to find some technology heaven elsewhere sounds like a bad swap.

And it’s not like you have to be in San Francisco to wear a flower in your hair — you can experience and help create your own slice of the San Franciscan vibe wherever you are. All it really takes is creating or finding a workplace you can be passionate about using technologies you’re passionate about with people who share that passion (am I saying “passion” enough?), and finding and building a community of similarly technology obsessed people who can help you, and who you can help, and to make you feel like you’re not alone (and who you can make dinner conversation with without resorting to the weather).

I lucked out on the first one — at SynthaSite I have an ambitious company that knows how to treat their employees well, great colleagues, and challenging work — and a pantry full of snacks, lunches materialising daily at my desk, games consoles, and 40-inch TVs. And there are at least a few similarly-enlightened workplaces around, and more can be created.

I already know a number of geeks who’d give a good argument on the merits of various memory allocation techniques. It takes work, but through efforts like GeekDinner and StarCamp, we come to know more, and different, people and benefit from that meeting as they introduce us to new perspectives and, hopefully, shake our preconceptions. And not only come to know people, but also come to know more about our trade through presentations and less formal conversations sparked by an interest that perhaps we didn’t know we had before others introduced the topic.

While it is easy to moan about the lacks we have here, it seems that by our attitudes and our actions we can create an ever-increasing slice of that seemingly far-away vibe. As we kick off planning for the next StarCamp in Cape Town, and a national web technology conference, I’m hoping we will find positive attitudes and actions in finding co-organisers, presenters, sponsors, and venues.

Syndicated 2008-06-28 12:21:15 from Cosmic Seriosity Balance

First Tech Leader post up

Just before I left for my San Francisco visit, I was approached by Nic on whether I'd like to write for Tech Leader, which is a South African "editorial" group blog about technology, edited and run by the Mail and Guardian Online.

My first post, Be sure to wear a flower in your hair, is on how my trip to San Francisco and the technology vibe and sense of "anything is possible" revitalised me a bit about South Africa and the potential future that could be if technology people stay and work for change (by which I mean in the industry, but it's also good to try change things outside it too).

I'm going to try write a post a week for Tech Leader on less nitty-gritty things, and try get back to a few posts a week here after my recent fortnight of silence dealing with post-travel jetlag and accumulated work responsibilities.  I'll post a pointer to Tech Leader when I post there, and post the full content here two weeks (or so) afterwards.

Syndicated 2008-06-19 13:02:27 from Cosmic Seriosity Balance

112 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!