Older blog entries for nbm (starting at number 119)

Early adventures with Sitemaps

Perhaps entirely randomly, I decided that TechGeneral would need Sitemaps before I put it live.

A Sitemap (sometimes called a Google Sitemap, although you won't see Google calling it that, and it is a standard that Yahoo!, Ask, and Live all support) is an XML file (or bunch of XML files) that describe the various resources on your web site which allows search engines and other programs to discover them more easily.

There are a few advantages to putting together a Sitemap.  Generally, search engines give up after they travel a few links into a web site to avoid infinite automatically generated links (not because of malicious intent necessarily, but because of weird programming).  With a Sitemap, each listed resource can potentially be treated as a first visit.  Also, if a site has navigation that search engines can't traverse to get to certain pages, Sitemaps can assist search engines to find those resources.

They also optionally assign a priority to each resource as a way to influence the importance assigned to the resource relative to other resources on your web site.  Similarly, an optional update frequency per resource can influence how often a search engine or other program should check back for new versions of that resource.  Last modified dates also optionally help to determine whether to try revisit a resource earlier or later than would normally happen.

Example Sitemap File


There are two types of Sitemaps - individual Sitemap files and Sitemap Index files.  Why would you want a Sitemap Index?  One, less relevant to many, reason is that individual Sitemap files can only contain 50 000 URLs (which, admittedly, the average blog isn't going to hit) and be less than 10MB uncompressed.  Another reason is that you might be using multiple systems that each generate Sitemap files (or you've hacked them to do so) but you don't want to merge them yourself.

Example Sitemap Index


One useful side-effect of using a Sitemap with Google's webmaster tools is that you can see errors that occur on resources listed in the Sitemap.  So, if a request for a resource starts returning 404 or 500 errors, you can separate that more specific set of errors from those caused by broken links on your site or on other sites.

However, Google's webmaster tools doesn't seem to like having a whole bunch of separate Sitemap files with a central Sitemap Index.  I mean, it seems to work, but it complains (warnings, not errors) that many of the Sitemaps (all on this site, most on my personal web site) have only entries with the same priority.  I'm setting the priority of all the archives low (they have noindex, follow set anyway, so won't show up in search results), the frontpage high, and the posts are priorities based on age.

I get the feeling that the priorities only apply within the same file, and not within the same site.  This somewhat makes sense, since one can delegate a sitemap for a particular folder on your web site, and you wouldn't want an overeager person assigning "1.0" to all content within the folder, overriding your beautifully crafted values for the base site.  However, in this case, they're all at the same level, and I really want the archives lower than the posts, and the frontpage higher than most of the posts.

Oh well, I'll push on and see whether it's just a matter of warnings that aren't affecting things (my favourite kind) or an indication of things being as I suspect.

Syndicated 2008-08-26 16:50:44 from Neil Blakey-Milner

Wordpress.com scalability at WordCamp SA 2008

At WordCamp South Africa 2008, held in Cape Town yesterday, we were given a brief overview of how Wordpress.com is set up to scale.

Matt Mullenweg set the scene with some idea of just how huge Wordpress.com is.  I may mess up a few numbers mentioned, but there've been something like 6.5 billion page views on Wordpress.com since the beginning of the year, there are 3.8 million Wordpress.com hosted blogs (only Blogger is bigger), and there are 1.4 billion words in posts created on Wordpress.com.

Warwick Poole then gave us some more in-depth numbers, although pointing out that Wordpress.com was bigger than AdultFriendFinder was a pretty good and well-understood indication from the audience's reaction.  In May 2008, Wordpress.com was served 693 million page views, but this rose to 812 million page views in July.  Over 1TB of media was uploaded in May, 1.3TB in July.  In May, 417TB of traffic left the Wordpress.com data centres.  These numbers are available in the "July wrap-up" post on the Wordpress.com web log.

Apparently, across the approximately 710 servers, 10 000 web requests and 10 000 databases requests are handled per second (I wasn't intelligent to write down whether this was the average).  110 requests per second are done to Amazon's S3 storage service, while 3TB of media is cached on their own media caches.  They output 1.5TB/s (I wrote TB, so it probably is TB and not Tb.  I'm guessing this is peak). They experience approximately 5 server failures a week.

How is it put together?  They use Round Robin DNS which determines the data centre (from testing, it seems there round robin six IPs - two IPs for each of three data centres).  There it hits a load balancer using some combination of nginx, wackamole, and spread.  They use Varnish for serving at least media, and currently use Litespeed web servers.  They also use MySQL and memcached.

They use (and developed) the batcache Wordpress plugin to serve content from memcached - according to the documentation, batcache only potentially servers stale content to first-time visitors - visitors who have interacted with the web log receive up to date content.

When new media is uploaded, its existence and initial location is stored in a table.  As necessary, the other data centres will create their own local copies of that media, and update that table.  The backup media stores in the data centres are write-only - apparently nothing is ever deleted from them.

That's about all I wrote down, but there's quite a bit of information about how Wordpress.com is set up and the sort of load/traffic it has on the Wordpress.com blog and on the blogs of various employees (such as this post on nginx replacing Pound, this one on Pound, and another on varnish) giving some useful information which will probably inform some technology choices we might make at SynthaSite.

Syndicated 2008-08-24 17:13:34 from Neil Blakey-Milner

Subversion (SVN) shortcuts to revert previous commits

Good version control system usage prevents many disasters, but that doesn't necessarily mean you won't make your own mistakes.  Today, I mistakenly included a file in a commit that I didn't want to commit yet.  I learned two new tricks while spending a few minutes puzzling the best way to get back to where I was before with that file.

First, make a mistake:

$ svn commit -m "..."
Sending dev.cfg
Sending gibe/plugin.py
Transmitting file data ..
Committed revision 114.

svn merge is the tool to use for this:

merge: Apply the differences between two sources to a working copy path.
usage: 1. merge sourceURL1[@N] sourceURL2[@M] [WCPATH]
2. merge sourceWCPATH1@N sourceWCPATH2@M [WCPATH]
3. merge [-c M | -r N:M] SOURCE[@REV] [WCPATH]

Trick #1: use svn merge's 3rd usage pattern with the -c option with the negative of the revision you've committed, and (here comes the trick) use . (the current directory) as the source of the merge:

$ svn merge -c -114 .
U gibe/plugin.py
U dev.cfg

With that your working copy is now where the repository was before your commit.  Commit that to the repository, and the repository is back where it was before your commit.

Now your working copy is where it was before you made any changes - but you probably want those changes back.  Easy enough:

$ svn merge -c 114 .
U gibe/plugin.py
U dev.cfg

Now your working copy is back where it was before you did the mistaken commit.

Trick #2: Of course, if your mistake is like mine and you only messed up one file and everything else is as it should be, you can just do this on one file, by using svn merge's 2nd usage pattern:

$ svn merge dev.cfg@114 dev.cfg@113
U dev.cfg

Commit that, and your repository is back to normal.  Then run:

$ svn merge dev.cfg@113 dev.cfg@114
U dev.cfg

Now the file is back where it was before your botch.

Syndicated 2008-08-22 15:01:03 from Neil Blakey-Milner

Simple Routes-based authentication with Pylons

Some of the services in the SynthaSite service layer use Python, WSGI, and the Pylons web application framework (with some TurboGears 2...). Particular functions require authentication, while others do not. We had a few simple working parameters:

  • We only currently need a single user name and password for these particular services, since we are authenticating the connecting application, not a particular user.
  • The authentication details must live in deploy configuration, not code.
  • We would like to easily be able to see a list of all entry points into the application, and see whether they require authentication.
  • If we do not specify otherwise, assume that our functions require authentication.
  • To simplify testing and development, be able to easily turn off the authentication requirement in deploy configuration.

We already have a list of all entry points into these applications, since they use Routes. In the Pylons layout, these live in config/routing.py, and look like this:

from pylons import config
from routes import Mapper

def make_map():
"""Create, configure and return the routes Mapper"""
map = Mapper(directory=config['pylons.paths']['controllers'],

# The ErrorController route (handles 404/500 error pages); it should
# likely stay at the top, ensuring it can always be resolved
map.connect('error/:action/:id', controller='error')


map.connect('users', '/users', controller='users', action='index')
map.connect('user', '/users/:user_id', controller='users', action='user')
# ...

While one can use Routes with controllers and actions that depend on the URL, I prefer being explicit.  This creates a single list of all the accessible URLs and their accompanying controllers and actions.

If you attach additional keywords to the map.connect method, then they are added to the defaults attribute of the Route object created for each of those routes.  The RoutesMiddleware WSGI middleware places the route that matches the incoming request into the routes.route key in the environ dictionary that drives WSGI.  So, we can just add _auth = True to routes that require auth and _auth = False to those that don't, and create our own simple authentication middleware.  It would look something like this:

from paste.auth.basic import AuthBasicHandler

class LocalAuthenticationMiddleware(object):
def __init__(self, app, config):
realm = config.get('localauthentication.realm', None)
username = config.get('localauthentication.username', None)
password = config.get('localauthentication.password', None)
if realm:
def authfunc(environ, username, password,
_wanted_username = username, _wanted_password = password):
if username == _wanted_username:
if password == _wanted_password:
return True
self.protected_app = AuthBasicHandler(app, realm, authfunc)
self.protected_app = app
self.config = config
self.app = app

def __call__(self, environ, start_response):
route = environ.get('routes.route', None)
if not route:
return self.app(environ, start_response)

if route.defaults.get('_auth', 'False') == 'False':
return self.app(environ, start_response)

return self.protected_app(environ, start_response)

We use Paste's AuthBasicHandler WSGI middleware to optionally wrap our application.  We keep a reference to our application around, in case we don't want to apply authentication.  When our middleware is called, we check whether we want the AuthBasicHandler-wrapped application, or the plain application, and call the one we want as per standard WSGI middleware.

Specifying _auth = True and _auth = False for every route is going to be painful.  Instead, we created a simple wrapper function around map.connect that we use instead, and it does the defaulting to requiring authentication for us (amongst other things):

    def connect(route_name, route_url, *args, **kw):
if 'method' in kw:
method = kw.pop('method')
if 'conditions' in kw:
kw['conditions']['method'] = method
kw['conditions'] = dict(method=method)

# Unless otherwise specified, require authentication
kw['_auth'] = kw.get('_auth', True)

return map.connect(route_name, route_url, *args, **kw)

Syndicated 2008-08-22 14:58:10 from Neil Blakey-Milner

Updating the TechGeneral deployment environment

Over the past few days, I've been putting the final touches on TechGeneral before letting anyone know about it.  The process from development to deployment has been surprisingly simple.

TechGeneral runs gibe, the web log server application I wrote for my personal web log.  Gibe is written in Python, using the TurboGears 1 mega-framework.

When deploying Python applications, using virtualenv (or something equivalent) is the best way to go.  Each virtual Python environment contains the particular versions of libraries necessary to run the applications that run in that environment.  TurboGears 1 is getting a bit old (although that's entirely relative), and needs some older versions of libraries.  No problem accomodating that with virtualenv.

Gibe itself, its plugins, and themes written for it (which are just plugins) are all Python packages, and are most easily installed using easy_install within the virtual Python environment.

This was my first real-life use of mod_wsgi, which manages a WSGI application's lifecycle.  I created a simple .wsgi file using the TurboGears example, and set up a line or two in my Apache config, and I had a fully managed Python process running as a specified user and using the virtual Python environment I'd set up.

At this point, moving from development to deployment was just a matter of creating Python packages, uploading them, installing them with easy_install, and using the Unix command "touch" on the .wsgi file to tell mod_wsgi to redeploy the application.

If a mistake happens, I just remove the new version of the offending package and touch the .wsgi file.

I created a simple development-side alias to create a new Python source distribution of the current Gibe plugin (or Gibe itself), and upload it to my server:

alias tgu="python setup.py sdist && \
scp `ls -trc1 dist/* | tail -1` \

On the server side, I have a simple function (paths removed for simplicity):

tginstall() { easy_install "$@" && \
touch mod_wsgi/techgeneral.wsgi ; }

(I supposed if I got really bored, I could create one command on the development machine to push it up, install it, and reload the server.)

Syndicated 2008-08-20 12:28:43 (Updated 2008-08-20 20:52:18) from Neil Blakey-Milner

Welcome to TechGeneral

I'm Neil Blakey-Milner, a technology generalist based in Cape Town, South Africa.  Welcome to TechGeneral, my new (at least at time of writing) technology web log, where I talk about my wide-ranging interests in technology.

Likely common themes are:

Since April 2003, I've maintained a mixed-bag web log, Cosmic Seriosity Balance.  From today, that will be where I'll talk about thing other than technology, such as:

I've realised that the people who read my technology posts (predominantly outside of South Africa) probably don't care much about the other stuff.  The people who read my other stuff (predominantly inside South Africa) probably don't care all that much about the technology stuff either.

I hope this split helps improve the subjective signal to noise ratio for those who follow what I have to say.

Syndicated 2008-08-17 14:07:26 from Neil Blakey-Milner

Be sure to wear a flower in your hair

(This is a repost of my entry "Be sure to wear a flower in your hair" to the South African Tech Leader technology group blog.  My next post, What is a geek?, has just been posted there, if you want to read it before a week or two from now when I'll repost it here.)

It’s really hard to summarise the experience of a first visit to San Francisco, assuming you’re at least somewhat a technology geek. San Francisco (and by that, one generally means the San Francisco Bay Area) is modern technology’s birthplace and still its hometown.

Xerox PARC (as in Palo Alto Research Centre) either created or popularised implementations of modern computing aspects such as the mouse, laser printers, Ethernet, GUI/WIMP interfaces, Object-Oriented Programming with the Smalltalk programming language, and the Integrated Development Environment. The Bay Area is home to the headquarters of technology giants such as Apple, Cisco, eBay, Google, Oracle, Sun Microsystems, and Yahoo!, as well as upstarts like Facebook, Mint.com, and SugarCRM. (And SynthaSite, of course.)

At times during my visit the technology industry seemed entirely pervasive — whether it was randomly walking past three people in the street arguing the merits of various memory allocation techniques (I kid you not) or hearing that one of your colleagues just moved into the apartment the CEO of a popular social media startup just moved out of. It is hard not to let your imagination loose with the idea of what can be achieved here, especially after seeing over 3000 developers, a large portion of them probably local to the area and most certainly at least as geeky as I am, at Google’s I/O conference. (I posted quite extensively about my Google I/O trip on my personal blog, if you want to check it out.)

If I sound a bit in love, it’s because I am. I challenge anyone in our industry to somehow not be a little in love with the vibe and pace and sense of belonging you will find in San Francisco. But this isn’t really about technology in San Francisco — it’s about it in South Africa.

Romance novels suggest that sometimes you need to discover (or be reminded of) what is out there to realise quite what you have, that while you find that there’s a lot of prettiness out there, you will also discover that there have been and always will be many and unassailable reasons for you being with the one you’re with.

I needed that a bit with South Africa. I’ve always wanted to be here for the long run, but it has been hard not to get worn down little by little over the past few years by the scarcity of interesting highly-skilled work and the similar scarcity of ambition in South African technology companies. Now, I have an updated and more accurate idea of what is out there, and while South Africa does fair poorly in some comparisons, there are other, more important, aspects to take into consideration. And those mean that leaving it to find some technology heaven elsewhere sounds like a bad swap.

And it’s not like you have to be in San Francisco to wear a flower in your hair — you can experience and help create your own slice of the San Franciscan vibe wherever you are. All it really takes is creating or finding a workplace you can be passionate about using technologies you’re passionate about with people who share that passion (am I saying “passion” enough?), and finding and building a community of similarly technology obsessed people who can help you, and who you can help, and to make you feel like you’re not alone (and who you can make dinner conversation with without resorting to the weather).

I lucked out on the first one — at SynthaSite I have an ambitious company that knows how to treat their employees well, great colleagues, and challenging work — and a pantry full of snacks, lunches materialising daily at my desk, games consoles, and 40-inch TVs. And there are at least a few similarly-enlightened workplaces around, and more can be created.

I already know a number of geeks who’d give a good argument on the merits of various memory allocation techniques. It takes work, but through efforts like GeekDinner and StarCamp, we come to know more, and different, people and benefit from that meeting as they introduce us to new perspectives and, hopefully, shake our preconceptions. And not only come to know people, but also come to know more about our trade through presentations and less formal conversations sparked by an interest that perhaps we didn’t know we had before others introduced the topic.

While it is easy to moan about the lacks we have here, it seems that by our attitudes and our actions we can create an ever-increasing slice of that seemingly far-away vibe. As we kick off planning for the next StarCamp in Cape Town, and a national web technology conference, I’m hoping we will find positive attitudes and actions in finding co-organisers, presenters, sponsors, and venues.

Syndicated 2008-06-28 12:21:15 from Cosmic Seriosity Balance

First Tech Leader post up

Just before I left for my San Francisco visit, I was approached by Nic on whether I'd like to write for Tech Leader, which is a South African "editorial" group blog about technology, edited and run by the Mail and Guardian Online.

My first post, Be sure to wear a flower in your hair, is on how my trip to San Francisco and the technology vibe and sense of "anything is possible" revitalised me a bit about South Africa and the potential future that could be if technology people stay and work for change (by which I mean in the industry, but it's also good to try change things outside it too).

I'm going to try write a post a week for Tech Leader on less nitty-gritty things, and try get back to a few posts a week here after my recent fortnight of silence dealing with post-travel jetlag and accumulated work responsibilities.  I'll post a pointer to Tech Leader when I post there, and post the full content here two weeks (or so) afterwards.

Syndicated 2008-06-19 13:02:27 from Cosmic Seriosity Balance

Pylons/TG2/WSGI Sprint and sight-seeing weekend

I spent much of the weekend in sunny Sebastopol at the Pylons/TG2/WSGI sprint at O'Reilly's headquarters there.  One doesn't expect the lack of fanfare that marks the O'Reilly offices - besides a modest sign on entry to the parking lot, only a Tarsier statue made of recycled metal identifies the pretty normal-looking offices.

There's not much I can say that would probably be of interest to others other than I had a lot of fun, and the sprinters were all very friendly, great to chat to, geeky, and generally just like the great group of geeks we have in Cape Town.

My awesome carpool partner, Kelvin, not only managed not to go mad stuck with me in a car for an hour and a bit each way for two days - he was quite keen to show me around San Francisco on Sunday evening.  We did some common tourist things - went to the TransAmerica Pyramid, Union Square, past the Symphony Hall, through the Golden Gate Park and China Town, up Coit Tower on Telegraph Hill, down the "Crookedest Road", and generally meandering all over the place.

Syndicated 2008-06-03 05:58:56 from Cosmic Seriosity Balance

Google I/O: Google App Engine fireside chat

There were a few questions about the choice of Python as a language, and whether and what languages would come next, comparisons to other existing containers, and so forth.  Guido van Rossum said it was partly because Python is one of the three big languages at Google, and because it was (relatively) easy to harden the VM.  Kevin Gibbs said they had to start somewhere, and that they were committed to others.  Paul McDonald said that the two most voted-for issues on the issue tracker are language-related, and that there were teams (ie, more than one) currently actively working on languages (ie, more than one).

A couple of questions around "maturity" - the team says they'll make it clear when it is no longer a preview, and that this will probably happen when they have the billing set up and offline processing.  They expect billing to be available "toward the end of the year".

Question about HTTPS/SSL and access to encryption within GAE code.  Answer is that it's something they want to do, but don't know when they'll get to it.  Data is "strictly" partitioned between apps in the store (BigTable).

A common thread in answers were that the Google App Engine team were very interested in people being able to get their data and code out of GAE, and they're working on making it easy to bulk output the data.  They hoped that a standard would emerge for BigTable-like storage (CouchDB, SimpleDB) so that people could write code and host it on GAE or elsewhere.   And people are already working on compatible APIs to make it possible to run on other storage systems (but may not be too efficient).

Syndicated 2008-05-29 23:33:05 from Cosmic Seriosity Balance

110 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!