25 Jun 2009 Ankh   » (Master)

Been spending a lot of time working on a 200-year-old 32-volume dictionary of biography that I own (I got it in a second-hand bookshop in Oxford, missing two volumes that I later got elsewhere). I found several versions that had been OCRd really badly, and have been cleaning up one version enough that I can then try to use the other versions to detect errors.

The current version, converted first to XML and thence to HTML, is at words.fromoldbooks.org if anyone is interested. I'm hoping to be able to feed the cleaned up text back to Project Gutenberg and archive.org eventually, and to generate RDF.

Lots of interesting text processing challenges, so a useful diversion for a while.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!