Been spending a lot of time working on a 200-year-old 32-volume dictionary of biography that I own (I got it in a second-hand bookshop in Oxford, missing two volumes that I later got elsewhere). I found several versions that had been OCRd really badly, and have been cleaning up one version enough that I can then try to use the other versions to detect errors.
The current version, converted first to XML and thence to HTML, is at words.fromoldbooks.org if anyone is interested. I'm hoping to be able to feed the cleaned up text back to Project Gutenberg and archive.org eventually, and to generate RDF.
Lots of interesting text processing challenges, so a useful diversion for a while.
