Name: Liam Quin
Member since: 2000-02-18 03:07:39
Last Login: 2009-07-24 20:54:01
Homepage: http://www.holoweb.net/~liam/
Notes:
Living (as a home owner) near Milford, Ontario, an SGML and XML Guru, text retrieval (lq-text), Unix and C programming since 1981 (urp!), Open Source and freeware since 1983 (well, that predates the FSF and GNU and the term Open Source, OK), IRC (Ankh usually), SGML since 1987, co-author of The XML Specification Guide (Wiley 1999), author of the Open Source XML Database Toolkit (Wiley, 2000), and one of three authors of Mastering XML Premium Edition (Sybex).
I currently work for W3C as XML Activity Lead.
Have also been involed in, or worked with, the X Window system, typography, DSSSL, XSLT, Scheme, C, Canadian font standards representative / advisor for ISO-related work, known as the barefoot programmer, what else should I say?
In spare time I scan old photos and engravings from antiquarian books, and put them on the Web together with extracts from the books.
trying to do a Gtk front end to lq-text, going for long walks in bare feet
You can email me as liam at holoweb.net if you like. Tell me what colour socks you're wearing.
Ankh certified:
- Graydon, whom I knew when he worked for me
- trance9, whom I've known for even longer than graydon, but who wears shoes more
- zodiac, who is my brother
- deus_x, who is writing an interesting content management system (EFnet/#Perl)
- jwz and jef, who were both giving away X and graphics utilities long before Mosaic was born
- milambar, whom I know from SorceryNet and have met
- jivera, halcy0n and Mysidia, also from SorceryNet
- some people as apprentice so they could post, or at their request
I spent some time with Marc Lehmann's String::Similarity module, which seems to do reasonably well on finding similar strings that were OCR'd independently. I wish Google would get a clue and make higher resolution scans: the OCR error rate would drop hugely, they'd get more of the punctuation and footnotes, and they might eve nstart capturing some of the diagrams! The problem is that it's more lucrative to have millions of badly scanned crap than to have hundreds of thousands of well-scanned books, it seems.
The current version, converted first to XML and thence to HTML, is at words.fromoldbooks.org if anyone is interested. I'm hoping to be able to feed the cleaned up text back to Project Gutenberg and archive.org eventually, and to generate RDF.
Lots of interesting text processing challenges, so a useful diversion for a while.
Image editing is going much better with 8 Gigabytes of memory. I've been able to get three or four images done for FromOldBooks.org in the time it used to take to do one.
On the other hand, the only reason I get any images scanned and edited at all is because I get too tired to do much else; it's pretty insanely busy here.
Unfortunately, Google's ads almost entirely stopped working on my Web site (Google downgraded my pagerank from 8 to 4 a few months ago), and with the fall in the US dollar (it's been bushed), we're struggling a bit more than we'd like. OK, a lot more than we'd like.
Luckily, my spam says that I won the UK Microsoft email lottery, and the prize is either (1) all of Nigeria, or (2) more spam. Speaking of which, SpamAssassin seems to be working better after a one-line fix (I filed a bug for it). Or at least its not complaining as much.
So, today's image (no, I won't post them every day) is an ammeter from an 1892 book:
24 Jul 2008 (updated 24 Jul 2008 at 22:30 UTC) »
It's been the rainiest July on record here - and the month isn't over yet, of course. We discovered that the swimming pool can indeed fill above the top of its liner.
And during the storms, the dog, who is possessed by a daemon, becomes uncontrollable. or controllable only with difficulty.
I still miss being able to have time to concentrate, to focus enough to write reasonable amounts of code, to program. Working at W3C means I get to have a vague warm fuzzy feeling about helping the world a teeny bit, but it isn't always enough compensation.
In what little spare time I have, I scan pictures from old books. Soemone recently made a set of photoshop brushes from the 16th century demonic seals from the Goetia, and they two sets have each had over 900 downloads (they are here
and here if you are into such things). I have well over 2,000 images now, with sometimes fairly substantial extracts from the books, captions and other metadata. And there's an encyclopædia, some dictionaries of slang (including Brewer's Phrase and Fable), most of a vitriolic satirical political dictionary from the 1790s, and a bunch of other stuff.
Most of the text is in XML, so every now and then I update the XSLT that makes the HTML files and add smarts to find more cross-references. I want to do geotagging and links to maps, but this is harder than it sounds because the placenames I have are usually from when the books were published, not today.
Today's addition is some pictures of fonts, from a book I bought in Boston a couple of weeks ago, although these are not font samples as most people here would expect them to be, I suspect :-)
I did get to do some programming recently, though, and added some XML support to my ancient text retrieval package, lq-text. The changes aren't yet released, until I finish with some UTF-8 issues, but if you are interested, drop me a line. I wrote a short paper on it for the Balisage markup conference, too. I hope soon I'll use lq-text for the search function on my Web site, alongside the XQuery-based search that I have now.
Spending time on XML as character strings makes the world of RDF seem even further away, but I'm reading an interesting book on Ontology Matching to make up for it, inbetween scanning pictures and working on stuff for XQuery and for XSL-FO 2.0.
Now it's time to go and sedate the dog with some herbal calmer.
I was using the Perl interface, and maybe that's a mistake, because it's obvious that they don't spend as much effort on it as on the C API. The documentation is very minimal, for example. But in the end, and after uninstalling all unwanted versions of bsd db from my laptop, it worked. Query time went doewn from 11 seconds to 2 seconds, partly because the 11-second version is starting a JVM for each query, partly because dbxml is in C, and partly because I had to remove some features from the query because I couldn't get them to work.
After help from one of the people maintaining the software, I discovered that I'll be able to get the other features to work. The search engine on my Web site isn't actually too slow for most queries (try it here) but it's using more memory than I'd like, and there are some queries on my photographs that do take too long.
The good thing about using XQuery to develop these things is that it's relatively easy to make changes. So maybe some changes are coming.
Ankh certified others as follows:
Others have certified Ankh as follows:
[ Certification disabled because you're not logged in. ]
FOAF updates: Trust rankings are now exported, making the data available to other users and websites. An external FOAF URI has been added, allowing users to link to an additional FOAF file.
Keep up with the latest Advogato features by reading the Advogato status blog.
If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!