31 Aug 2005 Ankh   » (Master)

[update: 2 years later and we had a Summer without rain...]

All the way up here in Canada we're getting rain from Hurricane Katrina, now a tropical storm. We're getting maybe 50mm (2 inches) of rain in a few hours. One of our windows blew in (the whole frame, not the glass) during the night. Luckily, the cats didn't leap through the open window and go out. Or if they did, they leapt back inside. And I don't think any other animals came inside either (the perils of living in the country!)

I've done a little more work on lq-text, the text retrieval system that I first released (for Unix) in 1989. I'd like to teach it xpath, but for now I barely have enough time to work on making sure the documentation is up to date, and that the software actually builds. I see a few people downloading it each month but I rarely hear back from them. As far as I know, lq-text is still one of the better text indexing packages for plain text, but it doesn't do word processing files, PDF, etc. It does index HTML/XML/SGML but only by ignoring the element structure.

I've played a little with OCR programs recently. The GNU gocr turned out to be no help at all for old books (e.g. I tried one printed in 1845, and also saw samples others had tried). Here's some gocr output that's better than average:

Iu a mvoode;: box, in the cl;oir, Do?v lie.s a ?yen:8?Ҁ¢bably- _i;e emgg-., of wood, of ,a Cr;;,s_,adeT; mml3o he ww it is_ í;npossible to tell 8vit); any certaii:ty, but mh-e v.ei;ture to tl;í;3k it rejirt_,R._ents ui;e uf tl(e_ t?h-o

Here is the same passage as read by Abbyy.com's reader:

In a wooden box, in the choir, now lies a remarkably fine effigy. of wood, of a Crusader: who he was it is impossible to tell with any certainty, but we venture to think it represents one of the two distinguished persons

So you can guess which program I'm using. Frankly, if gocr had a user interface as clean as that of Abbyy's program, the quality might be more nearly tolerable: you can click anywhere on the image to go to the corresponding place in the text draft, and vice versa, and the spell checker aligns both text and image as you go, highlighting regions in both very clearly.

I made a transcription (is that the right word here?) using OCR of several pages from Sir Charles Knight's Old England averaging less than five minutes per page , although careful proof-reading takes longer. I made a simple XML format that preserves all of the typographic distinctions in the original that I can discern and that appear to have been deliberate (e.g. I am not recording where a piece of metal type broke and lost a serif).

This preservation of distinctions is something Project Gutenberg doesn't seem to take care to do. For example, the `Encyclpedia Gutenberg' (actually the OCR'd text from the 1911 Encyclopaedia Britannica) has lost all the small caps, which were used to denote implicit cross references. As an experiment I have ordered a DVD with scanned images, and I'll see (if the images are good enough) how long it takes me to get something as good. Probably not long if I use their text as a baseline, although some rudimentary analysis of the published Project Gutenberg text found a lot of obvious errors that I doubt are in the original. This is not to say I would not also have many errors, of course, but I don't have a team of people doing proofreading.

When I worked at SoftQuad we did conversion of texts into SGML, often charging US$50,000 or more for a project, but still undercutting some of the competition. The trick was extensive analysis and a lot of scripting. For example, the abbreviation q.v. usually marks a cross-reference, so check for the longest phrase before that marker to find a plausible target for a link. Of course, if there are typographical distinctions it's easier. So now I'm using some of that experience. The transcription I mentioned earlier has thumbnails of pictures. These are pictures I had already scanned over the past five or six years, but because I used consistent filenames I was able to connect them to the text, which has references like (Fig. 12), automatically. This in turn gives me a list of figures not references, which helps me look for errors in the script or in the OCR'd text.

Combining threads, I made an lq-text index to the Gutenbergopedia, and then I could get a keyword-in-context index of "q.v.":

$ lqphrase
"q.v." | lqkwic
==== Document 1: vol1/aargau.xml ====
  1:ower course of the river Aar (q.v.), whence its name.
Its total area is 541
  2:hot sulphur springs of Baden (q.v.) and Schinznach,
while at Rheinfelden th
  3:pital of the canton is Aarau (q.v.), while other
important towns are Baden
  4:er important towns are Baden (q.v.), Zofingen (4591
inhabitants), Reinach (
==== Document 2: vol1/aaron.xml ====
  5: distinct from the Decalogue (q.v.) (Ex. xxxiii. seq.).
Kadesh, and not Sin
  6:o the Mosaite founder of Dan (q.v.). This throws no
light upon the name, wh

Another good error-checking technique is to look for words that only occur once, or whose frequency is very different than one might expect. You need more than just one volume to do frequency analysis really, but I can already see words like a11erican (should be American), ciimate, AAstotle (Aristotle) and so on. In a way you can think of this as debugging: doing experiments that might reveal errors, and then correcting them.

There are some other interesting things about OCR'd text to do with grammars and metadata, with links and expressing relationships, but I should put those in my XML blog when I get a chance.

On a tangentially related topic: I remember working at an aircraft company and seeing a junior consultant spend a day doing some editing that I could have done in under five minutes. He didn't know about regular expressions. I may have mentioned this here before, but another thing people often don't think of is to use regular expressions to generate shell scripts.

When I scan images I name the files with the figure number (or page number, if figures are not numbered) at the start, so they sort together, e.g.

-rwxr-xr-x  2 liam liam 200947 Aug  4  2003
-rwxr-xr-x  2 liam liam  54461 Aug  4  2003
-rwxr-xr-x  2 liam liam  68865 Aug  4  2003
(you can see these at fromoldbooks.org). I use a shell script to extract the image size and rename the files with the widthxheight. It also extracts the JPEG compression quality and adds that if it's not 75%.

Now, suppose I got the figure number wrong, and I have a bunch of files to rename from 071- to 017- (or whatever).

I can use sed (no, don't panic) like this:

ls 071* | sed 's/^071-/017-/'

This gives me the new filenames:


But really I need to generate a set of Unix commands to rename the files:

ls 071* | sed 's/^071-\(.*\)/mv -i & 017-\1/'

If the expression intimidates you, take off your shoes and read it again :-) The \1 in the replacement part means whatever was matched by the \(...\). The & means the whole thing that was matched. So we get this:

mv -i
mv -i
mv -i

I have put the -i option to mv so that, if I make a mistake, mv will prompt me before overwriting files.

Now I'm ready to run it, and I can do that by piping my command to the shell:

ls 071* | sed 's/^071-\(.*\)/mv -i & 017-\1/' | sh

If all this sounds pointless compared to issuing three mv commands and using filename completion with tabs, I'll mention that I usually end up doing it in three of roud directories, since I want to rename the original scans as well as the JPEG files I put on the Web, and also that I use d a real but short example deliberately.

The technique of constructing programs on the fly is a very powerful one, and is also used with XSLT, but with shell scripts you get the added benefit that reuse is just an up-arrow away in your history! (or a control-P away if, like me, you don't use the arrow keys much because it's faster to use the control-key equivalents).

OK, enough rambling for now.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!