[update: 2 years later and we had a Summer without rain...]
All the way up here in Canada we're getting rain from
Hurricane Katrina, now a tropical storm. We're getting
maybe 50mm (2 inches) of rain in a few hours. One of our
windows blew in (the whole frame, not the glass) during the
night. Luckily, the cats didn't leap through the open window
and go out. Or if they did, they leapt back inside. And I
don't think any other animals came inside either (the perils
of living in the country!)
I've done a little more work on lq-text,
the text retrieval system that I first released (for Unix)
in 1989. I'd like to teach it xpath, but for now I barely
have enough time to work on making sure the documentation is
up to date, and that the software actually builds. I see a
few people downloading it each month but I rarely hear back
from them. As far as I know, lq-text is still one of the
better text indexing packages for plain text, but it doesn't
do word processing files, PDF, etc. It does index
HTML/XML/SGML but only by ignoring the element structure.
I've played a little with OCR programs recently. The
GNU gocr turned out to be no help at all for old books (e.g.
I tried one printed in 1845, and also saw samples others had
tried). Here's some gocr output that's better than
average:
Iu a mvoode;: box, in the cl;oir, Do?v lie.s a
?yen:8?Ҁ¢bably- _i;e emgg-.,
of wood, of ,a Cr;;,s_,adeT; mml3o he ww it is_ í;npossible
to tell 8vit);
any certaii:ty, but mh-e v.ei;ture to tl;í;3k it
rejirt_,R._ents ui;e uf tl(e_ t?h-o
Here is the same passage as read by Abbyy.com's reader:
In a wooden box, in the choir, now lies a remarkably
fine effigy. of wood, of a Crusader: who he was it is
impossible to tell with any certainty, but we venture to
think it represents one of the two distinguished persons
So you can guess which program I'm using. Frankly, if
gocr had a user interface as clean as that of Abbyy's
program, the quality might be more nearly tolerable: you can
click anywhere on the image to go to the corresponding place
in the text draft, and vice versa, and the spell checker
aligns both text and image as you go, highlighting regions
in both very clearly.
I made a transcription
(is that the right word here?) using OCR of several
pages from Sir Charles Knight's Old England averaging
less than five minutes per page , although careful
proof-reading takes longer. I made a simple XML format that
preserves all of the typographic distinctions in the
original that I can discern and that appear to have been
deliberate (e.g. I am not recording where a piece of metal
type broke and lost a serif).
This preservation of distinctions is something Project
Gutenberg doesn't seem to take care to do. For example, the
`Encyclpedia Gutenberg' (actually the OCR'd text from the
1911 Encyclopaedia Britannica) has lost all the small
caps, which were used to denote implicit cross references.
As an experiment I have ordered a DVD with scanned images,
and I'll see (if the images are good enough) how long it
takes me to get something as good. Probably not long if I
use their text as a baseline, although some rudimentary
analysis of the published Project Gutenberg text found a lot
of obvious errors that I doubt are in the original. This is
not to say I would not also have many errors, of course, but
I don't have a team of people doing proofreading.
When I worked at SoftQuad we did conversion of texts
into SGML, often charging US$50,000 or more for a project,
but still undercutting some of the competition. The trick
was extensive analysis and a lot of scripting. For example,
the abbreviation q.v. usually marks a
cross-reference, so check for the longest phrase before that
marker to find a plausible target for a link. Of course, if
there are typographical distinctions it's easier. So now
I'm using some of that experience. The transcription I
mentioned earlier has thumbnails of pictures. These are
pictures I had already scanned over the past five or six
years, but because I used consistent filenames I was able to
connect them to the text, which has references like (Fig.
12), automatically. This in turn gives me a list of figures
not references, which helps me look for errors in the script
or in the OCR'd text.
Combining threads, I made an lq-text index to the
Gutenbergopedia, and then I could get a keyword-in-context
index of "q.v.":
$ lqphrase
"q.v." | lqkwic
==== Document 1: vol1/aargau.xml ====
1:ower course of the river Aar (q.v.), whence its name.
Its total area is 541
2:hot sulphur springs of Baden (q.v.) and Schinznach,
while at Rheinfelden th
3:pital of the canton is Aarau (q.v.), while other
important towns are Baden
4:er important towns are Baden (q.v.), Zofingen (4591
inhabitants), Reinach (
==== Document 2: vol1/aaron.xml ====
5: distinct from the Decalogue (q.v.) (Ex. xxxiii. seq.).
Kadesh, and not Sin
6:o the Mosaite founder of Dan (q.v.). This throws no
light upon the name, wh
Another good error-checking technique is to look for
words that only occur once, or whose frequency is very
different than one might expect. You need more than just
one volume to do frequency analysis really, but I can
already see words like a11erican (should be
American), ciimate, AAstotle (Aristotle) and
so on. In a way you can think of this as debugging: doing
experiments that might reveal errors, and then correcting them.
There are some other interesting things about OCR'd text
to do with grammars and metadata, with links and expressing
relationships, but I should put those in my XML blog when I
get a chance.
On a tangentially related topic: I remember working at
an aircraft company and seeing a junior consultant spend a
day doing some editing that I could have done in under five
minutes. He didn't know about regular expressions. I may
have mentioned this here before, but another thing people
often don't think of is to use regular expressions to
generate shell scripts.
When I scan images I name the files with the figure
number (or page number, if figures are not numbered) at the
start, so they sort together, e.g.
-rwxr-xr-x 2 liam liam 200947 Aug 4 2003
071-Penshurst-Place-Kent-the-great-hall-1032x1522.jpg
-rwxr-xr-x 2 liam liam 54461 Aug 4 2003
071-Penshurst-Place-Kent-the-great-hall-581x857.jpg
-rwxr-xr-x 2 liam liam 68865 Aug 4 2003
071-Penshurst-Place-Kent-the-great-hall-774x1142.jpg
(you can see these at
fromoldbooks.org).
I use a shell script to extract the image size and rename
the files with the
widthx
height. It also
extracts the JPEG compression quality and adds that if it's
not 75%.
Now, suppose I got the figure number wrong, and I have a
bunch of files to rename from 071- to 017- (or whatever).
I can use sed (no, don't panic) like this:
ls 071* | sed 's/^071-/017-/'
This gives me the new filenames:
017-Penshurst-Place-Kent-the-great-hall-1032x1522.jpg
017-Penshurst-Place-Kent-the-great-hall-581x857.jpg
017-Penshurst-Place-Kent-the-great-hall-774x1142.jpg
But really I need to generate a set of Unix commands to
rename the files:
ls 071* | sed 's/^071-\(.*\)/mv -i & 017-\1/'
If the expression intimidates you, take off your shoes
and read it again :-) The \1 in the
replacement part means whatever was matched by the
\(...\). The & means the whole
thing that was matched. So we get this:
mv -i
071-Penshurst-Place-Kent-the-great-hall-1032x1522.jpg
017-Penshurst-Place-Kent-the-great-hall-1032x1522.jpg
mv -i
071-Penshurst-Place-Kent-the-great-hall-581x857.jpg
017-Penshurst-Place-Kent-the-great-hall-581x857.jpg
mv -i
071-Penshurst-Place-Kent-the-great-hall-774x1142.jpg
017-Penshurst-Place-Kent-the-great-hall-774x1142.jpg
I have put the -i option to mv so that,
if I make a mistake, mv will prompt me before
overwriting files.
Now I'm ready to run it, and I can do that by piping my
command to the shell:
ls 071* | sed 's/^071-\(.*\)/mv -i & 017-\1/' |
sh
If all this sounds pointless compared to issuing three
mv commands and using filename completion with tabs,
I'll mention that I usually end up doing it in three of roud
directories, since I want to rename the original scans as
well as the JPEG files I put on the Web, and also that I use
d a real but short example deliberately.
The technique of constructing programs on the fly is a
very powerful one, and is also used with XSLT, but with
shell scripts you get the added benefit that reuse is just
an up-arrow away in your history! (or a control-P away if,
like me, you don't use the arrow keys much because it's
faster to use the control-key equivalents).
OK, enough rambling for now.