Older blog entries for Ankh (starting at number 176)

I got behind with digital photos, so I've been uploading basically unsorted pictures; I'm up to 2004, and in particular to the holiday in the UK that my husband [yes, live with it] and I had in September of that year. Some of the pictures are pretty good, but of course most aren't, so I'll have to try and make some selections eventually.

Last week I spent some time explaining to someone the difference between XML's name-based typing and the structure-based typing that was in an early draft of XML Query. I suppose you could say the structure-based typing was like an early version of the C Programming Language, in which tw types were entirely compatible if they had the same storage classes. You could assign bar = foo, in other words, if the number of bits in each variable was the same (more or less). By 1978 C had evolved past this, and there were implementations in which if you did
    typedef int hatsize;
    typedef int shoesize;

you couldn't call a function expecting a shoesize and pass a hatsize without at least getting a warning. In Java or C++ it would be absurd for an assignment across classes to be anything but an error, regardless of storage sizes. And it's absurd in XML too, in most cases.

Slowly working on my XML blog; I should add stuff about strong typing.

My husband installed a codec for a Web site he trusted, which turned out to've been misplaced trust, as it installed some virulent malware that keeps popping up saying you've been infected with adware or spyware, and need to buy their anti-adware tool. Of course, to make this credible, it also installs some adware in the background.

Part-way through a new Windows installation using the Acer recovery disks, we discovered that one of the disks was missing. And this left the laptop unusable. Well, usable by Linux :-) Luckily, Acer agreed (for a surprisingly small fee) to ship replacement CDs by overnight courier, so we should have them in a couple of days (you have to add a day for the border, usually).

Stupid marketing flyer of the week comes from The Source by Circuit City, which used to be Raido Shack. On the laptop with the smallest screen and least memory they say Increased memory and larger screen is ideal for gaming and graphics; on a mouse, fel the precision with an optical mouse (all their mice shown are optical)... there's a memory stick shown with the caption SONY 512MB Memory Stick PRO is smaller than a stick of gum -- possibly, but the stick shown clearly says 256MB on it. Two adjacent cameras have captions, (1) 6MP digitcal camera has everything you need to capture your best shots and (2) Everything you need in 6MP digital camera. I'm not sure how those captions are supposed to help differentiate the products. Looks like maybe The Source isn't long for this world.

On a more positive note, some of my calligraphy was used on the front cover of an American current affairs magazine called Time, which is cool.

rmathew, I'm not sure I'd take Ian Hixie's rant quite as strongly as you seem to've done. With only a little care you can serve XHTML documents as text/html and use XML tools with them just fine; I suspect Ian Hixie doesn't use XML tools very much. Opera (where he worked until recently) was very much dragged kicking and screaming into a world in which XML support was a given, and they have only recently added client-side XSLT to their browser.

On the subject of renaming folders, it's worth putting a redirect into your .htaccess or apache.conf so that you don't break the Web. Well, so that you don't break your bit of it :-)

Been going through piles of old digital photos, pictures from 2004, slowly catching up. Most of them I took for use as stock for people into photomanipulation, and a lot of them have been used. But I had only posted some of them. Also expanded my Calligraphy booklist somewhat.

I spent much of today patching holes in the upstairs of the barn that we use as an art gallery, so the birds can't nest in it and then spill their poop onto the artwork below. I hope that we get to a point where my husband and I can concentrate on making art, writing software, the garden, and life, instead of concentrating on working on the house. But it's going to take a while!

Distressingly, some scanned engravings from a book on torture has turned out to be fairly popular. Maybe I should not be surprised. I'm glad the images are not in colour.

In the unexpected light relief department, I was going from Toronto airport into the US one day last year, and the US immigration official asked me my job. I said (for the sake of simplicity) that I work for a standards organisation. He promptly looked at me and said, “what have you got in that suitcase?”
“Clothes and toiletries” said I, whereupon he asked,
“You're sure you've not got any metric in there? We don't want any of that!”
I assured him that the metric system was from a different organisation and that we (W3C) don't do that, and he grudgingly let me pass. Was that a twinkle in his eye?

Back from travel to France (W3C Tech Plenary) and California (Unicode conference). The Unicode conference reminds me (it doesn't take much to remind me) of some of the work that is still needed to tame fonts on Linux.

Afterwards, spent some time putting up more scans from old books on the Web. I did some reasonably high resolution scans of some 16th century type (2400dpi I think, I forget) but the files tend to be too large for comfortable Web viewing or downloading. I'm willing to digitise more type samples if they are of use to anyone, and also of course to host them, together with metadata and a search interface. I'm more likely to get requests for pretty initials (drop caps) or for castle plans, though, most of the time.

I wanted to play with the Google map interface to try and provide another interface to locations depicted in the images, but I haven't yet found the time.

titus mentioned John Udell's blog entry quoting R0ml as saying that open source means you don't need standards, because you take away the concept of ownership of a core technology. This is a bogus argument in oh-so-many ways. First, being open doesn't always mean that the core technology is not owned by some group or individual. A fork isn't always feasible. But that minor quibble aside, standards do not address the issue of who owns technology. They are about having multiple implementations that work together.

Standards are the reasons I can list over 100 open source IRC clients that work together, or that there can be so many different clients for the World Wide Web on so many different sorts of hardware and for so many environments.

We need both open implementations and open specifications, and we need specifications to be freely available (as in beer) so that people can afford to implement to the spec directly.

zanee asked about how to go about improving an open source project where the design may be questionable but the maintainers and developers don't admit a problem.

The act of wondering how to act is a necessary first step, and you've taken it :-) (I sound like a horrorscope from a cheap paper).

It's often a case of having to be very tactful, and also having to get the developers to want to make the changes, and being confrontational is unlikely to do that. Supplying a patch may help, as might convincing your company that they need faster time-to-market/turnaround, or whatever their jargon happens to be, and that it's unreasonable for the software to take 14 hours to run. The first approach gets you working with the developers, and the second gets pressure applied on them to improve the product.

I should note, by the way, that there is nothing about writing object-oriented code in Perl that prevents you fron use strict, and also that Devel::DProf should certainly work, although there are problems with thread-enabled Perl on some platforms that might conceivably interfere, and also native (xs) method calls might be a problem. You may find use strict; no strict vars; of use; see perldoc page for strict.

Your rant about how "people off the street" should be able to compile some complex Linux package is not I think well spoken. If software is too difficult to configure by the people who intend to use it, it's the software's problem. Every time.

vab, welcome to MIT; it's a fun place to work.

For people reading this blog syndicated, the original article is currently on advogato.org/recentlog.html which shows the most recent few entries.

Wow, lots of snow fell.

pesco, why invent a new syntax? The advantage of XML is not that it's particularly elegant, but rather that it's widely used. There are some nice feaures -- the end tags add a level of redundancy/error checking for example that is particularly useful for markup that can't easily be checked by computer as "right" or "wrong". And there are some less nice features. But most of the interesting research is at the edges (as Tommie Usdin said in her keynote at the Extreme Markup conference this year).

To be sure, there are experiments with alternate syntaxes, such as LMNL, an experimental markup language supporting overlap and structured attributes. Overlap is probably the biggest driving force in markup research at the syntax level today, although most people still try to stay within the bounds of XML in order to take advantage of all that XML software and understanding.

You mention functions. I claim that well-designed XML markup vocabularies are declarative in that they indicate a result or a meaning, if you will, rather than giving an algorithm. Of course, XSLT and XML Query straddle the boundary here. But they operate on XML documents, and those documents can be used and interpreted in many ways. Consider taking an SVG document and producing a set of colour swatches representing the colours used in the image described.

I see a lot of proposals for alternatives to XML, and I'm always interested to see use cases and hear what is being solved that XML can't do. Sometimes people think XML can't do things that it can; sometimes they think that a "more elegant" solution will appeal to a wider audience (but very rarely can people from such widely differing communitues as XML users agree on "elegant"), forgetting that it is not elegance that drives the adoption of XML; sometimes they genuinely have new ideas, or things they really can't do with XML.

We recently created a Working Group at W3C to investigate efficient interchange of XML; even Microsoft, who vociferously and publicly opposed the creation of the Working Group, have since said they'll consider using the result if it meets their needs, or if their customers demand it (and they do). Of perhaps more interest to you, we also created an XML Processing Model Working Group to standardise upon a way to say how an XML document is to be processed. It's a sort of functionjal scripting language for XML, or that's what I hope it will be. The processing model work is being done in public view, so you can watch, and maybe also get involved.

[disclaimer, in case it is not obvious: I am the XML Activity Lead at W3C]

Too many blogs. What to do? I wanted to put together some notes on buying and owning a home but rather than start a new blog, I am just making static Web pages for now. Once I'm up to a dozen or so pages I'll maybe rethink things.

Also working on two papers, one for the Unicode conference next Spring (on XQuery) and one for XTech 2006 (the conference is subtitled "Building Web 2.0").

fxn, yes, I agree strongly wth Tim's comments there. Before the FSF started, the Unix community used to share "public domain" software. However, I should also give Richard and the FSF credit for a unifying vision of a complete freely-available operating system -- I'll say freely available because like Tim I think the politics Free part caused some problems.

I actually do support Free software, but I also prefer to try and form consensus and agreement with all parties, and the FSF at that time wasn't known for flexible compromise. I don't think there are easy answers, though.

Spent some time once more thinking about fonts and typography in the open source and Free software community. I wrote a little more of an article on the subject and then got distracted again.

It's hard to satisfy the feature needs of professionals with ease of use for others. the About Face book talks about perpetual intermediates being good people to bear in mind when designing software. I'd like to ask for more powerful font choosing software, but most people don't care about fonts enough to want the options. So the right answer is to make more of the font environment "just work".

Also remembered my xml blog which is rather barren right now. I need to say something controversial, such as Linux wears white socks or sort is better than cat because it has more options and then I'd get lots of comments. But then I'd wish I had adverts on my blog :-) Maybe I'll blog about efficient XML, but for now it's about opaqueness and meaning.

The adverts on fromoldbooks.org/ now more than pay for the hosting, plus a sandwich for lunch each day. It's a balance: I don't want them to be too obtrusive, although the Google ads have two advantages over others I've tried: they seem to cause google to index your page more frequently, and they also display relatively interesting information, so I have kept them for now. I also noticed in trying another company that the impressions per day figures were very different: Google said I had many more page hits than the other company. Since I have access to my Web server's logs, Google won out.

TordJ, did you know that cellar door was the phrasse that got Tolkien started down the path of the Elvish language and the Lord of the Rings? He loved the sound of it. So do I, at least when said with an English or Welsh accent: celadaw.

Binary XML Politics has been interesting of late. I ran sessions at a number of conferences on three different continents, and found that people attending were in favour of W3C defining a more efficient transfer mechanism for XML, but vigorously opposed to "binary XML".

The reasons for opposition varied widely. Very few were stated clearly or coherently, so it's difficult to agree or disagree with them. As best I understand it, people are concerned about W3C introducing a second representation of XML documents into a world that already has dozens of widespread representations and probably thousands all told.

For instance, as far as an XML processor that doesn't understand EBCDIC knows, an XML document marked as encoded in EBCDIC might as well be some form of binary goop -- it's perhaps well-formed XML, but the processor can't do anything with it at all, not even pass it back to an application or check to see if it's well-formed. It just rejects it.

There are already standards (more than one at ISO, at least one at IETF, probably others elsewhere) specifying ways for XML to be interchanged, e.g. over a network, in various non-textual forms, ranging from gzip to ASN-1 used in Fast Infoset. Most of these will stick around for a long time, although perhaps some of them will be used much less often if W3C defines a spec for exchanging XML documents efficiently.

The politics is irritating because it seems to be based on spreading distrust rather than on technical arguments. Joe Gregorio wrote an article that doesn't allow comments back (fear? fear of spam? I don't know) but that seems pretty paranoid, says in essence (as I read it) "W3C is saying they are doing one thing but really doing something sinister and evil" without ever explaining why the thing is actually sinister or evil, and without justifying the claim in the slightest. I don't really know how to respond to paranoia apart from suggesting therapy and medical help. Of course, Joe could join the Efficient XML Interchange Working Group, but it's presumably more fun to make snide comments at a distance. I'm not sure saying "no, we're not doing something evil" would have much effect.

I'm singling out Joe here, whom I have never actually met. There are quite a few other people, some of whom I have met, and some of whom work for organizations with a reputation for spreading FUD, helping to make sure anyone with sensible arguments doesn't get heard. I've actually tried quite hard, as have others at W3C and elsewhere, to understand the arguments. I really have. I've flown to Japan, been to Europe and the US and Canada, spoken with (and listened to) many people, and in the end the strong, coherent, well-researched and technically supported arguments on the one side seem to me to outweigh the gibberish, emotional arguments and ranting on the other.

Even that doesn't mean the side who can communicate clearly is right in any useful sense, but only that I'm in a position to try to evaluate whether they are right. Nor would it be fair to paint everyone (on either side) with this over-simplifying brush. There have been clear arguments. I remember one from Michael Rys of Microsoft, for example, being very clearly stated and being against anything except defining any efficient format as a variant encoding (like <?xml version="goop1.0"?>) so that we are not partitioning the world into two camps, and not, as he put it (this from memory) weakening the foundations. It's an argument we heard clearly from a number of people and organisations, and have heeded.

I spent some effort this year to try and help XMLers have a clearer perception of what we're trying to do at W3C, and of the processes we use, some of which have come in part from the IETF, some from open source projects, some from ISO and other standards organisations, and some from within the W3C and its participants. I don't think W3C is perfect. Neither do I think all of our specs are good (although some are definitely above average as specs go). But neither are we evil demons seeking to destroy the XML we have created.

Oh well.

Transcriptions of texts from old books have interested me for years. I have had an eighteeth century dictionary of underworld slang on my Web site for several years now, and it gets quite a few hits, is linked to by Wikipedia, etc. I recently added a second one, by Captain Francis Grose; it's a little later, The Dictionary of the Vulgar Tongue. The interesting thing about this one is that Project Gutenberg has a text edition, so I wrote a Perl script to convert that to XML, some XSLT to split the result, and compared it to the original book.

As an aside, it pains me that the terms of Project Gutenberg are such that I'm not allowed to give them credit for the work they did, since I have fixed an average of a little over one typo per page, including some misspelt entry headings. I kept a log of changes and will send them back in case they are of use.

XSLT 2.0 (currently a candidate recommendation) has some useful new features that include regular expression substitution, and which make it easier to do conversion with fewer Perl scripts and more XSLT. I've been using Mike Kay's open source Saxon, and also his commercial SaxonSA which is Schema-aware. The extra type checking this provides can be very useful.

I linked the two dictionaries together, so words in one point to the other. I didn't do the reverse linking yet, because I want to resurrect the code I used to add internal links by looking at phrases in the definitions and comparing them to possible target headwords, and then checking for words in common in the two possibly-linked entries.

For some reason the other people who have copied the Grose dictionary of slang have mostly kept it in one file, or at most split it into one file per letter, but this makes it hard for people to bookmark entries, and also really confuses Web search engines that try and work out what each HTML document is about based on keywords inside it!

I used my lq-text text retrieval system on some of these texts, including an encyclopædia, to do things like look for words that only occur once (finding possible typos), as well as to help find links.

On this subject, I'm still working on making a new release of lq-text. If you would like to help, let me know. I think importing the RCS files into some versioning system or other (CVS, subversion, arch) and maybe some sort of autoconfigure support are the highest priorities right now, although having HTML documentation rather than SGML and PDF might also be good.

OK, I know I should post more entries instead of a few huge ones. This is what moving house can do to you!

Chromatic, I'm with Tim Bray: stopwords are a bug, not a feature. I admit, as I say that, that my own text retrieval package, lq-text, supports stop words: sometimes the bug is in limited disk and memory.

I found, though, that even if you eliminate stop words, remembering where a stop word was eliminated, but not which one, can be a useful compromise. Hence, lq-text can distinguish "printed in The Times" from "printed times".

Stemming tends to conflate senses: you might have a document in which recording is common, and another in which records is common, and you can no longer distinguish them. This may or may not matter to you, of course.

I hope you are familiar with the work by the late Gerald Salton's group at Cornell in document similarity.

One way to improve perceived performance can be to pre-compute things. I found that vector cosine differences were much more useful if you used phrases than words, but you can eliminate a lot of potential docuent pairs and make the work much faster that way too.

What I did was to treat each new document as a query against the indexed corpus before adding it. But this was more than ten years ago, when I was hoping to get involved in TREC.

Liam

31 Aug 2005 (updated 10 Nov 2007 at 03:12 UTC) »

[update: 2 years later and we had a Summer without rain...]

All the way up here in Canada we're getting rain from Hurricane Katrina, now a tropical storm. We're getting maybe 50mm (2 inches) of rain in a few hours. One of our windows blew in (the whole frame, not the glass) during the night. Luckily, the cats didn't leap through the open window and go out. Or if they did, they leapt back inside. And I don't think any other animals came inside either (the perils of living in the country!)

I've done a little more work on lq-text, the text retrieval system that I first released (for Unix) in 1989. I'd like to teach it xpath, but for now I barely have enough time to work on making sure the documentation is up to date, and that the software actually builds. I see a few people downloading it each month but I rarely hear back from them. As far as I know, lq-text is still one of the better text indexing packages for plain text, but it doesn't do word processing files, PDF, etc. It does index HTML/XML/SGML but only by ignoring the element structure.

I've played a little with OCR programs recently. The GNU gocr turned out to be no help at all for old books (e.g. I tried one printed in 1845, and also saw samples others had tried). Here's some gocr output that's better than average:

Iu a mvoode;: box, in the cl;oir, Do?v lie.s a ?yen:8?Ҁ¢bably- _i;e emgg-., of wood, of ,a Cr;;,s_,adeT; mml3o he ww it is_ í;npossible to tell 8vit); any certaii:ty, but mh-e v.ei;ture to tl;í;3k it rejirt_,R._ents ui;e uf tl(e_ t?h-o

Here is the same passage as read by Abbyy.com's reader:

In a wooden box, in the choir, now lies a remarkably fine effigy. of wood, of a Crusader: who he was it is impossible to tell with any certainty, but we venture to think it represents one of the two distinguished persons

So you can guess which program I'm using. Frankly, if gocr had a user interface as clean as that of Abbyy's program, the quality might be more nearly tolerable: you can click anywhere on the image to go to the corresponding place in the text draft, and vice versa, and the spell checker aligns both text and image as you go, highlighting regions in both very clearly.

I made a transcription (is that the right word here?) using OCR of several pages from Sir Charles Knight's Old England averaging less than five minutes per page , although careful proof-reading takes longer. I made a simple XML format that preserves all of the typographic distinctions in the original that I can discern and that appear to have been deliberate (e.g. I am not recording where a piece of metal type broke and lost a serif).

This preservation of distinctions is something Project Gutenberg doesn't seem to take care to do. For example, the `Encyclpedia Gutenberg' (actually the OCR'd text from the 1911 Encyclopaedia Britannica) has lost all the small caps, which were used to denote implicit cross references. As an experiment I have ordered a DVD with scanned images, and I'll see (if the images are good enough) how long it takes me to get something as good. Probably not long if I use their text as a baseline, although some rudimentary analysis of the published Project Gutenberg text found a lot of obvious errors that I doubt are in the original. This is not to say I would not also have many errors, of course, but I don't have a team of people doing proofreading.

When I worked at SoftQuad we did conversion of texts into SGML, often charging US$50,000 or more for a project, but still undercutting some of the competition. The trick was extensive analysis and a lot of scripting. For example, the abbreviation q.v. usually marks a cross-reference, so check for the longest phrase before that marker to find a plausible target for a link. Of course, if there are typographical distinctions it's easier. So now I'm using some of that experience. The transcription I mentioned earlier has thumbnails of pictures. These are pictures I had already scanned over the past five or six years, but because I used consistent filenames I was able to connect them to the text, which has references like (Fig. 12), automatically. This in turn gives me a list of figures not references, which helps me look for errors in the script or in the OCR'd text.

Combining threads, I made an lq-text index to the Gutenbergopedia, and then I could get a keyword-in-context index of "q.v.":

$ lqphrase
"q.v." | lqkwic
==== Document 1: vol1/aargau.xml ====
  1:ower course of the river Aar (q.v.), whence its name.
Its total area is 541
  2:hot sulphur springs of Baden (q.v.) and Schinznach,
while at Rheinfelden th
  3:pital of the canton is Aarau (q.v.), while other
important towns are Baden
  4:er important towns are Baden (q.v.), Zofingen (4591
inhabitants), Reinach (
==== Document 2: vol1/aaron.xml ====
  5: distinct from the Decalogue (q.v.) (Ex. xxxiii. seq.).
Kadesh, and not Sin
  6:o the Mosaite founder of Dan (q.v.). This throws no
light upon the name, wh

Another good error-checking technique is to look for words that only occur once, or whose frequency is very different than one might expect. You need more than just one volume to do frequency analysis really, but I can already see words like a11erican (should be American), ciimate, AAstotle (Aristotle) and so on. In a way you can think of this as debugging: doing experiments that might reveal errors, and then correcting them.

There are some other interesting things about OCR'd text to do with grammars and metadata, with links and expressing relationships, but I should put those in my XML blog when I get a chance.

On a tangentially related topic: I remember working at an aircraft company and seeing a junior consultant spend a day doing some editing that I could have done in under five minutes. He didn't know about regular expressions. I may have mentioned this here before, but another thing people often don't think of is to use regular expressions to generate shell scripts.

When I scan images I name the files with the figure number (or page number, if figures are not numbered) at the start, so they sort together, e.g.


-rwxr-xr-x  2 liam liam 200947 Aug  4  2003
    071-Penshurst-Place-Kent-the-great-hall-1032x1522.jpg
-rwxr-xr-x  2 liam liam  54461 Aug  4  2003
   071-Penshurst-Place-Kent-the-great-hall-581x857.jpg
-rwxr-xr-x  2 liam liam  68865 Aug  4  2003
   071-Penshurst-Place-Kent-the-great-hall-774x1142.jpg
(you can see these at fromoldbooks.org). I use a shell script to extract the image size and rename the files with the widthxheight. It also extracts the JPEG compression quality and adds that if it's not 75%.

Now, suppose I got the figure number wrong, and I have a bunch of files to rename from 071- to 017- (or whatever).

I can use sed (no, don't panic) like this:

ls 071* | sed 's/^071-/017-/'

This gives me the new filenames:

017-Penshurst-Place-Kent-the-great-hall-1032x1522.jpg
017-Penshurst-Place-Kent-the-great-hall-581x857.jpg
017-Penshurst-Place-Kent-the-great-hall-774x1142.jpg

But really I need to generate a set of Unix commands to rename the files:

ls 071* | sed 's/^071-\(.*\)/mv -i & 017-\1/'

If the expression intimidates you, take off your shoes and read it again :-) The \1 in the replacement part means whatever was matched by the \(...\). The & means the whole thing that was matched. So we get this:

mv -i
    071-Penshurst-Place-Kent-the-great-hall-1032x1522.jpg
    017-Penshurst-Place-Kent-the-great-hall-1032x1522.jpg
mv -i
    071-Penshurst-Place-Kent-the-great-hall-581x857.jpg
    017-Penshurst-Place-Kent-the-great-hall-581x857.jpg
mv -i
    071-Penshurst-Place-Kent-the-great-hall-774x1142.jpg
    017-Penshurst-Place-Kent-the-great-hall-774x1142.jpg

I have put the -i option to mv so that, if I make a mistake, mv will prompt me before overwriting files.

Now I'm ready to run it, and I can do that by piping my command to the shell:

ls 071* | sed 's/^071-\(.*\)/mv -i & 017-\1/' | sh

If all this sounds pointless compared to issuing three mv commands and using filename completion with tabs, I'll mention that I usually end up doing it in three of roud directories, since I want to rename the original scans as well as the JPEG files I put on the Web, and also that I use d a real but short example deliberately.

The technique of constructing programs on the fly is a very powerful one, and is also used with XSLT, but with shell scripts you get the added benefit that reuse is just an up-arrow away in your history! (or a control-P away if, like me, you don't use the arrow keys much because it's faster to use the control-key equivalents).

OK, enough rambling for now.

167 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!