24 Nov 2005

Binary XML Politics has been interesting of late. I ran sessions at a number of conferences on three different continents, and found that people attending were in favour of W3C defining a more efficient transfer mechanism for XML, but vigorously opposed to "binary XML".

The reasons for opposition varied widely. Very few were stated clearly or coherently, so it's difficult to agree or disagree with them. As best I understand it, people are concerned about W3C introducing a second representation of XML documents into a world that already has dozens of widespread representations and probably thousands all told.

For instance, as far as an XML processor that doesn't understand EBCDIC knows, an XML document marked as encoded in EBCDIC might as well be some form of binary goop -- it's perhaps well-formed XML, but the processor can't do anything with it at all, not even pass it back to an application or check to see if it's well-formed. It just rejects it.

There are already standards (more than one at ISO, at least one at IETF, probably others elsewhere) specifying ways for XML to be interchanged, e.g. over a network, in various non-textual forms, ranging from gzip to ASN-1 used in Fast Infoset. Most of these will stick around for a long time, although perhaps some of them will be used much less often if W3C defines a spec for exchanging XML documents efficiently.

The politics is irritating because it seems to be based on spreading distrust rather than on technical arguments. Joe Gregorio wrote an article that doesn't allow comments back (fear? fear of spam? I don't know) but that seems pretty paranoid, says in essence (as I read it) "W3C is saying they are doing one thing but really doing something sinister and evil" without ever explaining why the thing is actually sinister or evil, and without justifying the claim in the slightest. I don't really know how to respond to paranoia apart from suggesting therapy and medical help. Of course, Joe could join the Efficient XML Interchange Working Group, but it's presumably more fun to make snide comments at a distance. I'm not sure saying "no, we're not doing something evil" would have much effect.

I'm singling out Joe here, whom I have never actually met. There are quite a few other people, some of whom I have met, and some of whom work for organizations with a reputation for spreading FUD, helping to make sure anyone with sensible arguments doesn't get heard. I've actually tried quite hard, as have others at W3C and elsewhere, to understand the arguments. I really have. I've flown to Japan, been to Europe and the US and Canada, spoken with (and listened to) many people, and in the end the strong, coherent, well-researched and technically supported arguments on the one side seem to me to outweigh the gibberish, emotional arguments and ranting on the other.

Even that doesn't mean the side who can communicate clearly is right in any useful sense, but only that I'm in a position to try to evaluate whether they are right. Nor would it be fair to paint everyone (on either side) with this over-simplifying brush. There have been clear arguments. I remember one from Michael Rys of Microsoft, for example, being very clearly stated and being against anything except defining any efficient format as a variant encoding (like <?xml version="goop1.0"?>) so that we are not partitioning the world into two camps, and not, as he put it (this from memory) weakening the foundations. It's an argument we heard clearly from a number of people and organisations, and have heeded.

I spent some effort this year to try and help XMLers have a clearer perception of what we're trying to do at W3C, and of the processes we use, some of which have come in part from the IETF, some from open source projects, some from ISO and other standards organisations, and some from within the W3C and its participants. I don't think W3C is perfect. Neither do I think all of our specs are good (although some are definitely above average as specs go). But neither are we evil demons seeking to destroy the XML we have created.

Oh well.

Transcriptions of texts from old books have interested me for years. I have had an eighteeth century dictionary of underworld slang on my Web site for several years now, and it gets quite a few hits, is linked to by Wikipedia, etc. I recently added a second one, by Captain Francis Grose; it's a little later, The Dictionary of the Vulgar Tongue. The interesting thing about this one is that Project Gutenberg has a text edition, so I wrote a Perl script to convert that to XML, some XSLT to split the result, and compared it to the original book.

As an aside, it pains me that the terms of Project Gutenberg are such that I'm not allowed to give them credit for the work they did, since I have fixed an average of a little over one typo per page, including some misspelt entry headings. I kept a log of changes and will send them back in case they are of use.

XSLT 2.0 (currently a candidate recommendation) has some useful new features that include regular expression substitution, and which make it easier to do conversion with fewer Perl scripts and more XSLT. I've been using Mike Kay's open source Saxon, and also his commercial SaxonSA which is Schema-aware. The extra type checking this provides can be very useful.

I linked the two dictionaries together, so words in one point to the other. I didn't do the reverse linking yet, because I want to resurrect the code I used to add internal links by looking at phrases in the definitions and comparing them to possible target headwords, and then checking for words in common in the two possibly-linked entries.

For some reason the other people who have copied the Grose dictionary of slang have mostly kept it in one file, or at most split it into one file per letter, but this makes it hard for people to bookmark entries, and also really confuses Web search engines that try and work out what each HTML document is about based on keywords inside it!

I used my lq-text text retrieval system on some of these texts, including an encyclopædia, to do things like look for words that only occur once (finding possible typos), as well as to help find links.

On this subject, I'm still working on making a new release of lq-text. If you would like to help, let me know. I think importing the RCS files into some versioning system or other (CVS, subversion, arch) and maybe some sort of autoconfigure support are the highest priorities right now, although having HTML documentation rather than SGML and PDF might also be good.

OK, I know I should post more entries instead of a few huge ones. This is what moving house can do to you!

