Older blog entries for Ankh (starting at number 177)

Every once in a while I'm reminded why I use Mandriva Linux. I watched someone try to plug in a digital camera and upload pictures. Her husband insists on debian (OK, GNU/Debian Linux[tm]) but this meant instead of "plug in the camera. click on the icon that appears on the desktop" it's "find the device in /dev and mount it". A small difference perhaps, but a big one in outlook. Of course one could configure GNU/Debian Linux[tm] to behave the same way but her system administor husband looks down on things that are too easy. So he has a computer system that's designed to appeal to a system administrator who looks down on people who are not system administrators. Maybe Ubuntu would be a good compromise for the pair of them, based on debian but produced by people who care about using computers to do other things.

zanee, you are right: choose your battles.

badvogato, why should I tell you my husband's name when I don't know your name? Pictures of your ankles on a postcard please :-)

OK, I relent, he's called Clyde.

Liam

I got behind with digital photos, so I've been uploading basically unsorted pictures; I'm up to 2004, and in particular to the holiday in the UK that my husband [yes, live with it] and I had in September of that year. Some of the pictures are pretty good, but of course most aren't, so I'll have to try and make some selections eventually.

Last week I spent some time explaining to someone the difference between XML's name-based typing and the structure-based typing that was in an early draft of XML Query. I suppose you could say the structure-based typing was like an early version of the C Programming Language, in which tw types were entirely compatible if they had the same storage classes. You could assign bar = foo, in other words, if the number of bits in each variable was the same (more or less). By 1978 C had evolved past this, and there were implementations in which if you did
    typedef int hatsize;
    typedef int shoesize;

you couldn't call a function expecting a shoesize and pass a hatsize without at least getting a warning. In Java or C++ it would be absurd for an assignment across classes to be anything but an error, regardless of storage sizes. And it's absurd in XML too, in most cases.

Slowly working on my XML blog; I should add stuff about strong typing.

My husband installed a codec for a Web site he trusted, which turned out to've been misplaced trust, as it installed some virulent malware that keeps popping up saying you've been infected with adware or spyware, and need to buy their anti-adware tool. Of course, to make this credible, it also installs some adware in the background.

Part-way through a new Windows installation using the Acer recovery disks, we discovered that one of the disks was missing. And this left the laptop unusable. Well, usable by Linux :-) Luckily, Acer agreed (for a surprisingly small fee) to ship replacement CDs by overnight courier, so we should have them in a couple of days (you have to add a day for the border, usually).

Stupid marketing flyer of the week comes from The Source by Circuit City, which used to be Raido Shack. On the laptop with the smallest screen and least memory they say Increased memory and larger screen is ideal for gaming and graphics; on a mouse, fel the precision with an optical mouse (all their mice shown are optical)... there's a memory stick shown with the caption SONY 512MB Memory Stick PRO is smaller than a stick of gum -- possibly, but the stick shown clearly says 256MB on it. Two adjacent cameras have captions, (1) 6MP digitcal camera has everything you need to capture your best shots and (2) Everything you need in 6MP digital camera. I'm not sure how those captions are supposed to help differentiate the products. Looks like maybe The Source isn't long for this world.

On a more positive note, some of my calligraphy was used on the front cover of an American current affairs magazine called Time, which is cool.

rmathew, I'm not sure I'd take Ian Hixie's rant quite as strongly as you seem to've done. With only a little care you can serve XHTML documents as text/html and use XML tools with them just fine; I suspect Ian Hixie doesn't use XML tools very much. Opera (where he worked until recently) was very much dragged kicking and screaming into a world in which XML support was a given, and they have only recently added client-side XSLT to their browser.

On the subject of renaming folders, it's worth putting a redirect into your .htaccess or apache.conf so that you don't break the Web. Well, so that you don't break your bit of it :-)

Been going through piles of old digital photos, pictures from 2004, slowly catching up. Most of them I took for use as stock for people into photomanipulation, and a lot of them have been used. But I had only posted some of them. Also expanded my Calligraphy booklist somewhat.

I spent much of today patching holes in the upstairs of the barn that we use as an art gallery, so the birds can't nest in it and then spill their poop onto the artwork below. I hope that we get to a point where my husband and I can concentrate on making art, writing software, the garden, and life, instead of concentrating on working on the house. But it's going to take a while!

Distressingly, some scanned engravings from a book on torture has turned out to be fairly popular. Maybe I should not be surprised. I'm glad the images are not in colour.

In the unexpected light relief department, I was going from Toronto airport into the US one day last year, and the US immigration official asked me my job. I said (for the sake of simplicity) that I work for a standards organisation. He promptly looked at me and said, “what have you got in that suitcase?”
“Clothes and toiletries” said I, whereupon he asked,
“You're sure you've not got any metric in there? We don't want any of that!”
I assured him that the metric system was from a different organisation and that we (W3C) don't do that, and he grudgingly let me pass. Was that a twinkle in his eye?

Back from travel to France (W3C Tech Plenary) and California (Unicode conference). The Unicode conference reminds me (it doesn't take much to remind me) of some of the work that is still needed to tame fonts on Linux.

Afterwards, spent some time putting up more scans from old books on the Web. I did some reasonably high resolution scans of some 16th century type (2400dpi I think, I forget) but the files tend to be too large for comfortable Web viewing or downloading. I'm willing to digitise more type samples if they are of use to anyone, and also of course to host them, together with metadata and a search interface. I'm more likely to get requests for pretty initials (drop caps) or for castle plans, though, most of the time.

I wanted to play with the Google map interface to try and provide another interface to locations depicted in the images, but I haven't yet found the time.

titus mentioned John Udell's blog entry quoting R0ml as saying that open source means you don't need standards, because you take away the concept of ownership of a core technology. This is a bogus argument in oh-so-many ways. First, being open doesn't always mean that the core technology is not owned by some group or individual. A fork isn't always feasible. But that minor quibble aside, standards do not address the issue of who owns technology. They are about having multiple implementations that work together.

Standards are the reasons I can list over 100 open source IRC clients that work together, or that there can be so many different clients for the World Wide Web on so many different sorts of hardware and for so many environments.

We need both open implementations and open specifications, and we need specifications to be freely available (as in beer) so that people can afford to implement to the spec directly.

zanee asked about how to go about improving an open source project where the design may be questionable but the maintainers and developers don't admit a problem.

The act of wondering how to act is a necessary first step, and you've taken it :-) (I sound like a horrorscope from a cheap paper).

It's often a case of having to be very tactful, and also having to get the developers to want to make the changes, and being confrontational is unlikely to do that. Supplying a patch may help, as might convincing your company that they need faster time-to-market/turnaround, or whatever their jargon happens to be, and that it's unreasonable for the software to take 14 hours to run. The first approach gets you working with the developers, and the second gets pressure applied on them to improve the product.

I should note, by the way, that there is nothing about writing object-oriented code in Perl that prevents you fron use strict, and also that Devel::DProf should certainly work, although there are problems with thread-enabled Perl on some platforms that might conceivably interfere, and also native (xs) method calls might be a problem. You may find use strict; no strict vars; of use; see perldoc page for strict.

Your rant about how "people off the street" should be able to compile some complex Linux package is not I think well spoken. If software is too difficult to configure by the people who intend to use it, it's the software's problem. Every time.

vab, welcome to MIT; it's a fun place to work.

For people reading this blog syndicated, the original article is currently on advogato.org/recentlog.html which shows the most recent few entries.

Wow, lots of snow fell.

pesco, why invent a new syntax? The advantage of XML is not that it's particularly elegant, but rather that it's widely used. There are some nice feaures -- the end tags add a level of redundancy/error checking for example that is particularly useful for markup that can't easily be checked by computer as "right" or "wrong". And there are some less nice features. But most of the interesting research is at the edges (as Tommie Usdin said in her keynote at the Extreme Markup conference this year).

To be sure, there are experiments with alternate syntaxes, such as LMNL, an experimental markup language supporting overlap and structured attributes. Overlap is probably the biggest driving force in markup research at the syntax level today, although most people still try to stay within the bounds of XML in order to take advantage of all that XML software and understanding.

You mention functions. I claim that well-designed XML markup vocabularies are declarative in that they indicate a result or a meaning, if you will, rather than giving an algorithm. Of course, XSLT and XML Query straddle the boundary here. But they operate on XML documents, and those documents can be used and interpreted in many ways. Consider taking an SVG document and producing a set of colour swatches representing the colours used in the image described.

I see a lot of proposals for alternatives to XML, and I'm always interested to see use cases and hear what is being solved that XML can't do. Sometimes people think XML can't do things that it can; sometimes they think that a "more elegant" solution will appeal to a wider audience (but very rarely can people from such widely differing communitues as XML users agree on "elegant"), forgetting that it is not elegance that drives the adoption of XML; sometimes they genuinely have new ideas, or things they really can't do with XML.

We recently created a Working Group at W3C to investigate efficient interchange of XML; even Microsoft, who vociferously and publicly opposed the creation of the Working Group, have since said they'll consider using the result if it meets their needs, or if their customers demand it (and they do). Of perhaps more interest to you, we also created an XML Processing Model Working Group to standardise upon a way to say how an XML document is to be processed. It's a sort of functionjal scripting language for XML, or that's what I hope it will be. The processing model work is being done in public view, so you can watch, and maybe also get involved.

[disclaimer, in case it is not obvious: I am the XML Activity Lead at W3C]

Too many blogs. What to do? I wanted to put together some notes on buying and owning a home but rather than start a new blog, I am just making static Web pages for now. Once I'm up to a dozen or so pages I'll maybe rethink things.

Also working on two papers, one for the Unicode conference next Spring (on XQuery) and one for XTech 2006 (the conference is subtitled "Building Web 2.0").

fxn, yes, I agree strongly wth Tim's comments there. Before the FSF started, the Unix community used to share "public domain" software. However, I should also give Richard and the FSF credit for a unifying vision of a complete freely-available operating system -- I'll say freely available because like Tim I think the politics Free part caused some problems.

I actually do support Free software, but I also prefer to try and form consensus and agreement with all parties, and the FSF at that time wasn't known for flexible compromise. I don't think there are easy answers, though.

Spent some time once more thinking about fonts and typography in the open source and Free software community. I wrote a little more of an article on the subject and then got distracted again.

It's hard to satisfy the feature needs of professionals with ease of use for others. the About Face book talks about perpetual intermediates being good people to bear in mind when designing software. I'd like to ask for more powerful font choosing software, but most people don't care about fonts enough to want the options. So the right answer is to make more of the font environment "just work".

Also remembered my xml blog which is rather barren right now. I need to say something controversial, such as Linux wears white socks or sort is better than cat because it has more options and then I'd get lots of comments. But then I'd wish I had adverts on my blog :-) Maybe I'll blog about efficient XML, but for now it's about opaqueness and meaning.

The adverts on fromoldbooks.org/ now more than pay for the hosting, plus a sandwich for lunch each day. It's a balance: I don't want them to be too obtrusive, although the Google ads have two advantages over others I've tried: they seem to cause google to index your page more frequently, and they also display relatively interesting information, so I have kept them for now. I also noticed in trying another company that the impressions per day figures were very different: Google said I had many more page hits than the other company. Since I have access to my Web server's logs, Google won out.

TordJ, did you know that cellar door was the phrasse that got Tolkien started down the path of the Elvish language and the Lord of the Rings? He loved the sound of it. So do I, at least when said with an English or Welsh accent: celadaw.

Binary XML Politics has been interesting of late. I ran sessions at a number of conferences on three different continents, and found that people attending were in favour of W3C defining a more efficient transfer mechanism for XML, but vigorously opposed to "binary XML".

The reasons for opposition varied widely. Very few were stated clearly or coherently, so it's difficult to agree or disagree with them. As best I understand it, people are concerned about W3C introducing a second representation of XML documents into a world that already has dozens of widespread representations and probably thousands all told.

For instance, as far as an XML processor that doesn't understand EBCDIC knows, an XML document marked as encoded in EBCDIC might as well be some form of binary goop -- it's perhaps well-formed XML, but the processor can't do anything with it at all, not even pass it back to an application or check to see if it's well-formed. It just rejects it.

There are already standards (more than one at ISO, at least one at IETF, probably others elsewhere) specifying ways for XML to be interchanged, e.g. over a network, in various non-textual forms, ranging from gzip to ASN-1 used in Fast Infoset. Most of these will stick around for a long time, although perhaps some of them will be used much less often if W3C defines a spec for exchanging XML documents efficiently.

The politics is irritating because it seems to be based on spreading distrust rather than on technical arguments. Joe Gregorio wrote an article that doesn't allow comments back (fear? fear of spam? I don't know) but that seems pretty paranoid, says in essence (as I read it) "W3C is saying they are doing one thing but really doing something sinister and evil" without ever explaining why the thing is actually sinister or evil, and without justifying the claim in the slightest. I don't really know how to respond to paranoia apart from suggesting therapy and medical help. Of course, Joe could join the Efficient XML Interchange Working Group, but it's presumably more fun to make snide comments at a distance. I'm not sure saying "no, we're not doing something evil" would have much effect.

I'm singling out Joe here, whom I have never actually met. There are quite a few other people, some of whom I have met, and some of whom work for organizations with a reputation for spreading FUD, helping to make sure anyone with sensible arguments doesn't get heard. I've actually tried quite hard, as have others at W3C and elsewhere, to understand the arguments. I really have. I've flown to Japan, been to Europe and the US and Canada, spoken with (and listened to) many people, and in the end the strong, coherent, well-researched and technically supported arguments on the one side seem to me to outweigh the gibberish, emotional arguments and ranting on the other.

Even that doesn't mean the side who can communicate clearly is right in any useful sense, but only that I'm in a position to try to evaluate whether they are right. Nor would it be fair to paint everyone (on either side) with this over-simplifying brush. There have been clear arguments. I remember one from Michael Rys of Microsoft, for example, being very clearly stated and being against anything except defining any efficient format as a variant encoding (like <?xml version="goop1.0"?>) so that we are not partitioning the world into two camps, and not, as he put it (this from memory) weakening the foundations. It's an argument we heard clearly from a number of people and organisations, and have heeded.

I spent some effort this year to try and help XMLers have a clearer perception of what we're trying to do at W3C, and of the processes we use, some of which have come in part from the IETF, some from open source projects, some from ISO and other standards organisations, and some from within the W3C and its participants. I don't think W3C is perfect. Neither do I think all of our specs are good (although some are definitely above average as specs go). But neither are we evil demons seeking to destroy the XML we have created.

Oh well.

Transcriptions of texts from old books have interested me for years. I have had an eighteeth century dictionary of underworld slang on my Web site for several years now, and it gets quite a few hits, is linked to by Wikipedia, etc. I recently added a second one, by Captain Francis Grose; it's a little later, The Dictionary of the Vulgar Tongue. The interesting thing about this one is that Project Gutenberg has a text edition, so I wrote a Perl script to convert that to XML, some XSLT to split the result, and compared it to the original book.

As an aside, it pains me that the terms of Project Gutenberg are such that I'm not allowed to give them credit for the work they did, since I have fixed an average of a little over one typo per page, including some misspelt entry headings. I kept a log of changes and will send them back in case they are of use.

XSLT 2.0 (currently a candidate recommendation) has some useful new features that include regular expression substitution, and which make it easier to do conversion with fewer Perl scripts and more XSLT. I've been using Mike Kay's open source Saxon, and also his commercial SaxonSA which is Schema-aware. The extra type checking this provides can be very useful.

I linked the two dictionaries together, so words in one point to the other. I didn't do the reverse linking yet, because I want to resurrect the code I used to add internal links by looking at phrases in the definitions and comparing them to possible target headwords, and then checking for words in common in the two possibly-linked entries.

For some reason the other people who have copied the Grose dictionary of slang have mostly kept it in one file, or at most split it into one file per letter, but this makes it hard for people to bookmark entries, and also really confuses Web search engines that try and work out what each HTML document is about based on keywords inside it!

I used my lq-text text retrieval system on some of these texts, including an encyclopædia, to do things like look for words that only occur once (finding possible typos), as well as to help find links.

On this subject, I'm still working on making a new release of lq-text. If you would like to help, let me know. I think importing the RCS files into some versioning system or other (CVS, subversion, arch) and maybe some sort of autoconfigure support are the highest priorities right now, although having HTML documentation rather than SGML and PDF might also be good.

OK, I know I should post more entries instead of a few huge ones. This is what moving house can do to you!

Chromatic, I'm with Tim Bray: stopwords are a bug, not a feature. I admit, as I say that, that my own text retrieval package, lq-text, supports stop words: sometimes the bug is in limited disk and memory.

I found, though, that even if you eliminate stop words, remembering where a stop word was eliminated, but not which one, can be a useful compromise. Hence, lq-text can distinguish "printed in The Times" from "printed times".

Stemming tends to conflate senses: you might have a document in which recording is common, and another in which records is common, and you can no longer distinguish them. This may or may not matter to you, of course.

I hope you are familiar with the work by the late Gerald Salton's group at Cornell in document similarity.

One way to improve perceived performance can be to pre-compute things. I found that vector cosine differences were much more useful if you used phrases than words, but you can eliminate a lot of potential docuent pairs and make the work much faster that way too.

What I did was to treat each new document as a query against the indexed corpus before adding it. But this was more than ten years ago, when I was hoping to get involved in TREC.

Liam

168 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!