Recent blog entries for Ankh

27 Jun 2009 »

stan, Python already has regular expression support... if you want only ^.*$ then the simplest and most efficient way might be to prefix all others with \ and use the existing regexp support. Most implementations of Perl-style regular expression matching these days can use Boyer-Moore-style delta tables to go massively faster in many common cases. If the code was for your own understading, though, that's fine, and in any case Rob Pike rocks :-)

I spent some time with Marc Lehmann's String::Similarity module, which seems to do reasonably well on finding similar strings that were OCR'd independently. I wish Google would get a clue and make higher resolution scans: the OCR error rate would drop hugely, they'd get more of the punctuation and footnotes, and they might eve nstart capturing some of the diagrams! The problem is that it's more lucrative to have millions of badly scanned crap than to have hundreds of thousands of well-scanned books, it seems.

25 Jun 2009 »

Been spending a lot of time working on a 200-year-old 32-volume dictionary of biography that I own (I got it in a second-hand bookshop in Oxford, missing two volumes that I later got elsewhere). I found several versions that had been OCRd really badly, and have been cleaning up one version enough that I can then try to use the other versions to detect errors.

The current version, converted first to XML and thence to HTML, is at words.fromoldbooks.org if anyone is interested. I'm hoping to be able to feed the cleaned up text back to Project Gutenberg and archive.org eventually, and to generate RDF.

Lots of interesting text processing challenges, so a useful diversion for a while.

27 Jul 2008 »

Clearing the undo history of this image will gain 428.6 MB of memory.

Image editing is going much better with 8 Gigabytes of memory. I've been able to get three or four images done for FromOldBooks.org in the time it used to take to do one.

On the other hand, the only reason I get any images scanned and edited at all is because I get too tired to do much else; it's pretty insanely busy here.

Unfortunately, Google's ads almost entirely stopped working on my Web site (Google downgraded my pagerank from 8 to 4 a few months ago), and with the fall in the US dollar (it's been bushed), we're struggling a bit more than we'd like. OK, a lot more than we'd like.

Luckily, my spam says that I won the UK Microsoft email lottery, and the prize is either (1) all of Nigeria, or (2) more spam. Speaking of which, SpamAssassin seems to be working better after a one-line fix (I filed a bug for it). Or at least its not complaining as much.

So, today's image (no, I won't post them every day) is an ammeter from an 1892 book:

24 Jul 2008 (updated 24 Jul 2008 at 22:30 UTC) »

Cats and Dogs

It's been the rainiest July on record here - and the month isn't over yet, of course. We discovered that the swimming pool can indeed fill above the top of its liner.

And during the storms, the dog, who is possessed by a daemon, becomes uncontrollable. or controllable only with difficulty.

I still miss being able to have time to concentrate, to focus enough to write reasonable amounts of code, to program. Working at W3C means I get to have a vague warm fuzzy feeling about helping the world a teeny bit, but it isn't always enough compensation.

In what little spare time I have, I scan pictures from old books. Soemone recently made a set of photoshop brushes from the 16th century demonic seals from the Goetia, and they two sets have each had over 900 downloads (they are here

and here if you are into such things). I have well over 2,000 images now, with sometimes fairly substantial extracts from the books, captions and other metadata. And there's an encyclopædia, some dictionaries of slang (including Brewer's Phrase and Fable), most of a vitriolic satirical political dictionary from the 1790s, and a bunch of other stuff.

Most of the text is in XML, so every now and then I update the XSLT that makes the HTML files and add smarts to find more cross-references. I want to do geotagging and links to maps, but this is harder than it sounds because the placenames I have are usually from when the books were published, not today.

Today's addition is some pictures of fonts, from a book I bought in Boston a couple of weeks ago, although these are not font samples as most people here would expect them to be, I suspect :-)

I did get to do some programming recently, though, and added some XML support to my ancient text retrieval package, lq-text. The changes aren't yet released, until I finish with some UTF-8 issues, but if you are interested, drop me a line. I wrote a short paper on it for the Balisage markup conference, too. I hope soon I'll use lq-text for the search function on my Web site, alongside the XQuery-based search that I have now.

Spending time on XML as character strings makes the world of RDF seem even further away, but I'm reading an interesting book on Ontology Matching to make up for it, inbetween scanning pictures and working on stuff for XQuery and for XSL-FO 2.0.

Now it's time to go and sedate the dog with some herbal calmer.

18 Aug 2007 »

After playing with "ajax" a little for random images on my pictures from old books Web site, I spent some time investigating other XQuery engines, and in parcticular what used to be Sleepycat's dbxml, and is now Oracle Sleepycat DB XML.

I was using the Perl interface, and maybe that's a mistake, because it's obvious that they don't spend as much effort on it as on the C API. The documentation is very minimal, for example. But in the end, and after uninstalling all unwanted versions of bsd db from my laptop, it worked. Query time went doewn from 11 seconds to 2 seconds, partly because the 11-second version is starting a JVM for each query, partly because dbxml is in C, and partly because I had to remove some features from the query because I couldn't get them to work.

After help from one of the people maintaining the software, I discovered that I'll be able to get the other features to work. The search engine on my Web site isn't actually too slow for most queries (try it here) but it's using more memory than I'd like, and there are some queries on my photographs that do take too long.

The good thing about using XQuery to develop these things is that it's relatively easy to make changes. So maybe some changes are coming.

31 Jul 2007 »

Still catching up after two weeks of vacation, which in turn was after a couple of weeks away on business with only a 4-day gap at home in the middle. And I'm about to go away again to Montreal, for Extreme Markup. Which should be a lot of fun, it's one of the more interesting conferences I go to.

Spent some time thinking about what to do about the future of XML 1.1. Will write something up in a while; it's not pretty. But maybe we can salvage something good.

Also spent time thinking about stylesheets on the Web, something even less pretty. But it while be a while before I have a coherent write-up for that one.

Installed Mandriva 2007 Spring on my desktop computer; pleased to note that the hardware seems to work better than it did in Microsoft Windows XP.

Uraeus, you probably know this, but it really helps if you eat some yoghurt every day while you're on antibiotics. Make sure it's yoghurt with the live culture in it.

etbe, you can combine the water systems. For example, have a hot water tank that's heated with passive solar energy (pipes on the roof) and is at a relatively cool temperature, and then on-demand heating to top up the temperature.

27 Jul 2007 »

Today we go home from vacation, or rather, star going home: Clyde and I have been staying at the Cree Village Ecolodge in Moose Factory, Ontario. Tonight we take the train, the Polar Bear Express, to Cochrane, where we will stay overnight, and then tomorrow we take a bus to Timmins, an hour and a half away, where we can take a 'plane to Toronto, and thence to Kingston, Ontario, and from there drive home.

It was nice to have network access while we were here. I admit, though, I spent some time transcribing more entries from a very cynical 1790s political dictionary, e.g.:

Corruption, “the oil which makes the wheels of Government go well.”

I'm also starting to catch up with writing captions for photographs, and thinking about writing some more articles on using GIMP and other tools to clean up digital images, both photographs and scans.

Looked at the Open Library demo site. It has more of a Web 2.0 feel than a careful librarian feel, and it seems to me it would benefit from more of a synthesis. But it's early. It turns out that there has been a lot of progress made in the past few years on transcribing texts, but the ones done carefully and well are generally kept behind academic non-commercial-use-only walls, so instead we get to see the ones that are done badly, for example by running unattended OCR over 17th century texts with errors in almost every single word. And of course engravings scanned at such a low resolution that you can hardly tell if they are engravings of photographs. As higher-resolution digital beak ends get more affordable (e.g. 500 megapixel) this will change.

9 Jul 2007 »

Travel

I just got back from one of those trips where things keep going wrong. Upgradeable trans-atlantic flight has no business class---it felt like what Air Canada calls Premier Economy so not too bad, but no laptop pwer---and then luggage delayed a couple of days on arrival in Pisa. My laptop (a Dell D600 that I think was kindly donated to W3C by Intel) didn't start, but eventually I found that it would start if I held it on its side. Meetings in Pisa went OK, not as well as I had hoped although a lot of work was done. We did have a really nice trip to San Vivaldo (see below for pics) and San Gimignano (photos coming) both in Tuscany. The trip was the day after the Toronto pride Parade, where I took 6 Gigabytes of photos, oops.

And then on to Glasgow for the Text Layout Summit, except that a rather badly planned terrorist attack on the airport meant my connecting flight was canceled. So I got on a later flight, but my bag didn't arrive. Second time this trip. Many long calls to BA baggage on my mobile phone at over $2/minute. At which point do you give up and buy a temporary phone? once you've given them your mobile phone number there's a strong incentive not to change. A week later I am back in Canada, still with no luggage. I hope it does arrive.

The BA people gave me incorrect information, saying the bag was in Glasgow and would be delivered that night (the day after I arrived) when in fact it was in London and never arrove. Service people, don't tell lies just to keep a customer quiet. Tell the truth. Maybe British Airways says they are the "world's favourite airline" because no-one else will say it for them?

On the 'plane back from Glasgow I'd planned to do some work on my mkgallery software that I use for my photo galleries. But I didn't have the energy. The lack of laptop power was my excuse. I did get a little work done.

I upgraded (needed one of my two Special System Wide certificates that I had been saving for an upcoming trip to Japan, but I was too tired and irritable to cope without, so I did it. Boy I'm whiney today, sorry. I did get to sit next to an actress (she had been in Eastenders a few times, and was going to Toronto to be in an advert/commercial).

Our flight was delayed by a couple of hours. The captain kept us very well informed: it was because some people whose luggage had been loaded were delayed in security (this was in London Heathrow, LHR). Someone in the row behind me was on his 'phone talking about how dirty $ethnicGroup people were always causing trouble, sigh.

Of course, we arrived late, and I was glad I'd booked a hotel at Toronto airport for the night before taking the train the next day, as I'd have missed the last train. They don't understand passenger trains in North America. The rules for success are frequent, fast and cheap, and although you can lose either of the last two properties, the first is essential. Three trains a day doesn't count. There should be a train from Toronto to Montreal every ten minutes, 24 hours a day. And back again, too.

Well, I go only the 2 hours to Belleville, not all the way to Montreal. The train averages a little under 60 miles per hour, with only three of four short stops on the way. I'd expect a fast passenger train in most of Europe or the UK to be able to get up to 120 mp/h or more, and average at east 80 mp/h, on such a simple and straight track.

Text Layout Summit

It looks like HarfBuzz is making good progress. This is the next-generation text layout engine to be used by both Pango and Qt. It looks like it will also gain Apple's AAT and SIL's Graphite happy goodness, too, since OpenType isn't by itself sufficient for all the world's scripts. It also looks like it will be powerful enough (or simple enough, if you prefer) to be useful for projects such as Inkscape, Scribus and Gimp, all of which desperately need better text layout and font smarts even for Western scripts, let alone others.

Part of my reason for going was to make sure that what we (W3C) do with XSL-FO (and maybe with SVG and CSS too, as well of course as Internationalization) is compatible with what's going on in the world. That means making sure we're aware of what's going on, and enabling a two-way conversation, inviting people to participate in the W3C work where necessary too.

The Text Layout Summit was hosted by the KDE aKademy, but I didn't get to go to any of the aKademy sessions unfortunately.

One person in our group did try out the Mandriva Flash USB Linux that was given away, and was very impressed with it. He said it was the first Linux that had set up the display on his laptop at the right resolution so it actually worked. I tried it on my HP desktop at home yesterday and it worked there too, which was cool as the computer uses an ATI graphics card which until recently was supported by neither the Free nor the closed source drivers.

Commenting...

kelly, binary thinking is not of course limited to Wikipedians. Them or Us, Bad or Good, White or Black (or, Black or White, depending on context), ignorant or wise, male or female, people like to sort others into categories. Only a white sock wearer would be so stupid as to think this was sensible.

In some societies it seems that there is a strong link between the divisions into categories and "good or bad". It seems to be stronger in much of the US than in much of Canada, for example, which perhaps helps to make Canada more accepting of difference. But that's a generalization, and of course you can meet people in either country at either end of the spectrum (and I have, many times). Just as you can find reliable or unreliable Wikipedia articles, or good or bad articles in pretty much any publication, The Register included.

You have to expect that people will do this categorising and judging. The judging part isn't good of course, but people will do it anyway. "You're not one of us, so you're not such a good person" seems instinctual. When it turns into "you don't agree with me so you're not a good person" something has gone even more badly wrong. Which brings me to...

zbowling, boy what a rant! The speaker (from MSNBC) is right of course, although I don't think the episode was the first example of hypocrisy from Bush. Any leader of any group is under pressure from dissenting views, all the time, and again it may be unreasonable to expect perfection (although people do); on the other foot, it's not clear that the Bush regime is any less corrupt than those it sought to depose, nor that there are fewer deaths or injuries under the Bush colonization than before. At any rate, thanks for the link to the video!

federico yes, I like very much the 50mm f/1.8 lens that I have for my Canon D400; there's an f/1.2 but it's too expensive for me right now. I rented a 70-200mm f/2.8 lens a couple of weeks ago and liked that too, especially the image stabilisation, but it would cost more than a thousand pairs of socks! I'll post some of the pictures, or links to them, when I have found time to put them online. I especially like your hanging-dye-bottle picture! The warm colours in the others are great too. It's something I liked about a recent trip to Italy (San Vivaldo in Tuscany). More Italy pictures coming too :-) but I have not processed those in any way, they are just out of the camera.

6 Jul 2007 »

Going Home

I'm about to return to Canada after being away for a couple of weeks. I went to Pisa, where the slowly-falling-tower lives, and on the way Al Italia lost my bag. It arrived a day or so later. We had XML Query and XSL Working Group meetings in Pisa, and then I flew to Glasgow, where British Airways lost my bag. I've been at the Text Layout Summit, which was interesting and I think useful. It was co-hosted by aKademy, the KDE conference, although I didn't get to any of the aKademy sessions unfortunately. Tomorrow morning I go home, but my bag still isnt' here. Maybe I can fly naked. Calling the British Airways lost luggage number has been a very expensive and unpleasant affair: I think I've spent over an hour on hold, and I'm using a Canadian mobile phone with both trans-atlantic and roaming fees, yay. I kept hoping the bag would arrive soon and didn't go and rent another phone. Sigh.

HarfBuzz is interesting, and it appears that it will be used by both Qt/KDE and Gtk+/Gnome as the text shaper. So applications in the open source/Free world will start to have access to more advanced AAT and OpenType features, and internationalization will take a big step forward.

I'm also still working away at my scans from antiquarian books, and at least partly as a result, the GIMP image editor is now significantly faster: I routinely work with images that are hundreds of megabytes in size, say, 10,000 pixels on a side. The GIMP developers are very receptive to (sufficiently specific) comments about performance. And today someone offered to help with lq-text, the Unix text retrieval package that I first released in 1989. So that's pretty cool. Oh, and at LGM2 in Montreal someone offered to donate some scans of some old Russian books of alphabets, but unfortunately I lost the person's address (an SK1 developer from the Ukraine) so if you're reading this, sorry, please email me, e.g. liam at holoweb DOT net!)

2 May 2007 »

I need to remember to check back often enough to see if people reply when I do talkback stuff. Hmm.

Tomorrow I take the train to the Libre Graphics meeting in Montreal. After that I'm off to the W3C Advisory Committee Meeting and then www2007 in Banff (near Calgary, not to be confused with Calvary)p>At both LGM and WWW2007 I'll be talking about what we're doing with XSL-FO 2.0. XSL is a way to format XML documents, for example for print or screen. There are two parts, XSLT and XSL-FO. We just published XSLT 2.0 this January (at the same time as XML Query, as they both build on XPath 2.0) and now we're working on XSL-FO 2.0. It's pretty exciting, as we're considering standardising a whole lot more sophisiticated layout stuff than things like CSS give you, much of it stuff that people have been doing for hundreds of years with print and that are understood pretty well. So I'll show some examples of the sorts of things we're thinking about, and talk about how people can get involved.

Csv, yes, it's a big improvement, but from the perspective of graphic design and typography (the user interface of text and communication, if you will) there are still (as always, it seems!) some improvements that could be made. The most obvious to me is that the counter box is not aligned with the other boxes, and alignment is lost elsewhere. I'd get rid of the "Options" heading since the entire dialogue box is about choosing options such as destination folder. I had a quick go at improving it, I hope you don't mind:

EOG save-as dialogue

It raises a HIG question that's been endlessly and uselessly debated... The alignment of the labels in dialogue boxes is always difficult, as there's no single approach that works in all situations. It's similar to the problem of designing a table of contents for a book.

The best guiding principle is of proximity: put related things nearer to each other than to other, unrelated things. For example, a section heading should be nearer to the text it heads than to the preceding section, something Web browsers by default tend to get badly wrong. So, the label should be strongly associated with the value in most cases.

Now, the values are encased in vast and unavoidably ugly boxes which are the most visible things in the design. So we try to turn an ugliness into a strength by aligning them all, to give strength to the design. But if the value boxes are aligned vertically and the labels need to be near them, in a left-to-right world our choices become putting the value to the right of the label, or right-aligning the labels. In a right-to-left environment obviously the choice is the same, but in the other direction.

Of course, other factors come into play. One is familiarity with badly designed dialogue boxes that are already out there. Since familiarity is the most significant factor in comprehension, this is very important, and may be enough of an argument in itself to make an ugly dialogue box that flies in the face of what we know about human perception, but works better because people are accustomed to it. The use of Fraktur typefaces in Germany might be counted as another example of this.

Another factor is whether the labels or the values are the primary items of interest to the user, and this of course varies depending on the dialogue, the user, the application, and also the user's familiarity with the dialogue. I love Alan Cooper's idea of designing for the "perpetual intermediate" and assuming that people are only vaguely familiar with the dialogue, if at all. In that case the ability to scan down quickly and relate labels and values is most important, leading again to the right-aligned version. But sometimes people need to compare labels, or the labels perhaps are sorted alphabetically in a large list, and then left-aligned labels would be best, with the values to the left of them. But the HIG and I disagree in this area I think.

Oh, I should mention that I'd be tempted to treat the filename preview differently, since presumably it might be arbitrarily long and not fit in the dialogue box, but I don't know enough about the possible forms they may take to give a good suggestion I think. or maybe the dialogue resizes as they grow.

dwmem2, you're missing something about GNOME I think. The idea is not to get rid of all configurability, but to get rid of useless configurability (e.g. whether the rate of acceleration of the panel when it auto-hides should be linear or quadratic). That is, remove useless features without impacting functionality, and to get to a point where most things work without needing to be configured.

How well GNOME is succeeding at this can be argued, not least because it's very subjective, but I see it as a big improvement, even though sometimes I miss configuration options that I used to enjoy :-).

189 older entries...