forrest: Yes, Unicode/UTF-8 should be the default charset and encoding for Advogato (technically, UTF-8 is not a charset). So basically I need to convert all the Latin-1 stuff in the database over, then switch over the reported charset.
By the way, Google search results are now multilingual, with Russian, Japanese, and other alphabets all mixed in on the same page. They seem to have gone back and forth on this; even recently I got the "results can not be displayed in this character set" message. In any case, I think it's cool.
More blog navel-gazing
I expected to get a lot of response from my last entry, but I didn't. I tried to argue it fairly and carefully, to best reach an audience of journalists (to whom I expect it would be considered quite controversial), but to my usual readers I expect I'm preaching to the choir. Perhaps if I had blamed the media for their role in unbelievable ignorance of Americans, it would have stirred up more response.
In any case, there are some downsides to blogging, or at least areas where it needs work. For one, not everybody is capable of criticial reading (from the survey above, the fraction would seem to be less than 17%). The mainstream media is actually pretty good in distilling a story down to a form where busy people can absorb it quickly. Blogs aren't, at least not yet. I'm hopeful that technical innovations can help with that, not least the use of trust metrics to ferret out the good material, but of course people have to be writing that first.
Needless to say, I didn't get any e-mails from newspaper editors on why they're not covering Bruce Kushnick's book. The most parsimonous answer is that their souls are simply 0wnz0red, and they're no more capable of breaking a story on the corruption of the telecoms industry than Hilary Rosen is capable of writing an editorial on how music trading is sometimes good for artists.
But (and this is a big but), the blog world is not (yet) doing a good job covering this story either. Bruce's publication of the book is a good start, but there's a lot of followup work to be done: fact-checking, correcting mistakes, unearthing more evidence, summarizing the highlights, getting the word out. This is exactly the sort of thing that journalists claim to be good at, because they have the resources to do it. Perhaps bloggers don't, although my personal belief is that it's the kind of work that lends itself to the sort of distributed effort that's so effective in creating free software.
Word to PDF
I'm not sure whether it's better to try to create a batch renderer project now, or whether it's best to work on existing tools, such as the renderer in AbiWord. If the latter is really, really good, then it can be used as a batch renderer, and we're done.
Even if everybody's needs are being well met by the existing projects, in retrospect I think there would have been significant advantages to have done the batch renderer first. As cuenca points out, it's a considerably simpler problem because you don't have to design your data structures for incremental update and so on. So I think there would have been high-quality rendering much earlier than we're seeing now with the GUI-focussed work.
In any case, for people contemplating new projects to work with complex file formats, I think the advice is sound: do the batch processor first, then adapt it to work interactively. ImageMagick and netpbm happened before Gimp, and for a good reason.
Absolutely an important part of such a project is a regression suite. Even better, it should be possible to use such a suite with other Word processors, such as GUI editors.
I'm not enthusiastic about transcoding into another existing document format such as TeX. This path makes it easy to get basic formatting right, but probably much harder to get it really good. The idea of TeX code to match Word's formatting quirks makes me cringe.
AlanShutko: It's not surprising that Word's layout has changed over the years. In fact, it's fair to say that interchange and compatibility in the Word universe only works well if everybody is using the same version. I'm sure that that the fact that this fuels upgrading is merely a coincidence :)
Even so, that doesn't make the problem impossible, just harder. I believe that Word documents self-identify the version of Word that generated them. Therefore, in theory at least, it should be possible to create a pixel-perfect rendering of the document as seen by the writer. SMB has many implementation variances, but that doesn't stop Samba from being viable. The goal, as usual, should be "least surprise".
Of course the rendering depends on the font metrics. Is there anyone who believes it shouldn't? Depending on the printer is a misfeature, of course, but as I've argued above, a "best effort" is likely to make people happy.