Older blog entries for raph (starting at number 324)

Autopackage, namespaces, and DNS

I read about autopackage in LWN recently. It seems like a useful project, and I wish it well. I've certainly run into my own share of pain trying to install VoIP software and the like recently.

I'm very happy to see thought going into the question of what packages should look like. I've always felt that Linux package formats have been somewhat ad hoc and given over to the "scripting mentality", and that most distros sidestep the fundamental problem of resolving dependencies and versions by trying to create a snapshot of packages that just happen to work together. Over the long term, I'd love to see this replaced with something more systematic.

A good test of agility for package frameworks is whether they work on systems other than Unix. One of the most interesting things I've seen in this space is CLR Assemblies. From what I've seen, these really do try to be systematic and general, but of course are bound to the CLR runtime.

Indeed, one of the reasons that Java is so disappointing as a desktop platform is that they had the opportunity to really address the packaging problem, but blew it. The reality of Java packages is quite a mess: classpaths, .class files, jar files, war files, and of course "Web start" in a futile attempt to paper over the whole mess.

There is one aspect to autopackage's design that immediately struck me: its use of a DNS-rooted namespace. In fact, DNS is becoming the de-facto root for all kinds of namespaces, of which of course the Web is one of the biggest. This would be very cool if it weren't for the fact that the management of DNS is so corrupt. Even so, it basically works.

One of the discussions I had with John Gilmore at CodeCon was about what a next-generation DNS replacement should look like. I do believe that it's possible to fix many of the political problems of current DNS with better technology. Specifically, the single trust root of the existing DNS is just too tempting a target for parasites like the ICANN leadership. A better system would have distributed trust.

But I don't envy the person who tries to replace DNS with something better. One of the thorniest questions is what the policy for name disputes should be. I'm partial to pure first-come, first-served, largely because it's the only policy simple enough for people to understand, but I think it would encounter a lot of resistance in the real world. In particular, there's nothing to prevent squatters from bulk-registering all the words and trademarks in the world.

But what is a better policy? You can't really talk about a name service being secure unless you've specified a formal policy. It's a thorny problem. I sketched one possibility in my FC '00 submission, and am writing up an expanded version of that as a chapter in my thesis. It is in many ways an appealing design, but even I don't have confidence it's what the world should adopt.

So hopefully, we'll have smart people continue to put some thought into what kind of name service we really want. DNS is a very impressive accomplishment, and of course hugely useful, but eventually we're going to want something better.

Bayes and scoring

There's a lot of talk of Bayesian spam filtering these days, including an implementation the latest SpamAssassin beta. Indeed, Bayes is cool, but did you know that it's actually equivalent to systems that assign a score to each word (or other feature) and add them up?

Paul Graham popularized Bayesian statistics in his Plan for Spam. He analyzes word frequencies in a corpus of spam, and of non-spam, so each word gets a probability that it's spam. For example, "viagra" might be assigned a probability of 0.99 spam, and "eigenvector" 0.01 or so.

Then, when a mail comes in, you look at the 10 words with the most extreme probabilities (words common to both spam and non-spam don't tell you much and so won't be counted). Bayesian statistics will give you a probability that the email is spam or not, assuming that the probabilities of the individual words are independent (not a really valid assumption, but perhaps close enough).

The combining formula for two probabilities is ab / (ab + (1 - a) (1 - b)). But use the transform f(x) = log(x) - log(1 - x), and the equivalent combining rule is just f(a) + f(b). Do the math!

So you don't really need Bayes to do this computation. Perhaps it's most useful to think of the Bayesian math as giving a sound probabilistic interpretation of score addition, which seems fairly ad hoc at first sight.

Doing this kind of combination entirely in linear space is interesting to me, because it seems much easier to combine with other techniques. After all, eigenvector-based trust metrics are based directly on linear algebra. I haven't wrapped my head completely around how I'd meld these two ideas together, but it certainly is intriguing.

Sleep study

I had my annual physical yesterday (with a new doctor), and again the topic of doing a home sleep study came up. I'm pretty sure I have sleep apnea, but when I got a sleep study done last year, it was inconclusive. It showed only snoring, no actual apnea, but it also showed no REM sleep, which is strange. I'd really like to be able to take measurements over a longer period of time.

Primarily, I want to measure 4 things: EEG (for determining sleep stage), sound (snoring), airflow, and pulse-oximeter (for determining whether breathing is supplying adequate oxygen).

I've done plain sound measurements already, using the mic input of my laptop. EEG's are more challenging. Pro equipment costs a small fortune - even on eBay, there doesn't seem to be a good supply. I see a DIY project called OpenEEG, which might work. Obviously, I need electrodes and pre-amps, but it seems to me an off the shelf D/A card (maybe a labjack) might save me some time, and me more versatile for picking up the other inputs.

Most of the people doing home EEG seem to be into biofeedback, but from what I can tell the requirements are fairly similar.

I'd like to hook up with others who might be interested in building a home sleep study, or who can give me tips on finding the sweet spot between spending too much time and too much money. If I'm successful, I definitely want to post my recipes and software, as it's very likely to be useful for others.

Well, CodeCon is over. I think my talk went pretty well. At least, I got some good questions after the talk, which is always encouraging.

Talk of war

chalst: I'm basically in agreement with cmm here. The free software community has a some advantages over the unwashed masses; we can mostly read and write, even sometimes think, and we're very comfortable with challenging the conventional wisdom. But I don't think there's anything that gives us any special insight or privilege compared to other thoughtful people.

Ordinarily, I would consider discussions of politics to be off-topic for this site, but this war threatens to affect us so deeply that I think it deserves some attention from everybody. It's a scary thought, but if it goes badly, it could change some priorities; we could be worrying more about how to treat radiation burns than whether it should be "Linux" or "GNU/Linux".

That said, given the focus here, I'd like to see mostly posts that bring insight, or have some special relevance for free software people. There's an awful lot of stuff written on the Net about the war, and frankly, most of it is dreck. That includes knee-jerk anti-Bush flaming just as much as knee-jerk pro-war (or "anti-peace", as I prefer to call it :) sentiment. I much prefer things that make me think. John Perry Barlow's Sympathy for the Devil is one recent such piece.

I pray that we can avert a large-scale conflagration in which many people die, and hatred of America rises to a fever pitch. I think the uncertainty about it is really hard on people - a lot of people around me seem down, and a friend of mine has observed a trend of "shabbiness".

CSS

sdodji: have you looked at the RCSS codebase at all? It uses some clever algorithms to efficiently do the CSS selector processing. It wasn't written with the Simple API for CSS in mind, but you might find some of it useful in any case. You're welcome to use the code any way you see fit, and if you want me to explain some of the more rocket-scientific aspects, just ask.

Work

A lot of cool things are happening. For one, rillian is getting good results out of the jbig2 code. It actually renders nontrivial PDF files now, although it needs some cleanup to make the error handling more robust, etc. It sounds like we'll have real users soon.

I'm also very, very excited to be working with tor on the design of Fitz and related things. I think the first chunk of released code will be a library of filters for PS/PDF (mostly used for compressed images). This will give us a chance to gain some valuable experience with the new runtime discipline in the context of a well-defined problem domain.

Conscious design of runtimes is fun, but challenging. Our main goals are ease of integration with diverse codebases, performance, and robustness. I've been carefully studying the Ghostscript stream implementation, and have found a number of small bugs, areas where performance can be improved, and ways in which we can better tolerate exceptional and corner cases. I think the new code will be altogether simpler as well.

So we're really trying to do things right. One of the elements going into the runtime is an interface for atoms (in the Lisp sense; they're called "names" in PostScript/PDF lingo). These need to be very fast, have an easy interface, and not leak (I found it interesting to learn that Java interned strings did leak until the JVM 1.3 and weak references). After some discussion, I think we've arrived at a good answer.

Tor and I are mostly using irc to communicate, and it's working well. We had another wide-ranging discussion today, including careful analysis of Quartz Extreme and general design questions about how to get inter-app transparency working well in both software-only and hardware-accelerated environments.

These are exciting times! I'm happy to be alive.

An interim entry from the floor of CodeCon, thanks to wireless networking provided by Up Networks.

Alan's blog

Last night, Alan wrote the first entry in his new blog. I typed most of it, but he's rapidly getting better at keying.

I'm hoping that this blog will motivate his writing. I'm sure he'll appreciate feedback (for now, just send it to me).

My Codecon slides

I'm putting up a draft of my presentation. Some of it might be difficult to follow without the narration, but you might find it interesting nontheless.

Crowd counting

I've been following the various crowd estimates for the peace marches and demonstrations in San Francisco. Traditionally, it's very much an inexact science, and estimates vary widely.

For the last march, the SF Chronicle did something very cool: they took high-resolution timestamped aerial photographs, measured them, and posted them to the Web. Surprisingly, this count (65,000 at the 1:45pm snapshot) is considerably smaller than the consensus estimate (200,000, including people who left the march earlier or joined later).

In the grand scheme of things, the exact number marching is not that important. The Jan 18 one was an amazing expression from the people, and my friends who were at the Feb 16 one tell me that it was even more intense. It's not just San Franciscans, and it's not just Americans. People from all over the world have expressed themselves.

Even so, the wide range of estimates, and the variations in the reporting, illustrate the impact of viewpoint on what should be, after all, a fairly easily quantifiable, objective truth. We are being asked to evaluate the risks of going to war against the risks of not going to war, based on data that's at least an order of magnitude fuzzier than the simple question of how many people were on the streets of San Francisco. This is not easy.

I am not impressed with the International Answer people's response: `"Oh my word. Come on, that's ridiculous," said Bill Hackwell, spokesman.' It's possible he was simply quoted out of context, but I'm curious to know exactly what he thought was ridiculous.

I am passionately anti-war, even more passionately anti this war, but most deeply pro-truth. The Chronicle showed how seat of the pants guesstimating can be replaced, using a bit of technology, with hard data. I think this is progress, and fervently hope that we see more of it.

Codecon, day 1

I just got back from the first day of codecon and the Google-sponsored speaker reception afterwards. I was expecting it to be intense, but misunderestimated exactly how so. I met a lot of people, including old friends, more than a few cypherpunks, people I know online but met for the first time in person, and people I've been wanting to meet for a while. There are lots more people I didn't get a chance to really talk to; hopefully Monday.

Google is snatching up lots of smart people now. Spencer Kimball and Peter Mattis, of Gimp fame, are reunited once again (in fact, for almost a year, but I only just learned this). We had a very nice talk. They're both passionate about their work for Google. There's a reason why Google is able to provide such an amazingly valuable service, and it has a lot to do with the caliber of people working for them. I also enjoyed talking with Nelson Minar.

I also got to meet Larry Page, but felt like I kinda flubbed it. I also managed to just about lose my temper with John Gilmore arguing about what properties a next-generation DNS should have. This caught me off guard - I'm generally pretty levelheaded. I did apologize, and afterwards John said it was the best discussion about DNS he'd had in a while, so I guess not all is lost.

Vipul, of Vipul's Razor and now CloudMark, is very cool. I was struck by his depth of thinking, and his efforts to balance the technology, the social good (including free software releases), and the business. We talked about some of my more speculative ideas about how to use trust to defeat spam, and we really connected. He seemed to immediately understand the goals of my research, and I appreciated his perspective on deploying real systems for paying customers. I hope we get to work together.

Of the Codecon talks, my favorite was the panel on version control, with Larry McVoy (Bitkeeper), Greg Stein (Subversion), and Jonathan Shapiro (OpenCM). The conference organizers were nervous that it would degenerate into a licensing flamewar, but they needn't have worried. It was obvious that the panelists have a tremendous amount of respect for each other's work, and that the differences between these projects largely reflect differing goals.

A common theme was how difficult it is to get configuration management right. Everybody seriously underestimated how much time it would take to get a usable system going. Also, while there was definite agreement that CVS is broken and not easily fixable, there wasn't a clear consensus that most people a strong motivation to migrate from CVS to any of these new systems. CVS actually works reasonably well for most open-source projects, where you don't typically have lots of people pounding concurrently on one file. This kind of scenario is very common with paying customers, and Bitkeeper handles it well. Of course, any modern configuration management tool (with atomic transactions, robust tracking of changes, etc.) will be able to do a much better job than CVS, but that's not saying much.

I haven't decided whether the Web-based infrastructure of Subversion (particularly WebDAV as the client/server protocol) is a good thing or a bad thing. I think it depends a lot on what kind of user we're talking about. Windows and Mac can mount WebDAV right onto the desktop, which means that unsophisticated users can do version controlled operations just by clicking and dragging. For some applications, this is a huge win, because you can do things like back out unintentionally bungled changes, roll the clock backwards to get a consistent snapshot at some particular time, and so on. These are real problems that users have, and which the stock filesystem based implementation of folders doesn't solve.

For free software programmers, I don't see this as such a big win. Regarding integration with existing tools, people don't mount WebDAV folders from an Emacs mode, but there are Emacs modes for CVS. Then you have to deal with cruft like HTTP authentication (most Subversion deployment seems to use HTTP basic auth over SSL, which I guess is workable, but doesn't strike me as exactly the right way to do this).

In any case, I'm really glad that good work is happening in this space, and I'm hopeful that a really viable alternative to CVS will emerge. Subversion could well be it, but that's not a given, and in the long run, one of the other projects could turn out to be more robust, scalable, and overall a better match for the needs of free software developers.

Oh, and while I generally respect Larry's right to license BitKeeper however he wants, I did not at all get a warm and fuzzy feeling about it. In fact, it feels to me that his "free use" licensing terms are in fairly direct conflict with the spirit of the free software community. I am definitely not tempted to use it for Ghostscript or related projects. But if you're looking at BitKeeper as an alternative to Perforce or some other proprietary CM system, take a look; there's a good chance it'll do what you want.

Booger

Joey DeVilla's favorite amalgam of "Google" and "Blogger" is "Booger". Yeah!

There is one thing, I think, that Google and a blog hosting engine inside the same trust boundary can do that would be somewhat difficult otherwise: making backlinks work really well, based on both linguistic analysis for relevance and, of course, PageRank. It's possible to use a trust metric to automate links between blogs in a more distributed context, but so far nobody's been smart enough and motivated enough to actually try to build it. It's probably a lot more likely to happen in a centralized, infrastructure-rich setting.

Off-topic

Here are two interesting and related interviews. The first describes how child psychiatry has a history of being science-resistant, but advances in the field are overcoming this. The second describes some of the cutting-edge research being done at the NIMH, and the palpable enthusiasm of Dr. Manji in being part of the community. I've long been fascinated by the signalling and computation that goes on in networks of cells, and found my interest rekindled by this interview.

This essay by Kanan Makiya is interesting. He's far from a disinterested party, of course, but I certainly agree that these kinds of discussions should be taking place out in the open. Real democracy is messy and unpredictable. Perhaps it's even true that, as the State Department under Colin Powell and the CIA believe, "it could have a destabilising influence on the region."

Google buys Blogger

Breaking news: Google buys Pyra. This only kinda makes sense to me. As I've written before, Google and blogs have a synergistic relationship, but to pick a single platform in this time of experimentation and ferment seems odd.

cactus: I see your point about word "blog" being the latest hype fad, but it is a useful word. To my mind, it simply means posting your writings online in a reverse-chronological format, and with plenty of Web links for further reading. Advogato diaries qualify.

Of course, what people do with the format varies widely. Some write about their cat's hairballs. Others use it as a tool for intellectual inquiry, and perhaps to participate in the distributed leadership of the free software community. In fact, by numbers alone there are many more of the former.

One of these days I'm going to have to write up my thoughts on "humble elitism". (when I mentioned this phrase to Heather, she asked me if it was like "compassionate conservatism", so I think I'll have to pick a different name). I strive to make my blog one of the elite, but only by pouring thought and good writing. Usually, "elitism" refers to some kind of caste system. And of course, on any given diary entry I'm liable not to live up to my goals. In any case, I certainly enjoy trying.

A country code for VoIP

I saw on boingboing a few days ago that there's now a country code reserved for Internet phones. I had a little difficulty understanding what that meant, but think I've got it now. Essentially, this is a way to bring VoIP phones into the standard phone number namespace. It is in this sense a dual of ENUM, which is a gateway to access the phone number namespace through DNS.

From what I can see, this new country code is being run by FWD (Free World Dialup). You register for a free account using a simple, straightforward Web form, and you get a number. Mine is 18408. Then, you point your SIP phone's config to the FWD server, register, and then when people query the FWD server for your number, they find your phone. For example, to reach my phone, dial sip:18408@fwd.pulver.com (try it; I'll try to keep a phone app running).

This number also now exists in the POTS number namespace, but your phone company won't route to it yet because they're evil. As soon as public pressure overcomes their evilness, you'll be able to reach my VoIP phone simply by dialing 011 +87810 18408 from your US phone.

I think this is a huge step. To the extent that people can call your phone, it makes it practical to go VoIP only. Of course, you can do that today with a service such as Vonage, but that costs $40/month, and this is free.

From what I can gather, FWD is going to make a little money off "long distance charges" from phone companies that peer with them. I like this idea - it would seem to provide a revenue stream that would actively promote the use of VoIP phones. You can bet that the telcos are going to drag their heels as much as possible.

I think there's one more piece to this, which is phone cards. Even if your scumbag incumbent telco won't peer with FWD, you'll probably be able to shell out $20 for a phone card with a company that will. There's no reason why these companies can't provide service for a penny or two a minute. The standard phonecard service, after all, is basically two telco to Internet gateways joined back-to-back. Here, the caller just buys one of them. So this basically solves the problem of being callable by my Mom. All she has to dial is 1-800-call-crd, then a (typically 10 digit) pin, then 011 87810 18404. Only 34 digits, but at least she'll be able to reach me.

Phones

PC's running phone software don't make good phones. A dedicated piece of hardware is better. Even aside from the general flakiness of sound cards and drivers, phones are a lot better at ringing and being always on.

You can buy a phone like a Cisco ATA 186 for about $150 from eBay, but I think the price is going to come down to $50 or so once D-Link or Linksys gets into the game. Basically, it's the same gear as a phone with a built-in digital answering machine (AT&T brand $30 at Best Buy), plus a 10/100 Ethernet interface.

In any case, I tried out kphone and gnome-meeting again, and was successfully able to complete calls with both. I had trouble compiling GM 0.96, so no doubt I'll give it another go when I upgrade to RH 8.1.

I'm less impressed with kphone. I could receive audio ok but not transmit, so I took a look at the code to see what was wrong. The actual audio interface code is buggy and unsophisticated. One of the most basic problems is their use of usleep(0) to wait for the next timer tick for basic scheduling. This, of course, is hideously dependent on the details of the underlying kernel scheduler, and in any case, gives you very poor temporal resolution on PC hardware. Even worse, if 5 ticks go by without an audio packet being ready, the code reads a packet and drops it on the floor, for what reason I don't know.

There's also a problem with the kernel audio drivers I'm using (alsa 0.90beta12 with Linux 2.4.19). Even though kphone does a SNDCTL_DSP_SETFRAGMENT ioctl to set the fragment size to 128 bytes, the actual value, as returned from SNDCTL_DSP_GETISPACE, is 2048 bytes, which is way too big (it's 125ms). Combined with the packet-dropping logic above, the net result was no audio.

People should not have to worry about this. I think it makes sense to wait until you can get a Chinese-made phone with Speex in it at commodity prices. Hopefully, this will happen soon.

A good homepage

I came across Miles Nordin's web site last night after following a link from the Java discussion on our front page. I found myself immediately observed. Miles writes well, is well read, and has a fabulously critical attitude. Many of the other pages, especially those having to do with wireless networking, are worth reading.

Word

cinamod: I basically agree with everything you say. If Abiword or OO are good enough, and the code is clean enough to be split out as a batch renderer, then there's no need for a separate codebase.

I've had a look at the Word document format, and it's not quite so bad as I was expecting. The documentation is atrocious, but the format itself seems fairly reasonable. Of course, I'm sure that if I got into the details I'd find lots of corner cases and bad hacks.

The main thing not to like is the obvious lack of design for forwards and backwards compatibility. No doubt, this is economically motivated - gotta keep that upgrade treadmill going.

On the plus side, the format was clearly designed with an implementation in mind (as opposed to the W3C process, for which implementation is a distasteful afterthought). It's fairly easy to see how to process a Word file very efficiently, in both CPU time and memory usage. For example, resolving stylesheets is a straightforward linear chain, as opposed to all the nutjob nondeterministic stack automaton stuff in CSS, or the mini-Lisp in DSSSL/XSLT.

I'm tempted to write here about Word's plex/fkp/character run architecture as opposed to the more generic tree approach we tend to see these days, but probably most people would be bored with that level of detail. The top-level point is that algorithms for manipulating Word's structures on-disk are straightforward, while manipulating trees efficiently on-disk seems to require a lot of cleverness. Of course, with RAM so cheap these days, it's reasonable to ask whether memory-constrained processing of files is important at all.

The Word format is too tightly bound to a specific implementation, and it certainly shows in what documentation Microsoft has produced. They often seem to confuse the interface, which in this case is the on-disk representation of the document, with the implementation details.

In any case, I'm glad I've learned more about the file format. Its popularity means we have to deal with it somehow. Further, as PDF continues to become document-like and less of a pure graphical representation, it's important to understand the influence that the Word design has on its evolution.

I've commented before on the need for a good, open, editable document format. The lack of adequate documentation and Microsoft's proprietary lock on change control make the Word format unappealing. I've certainly thought about designing my own document format, but it's not easy to make a word-processing format much better than Word, or a graphics-oriented format much better than PDF. So that's probably a windmill I'd be happiest not tilting at.

UTF-8

forrest: Yes, Unicode/UTF-8 should be the default charset and encoding for Advogato (technically, UTF-8 is not a charset). So basically I need to convert all the Latin-1 stuff in the database over, then switch over the reported charset.

By the way, Google search results are now multilingual, with Russian, Japanese, and other alphabets all mixed in on the same page. They seem to have gone back and forth on this; even recently I got the "results can not be displayed in this character set" message. In any case, I think it's cool.

More blog navel-gazing

I expected to get a lot of response from my last entry, but I didn't. I tried to argue it fairly and carefully, to best reach an audience of journalists (to whom I expect it would be considered quite controversial), but to my usual readers I expect I'm preaching to the choir. Perhaps if I had blamed the media for their role in unbelievable ignorance of Americans, it would have stirred up more response.

In any case, there are some downsides to blogging, or at least areas where it needs work. For one, not everybody is capable of criticial reading (from the survey above, the fraction would seem to be less than 17%). The mainstream media is actually pretty good in distilling a story down to a form where busy people can absorb it quickly. Blogs aren't, at least not yet. I'm hopeful that technical innovations can help with that, not least the use of trust metrics to ferret out the good material, but of course people have to be writing that first.

Needless to say, I didn't get any e-mails from newspaper editors on why they're not covering Bruce Kushnick's book. The most parsimonous answer is that their souls are simply 0wnz0red, and they're no more capable of breaking a story on the corruption of the telecoms industry than Hilary Rosen is capable of writing an editorial on how music trading is sometimes good for artists.

But (and this is a big but), the blog world is not (yet) doing a good job covering this story either. Bruce's publication of the book is a good start, but there's a lot of followup work to be done: fact-checking, correcting mistakes, unearthing more evidence, summarizing the highlights, getting the word out. This is exactly the sort of thing that journalists claim to be good at, because they have the resources to do it. Perhaps bloggers don't, although my personal belief is that it's the kind of work that lends itself to the sort of distributed effort that's so effective in creating free software.

Word to PDF

Thanks for the great feedback from cinamod and cuenca on this topic. I'll try to respond.

I'm not sure whether it's better to try to create a batch renderer project now, or whether it's best to work on existing tools, such as the renderer in AbiWord. If the latter is really, really good, then it can be used as a batch renderer, and we're done.

Even if everybody's needs are being well met by the existing projects, in retrospect I think there would have been significant advantages to have done the batch renderer first. As cuenca points out, it's a considerably simpler problem because you don't have to design your data structures for incremental update and so on. So I think there would have been high-quality rendering much earlier than we're seeing now with the GUI-focussed work.

In any case, for people contemplating new projects to work with complex file formats, I think the advice is sound: do the batch processor first, then adapt it to work interactively. ImageMagick and netpbm happened before Gimp, and for a good reason.

Absolutely an important part of such a project is a regression suite. Even better, it should be possible to use such a suite with other Word processors, such as GUI editors.

I'm not enthusiastic about transcoding into another existing document format such as TeX. This path makes it easy to get basic formatting right, but probably much harder to get it really good. The idea of TeX code to match Word's formatting quirks makes me cringe.

AlanShutko: It's not surprising that Word's layout has changed over the years. In fact, it's fair to say that interchange and compatibility in the Word universe only works well if everybody is using the same version. I'm sure that that the fact that this fuels upgrading is merely a coincidence :)

Even so, that doesn't make the problem impossible, just harder. I believe that Word documents self-identify the version of Word that generated them. Therefore, in theory at least, it should be possible to create a pixel-perfect rendering of the document as seen by the writer. SMB has many implementation variances, but that doesn't stop Samba from being viable. The goal, as usual, should be "least surprise".

Of course the rendering depends on the font metrics. Is there anyone who believes it shouldn't? Depending on the printer is a misfeature, of course, but as I've argued above, a "best effort" is likely to make people happy.

Fear

Patriot II draft

How blogs are better than mainstream media

The Washington Post recently ran a "journalist checks out blogs, doesn't quite see what the big deal is all about" story recently. A lot of these have been appearing lately; this one seems entirely typical. I've been thinking about the differences between blogs and mainstream journalism for some time, so the appearance of this story in a highly regarded newspaper, and Dave Winer's criticism of the piece, inspired me to speak to the issue.

The main theme of the piece, as usual, is that blogs are an interesting phenomenon, but cannot take the place of professional news organizations. The typical blogger, according to the piece, posts mostly opinion and links to news stories from the mainstream media, as opposed to real reporting.

This is basically true, I think, but rather misses the point. Blogs are incredibly diverse, with a wide distribution of things like writing quality, fairness, objectivity, originality, passion, and so on. The average blog, frankly, scores pretty low on all these scales. But I tend not to read to many of those. I seek out the exceptional blogs, the ones that inform and delight me, move me with their words, bring stories to life, make me think. Even though these are a small fraction of all blogs written, I'm able to find quite a few of them.

By contrast, mainstream media tends to be uniformly mediocre. The actual difference in quality between a top newspaper and an average one is small. In fact, thanks to wire services, they tend to run most of the same content. In computers and software, aside from a handful of good technology reporters such as John Markoff and Dan Gillmor, there is almost no good reporting.

I don't read blogs the same way I read the paper, and that difference, I think, captures how blogs can be so much better. My "toolkit" consists of three essential elements: blogs, critical reading, and Google. In combination, they give me a reading diet that is, on most topics, vastly superior to what I'd get from reading the mainstream media.

To me, critical reading has two major pieces. First, trying to separate the wheat from the chaff. This is especially hard on the Internet (and in blogspace), because there is a lot of chaff out there. Second, reading multiple different views on a story, and trying to determine the truth from the bits for which there is consensus, and also to understand the real disagreements at the root of the differing views.

Synthesizing an understanding from multiple views is important because I don't have to depend on the objectivity of the writer. It is, of course, very important to judge how credible the writer is, what their biases are, and to what extent they let that distort the story. This isn't easy, and it's possible to get wrong. Even so, I find that I get a much clearer picture after reading two or more passionate stories from different sides, than one objective, dispassionate story.

Objectivity, while a noble goal, comes at a price. In the context of the media business, it usually guarantees that the reporter doesn't know much about the subject at hand. This, in turn, is most clearly detectable as a high rate of technical errors (Dave Winer points out some in the article under discussion), and the more worrisome, but less quantifiable, lack of insight. Ignorance about a topic also makes journalists more vulnerable to manipulation, at worst simply parroting press releases and "backgrounders". More typical is the way the mainstream papers accepted the SF police's estimate of 55,000 at the Jan 18 marches, even though the actual number was about triple that.

And on a lot of topics, learning about an issue leads one almost inevitably to take a side. Take the management of DNS for example. Of the people who know what's going on, those who do not have an interest in the status quo are almost all outraged. It's hard to find somebody who's both knowledgeable and objective, so insisting on the latter serves the story poorly.

The importance of Google

If you are going to read critically and sift through various viewpoints, the key questions are "what are other people saying about this?" and "how do these viewpoints differ?". As mentioned above, it's not trivial to find good alternate sources. But it's a skill one can learn, and there are tools that can help. Among the most important of these is Google. On any given topic, construct a nice search query, pass it to Google, and in a hundred milliseconds or so you'll be presented with lots of good links to choose from. Not all will be relevant or well-written, but you only have to sift through a dozen or two before coming up with a winner, and you can tell quite a bit from the results page, without even having to visit the link.

I'll give a couple of examples on how Google can provide more information than mainstream press articles. First was Nicholas Kristof's July 2, 2002 editorial in the New York Times entitled "Anthrax? The F.B.I. yawns". This editorial referred to a mysterious "Mr. Z". For whatever reasons (fear of libel suits, perhaps?), the New York Times saw not fit to print the name of this individual, so people reading the papers were in the dark for a while. Googling, of course, revealed the name readily.

A more mundane example is this Washington Post story on a terrorist scare. A kid on the plane asked a flight attendant to pass to the pilot a napkin inscribed "Fast, Neat, Average". This is an Air Force Academy catchphrase, the standard response to an "O-96" dining hall feedback form, and, according to USAF folklore, also used in Vietnam as a challenge-response. Cadets and graduates sometimes write the phrase on a napkin in the hope that the pilot is USAF-trained. In this case, the kid turned out to be a neighbor of an AFA cadet, without much good sense about how cryptic notes might get interpreted. In any case, the Washington Post article carefully omits the response ("Friendly, Good, Good"), even though it's easy enough to find through Google, among other places in a speech by President George H. W. Bush.

Other newspapers do worse. The Washington Times manages to misquote the three-word phrase. The AP wire story, as published by the Seattle Post-Intelligencer, CNN, and other papers, doesn't even bother with the O-96 dining hall part of the story.

Why isn't there any coverage of Teletruth?

The systemic corruption of the telecom industry is one of the most important stories since Enron, but you won't find it in your newspaper. Why not? Bruce Kushnick has written a book detailing the crimes of the telecom corporations, but nobody on the mainstream press is following up on it. A Google News search returns exactly one result from either "Teletruth" or "Bruce Kushnick", and that appears to be a press release.

I'm having real trouble understanding why this story isn't getting any coverage in the mainstream press. I'm having even more trouble reconciling this fact with the ideals of objectivity as professed by journalists. If you're a working editor or journalist, especially in the tech sector, did your publication make a decision not to run the story? Why? I'd really appreciate more insight. Even if Bruce Kushnick is a complete nut (which I doubt), it seems as relevant as the Raelians.

I consider it quite plausible, even likely, that this is a huge story, but for whatever reason, readers of newspapers are completely in the dark about it. Critical readers of blogs, though, aren't.

Conclusion

Just about every time I've had the opportunity to check a mainstream news story, I've found it riddled with errors. Every time I've been interviewed by the mainstream press, the resulting story significantly distorted what I was trying to say, and from what I read in other blogs, this experience is very common. Even in the off chance that a tech story is factually correct, I don't learn much from it. There are important voices missing from mainstream media, especially those critical of big companies, or, more importantly, providing a credible alternative.

By contrast, the best of the blogs I read are passionate, well-informed, topical, and insightful. They don't make a lot of stupid factual errors, but those that slip through are corrected quickly. The best blogs are partial but fair, and up-front about their biases, as opposed to pretending to be totally objective.

It's not just technology reporting, either, although that's obviously close to the hearts of the early blogging community. I think the flaws of mainstream reporting, and the potential of blogging to address those flaws, generalize to many other areas of interest. I'm sure, though, that newspapers are a very information source for sports gamblers, and will continue to be important in that role for quite some time.

It takes more time and effort to get one's information through critical reading of blogs than it does to read the paper, but the results are well worth it. To paraphrase Thomas Jefferson, were it left to me to decide whether we should have newspapers without blogs, or blogs without newspapers, I should not hesitate a moment to prefer the latter.

315 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!