8 Mar 2003 raph   » (Master)

Autopackage, namespaces, and DNS

I read about autopackage in LWN recently. It seems like a useful project, and I wish it well. I've certainly run into my own share of pain trying to install VoIP software and the like recently.

I'm very happy to see thought going into the question of what packages should look like. I've always felt that Linux package formats have been somewhat ad hoc and given over to the "scripting mentality", and that most distros sidestep the fundamental problem of resolving dependencies and versions by trying to create a snapshot of packages that just happen to work together. Over the long term, I'd love to see this replaced with something more systematic.

A good test of agility for package frameworks is whether they work on systems other than Unix. One of the most interesting things I've seen in this space is CLR Assemblies. From what I've seen, these really do try to be systematic and general, but of course are bound to the CLR runtime.

Indeed, one of the reasons that Java is so disappointing as a desktop platform is that they had the opportunity to really address the packaging problem, but blew it. The reality of Java packages is quite a mess: classpaths, .class files, jar files, war files, and of course "Web start" in a futile attempt to paper over the whole mess.

There is one aspect to autopackage's design that immediately struck me: its use of a DNS-rooted namespace. In fact, DNS is becoming the de-facto root for all kinds of namespaces, of which of course the Web is one of the biggest. This would be very cool if it weren't for the fact that the management of DNS is so corrupt. Even so, it basically works.

One of the discussions I had with John Gilmore at CodeCon was about what a next-generation DNS replacement should look like. I do believe that it's possible to fix many of the political problems of current DNS with better technology. Specifically, the single trust root of the existing DNS is just too tempting a target for parasites like the ICANN leadership. A better system would have distributed trust.

But I don't envy the person who tries to replace DNS with something better. One of the thorniest questions is what the policy for name disputes should be. I'm partial to pure first-come, first-served, largely because it's the only policy simple enough for people to understand, but I think it would encounter a lot of resistance in the real world. In particular, there's nothing to prevent squatters from bulk-registering all the words and trademarks in the world.

But what is a better policy? You can't really talk about a name service being secure unless you've specified a formal policy. It's a thorny problem. I sketched one possibility in my FC '00 submission, and am writing up an expanded version of that as a chapter in my thesis. It is in many ways an appealing design, but even I don't have confidence it's what the world should adopt.

So hopefully, we'll have smart people continue to put some thought into what kind of name service we really want. DNS is a very impressive accomplishment, and of course hugely useful, but eventually we're going to want something better.

Bayes and scoring

There's a lot of talk of Bayesian spam filtering these days, including an implementation the latest SpamAssassin beta. Indeed, Bayes is cool, but did you know that it's actually equivalent to systems that assign a score to each word (or other feature) and add them up?

Paul Graham popularized Bayesian statistics in his Plan for Spam. He analyzes word frequencies in a corpus of spam, and of non-spam, so each word gets a probability that it's spam. For example, "viagra" might be assigned a probability of 0.99 spam, and "eigenvector" 0.01 or so.

Then, when a mail comes in, you look at the 10 words with the most extreme probabilities (words common to both spam and non-spam don't tell you much and so won't be counted). Bayesian statistics will give you a probability that the email is spam or not, assuming that the probabilities of the individual words are independent (not a really valid assumption, but perhaps close enough).

The combining formula for two probabilities is ab / (ab + (1 - a) (1 - b)). But use the transform f(x) = log(x) - log(1 - x), and the equivalent combining rule is just f(a) + f(b). Do the math!

So you don't really need Bayes to do this computation. Perhaps it's most useful to think of the Bayesian math as giving a sound probabilistic interpretation of score addition, which seems fairly ad hoc at first sight.

Doing this kind of combination entirely in linear space is interesting to me, because it seems much easier to combine with other techniques. After all, eigenvector-based trust metrics are based directly on linear algebra. I haven't wrapped my head completely around how I'd meld these two ideas together, but it certainly is intriguing.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!