Older blog entries for jwb (starting at number 11)

Incremental apt-get update

I'd like to meddle with the way Debian distributed software. It is very silly that apt-get update downloads several megabytes of data, just to get the incremental changes since the last apt-get update. I believe it should be easy to implement an incremental update that sends only the changes since time t. This would require that each upload have a serial number, but either that already exists somewhere in the apt system, or it could be hacked in without too much trouble.

By my caculations a daily apt-get update could be reduced from several megabytes to tens of kilobytes.


After reading numerous enthusiastic endorsements of BitTorrent, I downloaded the client and tried to get Mandrake 9.1 images from the network. I must say I was not impressed. The estimated time to complete the download was 42 hours, at about 10KB/sec. Normally I can download at around 135KB/sec. Worse, my client was uploading at around 20KB/sec! On my assymetrical connection, uploads degrade downloads. Is this something BitTorrent does not account for? The client should limit uploads to at most, say, 10% of downloads. Otherwise there is little reason for the user to participate. Certainly if uploads were going to be 200% of downloads I would never use it.

Also their FAQ desperately needs a "How do run this farking program?" section. /usr/bin/btdownloadprefetched.py is not obvious.

25 Sep 2002 (updated 25 Sep 2002 at 18:58 UTC) »

Someone mailed me to request my spam filter system, so I packaged it up slightly with some command line arguments and documentation. You may download it here:


I added a --gram-length=n option, so you can play with that dimension of the system.


I recommend you try Workrave, to help keep your hands useful in later life.

n-grams in spam filter

I modified my spam filter to compare the performance of unigrams, digrams, and trigrams. The undesired corpus contains 353 mails; the desired corpus holds 3352. When using trigrams the vocabulary is over 1.1 million terms. My system is the same as Paul Grahams, except I do not double the document frequency of terms in the good corpus, and I consider mail as spam if its probability exceeds 50%.

The unigram system identified all but eight mails from the spam corpus, with zero false positives. The digram and trigram systems both identified all but three, also with no false positives. Of course the trigram system takes much longer for the analysis, so I believe I will use digrams for the present system.

The system works so well that I will write a small C library for use by mail clients. I think spam filtering has no effect on spammers until it becomes widespread. So I will try to spread it widely.

tk: I have tried using 2-grams and 3-grams in my spam filter. Of course this tends to bloat the vocabulary and therefore the time required for analysis. Later I will attempt to characterize the effect of term length on filter performance. My parser it something of a hack, so any extensions to the term length will probably result in repulsive code.
7 Sep 2002 (updated 7 Sep 2002 at 23:55 UTC) »

The spam filter is getting better as the training corpus grows. It now has 300 messages, which explains why I need this software. The mail parsing code I posted earlier was not optimal. It turns out that parsing the unencoded content of binary attachments helps to trap viruses and other windows programs. In both the bodies and the headers it is necessary to ignore very short words, and to remove non-word characters. This means that:

To: jwbaker@acm.org

Is reduced to just


Which is perfectly fine for these purposes. I'm still tweaking the implementation. The latest iteration found five spams buried in my "good" corpus, but decided some things in my spam corpus were good. There are some odd effects currently. For example, my mail goes through mail.saturn5.com comparatively recently. Before it went via a different server, and I still have many spams from that server. Therefore the spam corpus disproportionately represents the old mail server. The term for the new mail server "mailsaturn5com" can tend to make a spam appear legitimate.

The goal is still zero false positives.

victory The best part of a spam filter is the tiny feeling of victory when a spam takes a detour into the spam folder with 100% confidence, and helps to make sure that any spams like it will follow. Muhhhhahhahhaha


I implemented Paul Graham's probabilistic filtering for spam. At first I attempted to simply use rainbow from the libbow distribution, but rainbow has many problems. It is painfully academic, its HTML parsing code always segfaults, it is miserably slow, and it cannot test more than one document per launch. Therefore did I apt-get remove libbow.

My implementation, in Perl, attempts to understand the MIME and RFC822 structures for better tokenizing. Here is the meat of the lexer:

sub lex_header {
    my ($h, $w) = @_;
    my (@t);
    @t = $h->tags();
    foreach my $t (@t) {
        my ($n);
        $n = $h->count($t);
        for (my $i = 0; $i < $n; $i++) {
            foreach my $tok (split(/\s/, $h->get($t, $i))) {

sub lex_entity { my ($r, $w) = @_; my ($t, $h, $n); lex_header($r->head(), $w); $t = $r->effective_type(); $h = $r->bodyhandle(); if (defined($h)) { $h = $h->open("r"); if ($t eq 'text/plain' || $t eq 'text/html') { while ($_ = $h->getline()) { chomp(); if (!/[^A-Za-z0-9\+\=]/) { next; } foreach my $tok (split(/\b/, $_)) { $w->{$tok}++; } } } } $n = $r->parts(); for (my $i = 0; $i < $n; $i++) { lex_entity($r->parts($i), $w); } }

This is a decidedly informal approach to tokenizing a mail, but I think it works. Especially important is the line that rejects base64-encoded text in a text/plain entity. See if you can spot it.

After training the system on a spam corpus of 142 mails and a legit corpus of 3332 mails, the filter can reject about half of incoming spam and has not yet produced a false positive. Hopefully the rejection ratio improves as the spam corpus grows.

One damning bit of trivia arose from this experiment. The mail servers at my previous employer Critical Path are all 99% reliable indicators of spam. One of Critical Path's features is supposed to be spam filtering. Evidently it doesn't work overly well.

CD Changer UI

I have a 3-disc CD changer. It is pretty useless. You play the first CD, then the second, then the third. To play the fourth, you eject the third. To play the fifth, you eject the fourth and so on. The first and second discs stay in there forever. If I really wanted to program four hours of music, I would plug in the iPod.

Gene Kan

I knew Gene Kan, barely. He and Yaroslav are friends with my friends, and are occassional visitors. Gene was in my office a few weeks before his death, installing an email server in a machine room. Why would a man expecting death build an email server to last a decade? I do not know.

Role-based access control

There is a SIG in the ACM for role-based access control (RBAC). They meet every year for a symposium. A lot of papers are presented. I cannot see anything in all the RBAC literature which is more functional than putting users into groups the old-fashioned Unix way. My machines using regular Unix security have many role accounts: httpd, nntpd, ftpd, lpd, uucp, daemon, nobody, and so forth.

The main problem with the Unix authorization model is the lack of delegation of authority, but this is a separate issue from RBAC.

The iPod is my new diversion. I started a project "dopi" to write an application that uses the iPod. So far, I can list all the files, read and print their ID3 v1.1 tags, play music from the iPod, copy data from the iPod to another filesystem, and write the ID3 v1.1. ID3 v1.1 data can be changed in place because the data is fixed length and offset. I can't write anything else because libhfsp doesn't support writing and there is no read/write filesystem driver for HFS+ in Linux.

I just wrote a stupid little program that ties everything together by walking through the iPod playing 200KiB at offset 400KiB from the beginning of every mpeg.

I don't think Tk is going to cut the mustard for the UI, so I am going to go off and read a tutorial on QT. C++ is evil, but GTK+ is more evil.

If you have an iPod and a suggestion please send it along.

2 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!