Older blog entries for jwb (starting at number 5)

spam

I implemented Paul Graham's probabilistic filtering for spam. At first I attempted to simply use rainbow from the libbow distribution, but rainbow has many problems. It is painfully academic, its HTML parsing code always segfaults, it is miserably slow, and it cannot test more than one document per launch. Therefore did I apt-get remove libbow.

My implementation, in Perl, attempts to understand the MIME and RFC822 structures for better tokenizing. Here is the meat of the lexer:

sub lex_header {
    my ($h, $w) = @_;
    my (@t);
    
    @t = $h->tags();
    
    foreach my $t (@t) {
        my ($n);
        
        $n = $h->count($t);
        
        for (my $i = 0; $i < $n; $i++) {
            foreach my $tok (split(/\s/, $h->get($t, $i))) {
              $w->{$tok}++;
            }
        }
    }
}

sub lex_entity { my ($r, $w) = @_; my ($t, $h, $n); lex_header($r->head(), $w); $t = $r->effective_type(); $h = $r->bodyhandle(); if (defined($h)) { $h = $h->open("r"); if ($t eq 'text/plain' || $t eq 'text/html') { while ($_ = $h->getline()) { chomp(); if (!/[^A-Za-z0-9\+\=]/) { next; } foreach my $tok (split(/\b/, $_)) { $w->{$tok}++; } } } } $n = $r->parts(); for (my $i = 0; $i < $n; $i++) { lex_entity($r->parts($i), $w); } }

This is a decidedly informal approach to tokenizing a mail, but I think it works. Especially important is the line that rejects base64-encoded text in a text/plain entity. See if you can spot it.

After training the system on a spam corpus of 142 mails and a legit corpus of 3332 mails, the filter can reject about half of incoming spam and has not yet produced a false positive. Hopefully the rejection ratio improves as the spam corpus grows.

One damning bit of trivia arose from this experiment. The mail servers at my previous employer Critical Path are all 99% reliable indicators of spam. One of Critical Path's features is supposed to be spam filtering. Evidently it doesn't work overly well.

CD Changer UI

I have a 3-disc CD changer. It is pretty useless. You play the first CD, then the second, then the third. To play the fourth, you eject the third. To play the fifth, you eject the fourth and so on. The first and second discs stay in there forever. If I really wanted to program four hours of music, I would plug in the iPod.

Gene Kan

I knew Gene Kan, barely. He and Yaroslav are friends with my friends, and are occassional visitors. Gene was in my office a few weeks before his death, installing an email server in a machine room. Why would a man expecting death build an email server to last a decade? I do not know.

Role-based access control

There is a SIG in the ACM for role-based access control (RBAC). They meet every year for a symposium. A lot of papers are presented. I cannot see anything in all the RBAC literature which is more functional than putting users into groups the old-fashioned Unix way. My machines using regular Unix security have many role accounts: httpd, nntpd, ftpd, lpd, uucp, daemon, nobody, and so forth.

The main problem with the Unix authorization model is the lack of delegation of authority, but this is a separate issue from RBAC.

The iPod is my new diversion. I started a project "dopi" to write an application that uses the iPod. So far, I can list all the files, read and print their ID3 v1.1 tags, play music from the iPod, copy data from the iPod to another filesystem, and write the ID3 v1.1. ID3 v1.1 data can be changed in place because the data is fixed length and offset. I can't write anything else because libhfsp doesn't support writing and there is no read/write filesystem driver for HFS+ in Linux.

I just wrote a stupid little program that ties everything together by walking through the iPod playing 200KiB at offset 400KiB from the beginning of every mpeg.

I don't think Tk is going to cut the mustard for the UI, so I am going to go off and read a tutorial on QT. C++ is evil, but GTK+ is more evil.

If you have an iPod and a suggestion please send it along.

I have begun work on an implementation of the W3C DOM Level 2 Core and Events. The implementation language is C. There are existing DOM implementations in C, not least DOMC and GDOME. These packages don't implement my style of C API, so I decided I needed my own. I absolutely insist on consistent return values from functions, naming conventions, etc.

So far I have some relatively good ideas for the nuts and bolts. It should be easy to use a shared string table to reduce memory usage on long documents, and I have planned ahead of time how to implement live NodeList objects (which neither existing implementation implements).

The major stumbling block so far has been deciding what C datatype to use for DOMString, which is a sting of unsigned 16-bit values. wchar is a joke: just like all other C datatypes (save char), you never have any idea of its actual width. The C library functions for dealing with wchar are horrible. I have dodged this problem by skipping wchar altogether and just using auto* tools to detect and define a 16-bit-wide system datatype.

As usual, C's definition is a hindrance. There ought to only be signed and unsigned 8, 16, 32, and 64-bit integer datatypes in C, and they ought to be explicity defined that way.

I am playing with this site. It let me certify myself as Apprentice :) I think this is a niche community worth investigating.

PS: There doesn't seem to be a way to revoke a certification.

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!