Older blog entries for jwb (starting at number 8)

n-grams in spam filter

I modified my spam filter to compare the performance of unigrams, digrams, and trigrams. The undesired corpus contains 353 mails; the desired corpus holds 3352. When using trigrams the vocabulary is over 1.1 million terms. My system is the same as Paul Grahams, except I do not double the document frequency of terms in the good corpus, and I consider mail as spam if its probability exceeds 50%.

The unigram system identified all but eight mails from the spam corpus, with zero false positives. The digram and trigram systems both identified all but three, also with no false positives. Of course the trigram system takes much longer for the analysis, so I believe I will use digrams for the present system.

The system works so well that I will write a small C library for use by mail clients. I think spam filtering has no effect on spammers until it becomes widespread. So I will try to spread it widely.

tk: I have tried using 2-grams and 3-grams in my spam filter. Of course this tends to bloat the vocabulary and therefore the time required for analysis. Later I will attempt to characterize the effect of term length on filter performance. My parser it something of a hack, so any extensions to the term length will probably result in repulsive code.
7 Sep 2002 (updated 7 Sep 2002 at 23:55 UTC) »
spam

The spam filter is getting better as the training corpus grows. It now has 300 messages, which explains why I need this software. The mail parsing code I posted earlier was not optimal. It turns out that parsing the unencoded content of binary attachments helps to trap viruses and other windows programs. In both the bodies and the headers it is necessary to ignore very short words, and to remove non-word characters. This means that:

To: jwbaker@acm.org

Is reduced to just

jwbakeracmorg

Which is perfectly fine for these purposes. I'm still tweaking the implementation. The latest iteration found five spams buried in my "good" corpus, but decided some things in my spam corpus were good. There are some odd effects currently. For example, my mail goes through mail.saturn5.com comparatively recently. Before it went via a different server, and I still have many spams from that server. Therefore the spam corpus disproportionately represents the old mail server. The term for the new mail server "mailsaturn5com" can tend to make a spam appear legitimate.

The goal is still zero false positives.

victory The best part of a spam filter is the tiny feeling of victory when a spam takes a detour into the spam folder with 100% confidence, and helps to make sure that any spams like it will follow. Muhhhhahhahhaha

spam

I implemented Paul Graham's probabilistic filtering for spam. At first I attempted to simply use rainbow from the libbow distribution, but rainbow has many problems. It is painfully academic, its HTML parsing code always segfaults, it is miserably slow, and it cannot test more than one document per launch. Therefore did I apt-get remove libbow.

My implementation, in Perl, attempts to understand the MIME and RFC822 structures for better tokenizing. Here is the meat of the lexer:

sub lex_header {
    my ($h, $w) = @_;
    my (@t);
    
    @t = $h->tags();
    
    foreach my $t (@t) {
        my ($n);
        
        $n = $h->count($t);
        
        for (my $i = 0; $i < $n; $i++) {
            foreach my $tok (split(/\s/, $h->get($t, $i))) {
              $w->{$tok}++;
            }
        }
    }
}

sub lex_entity { my ($r, $w) = @_; my ($t, $h, $n); lex_header($r->head(), $w); $t = $r->effective_type(); $h = $r->bodyhandle(); if (defined($h)) { $h = $h->open("r"); if ($t eq 'text/plain' || $t eq 'text/html') { while ($_ = $h->getline()) { chomp(); if (!/[^A-Za-z0-9\+\=]/) { next; } foreach my $tok (split(/\b/, $_)) { $w->{$tok}++; } } } } $n = $r->parts(); for (my $i = 0; $i < $n; $i++) { lex_entity($r->parts($i), $w); } }

This is a decidedly informal approach to tokenizing a mail, but I think it works. Especially important is the line that rejects base64-encoded text in a text/plain entity. See if you can spot it.

After training the system on a spam corpus of 142 mails and a legit corpus of 3332 mails, the filter can reject about half of incoming spam and has not yet produced a false positive. Hopefully the rejection ratio improves as the spam corpus grows.

One damning bit of trivia arose from this experiment. The mail servers at my previous employer Critical Path are all 99% reliable indicators of spam. One of Critical Path's features is supposed to be spam filtering. Evidently it doesn't work overly well.

CD Changer UI

I have a 3-disc CD changer. It is pretty useless. You play the first CD, then the second, then the third. To play the fourth, you eject the third. To play the fifth, you eject the fourth and so on. The first and second discs stay in there forever. If I really wanted to program four hours of music, I would plug in the iPod.

Gene Kan

I knew Gene Kan, barely. He and Yaroslav are friends with my friends, and are occassional visitors. Gene was in my office a few weeks before his death, installing an email server in a machine room. Why would a man expecting death build an email server to last a decade? I do not know.

Role-based access control

There is a SIG in the ACM for role-based access control (RBAC). They meet every year for a symposium. A lot of papers are presented. I cannot see anything in all the RBAC literature which is more functional than putting users into groups the old-fashioned Unix way. My machines using regular Unix security have many role accounts: httpd, nntpd, ftpd, lpd, uucp, daemon, nobody, and so forth.

The main problem with the Unix authorization model is the lack of delegation of authority, but this is a separate issue from RBAC.

The iPod is my new diversion. I started a project "dopi" to write an application that uses the iPod. So far, I can list all the files, read and print their ID3 v1.1 tags, play music from the iPod, copy data from the iPod to another filesystem, and write the ID3 v1.1. ID3 v1.1 data can be changed in place because the data is fixed length and offset. I can't write anything else because libhfsp doesn't support writing and there is no read/write filesystem driver for HFS+ in Linux.

I just wrote a stupid little program that ties everything together by walking through the iPod playing 200KiB at offset 400KiB from the beginning of every mpeg.

I don't think Tk is going to cut the mustard for the UI, so I am going to go off and read a tutorial on QT. C++ is evil, but GTK+ is more evil.

If you have an iPod and a suggestion please send it along.

I have begun work on an implementation of the W3C DOM Level 2 Core and Events. The implementation language is C. There are existing DOM implementations in C, not least DOMC and GDOME. These packages don't implement my style of C API, so I decided I needed my own. I absolutely insist on consistent return values from functions, naming conventions, etc.

So far I have some relatively good ideas for the nuts and bolts. It should be easy to use a shared string table to reduce memory usage on long documents, and I have planned ahead of time how to implement live NodeList objects (which neither existing implementation implements).

The major stumbling block so far has been deciding what C datatype to use for DOMString, which is a sting of unsigned 16-bit values. wchar is a joke: just like all other C datatypes (save char), you never have any idea of its actual width. The C library functions for dealing with wchar are horrible. I have dodged this problem by skipping wchar altogether and just using auto* tools to detect and define a 16-bit-wide system datatype.

As usual, C's definition is a hindrance. There ought to only be signed and unsigned 8, 16, 32, and 64-bit integer datatypes in C, and they ought to be explicity defined that way.

I am playing with this site. It let me certify myself as Apprentice :) I think this is a niche community worth investigating.

PS: There doesn't seem to be a way to revoke a certification.

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!