I implemented Paul Graham's probabilistic filtering for spam. At first I attempted to simply use rainbow from the libbow distribution, but rainbow has many problems. It is painfully academic, its HTML parsing code always segfaults, it is miserably slow, and it cannot test more than one document per launch. Therefore did I apt-get remove libbow.
My implementation, in Perl, attempts to understand the MIME and RFC822 structures for better tokenizing. Here is the meat of the lexer:
sub lex_header { my ($h, $w) = @_; my (@t); @t = $h->tags(); foreach my $t (@t) { my ($n); $n = $h->count($t); for (my $i = 0; $i < $n; $i++) { foreach my $tok (split(/\s/, $h->get($t, $i))) { $w->{$tok}++; } } } }sub lex_entity { my ($r, $w) = @_; my ($t, $h, $n); lex_header($r->head(), $w); $t = $r->effective_type(); $h = $r->bodyhandle(); if (defined($h)) { $h = $h->open("r"); if ($t eq 'text/plain' || $t eq 'text/html') { while ($_ = $h->getline()) { chomp(); if (!/[^A-Za-z0-9\+\=]/) { next; } foreach my $tok (split(/\b/, $_)) { $w->{$tok}++; } } } } $n = $r->parts(); for (my $i = 0; $i < $n; $i++) { lex_entity($r->parts($i), $w); } }
This is a decidedly informal approach to tokenizing a mail, but I think it works. Especially important is the line that rejects base64-encoded text in a text/plain entity. See if you can spot it.
After training the system on a spam corpus of 142 mails and a legit corpus of 3332 mails, the filter can reject about half of incoming spam and has not yet produced a false positive. Hopefully the rejection ratio improves as the spam corpus grows.
One damning bit of trivia arose from this experiment. The mail servers at my previous employer Critical Path are all 99% reliable indicators of spam. One of Critical Path's features is supposed to be spam filtering. Evidently it doesn't work overly well.