7 Sep 2002 jwb   » (Journeyer)


The spam filter is getting better as the training corpus grows. It now has 300 messages, which explains why I need this software. The mail parsing code I posted earlier was not optimal. It turns out that parsing the unencoded content of binary attachments helps to trap viruses and other windows programs. In both the bodies and the headers it is necessary to ignore very short words, and to remove non-word characters. This means that:

To: jwbaker@acm.org

Is reduced to just


Which is perfectly fine for these purposes. I'm still tweaking the implementation. The latest iteration found five spams buried in my "good" corpus, but decided some things in my spam corpus were good. There are some odd effects currently. For example, my mail goes through mail.saturn5.com comparatively recently. Before it went via a different server, and I still have many spams from that server. Therefore the spam corpus disproportionately represents the old mail server. The term for the new mail server "mailsaturn5com" can tend to make a spam appear legitimate.

The goal is still zero false positives.

victory The best part of a spam filter is the tiny feeling of victory when a spam takes a detour into the spam folder with 100% confidence, and helps to make sure that any spams like it will follow. Muhhhhahhahhaha

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!