The spam filter is getting better as the training corpus grows. It now has 300 messages, which explains why I need this software. The mail parsing code I posted earlier was not optimal. It turns out that parsing the unencoded content of binary attachments helps to trap viruses and other windows programs. In both the bodies and the headers it is necessary to ignore very short words, and to remove non-word characters. This means that:
To: jwbaker@acm.org
Is reduced to just
jwbakeracmorg
Which is perfectly fine for these purposes. I'm still tweaking the implementation. The latest iteration found five spams buried in my "good" corpus, but decided some things in my spam corpus were good. There are some odd effects currently. For example, my mail goes through mail.saturn5.com comparatively recently. Before it went via a different server, and I still have many spams from that server. Therefore the spam corpus disproportionately represents the old mail server. The term for the new mail server "mailsaturn5com" can tend to make a spam appear legitimate.
The goal is still zero false positives.
victory The best part of a spam filter is the tiny feeling of victory when a spam takes a detour into the spam folder with 100% confidence, and helps to make sure that any spams like it will follow. Muhhhhahhahhaha