21 Aug 2002 madscientist   » (Master)

I spent an hour or so playing with an implementation of Paul Graham's anti-spam algorithm, described recently in A Plan for Spam.

I implemented two different tools, both in Perl: spamcalc takes two sets of filenames, separated by "--"; the first set is a list of files containing "good" email (you can have lots of email messages in a single file, or one per file--but it only groks standard UNIX mailbox format, with '^From ' delimiters). This script reads in and tokenizes (using the same algorithm Paul describes) all the "good" messages, counting how many times each appears, then does the same for bad messages. Then, it does the weight calculation and constructs a DB file containing all the valid (appeared enough times to count) tokens and their weights.

The second script, spamcheck, takes a single message, tokenizes it, and computes the 15 most "interesting" tokens. It then applies Bayes Rule and shows you the resulting probability that this mail is spam. The implementation (barring any stupid coding errors on my part) is identical to that described in the paper, including ignoring case, etc.

I then played around with it for a bit. The main problem I have is that, as I suspect is the case with most people, I don't keep my old spam. So, I had to dig hard to come up with some spam to test with, and only managed to find 10 messages that I had received. So, I'm just testing--who cares, right?

Next, I have a huge archive of every email I've ever sent (well, since 1995 or so--the older stuff is on backup somewhere), but that's not really what I want since I'm trying to test email others have sent me: it seems likely to me that email I sent would give a different, skewed statistical "look" from email I receive, and harm the filter. However I also have a pretty large set of folders containing mail others have sent me, so I used all of that for the "good" mail. I then ran some test email, both spam and not-spam, through the filter.

Well, the results were disappointing: everything was categorized as spam! Looking at the results shows why: there are about 5 instances of the year ("2002") as a token in the test messages (in the headers, etc.), and each one of those was labeled individually as very interesting, and they all had a strong correlation to spam (.88 or so). Why is this? Easy, once you think about it: my spam was all of very recent vintage: today, actually. However, my good email was from folders where a very large number of messages were from previous years. So, the "2002" token appeared in all the spam messages, but a much smaller percentage of the good messages, hence the year was treated as a high-probability indicator of spam! Not good. Maybe if I had more spam (even if it was all from 2002) there would be more interesting words than the year and this wouldn't matter. Of course, older spam would also solve this problem.

Then I decided to try to get more spam to test with, so I went looking at archives of mailing lists, like the GNU lists, which I know get lots of spam. I found 30-40 messages and saved them, and re-ran spamcalc. Now when I tested messages, they were all categorized as not spam! Again, checking the details shows why, and it's related to mail headers: all the email to me contained headers that showed my hostname, etc. All the spam I installed from the archives did not. So, any tokens containing my host, etc. indicated a low probability of spam... again, not good!

So, I changed my "good" list to be just my inbox, which does contain some older messages but most of which are more recent, to solve the first problem, and I included only the spam I'd actually received to solve the second. This works better than the other two, but still I don't have enough spam mail to get a really good filter yet. But, I've started saving spam so maybe it won't be too much longer :).

In summary, if you want to use this algorithm be aware that for good results it's best if both your good and spam sets of messages are of similar vintage (not just due to the year, but other things in the headers like different local hostnames, etc.), and that you use spam you actually received rather than public archives of spam.

One way around this would be to enhance the algorithm to ignore some kinds of tokens outright: maybe avoiding things that look like dates, and maybe the first one or two (or some well-known set) of Received headers (ones that will be in every message you receive anyway); obviously now we're moving slightly away from a pure statistical analysis and trying to inject some AI into the algorithm. Which kind of goes against the whole idea.

Anyway, I thought it was an interesting experiment.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!