I spent an hour or so playing with an implementation of Paul Graham's
anti-spam algorithm, described recently in A Plan for Spam.
I implemented two different tools, both in Perl: spamcalc takes two
sets of filenames, separated by "--"; the first set is a list of files
containing "good" email (you can have lots of email messages in a single
file, or one per file--but it only groks standard UNIX mailbox format,
with '^From ' delimiters). This script reads in and tokenizes (using
the same algorithm Paul describes) all the "good" messages, counting how
many times each appears, then does the same for bad messages. Then, it
does the weight calculation and constructs a DB file containing all the
valid (appeared enough times to count) tokens and their weights.
The second script, spamcheck, takes a single message, tokenizes it,
and computes the 15 most "interesting" tokens. It then applies Bayes
Rule and shows you the resulting probability that this mail is spam.
The implementation (barring any stupid coding errors on my part) is
identical to that described in the paper, including ignoring case, etc.
I then played around with it for a bit. The main problem I have is
that, as I suspect is the case with most people, I don't keep my old
spam. So, I had to dig hard to come up with some spam to test with, and
only managed to find 10 messages that I had received. So, I'm just
testing--who cares, right?
Next, I have a huge archive of every email I've ever sent (well,
since 1995 or so--the older stuff is on backup somewhere), but that's
not really what I want since I'm trying to test email others have sent
me: it seems likely to me that email I sent would give a different,
skewed statistical "look" from email I receive, and harm the filter.
However I also have a pretty large set of folders containing mail others
have sent me, so I used all of that for the "good" mail. I then ran
some test email, both spam and not-spam, through the filter.
Well, the results were disappointing: everything was categorized as
spam! Looking at the results shows why: there are about 5 instances of
the year ("2002") as a token in the test messages (in the headers,
etc.), and each one of those was labeled individually as very
interesting, and they all had a strong correlation to spam (.88 or so).
Why is this? Easy, once you think about it: my spam was all of very
recent vintage: today, actually. However, my good email was from
folders where a very large number of messages were from previous years.
So, the "2002" token appeared in all the spam messages, but a much
smaller percentage of the good messages, hence the year was treated as a
high-probability indicator of spam! Not good. Maybe if I had more spam
(even if it was all from 2002) there would be more interesting words
than the year and this wouldn't matter. Of course, older spam would
also solve this problem.
Then I decided to try to get more spam to test with, so I went
looking at archives of mailing lists, like the GNU lists, which I know
get lots of spam. I found 30-40 messages and saved them, and re-ran
spamcalc. Now when I tested messages, they were all categorized as
not spam! Again, checking the details shows why, and it's
related to mail headers: all the email to me contained headers that
showed my hostname, etc. All the spam I installed from the archives did
not. So, any tokens containing my host, etc. indicated a low
probability of spam... again, not good!
So, I changed my "good" list to be just my inbox, which does contain
some older messages but most of which are more recent, to solve the
first problem, and I included only the spam I'd actually received to
solve the second. This works better than the other two, but still I
don't have enough spam mail to get a really good filter yet. But, I've
started saving spam so maybe it won't be too much longer :).
In summary, if you want to use this algorithm be aware that for good
results it's best if both your good and spam sets of messages are of
similar vintage (not just due to the year, but other things in the
headers like different local hostnames, etc.), and that you use spam you
actually received rather than public archives of spam.
One way around this would be to enhance the algorithm to ignore some
kinds of tokens outright: maybe avoiding things that look like dates,
and maybe the first one or two (or some well-known set) of Received
headers (ones that will be in every message you receive anyway);
obviously now we're moving slightly away from a pure statistical
analysis and trying to inject some AI into the algorithm. Which kind of
goes against the whole idea.
Anyway, I thought it was an interesting experiment.