10 Sep 2002 jwb   » (Journeyer)

n-grams in spam filter

I modified my spam filter to compare the performance of unigrams, digrams, and trigrams. The undesired corpus contains 353 mails; the desired corpus holds 3352. When using trigrams the vocabulary is over 1.1 million terms. My system is the same as Paul Grahams, except I do not double the document frequency of terms in the good corpus, and I consider mail as spam if its probability exceeds 50%.

The unigram system identified all but eight mails from the spam corpus, with zero false positives. The digram and trigram systems both identified all but three, also with no false positives. Of course the trigram system takes much longer for the analysis, so I believe I will use digrams for the present system.

The system works so well that I will write a small C library for use by mail clients. I think spam filtering has no effect on spammers until it becomes widespread. So I will try to spread it widely.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!