I modified my spam filter to compare the performance of unigrams, digrams, and trigrams. The undesired corpus contains 353 mails; the desired corpus holds 3352. When using trigrams the vocabulary is over 1.1 million terms. My system is the same as Paul Grahams, except I do not double the document frequency of terms in the good corpus, and I consider mail as spam if its probability exceeds 50%.
The unigram system identified all but eight mails from the spam corpus, with zero false positives. The digram and trigram systems both identified all but three, also with no false positives. Of course the trigram system takes much longer for the analysis, so I believe I will use digrams for the present system.
The system works so well that I will write a small C library for use by mail clients. I think spam filtering has no effect on spammers until it becomes widespread. So I will try to spread it widely.