19 Aug 2002 barryp   » (Journeyer)

I was intrigued by Bram's Python code for analyzing spam, and have been studying Paul Graham's article and raph's comment to it, but am still a bit perplexed at the significance of the values +-4.6, -1.4 and 2.2. Do they really mean something or are they just pulled out of thin air? 4.6 = log(100) makes a small bit of sense, but the other two I don't quite get.

Anyhow, had the idea that a possible way to have users submit mail for analysis/training would be to have them copy messages into special IMAP folders - which gave an excuse to play around with Python's imaplib library. Created folders named "Learn-Spam" and "Learn-OK" and had a script pull messages from there and remove when finished.

One thing I see is that you're gonna have make sure to do base64 and quoted-printable decoding of message parts, otherwise spammers could easily obscure their stuff from scanning.

For persistant storage of tokens, scores and such - tried PostgreSQL and found that inserting hunderds of small records per message took a *lot* of time. Tried a PyBSDDB dbshelve, which was smoking fast by comparison for this type of job.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!