The CodeCon 2003 archives are now on codecon.org. We had lots of great presentations this year, those of you who missed it can now hear them.
Pushed out a new release yesterday, and a bugfix release today. Get it on the download page.
tjansen: requiring hosters of big content to purchase a big net connection and then recoup the costs later would force them to take on a huge risk, and slow down new deployments immensely. Besides, we don't have micropayments and won't for the forseeable future, so the point is moot.
brondsem: Counting a repeated instance multiple times causes long messages to be weighed disproportionately. Each message either is or is not a spam, a long spam message does not count as two pieces of spam.
Also, if a non-spam message is about mortgages it's likely to contain the word 'mortgage' many times. Counting that commonly used spam word against it repeatedly is much more likely to tag it as spam, when in fact the repeated instances are just as likely to occur in non-spam as spam.
The idea behind only counting a few words is to not make all messages look the same. If you count all the words in all messages then most messages wind up with pretty much the same average value.
You are absolutely right that real backtracing is needed to know for sure though. Anyone who's setting up a spam filtering system please save all your spams and set it up so you can easily do backtracing and find out the number of false positives and false negatives different techniques would have hit.