The graham/bayes approach to spam is interesting and seems to work quite well. However, it seems to have pretty major issues with multi-language mails and I am not sure how to fix this in a convenient manner.
I get lots of "good" english and german Mail, but there is by far more english spam than german spam in my inbox. This has the effect that a word that should appear in nearly every german mail like e.g. "ein" appears rarely in spam mails and more frequently in good mails. Suddenly a word that should behave neutral for detecting spam becomes a witness for a good mail. In the case of "ein" the spam probability is 0.05 in my database.
It is not that bad because I do not get too much german spam. However, it seems like a fundamental problem to me and it most probably cannot be adressed without different databases and a way to determine what language a mail contains (this most probably can work the same way as distinguishing between spam/nonspam). However, the training/sorting work would increase significantly - I usually don't sort my mails by language...
On the other hand the very same effect is useful for me with CJK-Mails - I don't speak any of these languages so there are no "good" CJK-Mails in my inbox. It is perfectly reasonable that the filter classifies them as spam...