15 Mar 2003
(updated 17 Mar 2003 at 00:11 UTC) »
I wonder if anyone replied to what I last wrote about Advogato diaries and LiveJournal? I didn't come back in time to check everyone's diaries and find out, so I don't know. If you replied, can you mail me and let me know?
Needless to say, I don't have this problem with LiveJournal, because replies are attached to the comment they reply to and I receive email alerts about them.
Bayes and scoring
raph discusses Paul Graham's plan to use Bayesian statistics to defeat spam and points out that his combiner is isomorphic to a linear one. It turns out that Graham's combiner assumes that the a priori probability that a mail is spam is 0.5, which isn't right, but you can easily fix the problem without sacrificing linearity.
For those who like me were confused by what "a probability of 0.99 spam" means, it means Pr(S|A) where S is the event "the mail is spam" and A is the event "the mail contains the word viagra". The extent to which A and B together indicate that the mail is spam is given by
Pr(S|A and B) = (a o b) = ab/(ab + (1-a)(1-b)) (where a = Pr(S|A))
but this formula is only correct if you assume that
Pr(S) = Pr(not S) = 0.5
and that the events are independent apart from their dependence on whether or not it's spam:
Pr(A and B|S) = Pr(A|S) Pr(B|S)
Pr(A and B|not S) = Pr(A|not S) Pr(B|not S)
Assuming that the events are independent is fundamental to the whole approach, but assuming Pr(S) = 0.5 seems to me like a mistake. Interestingly, though, if my math is right then the correct combiner for arbitrary a priori Pr(S) = s is (a o_s b) = (a o b o (1-s)) (do the math, or mail me and ask me to post it here). Thus if you define f(x) = log(x) - log(1-x) as before so that f(a o b) = f(a) + f(b), then you can define
f_s(x) = f(x) - f(s)
and you find
f_s(a o_s b) = f(a o b o (1-s)) - f(s)
= f(a) + f(b) + f(1 - s) - f(s)
= f(a) + f(b) - f(s) - f(s)
= f_s(a) + f_s(b)
This makes sense if you think about it: if a = s, then Pr(S|A) = Pr(S) so A tells you nothing about whether or not it's spam, and f_s(a) = f(s) - f(s) = 0 so it contributes nothing to the score either way.
So you don't need to assume that the a priori probability of spam is 0.5 to make raph's linear approach work.