6 Jul 2003 Stevey   » (Master)

Bayesian Spelling

 A lot of people have heard of Bayesian Spam filtering recently, as a result of Paul Grahams Plan For Spam article.

 I confess that my maths knowledge is lacking, but I can follow along with his idea. Counting tokens is trivial stuff, and applying weights to the different tokens appears to be reasonable - so I can follow along, and see how it all wokrs.

 Reading through the code of several implementations has been rewarding as I can see it all in action.

 The whole process has piqued my interest in statistics, something I've never really been that interested in before. I guess the closests statistical thing I have coded before has been Genetic Algorithms, where this kind of thing doesn't really turn up to the same extent.

 My formal maths training isn't terribly high, much like my computer training. Most of the things I know I've picked up by accidental discovery rather than pure theory, although I have read a lot of the literature over the past few years to shore up my home-learning approach to programming.

The Idea

 Whilst I was typing up the latest entry for my online journal I enabled the online spell checker.

 This managed to correct my erroneous spelling of "muscles" to "mussels". This was quite a fun mis-correction, but it did make me pause for thought.

 So often I've seen this in spell checkers before - you type "that" which is a real word - but not the one you should have written.

 Perhaps what we need is a statistical approach to spell checking; much like Paul's work - look over a corpus of previous emails/blog entries/whatever and look at the word distribution.

 Examining pairs of words it should be possible to see, for example that "hot this" doesn't ever occur - but that "sex", "curry", "weather" are a acceptible suffixes to follow "hot".

 I guess this does break down badly when you're using globally unique words for the first time - as there wouldn't be an entry in the database to describe it. So the first time you wrote "hot Madigasgar" you'd be flagged as if you'd made an error.

 It's an interesting idea though nonetheless. I wonder if it's been done before?

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!