Hacking about a bit with Portaloo, writing quite a lot, and wondering what on earth this job site is trying to tell me.
Finally managed to get some thoughts together on the spam/non-spam issue, a mere fortnight behind pretty much everybody else.
I've focused on the corpus collection side of things, since I worked on the SpeechDat(II) project for a while (the link via the Welsh flag on that page is long down, sorry). I could've written more about lexical model adaptation, but chose not to in the end.
Anyway, here's a link to what I wrote. Comments appreciated.
I've been wondering about the way that the current group of probabilistic spam-filters, from Vipul's Razor via spamassassin to those inspired by Paul Graham's work, actually collect their spam/non-spam corpuses, and, where appropriate, adapt their n-gram and other lexical analyses. I'm putting that here in order to embarrass myself into writing something about it in the very near future.
A lot's happened in the past month. My PhD grinds on, very slowly - current deadline for completion is March 31st. I have a Real Job for when I finish that. And I've almost completely neglected Advogato (sorry), but I'm glad I'm not a Journeyer any more. Later then...
New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.
Keep up with the latest Advogato features by reading the Advogato status blog.
If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!