My ghod, it's been a while since I updated the diary. Things I've done since then:
wrote SpamAssassin, a mail filter to identify spam using text analysis. Using its rule base, it runs a wide range of heuristic tests on mail headers and body text to identify spam.
This is pretty neat. It does a good job of differentiating spam from not-spam without too many false positives or negatives; and it's a proper Perl module, so it can be plugged into other mail delivery or filtering systems quite easily (at some stage ;).
I've been using something similar for a long time, but I eventually decided to reinvent the wheel. The end result is pretty good so IMHO it was worth it.
Helped start up Ireland Offline, a new organisation campaigning to sort out Ireland's internet backwater status and bring fat pipes to the people. This is going well... lots of interest, press and support, and some great people involved.
Decided to move to Australia ;) Yep, despite getting involved in Ireland Offline, I'm heading off to Melbourne in a month's time. Haven't really figured out the job situation there, but hopefully it shouldn't be too tricky getting hold of one. If anyone reading is in a position to hire a UNIX guru (hey, I'm allowed to plug myself for this), give us a mail.
The scoops page is an interesting situation. Every night, a cron job runs off and downloads pages from 136 sites (typically the ones that have clear-ish terms allowing redistribution of their content). The sitescooper script is run 5 times, for the 5 output formats that site provides. Since sitescooper caches these pages in a per-format cache (which allows it to run diffs on pages to see what's changed) as well as a shared cache (which ensures the network is only accessed once for each page), that was 6 copies of each page.
The cache is expired every few days, removing pages older than a month or so. Still, it was running pretty big, all the same. I've now implemented a Singleton pattern for the cache usage, which brings it down to 1 ref-counted copy of each page, and 6 pointers. After a few weeks of this, the cache disk usage is running at about 120 megs, down from about 800.
This unfortunately may still be too much for the poor overburdened colocated server I use, especially since I'll be on the other side of the world. :( As a result the list of sites on the page may need another diet. We'll see...
WebMake: lots of new stuff in the pipeline. It now supports plugins, which are library files that can define library functions for the inline perl code, and -- since I've added tag-definition support -- a plugin can also add new tags, for use either in the HTML input documents, or in the WebMake .wmk XML file itself. Who needs taglibs? ;)
This has allowed lots of new features, without messing up the core. It's been in the released version for a while.
However, a new new feature, not released yet, is IMHO neater. It's "edit-in-browser" support, which is long overdue.
This is really just a CGI script and a set of modules, allowing a WebMake site to be managed in a web browser; the user logs in using traditional htpasswd authentication, picks a WebMake site (ie. a .wmk file), and can then pick bits of content from the file and edit them in a textbox. It also has a directory browser/file manager for the tags that load content from a directory tree, like contents and media.
Once they're done editing, they can build the site (using WebMake, obviously), and -- the really neat bit -- check their changes into CVS.
Since CVS support is built-in, this means that I can update my sites from anywhere in the world, with a web browser, or do it quickly at the command-line from anywhere I have the sites checked out -- at home, in work, etc. It also gives a bonus in that it makes site replication super-easy -- just cvs checkout and it's done. And it's free. CVS is cool.
So I'm just documenting this up, grabbing screenshots etc., and then I'll release it.