Older blog entries for jmason (starting at number 56)

Patches and Contributed Code

Here's an interesting one. I've written a few free-software apps in the past, and recently SpamAssassin has taken off. It's very much sysadmin-oriented, being a mail filter for spam which works well as a system-wide filter.

It's illustrated that there's a big difference in audiences, between app users and sysadmins; sysadmins will regularly hack the code to ''scratch their itch'' and send back a patch; whereas patches don't often come from users.


My ghod, it's been a while since I updated the diary. Things I've done since then:

  • wrote SpamAssassin, a mail filter to identify spam using text analysis. Using its rule base, it runs a wide range of heuristic tests on mail headers and body text to identify spam.

    This is pretty neat. It does a good job of differentiating spam from not-spam without too many false positives or negatives; and it's a proper Perl module, so it can be plugged into other mail delivery or filtering systems quite easily (at some stage ;).

    I've been using something similar for a long time, but I eventually decided to reinvent the wheel. The end result is pretty good so IMHO it was worth it.

  • Helped start up Ireland Offline, a new organisation campaigning to sort out Ireland's internet backwater status and bring fat pipes to the people. This is going well... lots of interest, press and support, and some great people involved.

  • Decided to move to Australia ;) Yep, despite getting involved in Ireland Offline, I'm heading off to Melbourne in a month's time. Haven't really figured out the job situation there, but hopefully it shouldn't be too tricky getting hold of one. If anyone reading is in a position to hire a UNIX guru (hey, I'm allowed to plug myself for this), give us a mail.

  • Sitescooper: not an awful lot of news here; Plucker support is pretty good now, and I've put its caching subsystem on a diet in preparation for a move to a new server for the Nightly Scoops site.

    The scoops page is an interesting situation. Every night, a cron job runs off and downloads pages from 136 sites (typically the ones that have clear-ish terms allowing redistribution of their content). The sitescooper script is run 5 times, for the 5 output formats that site provides. Since sitescooper caches these pages in a per-format cache (which allows it to run diffs on pages to see what's changed) as well as a shared cache (which ensures the network is only accessed once for each page), that was 6 copies of each page.

    The cache is expired every few days, removing pages older than a month or so. Still, it was running pretty big, all the same. I've now implemented a Singleton pattern for the cache usage, which brings it down to 1 ref-counted copy of each page, and 6 pointers. After a few weeks of this, the cache disk usage is running at about 120 megs, down from about 800.

    This unfortunately may still be too much for the poor overburdened colocated server I use, especially since I'll be on the other side of the world. :( As a result the list of sites on the page may need another diet. We'll see...

  • WebMake: lots of new stuff in the pipeline. It now supports plugins, which are library files that can define library functions for the inline perl code, and -- since I've added tag-definition support -- a plugin can also add new tags, for use either in the HTML input documents, or in the WebMake .wmk XML file itself. Who needs taglibs? ;)

    This has allowed lots of new features, without messing up the core. It's been in the released version for a while.

    However, a new new feature, not released yet, is IMHO neater. It's "edit-in-browser" support, which is long overdue.

    This is really just a CGI script and a set of modules, allowing a WebMake site to be managed in a web browser; the user logs in using traditional htpasswd authentication, picks a WebMake site (ie. a .wmk file), and can then pick bits of content from the file and edit them in a textbox. It also has a directory browser/file manager for the tags that load content from a directory tree, like contents and media.

    Once they're done editing, they can build the site (using WebMake, obviously), and -- the really neat bit -- check their changes into CVS.

    Since CVS support is built-in, this means that I can update my sites from anywhere in the world, with a web browser, or do it quickly at the command-line from anywhere I have the sites checked out -- at home, in work, etc. It also gives a bonus in that it makes site replication super-easy -- just cvs checkout and it's done. And it's free. CVS is cool.

    So I'm just documenting this up, grabbing screenshots etc., and then I'll release it.

Just certified Dave Brownell as a Master, seeing as he's one of those guys who just keeps cropping up in the most interesting projects.

Still need to do a proper diary update at some stage...

Sitescooper 3.0.2 released -- and about time too ;)

Been a long time since I updated the diary. There's a few reasons:

  • been busy :( -- trying to get up a head of steam to fight software patents in Europe -- Ireland is backing the move, so I'm trying to get some ILUG members (myself included) to fight it. Problem is, I don't know where to start, myself -- letterwriting and political campaigning are not my strong points :(

  • Also, I don't think recentlog.html is scaling, it's too difficult to follow the diaries. Generally if I check my diary the morning after posting, it's already scrolled off. This makes it very tricky to be bothered posting, if there's a 90pc chance no-one's going to read it... after all, who actually goes to a /person page to read their diaries? 's the tragedy of the commons, innit. ;)

But notwithstanding the latter point, I'll throw a few opinions into the ether on what I've read in other diaries. And might as well do an update on WebMake and sitescooper...

---- WebMake

Released 0.7. It works quite well, generates sitemaps, breadcrumb trails, back/forward navigation links, and other nifty metadata things. Not sure what needs to be done next... I have a few non-urgent plans:

generate RDF sitemaps

as suggested in Dan Bricklin's paper, URL on the WebMake todo list. This could be cool, esp. if it can be reused to generate RSS "what's new" lists for My Netscape, Scripting News, oreilly.net, etc.

access to stat() data on links

Allow automatic generation of file size info, by making file size a metadatum on a content item -- this'd be handy for download pages.

come up with an intermediate XML format for EtText

caolan suggested this one, and it's a goodie. If EtText generates an XML format instead of plain XHTML, it may be a neat way of (a) allowing more flexible styling of the HTML, (b) allowing other output formats (WML, DocBook, etc.), (c) some neat XSL tricks.

"edit-in-browser" functionality

Throw in a CGI which can parse and edit WebMake files and EtText, and you've got good ol' "edit-in-browser" as seen on Advogato, editthispage.com, blogger, etc.

Mebbe I'll just let it get stable first though.

---- Sitescooper

Not much here -- need to fix the NYT login problem (again). Lots of hassle with sites blocking us out of their "AvantGo versions"; AG are taking a strong line with the sites to block us out, it looks like. Nasty.

Mandrake caused a bit of a stink recently, with their announcement that Mandrake News and the Mandrake Forum would be made palm-readable with AvantGo, and not a mention of sitescooper or Plucker. So I've made a site file for MF, which AG still can't handle ;).

Michael Nordström from Plucker asked for the URL of their PDA-friendly version, but no response. hmm.

Maybe we should look into making a sitescooper-on-Mandrake RPM for their Cooker distro, and subvert from the inside ;)

---- Comments

lkcl --

i was going to have to send < and friends because of the break-ups in the data flow: jabber has a wrapper around data called a <stream>. this is where things start to get scary.

It's a nasty problem -- you could try using CDATA sections, which act as unreadable blocks of data, XML tags in there won't get parsed. Not sure how well libxml supports 'em though.

mrorganic mentioned:

Personal: got the QNX/RTP stuff loaded and working last night. I haven't done much with it yet, but I already know I like it better than anything I've gotten running on Linux. Photon makes X look like the buggy, bloated hack job that it is. I haven't made much use of PhAB yet (the GUI-builder for Photon), and reports indicate it is still unstable, but I'll probably play around with it a bit tonight and see what it's capable of.

I've always been a fan of OSes like VxWorks and QNX because they seem so much *cleaner* than other architectures.

I've been using QNX4 (the previous version before RTP) for the last year + 1/2. It's not much cleaner than Linux, it just has less functionality. And oh, the bugs, don't get me started ;)

BTW someone mentioned shouldexist.org. There's also halfbakery.com with a similar anti-patents concept.

thomasq, the graduation gowns are colour-coded according to institution and the type of degree (BA, M.Sc etc.) -- just encountered this recently at my GF's Ph.D graduation. The stage looked like someone had gone crazy with the flood-fill.

Great paper from the O'Reilly OSS Convention in Monterey about Salon's CMS system. Looks cool, must nick some ideas ;)

Hey caolan, re: QNX -- don't believe the hype! It's nice, but not that nice... mark it up as a bit like Be.

Released Sitescooper 3.0.1 today, with quite a few bugs fixed and lots of new sites. It's nice to put that one to bed for a few days; maybe I can get back to WebMake for a while and fix a dependencies-with-perl-code problem.

BTW -- sitescooper users -- note that sitescooper.cx will be disappearing soon. It's sitescooper.org from now on. Those cheap sods in the .cx ccTLD registry folded their "free domains for open source projects" less than 6 months after it was first offered, so I'm f---ed if I'm going to pay them for a .cx after that.

Anyway, nothing I like better in the routine code maintainance dept than firing up the profiler, spotting a hotspot, spending 15 minutes refactoring it and getting a 10% speedup. Beauty!

In other news -- I joined FoRK and got a mail from James Casey, who (a) actually is a friend of Rohit Khare, like the list sez, and (b) I haven't seen in ages. He's apparently off in That London at the mo', but pints will be had next time we're in the same city I should hope.

Argh, netscape 4.75 crashed while editing the diary, probably due to some wierdness where AbiWord mucked up my fonts. Looking forward to an X11 where fonts just work :(

Anyway, released WebMake 0.5 last night.

It's pretty nice already for static, informational sites like homepages etc.; I rejigged the Irish Internet Users pages to use it in 5 minutes, which was handy, and it's a big improvement on what I had there previously.

However I need to add more support for sites where the index page is dynamically generated from a list of static story files. Here's how it works currently:

  1. WebMake file indicates location of one or more story archives, containing 1 story per file

  2. each file can also include meta tags to indicate metadata, like its title, one-line abstract, priority (aka score), section, etc.

  3. some perl code gets the names of all the story content items

  4. perl code then sorts them by section, score and title

  5. foreach item, set title, url, abstract, section, score variables, and fill out a user-specified template with them

  6. set a content item to contain that list

  7. list is written to whatever <out> files it's used in.

That's all well and good, but it's not tidy; the Perl code makes it too messy... I think steps 3 to 6 need tidying up, and possibly some kind of no-perl-required way to do it.

Joined FoRK, so now I'm thoroughly snowed ;)

WebMake now has a significant chunk of CMS magic included, in that it can handle metadata and use this to order and query content chunks, in order to generate indices and sitemaps. And better, the dependency checking works with it, so unchanged files do not even need to be read to get their metadata, it's cached in a per-site db file.

BTW the big win of WebMake's dependency support is that it means that WebMake is a CMS which works with web caches nicely. Wes Felter's HtP site brought this point up on the radar last month with a pointer to Resin's caching system.

Anyway, 0.4, just released, does this nicely, and even has some doco ;)

It's getting to the stage where it's satisfied the functionality I needed it to have, so I'll probably be slowing down soon and letting it accumulate some bugfixes and get stable.

One thing first, though: the CVS code now can generate a sitemap using only 3 types of data:

  • an "up" metadatum, pointing to the content item that is "up" from the current node

  • a "root" attribute on a content item, indicating that it's the root of the content tree

  • a pair of content templates which will be filled out with the details of each node, to generate the list

This is a beaut. It means that an RSS site summary file, or even a Slashdot-style "front page", can be generated entirely using a <sitemap> tag. Well, nearly -- I still need to write support for the visibility time range metadata types...

Other thing on the TODO list: allow WebMake to get content from an external command, and write up a doco on how WebMake can be used from within mod_perl to act as a conventional, dynamic-server-pages style system.

Hmm.... wonder what the wiki tag does? BTW still need a project tag ;)

47 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!