Older blog entries for thayer (starting at number 9)

Gzip Appending.

So, dear diary, I found something really cool the other day. Apparently gzip supports appending multiple files to a single file. Eg. The following actually works:

$ gzip < log1 > biglog.gz
$ gzip < log2 >> biglog.gz
$ gunzip < biglog.gz

The result is just like a "cat log1 log2". In fact, you can even tail a gzipped log:

$ foo | gzip >> biglog.gz
$ tail -f biglog.gz | gunzip &
$ bar | gzip >> biglog.gz

I haven't had time to look at the gzip code, but it has great implications for the sorts of things I have to work with. For example, I may be able to avoid uncompressing things to work with them, then recompressing. Ie. I'll need a lot less disk. Considering that we have the weblogs for ny.com back to 1994 that's some serious data to cope with. And that's nothing compared to the Terabytes a week at work (Inktomi/Yahoo).

So I'm not sure how I'll use this, but it sure is cool, and surprising that I've never noticed it before. Furthermore, a quick survey of my co-workers and friends revealed that noone has seen this before...

Rebirth of prolog as the Semantic Web: Looking over some of the REST vs SOAP vs XMLRPC debate, I feel that the semantic web offers little more than "obj1 link obj2", which looks a whole lot like a prolog-ish world-view. And I don't necessarily mean that in a bad way. However, I enjoy the simplicity of XMLRPC over REST, and find SOAP is just silly given the context in which it lives. For SOAP, why have all this flexible typing when you're throwing type constraints in at the same time?

  • REST is good for data.
  • XMLRPC is good for simplistic RPC
  • SOAP is good for wasting bandwidth
  • Prolog is good for taking data and inferring or querying

I'll keep my bits to myself now, I promise.

Memento as a programming model: I was talking to Mark English as the movie Memento. The main character can't create or store long-term memories as the result of a violent accident. The movie proceeds from the end and moves backward (and forward). This has the effect of the viewer experiencing reality in the same way as someone with impaired memory. The amazing thing is that the main character has to live out his life without knowing what's just happened, and he's developed a variety of strategies and habits to get by, such as tattooing his body with key messages about the murder of his wife.

Anyway, it occurred to me that "lack of memory" is closely related to "good programming practice." Imagine if every day you had to work on a program, you forgot the prior day's coding!

  • You'd develop simplifying habits: simple variable names, variable names that hint at their type.

  • You're code would be well broken down. No function or class that couldn't be understood in 15 minutes.

  • You'd document it first then fill in the details. You'd need to concisely capture a description of the problems being solved.

  • You'd write test code before even starting on the code. The first thing you'd do every day after re-reading the documentation would be to run the unit test to know if you were done yet.

  • Versions would be checked in often. You would naturally save it once you started forgetting parts.

Well, I realize it's a tenuous sort of argument. I probably couldn't program with limited memory. Some of the best programmers I know, or have heard of, have tremendously strong memory. James Tanis tells me that Seth Robertson was able to do some of his best kernel work because he could keep so much of it in his head at once. Nevertheless, it's a fascinating metric for judging good programming practice. If there's a question of how to program something, ask yourself what someone without long-term memory would do.

Where was I?

30 Apr 2002 (updated 30 Apr 2002 at 07:02 UTC) »
spamassasin is a perl script which I've been using with good success to clear out spam. I have it setup to filter all my incoming mail, then hand-off to slocal.

Eg. /usr/local/bin/spamassassin -P | /usr/lib/nmh/slocal -user $1

My maildelivery looks for the special header added by spamassassin, X-Spam-Flag. The default rules catch about 80% of my spam, which I dump in my spam folder just in case I want to double-check the results. There are some very complex configuration rules one can build that result in each message being scored as potential spam. Users can extend the rules, and apply new scores for existing rules.


Ifile: ifile is a neat program for automatically filtering email into appropriate folders, such as spam. It's design seems to be specifically for mh/exmh, and slocal (which I use).

It seems to simply use word analysis without much regard for the semantic or syntactic structure of email. Email comes into slocal and is sorted based on ifile's simple database of folders to word occurences. It's smart enough to learn when you refile a piece of mail.

I was hoping it would identify spam for me, but having tried it for a couple of weeks, I'm turning it off. Unfortunately, it's not appropriate for the way I file my email. I can imagine situations where it's useful, but it's got several problems:

  • refile slightly busted. I get an occasional "Not able to open..." (I use nmh and mh-e emacs-mode, which might not all interact well enough)
  • treats words equally whether in the header or not
  • doesn't notice how slocal rules caused things to be filed. I'd like it to learn from the rules I have already. (Need to periodically run knowledge_base.mh for this reason.)
  • seems to pick up folders which weren't in my .folders, namely my OLD/inbox and ARCHIVE/inbox. (I auto migrate stuff over a month old, and then archive and compress after two months.
  • I'd like it to treat my manual refiles as extra important in it's database.

One day, I'd like to get around to making some minor adjustments and fixes/updates. It seems like parts of it should be a standard part of the unix toolkit, such as word frequency analysis. Here are some guidelines if you're considering using it:

  • Use it if you use inc instead of using slocal/maildelivery. All your email comes to one inbox and gets filed by you. It will do a pretty good job of guessing how you file.
  • Don't use it if you programmatically filter. ifile is good for human text, it won't save you much if you deal with highly structured email. You'll have to continue to write rules and ifile will often get confused.
MoinMoinWiki I'd like to mention a bit about WikiWebs:

I've been using MoinMoinWiki for several months now at work. The size of the Wiki, being used actively by about 6 people, has finally reached a point where the organization is a problem. But then I discovered two really great things 1) The CategoryCategory page, and 2) Template feature. Boy, this Wiki is a nice balance of powerful and simple. Really worth it's salt. At first I didn't think much of it, but I'm really starting to appreciate it.

I highly recommend everyone try the MoinMoinWiki which can be found at this SourceForge page. For those unfamiliar with WikiWikiWebs, here's some of the salient points, based on my experience with MoinMoinWiki:

  • The editing style is simple text, so you don't need to do very much to create okay looking docs. For example, in MoinMoin you can get a large headline with "== Headline ==" and if you start some lines with "1. " it turns that into a numbered list. There are a bunch of these, but it's quite easy to get started.

  • Linking between documents is automatic. You use a WikiWord, and it's automatically highlighted as a link. A WikiWord is two or more words conjoined at the hip and differentiated with capitalization. Eg. WikiWord is a WikiWord, CharlesThayer is a WikiWord, and so are AgentFunctionalSpec, MoonRakerReleaseNotes, XmlLinks, RpcLinks, BestQuotes, WalnutCreekPhoneList, etc.

  • A Wiki let's anyone create very unstructured documents. People can format short documents quickly, and migrate them to fancier more official documents when they become large enough.

  • The web interface makes it possible to do this from just about anywhere without the overhead of an application plus version control. But it still provides very readable docs and the ability to track changes.

  • In MoinMoinWiki you can turn off anonymous updates, or allow anyone to make changes. The whole notion behind WikiWebs is that folks can make updates immediately and easy. Also, see differences between versions is quite easy.

  • MoinMoinWiki is a fairly simple code base of python which doesn't take much effort to install.

  • In relation to the World-Wide Web a wiki is: (a) editable by everyone (b) quick (c) let's you link easily out of a page, and trivially see links *to* the page you're looking at.

Let me describe the two features of MoinMoinWiki that I mentioned earlier:

  • CategoryCategory: To tag a document as belonging to a category, one simply puts a link to the Category page. If I have a category for collections of web links, I create a page called "CategoryLinks". Then on the pages which I want under this category, I put a link to "CategoryLinks". From this page, I include the list of all pages pointing to it. Voila, the CategoryLinks page is an index for any page that refers to it. If I wanted to create a page of web references about XML, for example, I'd create XmlLinks and put a mention of CategoryLinks at the bottom. Note that the CategoryLinks page would include a reference to CategoryCategory so that it appears on that page (and vice versa).

  • Templates: Naming a page with Template at the end results in a Template. When you go to create a new page, you're given a list of templates to choose from. In this way, you can loosely specify the structure for different types of documents. For the Links example above, I'd create LinksTemplate and at the bottom I'd have the text "Part of: CategoryLinks". This way any page about links would auto-magic-ally on the links page CategoryLinks

For more advanced features there's a perl wiki called TWikiWeb which I haven't looked into yet. It seems to support some sophisticated structured data (ie. fill-out forms).

I'm wondering if a link to perlmonks will work here: perl monks. Now if I could find a python monks site, I'd be set...

Disturbing happenings at OSDN (sourceforge and freshmeat) are making me antsy.

My fondness for Jabber is continuing to grow. It's really smart in subtle ways. I think it makes a good general purpose monitor for events happening everywhere. If any shell script could report a message to a jabber channel for me across the whole network, then managing a lot of machines proactively would be much easier. Sort of personalized syslog... Problems are issues around unicode and internationalization, and embedding XML messages for supporting JAM (jabber as a middle-layer).

Things missing or things which I long to find a better version of:

  • Ad manager for web sites.
  • Wiki web (using MoinMoin WikiWeb in python)
  • Blogger
  • Web email system based on imap/pop for vanity accounts (check.com)
  • Map to web integration (mapit looks good)
  • Web based list manager (majordomo, gnu mailman okay)
  • XML diff
Python Networking

I've discovered several important deficiencies in some python networking classes. The problem is in the structure of both medusa/asyncore and the regular python socketserver plus basehttpserver.

Namely a request object can't declare that (a) it doesn't want to handle the request and the server should continue (b) the server should stop right there.

Both these actions are important for fork()ing. If the request handler forks, you have two processes that both want to reply to the same socket. There libraries are structured to support always forking or threading, but not a mixture of either determined within the request handler itself.

Also, although the medusa base class is smart enough in asyncore and asynchat to handle a special ExitNow exception, the HTTP server classes built on top are rude and catch all excepts. Someone using these can't raise an ExitNow because it's returned as a 500 or other HTTP error.

These problems have made it impossible for me to build a workable XMLRPC/SSL server which forks. Alas, I'm fairly happy with most of the base classes / modules. I'm going to have to add some lines to these packages and send out patches.

Much progress to report though none of it particularly technical. Moved, changed jobs, hired some developers, working hard, exploring, dealing with WTC bombing fall-out effects.

On the technical side: using jabber server extensively for middleware messaging layer, lots of python going on, really like the MoinMoin Wiki, couldn't get samba to compile under cygwin though it should be doable (it's probably not worthwhile), Jarl/everybuddy are alright jabber clients for Linux, Windows registry entries can have duplicates or null key/values.

Had some new thoughts on category trees. Greater than, less than for a path or tree translates to childof, parentof. There's a really simple relationship between trees and strings which makes lots of algorithms constant order in surprising ways. [more on that later]

Upgraded to reiserfs and I'm a happy camper. Got some more IDE drives to mount and a 2.4.0 kernel to finish compiling.

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!