Name: Charles Thayer
Member since: 2000-08-29 17:57:27
Last Login: 2010-03-17 16:11:21

Homepage: http://www.b2si.com/~thayer/

Notes:

Haunts: teleias.com as Senior Engineer; cityrealty.com as CTO; b2si.com as founder; mediabridge.com as founder and Chief Scientist; ny.com as founder; cs.columbia.edu as CRF.

Recent blog entries by thayer

Gzip Appending.

So, dear diary, I found something really cool the other day. Apparently gzip supports appending multiple files to a single file. Eg. The following actually works:

$ gzip < log1 > biglog.gz
$ gzip < log2 >> biglog.gz
$ gunzip < biglog.gz

The result is just like a "cat log1 log2". In fact, you can even tail a gzipped log:

$ foo | gzip >> biglog.gz
$ tail -f biglog.gz | gunzip &
$ bar | gzip >> biglog.gz

I haven't had time to look at the gzip code, but it has great implications for the sorts of things I have to work with. For example, I may be able to avoid uncompressing things to work with them, then recompressing. Ie. I'll need a lot less disk. Considering that we have the weblogs for ny.com back to 1994 that's some serious data to cope with. And that's nothing compared to the Terabytes a week at work (Inktomi/Yahoo).

So I'm not sure how I'll use this, but it sure is cool, and surprising that I've never noticed it before. Furthermore, a quick survey of my co-workers and friends revealed that noone has seen this before...

Rebirth of prolog as the Semantic Web: Looking over some of the REST vs SOAP vs XMLRPC debate, I feel that the semantic web offers little more than "obj1 link obj2", which looks a whole lot like a prolog-ish world-view. And I don't necessarily mean that in a bad way. However, I enjoy the simplicity of XMLRPC over REST, and find SOAP is just silly given the context in which it lives. For SOAP, why have all this flexible typing when you're throwing type constraints in at the same time?

  • REST is good for data.
  • XMLRPC is good for simplistic RPC
  • SOAP is good for wasting bandwidth
  • Prolog is good for taking data and inferring or querying

I'll keep my bits to myself now, I promise.

Memento as a programming model: I was talking to Mark English as the movie Memento. The main character can't create or store long-term memories as the result of a violent accident. The movie proceeds from the end and moves backward (and forward). This has the effect of the viewer experiencing reality in the same way as someone with impaired memory. The amazing thing is that the main character has to live out his life without knowing what's just happened, and he's developed a variety of strategies and habits to get by, such as tattooing his body with key messages about the murder of his wife.

Anyway, it occurred to me that "lack of memory" is closely related to "good programming practice." Imagine if every day you had to work on a program, you forgot the prior day's coding!

  • You'd develop simplifying habits: simple variable names, variable names that hint at their type.

  • You're code would be well broken down. No function or class that couldn't be understood in 15 minutes.

  • You'd document it first then fill in the details. You'd need to concisely capture a description of the problems being solved.

  • You'd write test code before even starting on the code. The first thing you'd do every day after re-reading the documentation would be to run the unit test to know if you were done yet.

  • Versions would be checked in often. You would naturally save it once you started forgetting parts.

Well, I realize it's a tenuous sort of argument. I probably couldn't program with limited memory. Some of the best programmers I know, or have heard of, have tremendously strong memory. James Tanis tells me that Seth Robertson was able to do some of his best kernel work because he could keep so much of it in his head at once. Nevertheless, it's a fascinating metric for judging good programming practice. If there's a question of how to program something, ask yourself what someone without long-term memory would do.

Where was I?

30 Apr 2002 (updated 30 Apr 2002 at 07:02 UTC) »
spamassasin is a perl script which I've been using with good success to clear out spam. I have it setup to filter all my incoming mail, then hand-off to slocal.

Eg. /usr/local/bin/spamassassin -P | /usr/lib/nmh/slocal -user $1

My maildelivery looks for the special header added by spamassassin, X-Spam-Flag. The default rules catch about 80% of my spam, which I dump in my spam folder just in case I want to double-check the results. There are some very complex configuration rules one can build that result in each message being scored as potential spam. Users can extend the rules, and apply new scores for existing rules.

http://spamassassin.org/

Ifile: ifile is a neat program for automatically filtering email into appropriate folders, such as spam. It's design seems to be specifically for mh/exmh, and slocal (which I use).

It seems to simply use word analysis without much regard for the semantic or syntactic structure of email. Email comes into slocal and is sorted based on ifile's simple database of folders to word occurences. It's smart enough to learn when you refile a piece of mail.

I was hoping it would identify spam for me, but having tried it for a couple of weeks, I'm turning it off. Unfortunately, it's not appropriate for the way I file my email. I can imagine situations where it's useful, but it's got several problems:

  • refile slightly busted. I get an occasional "Not able to open..." (I use nmh and mh-e emacs-mode, which might not all interact well enough)
  • treats words equally whether in the header or not
  • doesn't notice how slocal rules caused things to be filed. I'd like it to learn from the rules I have already. (Need to periodically run knowledge_base.mh for this reason.)
  • seems to pick up folders which weren't in my .folders, namely my OLD/inbox and ARCHIVE/inbox. (I auto migrate stuff over a month old, and then archive and compress after two months.
  • I'd like it to treat my manual refiles as extra important in it's database.

One day, I'd like to get around to making some minor adjustments and fixes/updates. It seems like parts of it should be a standard part of the unix toolkit, such as word frequency analysis. Here are some guidelines if you're considering using it:

  • Use it if you use inc instead of using slocal/maildelivery. All your email comes to one inbox and gets filed by you. It will do a pretty good job of guessing how you file.
  • Don't use it if you programmatically filter. ifile is good for human text, it won't save you much if you deal with highly structured email. You'll have to continue to write rules and ifile will often get confused.

5 older entries...

 

thayer certified others as follows:

  • thayer certified itamar as Journeyer

[ Certification disabled because you're not logged in. ]

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

X
Share this page