Older blog entries for madscientist (starting at number 7)

21 Aug 2002 (updated 21 Aug 2002 at 13:18 UTC) »

I spent an hour or so playing with an implementation of Paul Graham's anti-spam algorithm, described recently in A Plan for Spam.

I implemented two different tools, both in Perl: spamcalc takes two sets of filenames, separated by "--"; the first set is a list of files containing "good" email (you can have lots of email messages in a single file, or one per file--but it only groks standard UNIX mailbox format, with '^From ' delimiters). This script reads in and tokenizes (using the same algorithm Paul describes) all the "good" messages, counting how many times each appears, then does the same for bad messages. Then, it does the weight calculation and constructs a DB file containing all the valid (appeared enough times to count) tokens and their weights.

The second script, spamcheck, takes a single message, tokenizes it, and computes the 15 most "interesting" tokens. It then applies Bayes Rule and shows you the resulting probability that this mail is spam. The implementation (barring any stupid coding errors on my part) is identical to that described in the paper, including ignoring case, etc.

I then played around with it for a bit. The main problem I have is that, as I suspect is the case with most people, I don't keep my old spam. So, I had to dig hard to come up with some spam to test with, and only managed to find 10 messages that I had received. So, I'm just testing--who cares, right?

Next, I have a huge archive of every email I've ever sent (well, since 1995 or so--the older stuff is on backup somewhere), but that's not really what I want since I'm trying to test email others have sent me: it seems likely to me that email I sent would give a different, skewed statistical "look" from email I receive, and harm the filter. However I also have a pretty large set of folders containing mail others have sent me, so I used all of that for the "good" mail. I then ran some test email, both spam and not-spam, through the filter.

Well, the results were disappointing: everything was categorized as spam! Looking at the results shows why: there are about 5 instances of the year ("2002") as a token in the test messages (in the headers, etc.), and each one of those was labeled individually as very interesting, and they all had a strong correlation to spam (.88 or so). Why is this? Easy, once you think about it: my spam was all of very recent vintage: today, actually. However, my good email was from folders where a very large number of messages were from previous years. So, the "2002" token appeared in all the spam messages, but a much smaller percentage of the good messages, hence the year was treated as a high-probability indicator of spam! Not good. Maybe if I had more spam (even if it was all from 2002) there would be more interesting words than the year and this wouldn't matter. Of course, older spam would also solve this problem.

Then I decided to try to get more spam to test with, so I went looking at archives of mailing lists, like the GNU lists, which I know get lots of spam. I found 30-40 messages and saved them, and re-ran spamcalc. Now when I tested messages, they were all categorized as not spam! Again, checking the details shows why, and it's related to mail headers: all the email to me contained headers that showed my hostname, etc. All the spam I installed from the archives did not. So, any tokens containing my host, etc. indicated a low probability of spam... again, not good!

So, I changed my "good" list to be just my inbox, which does contain some older messages but most of which are more recent, to solve the first problem, and I included only the spam I'd actually received to solve the second. This works better than the other two, but still I don't have enough spam mail to get a really good filter yet. But, I've started saving spam so maybe it won't be too much longer :).

In summary, if you want to use this algorithm be aware that for good results it's best if both your good and spam sets of messages are of similar vintage (not just due to the year, but other things in the headers like different local hostnames, etc.), and that you use spam you actually received rather than public archives of spam.

One way around this would be to enhance the algorithm to ignore some kinds of tokens outright: maybe avoiding things that look like dates, and maybe the first one or two (or some well-known set) of Received headers (ones that will be in every message you receive anyway); obviously now we're moving slightly away from a pure statistical analysis and trying to inject some AI into the algorithm. Which kind of goes against the whole idea.

Anyway, I thought it was an interesting experiment.

Back from another vacation. Kind of crazy: two weeks away, then two weeks back, then another week away. But, I'm darn relaxed (and a good thing too, because due to other vacations I've taken already I only have a few days left this year...). This one was with my family, which is actually always pretty fun, and we had more extended relatives whom I hadn't seen in a number of years show up. Biking, beaching, and reading.

GNU make
I added another new feature or two to make, so I need another pretest release. Also, someone reported a problem on PTX (whatever that is!) with jobserver support which we haven't closed on yet. Still, it looks good.

CrossOver Office 1.2.0
Wow! Way cool software. I'm really impressed with how much progress WINE has been making; I didn't realize they were doing so well. And Codeweavers has done a nice job of productizing it. I'm very thrilled because I was able to get Quicken running on my Linux box, which is the very last application that caused me to reboot into Windows (I'd love to switch to GnuCash or similar, but it just does not have enough features for me to get there yet). This was causing my personal finances to suffer as I avoided rebooting as much as possible :). Although there are some small glitches, I'm able to download my online credit card statement, E*Trade account info, and my online bank info as well and import it all relatively easily. Certainly it's less labor intensive than rebooting into Windows...

Ya know, sore backs suck. My dad has had problems with his back and mine has been acting up on occasion as well--it's been kind of sore ever since the marathon plane ride to Hawai'i. It seems to be finally getting better now though. You don't really realize how much you depend on your back muscles until they start complaining... my wife has some yoga tapes and I might start trying those to see if that helps.

29 Jul 2002 (updated 29 Jul 2002 at 04:46 UTC) »

Back from vacation: almost two weeks in Hawai'i: half in Kona and half outside of Hilo. It was good to be back after so many years (and last time I was in Honolulu). Most excellent. Most relaxing. Didn't even bring a laptop. My fingers aren't working so well yet--have to recondition them. Good diving. The volcano is active, so we got to walk right up to the flow (the week we were there the lava chewed up more of the road). Did the helicopter thing: my first time in a helicopter; it was very cool. Got to see the vent and also lava flowing into the ocean. Unbelievable. Then we went to visit Kalapana, where the most beautiful black sand beach on the island (along with most of the little village next to it) was completely covered in lava back around 1991 (IIRC). It was incredible to hike out hundreds of yards across black, broken lava and think that ten years ago we would have been standing right in the ocean! Drove down into the Waipi'o Valley and had a good day there; spent a day on 69 beach (and got burned :( ); visited the Pu'uahonau o Honaunau national park: I'm telling you it was kind of eerie--I've never really had such a strong feeling of history right there... and I live less than a mile from the Battle Road in Massachusetts. Spelunked lava tubes, saw the Akapa and Rainbow falls, did a lu'au (of course), saw lots of sea turtles (very cool), and generally had a blast. A very long plane ride, but definitely worth it.

GNU make
Got a pretest release out before I left. Seems mostly fine, although there were a few small problems. I need to follow up with some issues that were raised, though. Still, I don't think it will be too much longer before a new release.

Web Hosting
Well, my subdomain is finally working. My hosting service offers very cool features, and a very reasonable price (and all their servers run Linux! :)) but when there are glitches it can take a little bit of perserverence to get them ironed out. Oh well, whaddaya gonna do.

Internet Banking
I just love internet banking. I've been so happy ever since I said "screw you" to Fleet and went Internet. Although, this vacation I did have a minor glitch: it turns out that my bank won't transfer money automatically from my reserve credit line to cover ATM transactions (although they do for checks of course). Annoying, but not a huge deal. Now that I know about it :).

Free Software
Geez, you're gone for two weeks and it takes a full day just to catch up. Debian 3.0 released: need to apt-get dist-upgrade on a number of my systems. New versions of gettext and automake: need to update my packages for that (esp. gettext as I was using a prerelease version before). And, of course, who knows what's going on when I get to work tomorrow ...

People who crack free software sites suck. Yeah, yeah, we're so impressed that you can take advantage of friendly people who are trying to help you; that really shows your chops. Gee, you're so l33t! Assholes.

I've gotten quite a bit done on GNU make this week. I want to try to get a few more things done but if I don't, no big deal. I'm off on vacation starting Sunday morning and I will have a pretest release out before I go. Some neat new things in there.

Why is it always that the week before you go on vacation you have to work four times as hard? Annoying.

chipx86, I don't understand your issues with gettext. Why do you want backward-compatibility with 0.10.x? This seems not very useful, as that version was really broken. Some of the new features in gettext are really excellent: personally I've removed all gettext code from my source tree and now I use the external mode. This is really nice. The next version of gettext comes with a new tool, autopoint, which does what you want: it does not modify ChangeLog, etc. Unfortunately it's been stuck in beta limbo for a couple of months now; I've been using the pretest which has a small bug (easy to work around though).

For heaven's sake. I don't know why this stuff always has to happen: why is Murphy always right? So I've been trying to hack together a new release of GNU make and the FSF's systems have been whacked all weekend: the mail server (mail.gnu.org) is completely incommunicado; I can't SSH to the server or FTP to the alpha FTP server; my CVS connections only work sporadically, etc. etc. I can't even get mail to anyone there because it all needs to go through systems that are broken. Ugh!

Still, I managed to get a decent amount done; I just can't finish up the details. Very annoying.

Must ... do ... work ...
My family will be back in town tomorrow evening so I need to get the release candidate for the next GNU make release done this weekend. I promised myself I'd get it finished before they got back. It's so hard to sit down and do all the minutiae involved with a release, even with all the excellent supporting rules provided by automake etc. It's about this time that I always wish I had a better test suite framework: make is about the simplest program to write tests for you can imagine... it seems the working with the test suite is actually more difficult than writing the tests themselves, which is never a good thing.

I've been trying to find a good spec for, or implementation of, a reliable packet-based protocol: RDP or whatever. None of them seemed to have really caught on though. I wonder why? If anyone has any pointers to good reliable packet protocols that are used and considered mature and stable, I'm interested. What the heck, I'm interested even if they're cool but not mature :).

sdodji, I have to disagree with a few of your comments on ClearCase. As a preface let me say I've never used the Windows version of ClearCase, but we've been using it on UNIXen for over 6 years, since CC 2.0. This is a free software forum so I don't want to get into a big discussion, and some of your comments are right on: it does require more admin commitment than other tools, and it does require very reliable and fast network and server hardware. I think many companies buy ClearCase that shouldn't: it is a very high-end tool. If you are willing to plunk down the $$ needed for licenses, then you should have enough to not blink too much at getting a good server and good network components.

However, ClearCase is not unreliable (again, on UNIX). We've been using the same server for about four years, with well over 100 developers using it and thousands of workspaces at a time and it has never crashed or hung. My Linux box at work as currently been up for 149 days. This is my main development system: I do tons of work on it with multiple views, both local and mounted from remote systems.

Also, you are certainly not forced to use their merge utility (their text based one is about the worst I've ever seen; the graphical tool is mediocre). I use GNU diff3 or Emacs ediff for all my ClearCase merging. Their findmerge command has a -exec option that lets you invoke whatever merge tool you want, no problem.

Awesome fireworks in Boston today! We really have some of the best 4th entertainment around, I believe. But it was HOT. It's always this time of year I think I should have gotten AC.

Tom: it's not just you, or automake. We all get that kind of email. I agree that it's frustrating. And I know exactly what you mean: it's such a drag when you find yourself engaged in a meaningless flamefest and realize you're simply wasting your time with someone who has no real interest in having a discussion but just wants to vent. Who has energy for that? I unsubscribed from gmd years ago and that's helped a good bit with my free time :).

One thing I often do is write an email, then delete it before sending it. In a way it's still frustrating because maybe I wasted 15 minutes on the thing, but at least it doesn't drag on, and I usually feel better afterwards.

4 Jul 2002 (updated 4 Jul 2002 at 02:12 UTC) »

I guess I'm naive. For some reason I assumed that, with the current tech slowdown, there would be both a glut of at least reasonably talented people needing work, and a new commitment on the part of companies to keep the customers they have, and win new ones.

So why, then, has service in every segment of the tech industry (that I'm a customer of) begun to suck so badly? Is it really that companies don't have enough money to provide decent customer service any more? Or are people just too stunned and depressed to care?

I bring this up as I wait for the second day for someone at my ISP to clean out the /var partition on my hosting server: since that partition filled up sometime early yesterday morning major chunks of my site no longer work: I can't ssh into the box (I have a session still running from before the disk filled, which is how I know what's going on), various parts of my pages, like counters, no longer work, and anyone trying to send email to my mailing lists gets it rejected with an error about "insufficient resources". I've filed numerous cases with no response at all, and now even that doesn't work. How hard is it to delete some log files? If it comes to that, how hard is it to install some trivial scripts to proactively email the admins when a system disk is getting full?

And as bad as that is, I don't even want to get into the experiences my wife has had with her new Treo--she loves the Treo, but the customer service for everything ranging from the rebate offer to the email account to the phone service has been a nightmare.

So, what is it? Are companies broke? Did they fire anyone who knew anything? Are employees too depressed to care? I always used to assume bad customer service was due to record low unemployment--but now there are people who need work, and companies who need customers, and yet the customer service seems worse than ever.

I'm confused.

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!