Work
Survived another reduction last month. Who'd have thought the telecomms industry would implode this badly in only two years? If the trades are to be believed it ain't over yet, either. *Sigh*. With all that some of our plans are on hold, some are reprioritized, and so it goes...
Home
Headed to West Virginia for another Buckwheat Festival with the family a few weeks ago. As always, great fun was had by all although the weather was variable. Just as one of the parades was ending the sky let loose with an absolutely torrential downpour. It was quite the site to see the formerly crowded sidewalks emptied in mere seconds and the street turned into a river; discarded cups and containers caromed around and through the waterlogged feet of the final, soaked high school marching band as they plodded towards the end of the route. As always, too, I got sick while we were there. Just mild this year though.
21 Aug 2002 (updated 21 Aug 2002 at 13:18 UTC) »
I implemented two different tools, both in Perl: spamcalc takes two sets of filenames, separated by "--"; the first set is a list of files containing "good" email (you can have lots of email messages in a single file, or one per file--but it only groks standard UNIX mailbox format, with '^From ' delimiters). This script reads in and tokenizes (using the same algorithm Paul describes) all the "good" messages, counting how many times each appears, then does the same for bad messages. Then, it does the weight calculation and constructs a DB file containing all the valid (appeared enough times to count) tokens and their weights.
The second script, spamcheck, takes a single message, tokenizes it, and computes the 15 most "interesting" tokens. It then applies Bayes Rule and shows you the resulting probability that this mail is spam. The implementation (barring any stupid coding errors on my part) is identical to that described in the paper, including ignoring case, etc.
I then played around with it for a bit. The main problem I have is that, as I suspect is the case with most people, I don't keep my old spam. So, I had to dig hard to come up with some spam to test with, and only managed to find 10 messages that I had received. So, I'm just testing--who cares, right?
Next, I have a huge archive of every email I've ever sent (well, since 1995 or so--the older stuff is on backup somewhere), but that's not really what I want since I'm trying to test email others have sent me: it seems likely to me that email I sent would give a different, skewed statistical "look" from email I receive, and harm the filter. However I also have a pretty large set of folders containing mail others have sent me, so I used all of that for the "good" mail. I then ran some test email, both spam and not-spam, through the filter.
Well, the results were disappointing: everything was categorized as spam! Looking at the results shows why: there are about 5 instances of the year ("2002") as a token in the test messages (in the headers, etc.), and each one of those was labeled individually as very interesting, and they all had a strong correlation to spam (.88 or so). Why is this? Easy, once you think about it: my spam was all of very recent vintage: today, actually. However, my good email was from folders where a very large number of messages were from previous years. So, the "2002" token appeared in all the spam messages, but a much smaller percentage of the good messages, hence the year was treated as a high-probability indicator of spam! Not good. Maybe if I had more spam (even if it was all from 2002) there would be more interesting words than the year and this wouldn't matter. Of course, older spam would also solve this problem.
Then I decided to try to get more spam to test with, so I went looking at archives of mailing lists, like the GNU lists, which I know get lots of spam. I found 30-40 messages and saved them, and re-ran spamcalc. Now when I tested messages, they were all categorized as not spam! Again, checking the details shows why, and it's related to mail headers: all the email to me contained headers that showed my hostname, etc. All the spam I installed from the archives did not. So, any tokens containing my host, etc. indicated a low probability of spam... again, not good!
So, I changed my "good" list to be just my inbox, which does contain some older messages but most of which are more recent, to solve the first problem, and I included only the spam I'd actually received to solve the second. This works better than the other two, but still I don't have enough spam mail to get a really good filter yet. But, I've started saving spam so maybe it won't be too much longer :).
In summary, if you want to use this algorithm be aware that for good results it's best if both your good and spam sets of messages are of similar vintage (not just due to the year, but other things in the headers like different local hostnames, etc.), and that you use spam you actually received rather than public archives of spam.
One way around this would be to enhance the algorithm to ignore some kinds of tokens outright: maybe avoiding things that look like dates, and maybe the first one or two (or some well-known set) of Received headers (ones that will be in every message you receive anyway); obviously now we're moving slightly away from a pure statistical analysis and trying to inject some AI into the algorithm. Which kind of goes against the whole idea.
Anyway, I thought it was an interesting experiment.
GNU make
I added another new feature or two to make, so I need another pretest
release. Also, someone reported a problem on PTX (whatever that is!)
with jobserver support which we haven't closed on yet. Still, it looks
good.
CrossOver Office 1.2.0
Wow! Way cool software. I'm really impressed with how much progress
WINE has been making; I didn't realize they were
doing so well. And Codeweavers has done a nice job of productizing it.
I'm very thrilled because I was able to get Quicken running on my Linux
box, which is the very last application that caused me to reboot into
Windows (I'd love to switch to GnuCash or similar,
but it just does not have enough features for me to get there yet).
This was causing my personal finances to suffer as I avoided rebooting
as much as possible :). Although there are some small glitches, I'm
able to download my online credit card statement, E*Trade account info,
and my online bank info as well and import it all relatively easily.
Certainly it's less labor intensive than rebooting into Windows...
Health
Ya know, sore backs suck. My dad has had problems with his back and
mine has been acting up on occasion as well--it's been kind of sore ever
since the marathon plane ride to Hawai'i. It seems to be finally
getting better now though. You don't really realize how much you depend
on your back muscles until they start complaining... my wife has some
yoga tapes and I might start trying those to see if that helps.
29 Jul 2002 (updated 29 Jul 2002 at 04:46 UTC) »
GNU make
Got a pretest release out before I left. Seems mostly fine, although there were a few small problems. I need to follow up with some issues that were raised, though. Still, I don't think it will be too much longer before a new release.
Web Hosting
Well, my subdomain is finally working. My hosting service offers very cool features, and a very reasonable price (and all their servers run Linux! :)) but when there are glitches it can take a little bit of perserverence to get them ironed out. Oh well, whaddaya gonna do.
Internet Banking
I just love internet banking. I've been so happy ever since I said "screw you" to Fleet and went Internet. Although, this vacation I did have a minor glitch: it turns out that my bank won't transfer money automatically from my reserve credit line to cover ATM transactions (although they do for checks of course). Annoying, but not a huge deal. Now that I know about it :).
Free Software
Geez, you're gone for two weeks and it takes a full day just to catch up. Debian 3.0 released: need to apt-get dist-upgrade on a number of my systems. New versions of gettext and automake: need to update my packages for that (esp. gettext as I was using a prerelease version before). And, of course, who knows what's going on when I get to work tomorrow ...
I've gotten quite a bit done on GNU make this week. I want to try to get a few more things done but if I don't, no big deal. I'm off on vacation starting Sunday morning and I will have a pretest release out before I go. Some neat new things in there.
Why is it always that the week before you go on vacation you have to work four times as hard? Annoying.
chipx86, I don't understand your issues with gettext. Why do you want backward-compatibility with 0.10.x? This seems not very useful, as that version was really broken. Some of the new features in gettext are really excellent: personally I've removed all gettext code from my source tree and now I use the external mode. This is really nice. The next version of gettext comes with a new tool, autopoint, which does what you want: it does not modify ChangeLog, etc. Unfortunately it's been stuck in beta limbo for a couple of months now; I've been using the pretest which has a small bug (easy to work around though).
Still, I managed to get a decent amount done; I just can't finish up the details. Very annoying.
RDP?
I've been trying to find a good spec for, or implementation of, a reliable packet-based protocol: RDP or whatever. None of them seemed to have really caught on though. I wonder why? If anyone has any pointers to good reliable packet protocols that are used and considered mature and stable, I'm interested. What the heck, I'm interested even if they're cool but not mature :).
ClearCase
sdodji, I have to disagree with a few of your comments on ClearCase. As a preface let me say I've never used the Windows version of ClearCase, but we've been using it on UNIXen for over 6 years, since CC 2.0. This is a free software forum so I don't want to get into a big discussion, and some of your comments are right on: it does require more admin commitment than other tools, and it does require very reliable and fast network and server hardware. I think many companies buy ClearCase that shouldn't: it is a very high-end tool. If you are willing to plunk down the $$ needed for licenses, then you should have enough to not blink too much at getting a good server and good network components.
However, ClearCase is not unreliable (again, on UNIX). We've been using the same server for about four years, with well over 100 developers using it and thousands of workspaces at a time and it has never crashed or hung. My Linux box at work as currently been up for 149 days. This is my main development system: I do tons of work on it with multiple views, both local and mounted from remote systems.
Also, you are certainly not forced to use their merge utility (their text based one is about the worst I've ever seen; the graphical tool is mediocre). I use GNU diff3 or Emacs ediff for all my ClearCase merging. Their findmerge command has a -exec option that lets you invoke whatever merge tool you want, no problem.
Tom: it's not just you, or automake. We all get that kind of email. I agree that it's frustrating. And I know exactly what you mean: it's such a drag when you find yourself engaged in a meaningless flamefest and realize you're simply wasting your time with someone who has no real interest in having a discussion but just wants to vent. Who has energy for that? I unsubscribed from gmd years ago and that's helped a good bit with my free time :).
One thing I often do is write an email, then delete it before sending it. In a way it's still frustrating because maybe I wasted 15 minutes on the thing, but at least it doesn't drag on, and I usually feel better afterwards.
4 Jul 2002 (updated 4 Jul 2002 at 02:12 UTC) »
I guess I'm naive. For some reason I assumed that, with the current tech slowdown, there would be both a glut of at least reasonably talented people needing work, and a new commitment on the part of companies to keep the customers they have, and win new ones.
So why, then, has service in every segment of the tech industry (that I'm a customer of) begun to suck so badly? Is it really that companies don't have enough money to provide decent customer service any more? Or are people just too stunned and depressed to care?
I bring this up as I wait for the second day for someone at my ISP to clean out the /var partition on my hosting server: since that partition filled up sometime early yesterday morning major chunks of my site no longer work: I can't ssh into the box (I have a session still running from before the disk filled, which is how I know what's going on), various parts of my pages, like counters, no longer work, and anyone trying to send email to my mailing lists gets it rejected with an error about "insufficient resources". I've filed numerous cases with no response at all, and now even that doesn't work. How hard is it to delete some log files? If it comes to that, how hard is it to install some trivial scripts to proactively email the admins when a system disk is getting full?
And as bad as that is, I don't even want to get into the experiences my wife has had with her new Treo--she loves the Treo, but the customer service for everything ranging from the rebate offer to the email account to the phone service has been a nightmare.
So, what is it? Are companies broke? Did they fire anyone who knew anything? Are employees too depressed to care? I always used to assume bad customer service was due to record low unemployment--but now there are people who need work, and companies who need customers, and yet the customer service seems worse than ever.
I'm confused.
FOAF updates: Trust rankings are now exported, making the data available to other users and websites. An external FOAF URI has been added, allowing users to link to an additional FOAF file.
Keep up with the latest Advogato features by reading the Advogato status blog.
If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!