Older blog entries for slamb (starting at number 63)

spam flags

I mentioned before that Thunderbird and Mail.app have slightly different flags for indicating that a message is ham rather than spam. Well, their interaction seemed to be even weirder than that alone would explain - if a message was marked as not junk in Mail.app, no attempt to mark it as junk in Thunderbird would stick. Look for NonJunk and you'll find this (reformatted to fit your television):


PRBool messageClassified = PR_TRUE;
...
if (FindInReadable(NS_LITERAL_CSTRING("NonJunk"), keywords...)
  mDatabase->SetStringProperty(uidOfMessage, "junkscore", "0");
// Mac Mail uses "NotJunk"
else if (FindInReadable(NS_LITERAL_CSTRING("NotJunk"), keywords...)
  mDatabase->SetStringProperty(uidOfMessage, "junkscore", "0");
// ### TODO: we really should parse the keywords into
// space delimited keywords before checking
else if (FindInReadable(NS_LITERAL_CSTRING("Junk"), keywords...)
{
  PRUint32 newFlags;
  dbHdr->AndFlags(~MSG_FLAG_NEW, &newFlags);
  mDatabase->SetStringProperty(uidOfMessage, "junkscore", "100");
}
else
  messageClassified = PR_FALSE;

On startup, Thunderbird says that a message is not junk if Mail.app said it was NotJunk. When marking a message as Junk, it doesn't clear Mail.app's NotJunk flags. Brilliant! How could this plan possibly fail?

What annoys me is that Thunderbird added this feature after Mail.app but made a subtle change that broke interoperability. Then they realized their parsing sucked and they were interpreting Mail.app's NotJunk as saying Junk. They fixed it with this hack job and the bug popped up elsewhere - now Thunderbird's attempt to change the marking to junk won't stay across restarts. A little forethought and there wouldn't have been this mess.

8 Jun 2007 (updated 8 Jun 2007 at 05:08 UTC) »
Training server-side Bayesian filters

Last night I worked on an unobtrusive way to train SpamAssassin's Bayesian database. (Autotraining sure spam and ham as it's delivered is nice, but you at least need a way of correcting its mistakes or it will keep making them.) The sa-learn utility is quite easy to use, but how do you specify what messages to feed to it? I haven't seen any good glue for this. You want to feed it messages which have been examined and categorized, and ideally you want to feed it each message exactly once. (sa-learn does realize that it's seen a message before, but it still takes some processing time to do even that.)

I decided to harness the power of RFC 2060. My trainer connects via IMAP4rev1, executes a SEARCH command for candidates (letting the server do the work of an arbitrarily complex query), downloads the messages and pipes them through sa-learn, flags them as learned (so the next search will skip them), and disconnects. I implemented it using imapfilter, and so far it works quite well. This approach would even work well if the SpamAssassin machine were separate from the mail store machine.

In the process, I noticed that Thunderbird updates spam status on the IMAP server in the Junk and NonJunk keywords. Mail.app does the same, in the Junk and NotJunk keywords (plus a few others). Did you see it? One uses NonJunk, the other NotJunk. How hard would it have been to get these guys in a room to fight this one out? Grr. They have a weird interaction because they just didn't put any thought into it.

I also tried out Lua for the first time, as it's imapfilter's extension language. Turns out I hate it. I really wanted to like it. I had been thinking of using it all over an embedded product for rapid development with little resources. It's minimalist, fast, and so on. But it's just unpleasant to use. Maybe it's too minimalist. I would have liked a separate array type (rather than just "tables" / associate arrays), and I hate "high-level" languages without exceptions. imapfilter's library is also a bit limiting - its fetch_message and pipe_to do everything in memory. That makes me more irritated that Lua doesn't just have an array slice syntax I can use to pass message lists to fetch_message. And it means I have to spawn sa-learn a bunch of times for reasonable memory consumption, and starting a Perl process heavy with modules takes a long time.

I might end up rewriting my trainer in Python using either imaplib and subprocess or twisted.mail.imap4 and twisted.internet.process. I'm not real impressed with either mail API, though. I like the JavaMail API better, but forking and interacting with child processes from Java (or even Jython) sounds painful.

2 May 2007 (updated 2 May 2007 at 22:40 UTC) »
clarkbw, re: security choices

C. You are connected to a site pretending to be www.url.com … Something evil could be going on! Someone might be trying to trick you! Though odds are this isn’t true, it’s likely that guilt or the legal department required us to put this dialog up just for this case.

No, no, no, no, no! This text is the entire purpose of SSL. If it's really unlikely, then thousands of people wouldn't have created an entire ecosystem around validating identities. You have to realize that a private conversation is totally worthless if you don't know who you are talking to, and if nothing warns you when that validation fails, why would you have validation at all? This text wasn't added by lawyers; it was added by people who just spent man-centuries creating cryptosystems which would be absolutely worthless if this text were not displayed.

This dialog box shouldn't say "don't worry, this is probably something wrong with their setup. Just go on, send them your credit card number like always." That would defeat the purpose of the system so bady I'm having trouble coming up with an analogy. It's sort of like a policeman seeing someone trying to pick a lock and opening it for them, then standing by, smiling, as they walk off with all the valuables the lock was protecting. If you downplay the security concerns of sending important information over this link, you're basically telling the lock "sometimes keys screw up, just let him in." (I warned you the analogy sucked.)

It should be alarming! It needs to be alarming enough that if someone goes to their bank's website and sees this dialog box, they won't enter their password. Instead, they'll call their bank on the telephone and tell them that they've spotted fraud. This is the correct action - it's either true or it will get the correct people angry at the security people who screwed up the configuration. It's very rare for a major bank to totally botch their security setup like this.

On the other hand, it shouldn't be so alarming that it will prevent people from browsing some random untrusted website which they have no intention of sending important information to. It's not uncommon for people to require SSL on a site, not bother paying the money to have it signed by a widely- trusted CA, and have instructions for people with particularly sensitive passwords to import the certificate into their browser. That's not a site configuration problem, either - it's a "you haven't given the computer a way to verify their identity" problem.

I agree that examining a certificate and finding the problem is unrealistic for most people. Maybe the details of the certificate should be in an "Advanced" pull-out or something.

2 May 2007 (updated 2 May 2007 at 00:17 UTC) »
clarkbw, re: security choices

I'm not convinced there's a problem with the status quo. For the 90% of people you describe, the SSL certificate dialog box comes down to this:

Your connection to www.bigbank.com is insecure. It's likely that people are trying to steal your money.

Give them my money | Cancel

My parents don't understand X.509 PKI, but they do understand that they care if a connection is secure if and only if they plan to send financial credentials over it. They know - and the computer doesn't - what information they are planning to send. Thus, they are capable of responding to this dialog correctly 100% of the time. Choosing either option for them would be right less than 100% of the time. A complicated voting scheme would be right less than 100% of the time.

apenwarr, re: tabbed MDI

Tabbed browsing is [...] less flexible [than MDI], because there's no way to display two documents side-by-side. Imagine if Photoshop used tabbing between images: useless! (In fairness, the hybrid model used in Firefox, where you can open a new window or a new tab, is a really good balance. I just wish there was an easy way to "convert this tab into a window" or vice versa.)

That's not a fundamental limitation of tabbed MDI. Do you have a Mac? Open Adium and with a couple conversations. You can drag the tabs - not only to rearrange them within a window, but also to drag one out of one window into another and back. It's intuitive and even has nice eye candy. (IIRC gaim has this same feature, though it doesn't look as nice.) I'd post a trendy screencapture video if I knew how to do such things easily. Every now and then I try to do it in Safari or Firefox and am disappointed that it doesn't work.

haruspex

I read the manual. Ironically, you did what you accused me of - not reading fully before complaining. Read my full post, and you'll see a mention of bison and (not coincidentally) the same bison pattern rule found in the GNU make manual. Unfortunately, those tools are not universally available, and this particular project is developed by a lot of BSD people. I anticipate requiring GNU make and bison would be a hard sell.

24 Mar 2007 (updated 24 Mar 2007 at 10:17 UTC) »
make oddities

I'm trying to correctly express the dependencies for running yacc, which produces multiple targets from a single invocation. Let's start with a rule from racoon2's lib/Makefile.in:


.y.c:
        $(YACC) $(YFLAGS) $<
        mv -f y.tab.c cfparse.c

There are three problems with this:

  • it has a hardcoded filename in a pattern rule,
  • it has an intermediate file with a generic filename, which causes problems when run in parallel. I use make -j4...this gets run in parallel if there are multiple .y files, and in a non-obvious case I'll mention below
  • It doesn't mention the y.tab.h target that other files (cftoken.o) depend on.

As far as I see, there's no way to express to make the generic intermediate file problem. The best you could do is to use lockfile(1). But it's not universally available, and if I'm going to try convincing a project to switch to tools not universally available, I might as well just try for bison, which produces unique filenames directly. Now my rule can look like this:


%.tab.c %.tab.h: %.y
	$(BISON) $(YFLAGS) $<

This works under GNU make, and almost works under BSD make. The problem there is that it's run twice. Easy to see with a test file:


.PHONY: all
all: foo.out1 foo.out2


%.out1 %.out2: %.in lockfile -r0 mylock touch $*.out1 $*.out2 sleep 1 rm -f mylock

clean: rm -f mylock foo.out[12]

It works with gmake but fails with bsdmake. And here's something odd: replace that pattern rule with a static one...


foo.out1 foo.out2: foo.in
	lockfile -r0 mylock
	touch foo.out1 foo.out2
	sleep 1
	rm -f lock

...and GNU make fails, too. A non-intuitive difference between static and pattern rules: static rules use multiple targets on a line to say that both targets are made with similar commands (but different $@), while a pattern one says that both targets are made with the exact same invocation.

What about cheating by making one target depend on another?


%.tab.c: %.y
	$(BISON) $(YFLAGS) $<


%.tab.h: %.tab.c

I thought about it, but there's no guarantee the targets are produced in a particular order, and if they happen in the one opposite what I give, it will rebuild. It might end up doing that over and over again if my choice is consistently wrong.

I guess what I can do is make cftoken.o depend on cfparse.tab.c, as a surrogate for cfparse.tab.h. It's silly, but it works.

My conclusion: make sucks, and GNU make sucks a little less than most. But I guess I already knew that from Recursive Make Considered Harmful. (Besides making the argument you'd expect from the title, it has some good points like how GNU make's reparsing of changed include files put it a step above the rest.)

lkcl, re: message passing

First, you're wrong about rename(): it's not atomic. There is an intermediate step that processes can see. From the Linux manpage:

However, when overwriting there will probably be a window in which both oldpath and newpath refer to the file being renamed.

(The atomicity people say it provides is carefully limited...newpath either refers to the old file or the new file. That's generally all you need.)

Second, your statement that Linux lacks message passing and atomic operations is false. Linux has many forms of message passing between processes/threads - pipes/FIFOs/sockets, POSIX message queues, etc, and they are entirely suitable for the task pphaneuf asked about. I'm not sure what atomicity guarantee he'd need that Linux doesn't provide. You call write() to send a byte on a wake-up pipe, and that byte is either in the buffer or it's not. There's no intermediate state; therefore, it is atomic. What more do you want? Block until the other process/ thread has actually grabbed the byte? Why? You can build that if desired.

Third, your mention of microkernels is a nonsequitur. I find Tanenbaum's work, and the L4 project you're alluding to by mentioning those universities, to be quite interesting. However, microkernels are not necessary to provide any particular IPC facility for userspace processes to communicate amongst themselves.

9 Feb 2007 (updated 9 Feb 2007 at 08:43 UTC) »
ncm, here are five tidbits you probably didn't know about me:

  1. I first learned about the Internet by dialing ISCABBS through my dad's dual-speed 300/1200 baud modem when I was 10.

  2. My official major in college was electrical engineering then computer science, but my favorite classes and professors were in physics. If I hadn't already loved computing for nine years before finding the physics for majors classes, I'd be on a different path today.

  3. When I drove from Iowa City to the Bay Area with a carload of belongings, I showed up at the doorstep of good friends I met through ISCABBS. I'd never been to Northern California or met them in person, and I lived with them for the next year.

  4. I was once nearly thrown in a Tanzanian jail. The Arusha and Serengeti offices of Tanzanian Immigration and Customs disagree on the legality of my crossing the border by bicycle from Kenya so far from an official border post. I probably shouldn't have taken the advice of a Kenyan who'd spent three days in Egyptian military custody for entering without a visa.[*]

  5. During that crossing, my friends and I went through a dozen inner tubes and two dozen patches before running out twelve kilometers from our destination and walking. Three days cycling from Karen, Kenya to Loliondo, Tanzania - much of it on a rocky, hilly, windy earthen path that diverged wildly from our map - will do that. We had to bum a 450 km ride to Arusha where we could get replacements. Probably the stupidest thing I've ever done, but the trip was one of the greatest experiences of my life. The people were kind, generous, and amazing in the truest sense of the word.

[*] "What did you see?" "Camels and sand. Camels and sand. What do you see?" But all worked out for him, too - he met his wife in Egypt.

5 Feb 2007 (updated 5 Feb 2007 at 22:00 UTC) »
ncm, fejj

The data structure you're talking about is called a Bloom filter. It seems to be one that I see and think "oh, that'll come in handy sometime" but then it never does. The nastiest limitation is that you can't ever remove anything from it. So if your web crawler should eventually rescan the same URL (as I hope it would), it'd be unsuitable for hashing URLs. I confess that I don't really understand what the interview question is getting at.

54 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!