10 Apr 2000 FarcePest   » (Journeyer)

You now have the opportunity to join the most extraordinary and most powerful wealth building program in the world!

Typical morning:

I am the comstar.net Spam Disposal Unit.

My day usually starts with a new pot of coffee. Unfortunately it was on all weekend, so I need to wait for it to cool down. I really need to get one that shuts off on it's own.

After logging in, I check my inbox, and then switch to the relays folder. 64 new relays. For a monday, that's fairly typical.

comstar.net is a business ISP, which among other things mean, we don't have dialup services (excepting ISDN). Some of our customers are dialup ISPs, though. But anyway, part of this is mail hosting for a couple hundred different domains. Which means, we get spam. Most of it seems to be for comstar.com, which is one of ours, but there are hardly any users within that domain. However, some spammer, somewhere along the line, got the idea that this domain has a million users in it. Part of the problem here is we use qmail for the MTA, and qmail-smtpd doesn't check recipients during the SMTP session, except to make sure that it's for a domain it should accept for. So from a sender perspective, all those recipients seem to exist.

So the general scenario goes like this:

  • Spammer sends to some 10K psuedo-random addresses, probably generated from some list of common names.
  • We attempt delivery on them.
  • Nearly all of them bounce.
  • The envelope sender is fake, of course, so they bounce again.
  • The double-bounces go into my spamgrab script.
This is where the fun begins.


The spamgrab script (mostly procmail) starts out by getting the original bounced message. With the qmail bounce format, this is pretty easy. sed does the job nicely.

Next it finds the IP of the host we got the message from, and it compares this against a cache. The cache entries stay around for a week, but after 24 hours they expire. That sounds a little contradictory. There are really two tests against the cache: The first test checks to see if the IP is present (up to a week). If it's there, it's a known spam host. The second test checks to see if a report has been sent within the last 24 hours. Messages from hosts that are in the cache are sent to /dev/null after any reporting.

The host might not be in there at all, of course. However, we also employ an RBLCheck script that runs just before qmail-smtpd. This checks against ORBS, RSS, RBL, and DUL, and tags the headers to indicate which lists that host is on. It does some other fun things as well; more on that later.

The tagging is for the benefit of the spamgrab script, when the mail eventually double-bounces. The spamgrab script looks for these tags, and generates a report if they are there, under certain conditions.

NEW Stock Holders and Investors Alert - for April 7

I haven't said much about the reports yet. Due to the cache, reports are only sent for a given host once every 24 hours.

If the host is on DUL, it sends a spam complaint (original message only) to the host's ISP, using the abuse.net database.

Otherwise, it's assumed to be relay spam. This generates a detailed relay spam report (including the entire original double-bounce) to the ISP's abuse department. It also generates a relay report, saving it in my relays mailbox. The relay report is for ORBS and/or RSS, avoiding reporting to lists that it is already on. Later on I go through and inspect these, and pump them back through the script so that they are actually mailed out.

One time, a spammer sent us the same spam through at least 300 different relays, twice on the same weekend. (600 total.) But since most of these were listed on ORBS, the spamgrab script sucked them all up, reported the relays to their ISPs, and generated reports for RSS. On average, though, I only generate about 3000 relay reports a month. Most of those are unique.

Hello Natural Health Enthusiast,

Now I know what you're thinking: If I'm using ORBS, RSS, RBL, and DUL, why do I have any spam to bounce?

Answer: Because I have leaky spam filters, by design.

  • If the host is on ORBS and either RSS or RBL, we refuse the mail at the SMTP session.
  • If the host is on DUL, it's throttled: Additional recipients after the first get a temporary failure code. In the Battle of the Bandwidth, DS3 beats V.90 any day.
  • If it's just on RSS, it's temporarily failed about 90% of the time. The other 10% of the time, it gets through.

The leaky filters are what enable me to send so many relay reports. If I blocked on ORBS directly, I'd wouldn't have spams to send to RSS. Besides, a lot of ORBS hosts aren't yet abused by spammers; remember that I only bother with the double-bounced spams. But once they are on both, I don't need or want 'em. RBL I just don't trust that much; their policies seem too erratic. But I will block on RBL if there's an ORBS listing. ORBS at least has an objective criterion: Does the host relay, or is it the smarthost for another relay? RSS is a little different: Does the host relay, and has it relayed spam? I never liked the idea of blocking all dialup connections. It's a bit unfair to Linux users who actually can run a real MTA.

But the leaky filters are just the beginning. I log all these incoming connections, and there's another script that finds the worst ones for the most recent period. Those hosts, the ones that are connecting the most and are spam-listed, get put on the firewall for awhile.

On a typical weekday, we refuse something like 70% of the incoming connections. On weekends, this goes up to about 95%.

Home Improvement Loans Here

What spam does get through, and past the spamgrab script, goes in my spam box. I sort these by size, look for clusters, pick a likely candidate, select a unique string ("waste your time", "university diplomas", "international driver's license"), and then pump those back into the script with an option that tells it: This is relay spam. This forces it to generate reports.

We (I) would be completely swamped without all this, and it's evolved over time to the point where it's gotten pretty efficient. It would be tough to do this with sendmail. qmail's modular design makes it relatively easy.

And I haven't told you about smeat yet... :)

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!