Recent blog entries for FarcePest

7 Mar 2005 (updated 7 Mar 2005 at 23:28 UTC) »

I'm going to PyCon 2005, and I'll be at the Zope3AppSprint.

One of the things need for my MySQL Users Conference talk is a sample application. Originally this talk was submitted as a tutorial, which I doubted would fly at MySQLUC anyway. For a tutorial, I think I'd spend an hour or more on the application. In this case, I have 45 minutes total, so I think I have about 5-10 minutes for a brief application tour with some code snippets.

Last time I did a tutorial (for OSCON 2002), I tried wrote a small application that would index your MP3 collection using ID3 tags. This time, I'm going for another timely subject: A RSS aggregator. This will actually be pretty easy, since I intend to use Universal Feed Parser to handle all the RSS and/or Atom feeds.

Part of my motivation is I still haven't found an RSS reader that I'm completely happy with. Currently the one I like best is rawdog. rawdog uses the Universal Feed Parser, and creates static web pages. Articles appear in reverse chronological order; there are also plugins to change the ordering. To actually update the feed database and write the pages, you use cron. You can set the refresh time separately for each feed, which is kind to the feed operators. The shortest feed time I use is an hour, and I have some at 2-8 hours, and a couple at one day or more. One big benefit is it's just a web page, so you can read it from anywhere.

Mozilla Thunderbird has RSS reader support, and it basically treats each feed as it would a newsgroup; it does everything you'd expect a USENET news reader to do, except post.

Mozilla Firefox can detect RSS and Atom feeds via the <link rel="alternate" ...> element, and then you can add the feed as a Live Bookmark. This looks like a regular folder of bookmarks, but it is refreshed periodically. This does not take the place of an RSS reader, since you only get a list of links, and no other content, and no tracking of read items, but it is useful. It's particularly easy to add new feeds.

Sage is an RSS and Atom feed aggregator and reader with feed discovery, in addition to Firefox's built-in feed discovery. You can import and export feeds in OPML format. It can render the feeds as HTML and tracks what you have read. I think I like it somewhat better than Thunderbird for this application.

Blam is a RSS (no Atom) written in C#. It works with Mono, and can read and write OPML. It doesn't update feed titles, though, like some of the others, and it may be a little too happy to reload feeds. If you don't already have Mono (or .NET), it's a lot to install.

What I want out of an RSS reader: I think I want is a more dynamic version of rawdog, with a dash of Gmail and del.icio.us. I am less concerned about where the articles are coming from, and more concerned with their classification (news, security, humor, blog, etc.)

So much for hanging out more on Advogato.

I'll be speaking at the 2005 MySQL Users Conference on (what else) Python and MySQL. I haven't yet figured out how long I'm staying at the conference, though.

MySQLdb is now six years old and on version 1.2. There were some times when development was a little... slow. 1.2 was a nice milestone because I got to throw a lot of old Python stuff away, and I get to do it some more in 1.3.x.

A lot of the stuff on my home page has really started to rust and/or smell bad. In theory, I am a web developer of sorts lately, and my own site is unmaintained crap. Maybe, someday...

I did put out a minor update of adns-python. Just a few minor bug fixes, plus support for TXT records. Someone gave me the patch for TXT records years ago; I just never made another release. It could still be updated to use some of the newer Python memory API stuff, and I almost worked on it, but I decided not to break it for now. adns has only had two releases in five years, so I don't feel so bad. No new releases means it works, right?

HyperText is another bit-rotting project. I think a few people actually use this, though: Webware mentions it as a "Future" item, but development on Webware seems to have slowed down a lot, although maybe there is a little more going on lately. XIST credits HyperText for some basic concepts; I may have to give it a try for a project I have in mind.

I should push ZMySQLDA-2.0.9 out the door with minimal changes; it's only been a beta for 2+ years.

SQLORB is another bit-rot example. Last release in 2002, worked on it a little more, never released it. There are a couple of other competing (I use the word loosely) projects:

SQLObject
There are some things SQLObject does right, and some things it does wrong (IMHO). Maybe I'll elaborate later...
ORM
Another similar project I haven't tried.

I am golden brown and del.icio.us. I was skeptical about del.icio.us at first, but it's kinda growing on me. I also think at some point, there's going to be a big meltdown, so please don't use it, or it'll go the way of orkut (slow and crashy).

If you love or hate spam, visit my Random Spam Generator. Now with hideous CSS styling and RSS feeds! Are you making more money that ever? Offensive content guaranteed! I then decided to look out for someone whom I could introduce to the investor as someone who will administer the usage of the funds. some companies willingly give up their secrets and disclosed our money confidently lodged there or many outright blackmail. Later uses were in industrial applications such as petrochemicals, poultry & farming, printed -circuit board manufacturing and food processing applications. Machine is FREE!

I think that's enough for now. See you in three years!

-- I LIVE -- AGAIN!

That VOICE! Coming from the MECHANICAL BEAST! It's HAUNTINGLY FAMILIAR! I'd recognize it ANYWHERE -- even after all these DECADES! And those words: "I LIVE -- AGAIN!" Who can FORGET such words? Not ME! I've heard them TOO OFTEN!" -- Megaton Man #2

Two years between diary entries is quite a while...

Well. Around September 2000, the company I was working for (comstar.net) got bought by Globix, which shortly thereafter began it's rapid swirling motion around the bowl. Globix filed a pre-packaged bankruptcy in March 2002. In April they closed my office in Athens, GA, and by the end of May, it dawned on them that gee, maybe they should stop paying me. (They are still paying me anyway.) Globix went from over 900 employees worldwide (perhaps briefly as many as 1200) to less than 300 (closer to 200) today. Atlanta went from 50+ to about 1/10th that. My own prediction is that they will be a memory by 2004.

Oh well, at least I got to go to (and speak at) OSCon 2002.

Anyway. Maybe I'll start hanging out more on Advogato.

Moved my Python pages due to the recent disk crash of starship.python.net. Just about everything is over there now. It's the home page link.

I'm still recovering from Saturday's paintball excursion.

7:30 a.m.: Get out of bed. Not used to getting up that early on any day of the week, let alone Saturday. Drive 90 mi to the paintball place. Yawn.

11 a.m.: We finally get organized for the first game. You could actually rent camo at this place, but I brought some old black&white urban/winter camo, with a dark blue shirt that's kinda like long underwear in texture. I'm not sure how badly that affected my visibility, as we were in a forest. Then again, the paintballs were also blue, but this turned out not to matter.

First game: Capture the flag, take it to the enemy base. I'm part of the group going after the flag. Thwock! Hit in the back of the head. I get a little cover, start feeling around, and... no paint == no kill. But we wiped 'em out.

Second game: Same as first. Took a hit on the gun. That does count. I think we lost that one.

Third game: The ridge. Same basic rules, the terrain is much, much more hilly. I am way too out of shape to be doing a lot of running around, so I stayed back to defend the base. Got hit on the wrist, no paint again. Last player alive on our team. They called it when they thought we were all dead, but it would have been five against one...

Fourth game: Same. Stayed to defend. Saw basically nothing. But by moving around I might have lured some enemies out to shoot at me, whereupon they got taken out. We won that one.

To this point, I have not gotten any paint on my body.

Lunch: Huge subs from Publix. One of the other players who only wore a T-shirt had a nice crater in his arm, courtesy of Centove. Those paintballs travel 300 feet per second. From what I could tell, it took a couple layers of skin. Ouch.

Fifth game: POW rescue. Defenders (us) entrenched. Rescuers have to grab the flag from us, take it back to their base, within 10 minutes. Finally I get to do a lot of shooting. Snipe, snipe, snipe. After getting a little bored, and with about a minute left, I moved up to the next barricade. Got nailed in the head. The headgear they give you covers your eyes, face, and sides of your head, which is good because this one hit right on the ear. Headgear did it's job, though some paint came through the vents.

Sixth game: Same, except some people on the other team had to leave, so I was on defense again. My old team came in running suicidally. Blew away Centove, maybe my boss, too. When you've got a whole bunch of 'em running straight towards you, you pretty much just have to get the range right. I almost pegged one of the refs; he sure did dance around a lot, but he was standing right between two legitimate targets... I got enough paint on the ankle to count as a hit, but we won immediately thereafter.

Seventh game: Another huge hill. Run as far as I could, then get some cover. Sited enemy trying to outflank us and splattered him. Turned out to be Centove yet again. Couldn't really see anyone else. Got nailed, hard, trying to move up.

Eight game: Same. Best cover I had was three pine logs. Had to stay pretty low. I think I got two. Then I got into a real short-range firefight with someone behind a big piece of plywood. We exchanged a lot of rounds, but I eventually took a gun hit.

Throughout all the games, I maybe took eight to ten hits on my upper body, and none of them broke paint. Lucky? More like, my body was absorbing the blow enough that the ball didn't break. I have a nice silver dollar size bruise on my shoulder from one, and some smaller bruises elsewhere.

Back hurt some later that evening. Right leg hurt a lot more the next day, and today. Whatever muscle it is on top of your thigh that you need for climbing stairs.

It was fun though, and we were lucky: T-storms were forecast, and it never rained, but it was really humid towards the end.

You now have the opportunity to join the most extraordinary and most powerful wealth building program in the world!

Typical morning:

I am the comstar.net Spam Disposal Unit.

My day usually starts with a new pot of coffee. Unfortunately it was on all weekend, so I need to wait for it to cool down. I really need to get one that shuts off on it's own.

After logging in, I check my inbox, and then switch to the relays folder. 64 new relays. For a monday, that's fairly typical.

comstar.net is a business ISP, which among other things mean, we don't have dialup services (excepting ISDN). Some of our customers are dialup ISPs, though. But anyway, part of this is mail hosting for a couple hundred different domains. Which means, we get spam. Most of it seems to be for comstar.com, which is one of ours, but there are hardly any users within that domain. However, some spammer, somewhere along the line, got the idea that this domain has a million users in it. Part of the problem here is we use qmail for the MTA, and qmail-smtpd doesn't check recipients during the SMTP session, except to make sure that it's for a domain it should accept for. So from a sender perspective, all those recipients seem to exist.

So the general scenario goes like this:

  • Spammer sends to some 10K psuedo-random addresses, probably generated from some list of common names.
  • We attempt delivery on them.
  • Nearly all of them bounce.
  • The envelope sender is fake, of course, so they bounce again.
  • The double-bounces go into my spamgrab script.
This is where the fun begins.

BEEN LOOKING FOR AN EXPLOSIVE BUSINESS OPPORTUNITY?

The spamgrab script (mostly procmail) starts out by getting the original bounced message. With the qmail bounce format, this is pretty easy. sed does the job nicely.

Next it finds the IP of the host we got the message from, and it compares this against a cache. The cache entries stay around for a week, but after 24 hours they expire. That sounds a little contradictory. There are really two tests against the cache: The first test checks to see if the IP is present (up to a week). If it's there, it's a known spam host. The second test checks to see if a report has been sent within the last 24 hours. Messages from hosts that are in the cache are sent to /dev/null after any reporting.

The host might not be in there at all, of course. However, we also employ an RBLCheck script that runs just before qmail-smtpd. This checks against ORBS, RSS, RBL, and DUL, and tags the headers to indicate which lists that host is on. It does some other fun things as well; more on that later.

The tagging is for the benefit of the spamgrab script, when the mail eventually double-bounces. The spamgrab script looks for these tags, and generates a report if they are there, under certain conditions.

NEW Stock Holders and Investors Alert - for April 7

I haven't said much about the reports yet. Due to the cache, reports are only sent for a given host once every 24 hours.

If the host is on DUL, it sends a spam complaint (original message only) to the host's ISP, using the abuse.net database.

Otherwise, it's assumed to be relay spam. This generates a detailed relay spam report (including the entire original double-bounce) to the ISP's abuse department. It also generates a relay report, saving it in my relays mailbox. The relay report is for ORBS and/or RSS, avoiding reporting to lists that it is already on. Later on I go through and inspect these, and pump them back through the script so that they are actually mailed out.

One time, a spammer sent us the same spam through at least 300 different relays, twice on the same weekend. (600 total.) But since most of these were listed on ORBS, the spamgrab script sucked them all up, reported the relays to their ISPs, and generated reports for RSS. On average, though, I only generate about 3000 relay reports a month. Most of those are unique.

Hello Natural Health Enthusiast,

Now I know what you're thinking: If I'm using ORBS, RSS, RBL, and DUL, why do I have any spam to bounce?

Answer: Because I have leaky spam filters, by design.

  • If the host is on ORBS and either RSS or RBL, we refuse the mail at the SMTP session.
  • If the host is on DUL, it's throttled: Additional recipients after the first get a temporary failure code. In the Battle of the Bandwidth, DS3 beats V.90 any day.
  • If it's just on RSS, it's temporarily failed about 90% of the time. The other 10% of the time, it gets through.

The leaky filters are what enable me to send so many relay reports. If I blocked on ORBS directly, I'd wouldn't have spams to send to RSS. Besides, a lot of ORBS hosts aren't yet abused by spammers; remember that I only bother with the double-bounced spams. But once they are on both, I don't need or want 'em. RBL I just don't trust that much; their policies seem too erratic. But I will block on RBL if there's an ORBS listing. ORBS at least has an objective criterion: Does the host relay, or is it the smarthost for another relay? RSS is a little different: Does the host relay, and has it relayed spam? I never liked the idea of blocking all dialup connections. It's a bit unfair to Linux users who actually can run a real MTA.

But the leaky filters are just the beginning. I log all these incoming connections, and there's another script that finds the worst ones for the most recent period. Those hosts, the ones that are connecting the most and are spam-listed, get put on the firewall for awhile.

On a typical weekday, we refuse something like 70% of the incoming connections. On weekends, this goes up to about 95%.

Home Improvement Loans Here

What spam does get through, and past the spamgrab script, goes in my spam box. I sort these by size, look for clusters, pick a likely candidate, select a unique string ("waste your time", "university diplomas", "international driver's license"), and then pump those back into the script with an option that tells it: This is relay spam. This forces it to generate reports.

We (I) would be completely swamped without all this, and it's evolved over time to the point where it's gotten pretty efficient. It would be tough to do this with sendmail. qmail's modular design makes it relatively easy.

And I haven't told you about smeat yet... :)

gstein certified me. I ended up browsing his Python pages and found a cool way to make Python byte-code files (.pyc) executable. Then I thought of a cooler way to do it. Or at least, if you want to make your .pyc files executable, this is an easier way to go about it.

P.S. 10-Apr-2000: It turns out that it only works if you directly reference the Python interpreter on the #! line, i.e. my examples with /usr/bin/env python don't actually work. I've since fixed the page. Thanks again to argent for some discussions that led up to this discovery.

Released MySQLdb-0.2.0. A lot of changes in this version, and most of them are non-obvious. Which is a good thing, right?

There's now a mutex in the standard Cursor class, which allows two threads to share a connection. Personally, I think sharing connections is a bad idea. For one, each connection is a seperate thread in mysqld, so by sharing a connection, you don't get to take advantage of multiple CPUs, or doing something while another operation blocks. Second, transactions are coming to MySQL, or so I hear, sometime in 3.23. It seems likely that transactions will begin and end on a per-connection basis, i.e. in most database designs, the commit/rollback is done on the connection, not the cursor.

There are now multiple cursor classes, built using several MixIn classes. I won't list them here; they are in the documentation. But it's something like this:

  • return rows as tuples or columns
  • use client-side (mysql_store_result) or server-side (mysql_use_result) cursors (the latter you especially don't want to share connections with)
  • raise Warning or not

Or you can mix up your own class.

0.1.3 introduced some methods that were intended to create a little backwards compatibility with the older MySQLmodule: cursor.fetchXXXDict(). It didn't quite work out that way. MySQLmodule would set the keys to be "table.column", and my implementation used "column". I figure, if your columns aren't unique, you should just alias them with AS in your SQL. The people who wanted the old way of doing things seemed to use SELECT * a lot, which is usually a bad idea.

So then I got the bright idea: Set the key to "column" unless it already exists; otherwise set it to "table.column". This gets rid of most of the ambiguity problems.

The other thing I did which relates to backwards-compatibility is I got rid of nearly all of the keyword options on the db.cursor() method, namely use and dict. There is a new one, cursorclass, which you can set to your own cursorclass. If you are a freak for fetching dictionaries, you would now do db.cursor(cursorclass=DictCursor), and this gives you a cursor that returns dictionaries. cursor.fetchXXXDict() is deprecated; just use cursor.fetchXXX() on the DictCursor instance.

Breaking the Cursor class into a bunch of MixIns let me do another important optimization. By default, the old cursor class used client-side cursors (mysql_store_result). This means your entire result set is sucked into memory on the client side. Then the various fetch calls would return rows. When you deleted the cursor, only then would the MYSQL_RES be freed.

What it does now is: After it does the query, it fetches all the rows, and hides them in the cursor. Then it frees the MYSQL_RES. Theoretically, this cuts the memory utilization down. Queries are probably slower (for large result sets), but fetches now are literally just slices.

This also allowed the introduction of some non-standard methods: cursor.seek(offset[, whence=0]) and cursor.tell().

Another optimization: A new string literal function in _mysql that not only does mysql_escape_string but adds the necessary quotes in place. This should speed up string INSERTS, since not as many strings are being built up and torn down.

Because of the work to make connections sharable, you now have to call cursor.insert_id() instead of db.insert_id() to get the last inserted auto increment field value.

I wonder sometimes whether the MixIn classes for Cursors are really worth it. They might cost performance, since you need the BaseCursor class and three MixIns to implement the standard cursor. On the other hand, some program logic is removed, which is usually a good thing.

On the Zope front: There's a ZMySQLDA-1.1.4 out. The patch I have to make ZMySQLDA use MySQLdb/_mysql doesn't work on it. However, I do have a new version (culled together from ZOracleDA) that I haven't released yet. If you want to test it, let me know. It seems to work perfectly fine for SELECT; someone has told me it doesn't work for other things. I'm using it in an on-going project; I just haven't had to write to the database yet.

MySQLdb is the number one download at The Vaults of Parnassus, beating out stuff like Numerical Python and Zope. Wow. I guess a lot of people like MySQL. So due to demand, I've started to build Red Hat RPMs as well. The package is named MySQL-python so that it's a sub-package of MySQL.

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!