Older blog entries for barryp (starting at number 66)

Was just googling for some info about Python and threads, and got burned by something that's been bugging me about search engines...

You get an awful lot of false hits because of words that appear in the surrounding navigational "fluff" that appears on most webpages.

For example, just about every page in every Python mailing list contains a "next in thread" link, referring to the mailing list threads. So "thread" is a horrendous word to try and search for :(

Many mailing lists show the subject lines of next or previous messages, lots of pages have nagivational links where a word here and a word there might match what you're looking for, but are completely unrelated to any single useful page.

Would be nice if there was a standard way to tag within a page what the "meat" and/or "fluff" is, so search engines can focus on or ignore parts of a page.

If an outfit like Google defined something like this (with the incentive of somewhat improving your pagerank being dangled in front of you) and mail-list web archive software as found in Mailman and such being updated to use it - it could really help out in web searches.

jfleck:

If you've got a Windows machine that various bastards may have messed with, you may want to try running something like Spybot - Search & Destroy to make sure you got everything.

I gave it a try after seeing recommendations for it on a couple TechTV websites (Screensavers, Call for Help), and it seems like a pretty decent, easy-to-use program.

Cool Python Stuff

I ran across a very handy Python package called SimpleTAL that is...

...a stand alone Python implementation of the TAL and TALES specifications used in Zope to power HTML and XML templates. SimpleTAL is an independent implementation of TAL; there are no dependencies on Zope nor is any of the Zope work re-used.

I used it on something where Zope would have been way overkill. Also, for a mini-DAV server, where I needed to generate XML for a PROPFIND response, SimpleTAL was much easier than generating the output through "print" statements or Python's miniDOM. Good stuff, amazing what can get done with just two small module files.

24 Jan 2003 (updated 24 Jan 2003 at 00:39 UTC) »
Troll Hiding

raph mentions:

I usually read the recentlog with a threshold of 3, so I don't tend to even notice troll posts unless someone else points to them.

How do you set that threshold? It would be nice to hide the low-ranking diary entries, but I don't see any preference/control for that.

...boy I hope somebody sees this.. :)

24 Nov 2002 (updated 24 Nov 2002 at 05:36 UTC) »

FreeBSD Ports

Been dabbling with Subversion off and on for a while now, and thought I'd try putting my FreeBSD box's /etc and /usr/local/etc dirs under SVN control. The ports tree has an older version, which works well enough - but I thought I'd see how hard it would be to come up with a newer port. (yeah, I know you can build it with just configure/make, but it would be nice to see the port updated).

Anyhow, after spending a bit of time looking at the FreeBSD Porter's Handbook, I have to say I was horrified at how much of a PITA it was to setup. It was one of those kinds of deals where you look at it and think: you have *got* to be shitting me. (To be fair though, like everything else in life maybe it wouldn't be so bad once you got a few under your belt)

I never got working exactly right, and ended up blowing it off. I figured that even if it did work, ports patches seem to have a tendency to be blown off or ignored (there's a nearly 2-month old patch for an intermediate Neon in the PR system just sitting there)

On the other hand though, I have more appreciation now for the people that do keep those ports updated and get things committed.

Another way

I was browsing through freshports.org, and something that caught my eye was A-A-P, which is a Python-powered package builder/installer meant to maybe someday supplant things like the *BSD ports systems.

Took a whack at putting together A-A-P "recipies" for Neon 0.23.5 and Subversion 0.15, and found that it did make building and installing FreeBSD packages pretty easy. Pretty good for a version 0.1 release

Just found that Greg Ward has a project on SourceForge named elspy - that's basically the same as my own exim/python scanner. Oops, oh well, you do learn something by doing it yourself.

20 Oct 2002 (updated 20 Oct 2002 at 20:57 UTC) »
Exim-Python local_scan

Finally broke down and finished my Exim add-on that allows you to write local_scan functions in Python. Great for fooling around with Bayesian spam-detection and such. Been using it for the last month or so on a FreeBSD box, should work on other Unixes.

9 Sep 2002 (updated 9 Sep 2002 at 01:36 UTC) »

Went geocaching this afternoon. Strangely, found out afterwards that the last cache I logged on the website was exactly one year ago today. Must be something about September weekends.

One of those deals where for the last half-mile or so you're wondering what the heck you've gotten yourself into. Seemingly nobody around for miles - if something happened (I dunno, wild dogs, fall in a hole, killed by drifters), they'd probably not find your body for years.

Then, as I'm walking along - just a little spooked, my cellphone rings, and it's my mom wanting to know if I wanted to go out for dinner since my grandma was in town. I had to explain that I was out-of-town, and couldn't make it - but didn't get into details about exactly where I was at the moment.

Modern life and technology sure are strange sometimes - there I was, in the middle of nowhere, taking a phone call that sounded exactly the same as it does at home.

Anyhow, found the cache, but no pencil to write in the log with though :( , made it out alive. Will try to remember to bring a walking stick to fend off wild dogs with next time (and a pen).

1 Sep 2002 (updated 1 Sep 2002 at 16:27 UTC) »

There's been a lot of talk lately about fighting spam, but I've been wondering lately about the problem of e-mail viruses.

At work we're getting a steady 200-300 Klez-infected e-mails a week, and it's been that way ever since it first hit. It seems like it's never going to stop - people just don't ever get a clue. And that's just one particular virus - as soon as a new and better one comes around, people will fall for it all over again.

Virus Relays

Maybe the thing to do is to start fighting e-mail viruses the way you fight spam - by identifying "virus relays" and blacklisting them. Eventually ISPs would have to implement virus scanning on their SMTP servers, and would push the problem back to the edges of the network where the actual infected machines would have to be cleaned up and better-protected if they wanted to ever send any e-mail.

I know that's pretty impractical, but I suppose you could still try identifying "virus relays" and periodically send them reports of how many infected emails were received from their system, and suggest they do something about it rather that just blindly passing the problem along.

I was intrigued by Bram's Python code for analyzing spam, and have been studying Paul Graham's article and raph's comment to it, but am still a bit perplexed at the significance of the values +-4.6, -1.4 and 2.2. Do they really mean something or are they just pulled out of thin air? 4.6 = log(100) makes a small bit of sense, but the other two I don't quite get.

Anyhow, had the idea that a possible way to have users submit mail for analysis/training would be to have them copy messages into special IMAP folders - which gave an excuse to play around with Python's imaplib library. Created folders named "Learn-Spam" and "Learn-OK" and had a script pull messages from there and remove when finished.

One thing I see is that you're gonna have make sure to do base64 and quoted-printable decoding of message parts, otherwise spammers could easily obscure their stuff from scanning.

For persistant storage of tokens, scores and such - tried PostgreSQL and found that inserting hunderds of small records per message took a *lot* of time. Tried a PyBSDDB dbshelve, which was smoking fast by comparison for this type of job.

57 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!