17 Mar 2009 ingvar   » (Master)

OK, as an addendum to my previous post, I ended up screen- scraping what I needed, parsed the data I wanted out of it and generated SQL statements to (later) populate a database with. It would probably have been more elegant to connect to the database and insert the data directly, but a FORMAT call is quite convenient, as it were.

The screen-scraper was constructed by using DRAKMA to fetch the pages and then some substring functions to extract the data I needed. Estimated 30 minutes of coding lisp and testing, then a further "lots" of actual scraping.

But, my main musing for today is something I've noticed recently, in my Apache logs. It seems as if there's an active business in "referring page" spam. I haven't run the numbers, but from eyeballing the logs, I am seeing at least a couple of page fetches per day, where the "referring page" field is several URLs that trigger my wetware "this is spam" detection. I wonder what the reasoning behind it is? Maybe they're banking on sites publishing their stats publicly?

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!