24 Dec 2009 dmarti   » (Master)

Webspam feeds?

So here's the basic idea. You have a web site with a bunch of user-generated links that may or may not be spam. These links can come from anywhere: comments, wiki edits, trackbacks, referers. Meanwhile, on some other site, a page on your site could be the target of a spam link. (For example, somebody makes an account, and puts a bunch of crap in his or her profile page.)

So you want to find the spammy links on your site, and you want the rest of the webmasters in the world to clean up the spammy pages on their sites.

So what do you do? Well, when you clean up, you post the URIs that you don't like to a link reputation clearinghouse. (You can also post the good URIs that appear on your site as good, to help the clearinghouse decide that you're a legit user, and to help prevent them from showing up as spam.) You might report to more than one clearinghouse, since they all accept basically the same HTTP POSTs. All easy to automate as part of the moderation process in your CMS if you want.

The clearinghouse does some digestion (naturally, spammers are going to try to clobber the clearinghouse with bogus reports, and naturally, some links are going to be reported the wrong way by mistake.) Each clearinghouse does its own digestion and reputation magick internally.

Then the clearinghouse generates RSS feeds by domain. You subscribe to one or more feeds from one or more clearinghouse services, and when you see a possibly bad page on your domain, you check it out. You can pick and choose among clearinghouses, since some will end up doing better digestion than others.

Big web sites that host a lot of user-generated content might want to run their own clearinghouses. Another logical place to put a clearinghouse is at a site that does link sharing or URL shortening. Individual webmasters might subscribe to just one clearinghouse, and clearinghouses might subscribe to each other.

Here's a simple, easy-to-use clearinghouse: Aloodo. Right now it's seeded with good and bad links from this site, along with a few other public sources. There's also a simple way to query the good and bad lists, so, for example, you can check out a new user's profile page and forum postings before deciding whether to make them public. If you have a webspam problem, let's talk about how this could be useful to you—either as a customized subscription or as an in-house install.

Syndicated 2009-12-24 18:06:50 from Don Marti

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!