Implementing spam protection in Wiki engines

Posted 2 May 2011 at 08:26 UTC (updated 2 May 2011 at 11:35 UTC) by audriusa Share This

Spam is really a problem in many Wiki communities, often forcing at least temporary to restrict editing rights. Most of the recent attempts to find a solution focus around captchas and spam lists. Captchas may be efficient to some extent; the problem is that to make them unreadable for bots, they must be twisted enough to become also difficult for humans to read. Lists seem less and less efficient, often accumulating thousands of entries and still leaving enough gaps for spammers. Spammers frequently use the Wiki search box to check if there is already some spam on the site - this shows that Wiki may be purely maintained and they can add more. Hence it may make sense to implement the delayed indexing but it also delays indexing of legitimate content. Blocking IP addresses is also no longer useful due DHCP.

One of the solutions may be to use combined protection rather than relying on some single "killer" approach. The rationale is to make spammer to invest more and more work into building the spam bot. Requiring a complex bot does not make the attack impossible but may statistically eliminate significant percent of spammers that are not willing to invest enough resources.

While maintaining our site (ultrastudio.org), we observed that significant percent of spam can also be stopped by relatively simple means that, to our surprise, were missing in JAMWiki 0.8.4 we use (before we added them) so may be missing in many other Wiki engines as well. If you work with the source code, there are following extensions that can be added to basically any Wiki engine that edits through the web form:

1. When processing edit form, check request type and require POST (bot that uses GET is much easier to implement). This may look funny, but really there are some wandering bots that periodically try to post spam links as new pages using GET request.

2. The edit session is always a three page session: the user visits the viewing page, then gets the edit page by following edit link and then submits the edit page. Tie these three pages through cookies or other obvious means. Again, the bot that needs to put one request, understand response and submit another request including data from the previous reply is more complex to write.

3. Set the minimal duration of edit session, especially if the multiple edits follow in rapid succession. Human will need at least few seconds for the edit and about the same time to start another edit. A bot frequently tries to edit different page every quarter of the second, making possible to auto-discover and auto-block it automatically.

4. Check the order of the fields and overall structure of the HTTP header and verify if the browser identified as the user agent is likely to produce such request. Reject edit calls of clearly non-browser origin. This forces the spam master to abandon simple web access functions, present in standard libraries of many languages.

The protection of this kind only eliminates relatively simple bots: surely it is possible to write a bot that would bypass it. However, from my experience, simple bots make a significant percent of all bots, and avoiding them allows to save a lot of resources. At least in my case, the amount of worries dropped by the order of magnitude, freeing a lot of time to work on content instead.


Blogspam.net, posted 2 May 2011 at 10:48 UTC by chalst » (Master)

There are a few options for farming out blog spam filtering to web services. A big advantage of doing this is that centralised repositories have a bigger picture of what's out there, and so learn more quickly.

Stevey's blogspam.net has been adapted to serve the Ikiwiki wiki engine, and has shown that adapting blog spam filtering to wiki spam filtering works.

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

X
Share this page