Advogato has been under a sustained attack from spammers since 11:00 UTC Sunday. The attack is originating from a botnet of at least several hundred nodes with world wide distribution. The attack is automated and creates 10 to 20 new user accounts with large, spam-filled blog posts every minute. I discovered the attack around two hours after it started and immediately turned off new account creation.
Mod_virgule buffers the 100 most recent new accounts for display in the "recent people joining" box on the front page. The attackers had blown past that number pretty quickly, requiring me to use the web server logs to track down and remove the bad accounts. Once removed, it left the recent accounts buffer completely empty. It will fill up again once I'm able to turn new account creation back on.
I spent a while Sunday logging and blocking IPs for individual nodes of the attacking botnet but basically gave up after blocking the first hundred or so. With account creation off, the attackers fail to create accounts and what we're left with is a low-level DDoS attack. The bandwidth being used isn't disabling and hopefully the attacker will give up once they realize no new accounts are being created.
The switch to the libxml2 HTML parser solved a lot of internal problems but as some of you have noticed, it introduced a new one. Libxml2 "thinks" in XML and when it comes across a set of HTML tags with no content, such as <em></em> it turns that into a self-closing tag: <em /> which is great if you're viewing the result with an XML parser but most browser HTML parsers can't parse certain tags as self-closing and see the tag as an open with no corresponding close. This has the effect of including all the subsequent markup on the page inside the offending tag, usually terminating display of the page.
It looks like only a handful of tags produce this effect, so it should be possible to filter them out. It may be possible to drop empty tag pairs before parsing or convert them back to open/close pairs.
Redi: in theory yes but the mod_virgule codebase is scary mix of HTML 4 (and earlier), XHTML, and XML. Throw in the random markup coming in from syndicated blogs and the resulting tag soup is very difficult to normalize without breaking something. However, incoming blog markup was previously being normalized to XHTML by libxml2 and I'm thinking now, we may have to switch that to HTML 4 to force the open/close tags. The function you mention produces different output depending on what markup type is specified on the tree (or on the individual node). So, parse the blog, walk the tree forcing it all to HTML 4, then ask libxml2 to export it. Maybe... I'm doing some work on the code today, so I'll let you know.
Another Update: I've got some code changes in that might (or might not) help with the broken tag problem. We'll have to see if any incoming blog posts break anything over the next day or so. Nothing new on the spam attack, it's still going strong. I'm going to look at implementing a few more security features in the code that might allow us to turn account creation back on without waiting for the attack to subside.