Older blog entries for robogato (starting at number 35)

More Minor Security Updates

I declared an Advogato hacking day today and got a little more work done on our security ToDo list. I've added a set of cryptographic nonce functions to generate tokens for email verification and CSRF prevention. The tokens have configurable expiration times. The new code replaces the hard-coded token generation used by the original cookie functions.

I also added a generic email function that can be used for account verification. This replaced the hard-coded part of the password recovery email function.

I was able to get the CSRF token code integrated with the account creation forms. It's tested and live. Hopefully this will knock out a few more of our automated account spammers including the commercial Incansoft spamming tools. I've still got a little more work to do before I can turn on the email verification but we're nearly there.

12 Sep 2011 (updated 12 Sep 2011 at 22:29 UTC) »

Status Update

Advogato has been under a sustained attack from spammers since 11:00 UTC Sunday. The attack is originating from a botnet of at least several hundred nodes with world wide distribution. The attack is automated and creates 10 to 20 new user accounts with large, spam-filled blog posts every minute. I discovered the attack around two hours after it started and immediately turned off new account creation.

Mod_virgule buffers the 100 most recent new accounts for display in the "recent people joining" box on the front page. The attackers had blown past that number pretty quickly, requiring me to use the web server logs to track down and remove the bad accounts. Once removed, it left the recent accounts buffer completely empty. It will fill up again once I'm able to turn new account creation back on.

I spent a while Sunday logging and blocking IPs for individual nodes of the attacking botnet but basically gave up after blocking the first hundred or so. With account creation off, the attackers fail to create accounts and what we're left with is a low-level DDoS attack. The bandwidth being used isn't disabling and hopefully the attacker will give up once they realize no new accounts are being created.

Other Fun

The switch to the libxml2 HTML parser solved a lot of internal problems but as some of you have noticed, it introduced a new one. Libxml2 "thinks" in XML and when it comes across a set of HTML tags with no content, such as <em></em> it turns that into a self-closing tag: <em /> which is great if you're viewing the result with an XML parser but most browser HTML parsers can't parse certain tags as self-closing and see the tag as an open with no corresponding close. This has the effect of including all the subsequent markup on the page inside the offending tag, usually terminating display of the page.

It looks like only a handful of tags produce this effect, so it should be possible to filter them out. It may be possible to drop empty tag pairs before parsing or convert them back to open/close pairs.

Redi: in theory yes but the mod_virgule codebase is scary mix of HTML 4 (and earlier), XHTML, and XML. Throw in the random markup coming in from syndicated blogs and the resulting tag soup is very difficult to normalize without breaking something. However, incoming blog markup was previously being normalized to XHTML by libxml2 and I'm thinking now, we may have to switch that to HTML 4 to force the open/close tags. The function you mention produces different output depending on what markup type is specified on the tree (or on the individual node). So, parse the blog, walk the tree forcing it all to HTML 4, then ask libxml2 to export it. Maybe... I'm doing some work on the code today, so I'll let you know.

Another Update: I've got some code changes in that might (or might not) help with the broken tag problem. We'll have to see if any incoming blog posts break anything over the next day or so. Nothing new on the spam attack, it's still going strong. I'm going to look at implementing a few more security features in the code that might allow us to turn account creation back on without waiting for the attack to subside.

2 Jun 2011 (updated 3 Jun 2011 at 19:32 UTC) »

Robogato Returns

We had a bad hardware crash recently and, as I was restoring Advogato to new hardware, I realized that it's been too long since I've devoted any significant time to improving the code around here. I took advantage of the downtime caused by the crash to make some final tweaks to the long-awaited libxml2 based HTML parser and made it live. It fixes a lot of the rendering problems already and will fix more once I make a few more tweaks.

I'm also working on improving security in general and making account creation by spammers harder in particular. I had a nice email exchange with dkg about the subject awhile back. He took a look at the code and provided a laundry list of things that needed fixing or improving. I'm working on those now. The first change just went live this week - mod_virgule now requires the POST method for submitted forms. This minor change already stopped a couple of our automated account spammers who were creating accounts with GETs. Only the dumbest spammers were doing that I'd think. Using POST isn't much harder. More changes to come.

If you're wondering what caused the increase in spam accounts we've been seeing for the last year, here's a possible contributor: Incansoft, apparently a purveyor of web-based spam tools, added an Advogato attack to a spamming tool they sell called Web20Bot (sorry, not going to link to it but you can google it). Web20Bot will create phony account profiles containing your backlink spam on 20 websites including Advogato.org, squidoo.com, wordpress.com, blogger.com, tumblr.com, and livejournal.com. They claim Web20Bot handles email verification and captchas, so working out a defense may be interesting. I doubt any of their spam lasts more than 48 hours around here anyway but it would be nice to make life harder for them. (incidentally, if someone were to come up with a copy of this thing so we could analyze it, that might be cool - maybe we could help other sites being attacked by it too).

Update: Thanks for pointing out those issues, Redi. I've fixed the diary edit problem, it should not have been checking for a POST. The <person>, <project>, and <wiki> tags were special cases in the old HTML handler. If one is broken, all three probably are. I'll get on that now. It will take me a little while to track down the problem. <proj> was deprecated in favor of <project> way back in the Raph days but the code checking for <proj> wasn't dropped until this most recent update. I didn't realize anyone still used it. I can add it back in.

Update 2: Ok, found the problem. The old tag handlers output directly to the apache buffer while the new handlers modify the XML tree, which is rendered to the buffer later. I need to modify or replace the handlers for those three tags. I'll try to get to it today if time allows.

Update 3: I think the special tag issue is fixed now, let's try this code for a day or so and see if any problems show up.

<person> test: redi

<proj> test: mod_virgule

<project> test: mod_virgule

<wiki> test: WikiPedia:Advogato.org

Watch for Spammers

If you're wondering about the source of the recent increase in phony users signing up for Advogato accounts, I think I've found it. A number of Russian SEO/spammer blogs are discussing a list of websites that seem to be highly trusted by Google based on the ratio of pages in the main Google index to the supplemental Google index. Advogato is #16 on the list. (I'd provide some links but giving them links from Advogato is the last thing we should do. If you're curious you should be able to find them using a site like Technorati to find blogs that have linked to Advogato in the last few weeks.)

A side effect has been a big bandwidth hit. I thought at first we'd been slashdotted. But the main result is a rash of SEO spammers signing up for Advogato accounts and trying to find some way to get backlinks to their link farms and spam sites. Average survival time for their profiles has been less than 48 hours so probably nothing to worry about but everyone should take a look at the "recent people joining" list and flag anyone who looks like spam. Hopefully it will die down in a week or two.

24 Feb 2008 (updated 21 Jan 2009 at 19:12 UTC) »

Test post for the libxml2 HTML parser

In theory, the libxml2 HTML parser should make best guesses on how to fix screwed up, illegal HTML and all tags should get closed at the end of this diary entry, preventing problems in diary entries that follow or elsewhere on the page.

bold tag with no close

italics tag with no close

strike tag with no close

Update Jan 2009: after a long downtime, I'm finally working on the HTML parser again. Should have it live this month!

Advogato Status Report

My New Year's resolution is to start doing monthly status reports again! Here's the first one.

Even though I haven't posted a status update in a while, minor code updates have continued. To find out what's changed in the live mod_virgule code running Advogato, see the changelog. It's always there and nearly always up to date.

The biggest change has been in the XML file store locking code. The previous system relied on a site-wide read/write lock that locked out access to the entire database when writes were happening. This was getting to be a problem because of trust recalculations and diary syndication that happens at the top of the hour. Write locks were often clogging things up for 10 to 15 minutes per hour.

But it's all good now. All the locking code has been totally ripped out and replaced with file-level locking. There should almost never be any detectable site delays caused by locking now. Besides fixing the hourly slowdowns, this also gives us a little more breathing room to continue growing.

Another recent change is a patch from fzort that improves the HTML parsing code to eliminate undesirable tag attributes. The long-term the plan is still switching to libxml2's HTML parser and junking the one in mod_virgule but, until then, this should make things a little more secure.

A few other fixes and improvements:

The GUID of syndicated blog posts is now preserved when they go out on the Advogato diary RSS feed.

Mod_virgule now has built in support for Google Analytics. Drop your GA ID code into the config.xml and the appropriate GA markup appears on every page throughout the site.

Joe Presbrey of MIT contributed a patch for an external FOAF URI on the user profile. This allows you to link your Advogato FOAF to any other existing FOAF profile you may have, helping to consolidate your online identify.

The computed trust level for each user is now exported via FOAF, referencing a local RDF schema that describes the trust levels. This mechanism was suggested by Sean B. Palmer and Dan Connolly on the W3C #swig IRC channel.

31 Aug 2007 (updated 31 Aug 2007 at 23:33 UTC) »

Advogato Status Report

A new rev of mod_virgule code is live on Advogato. See the changelog for the details. Here are a few highlights.

A discussion between ncm, raph, and chrisd speculated on why there seemed to be a decline in Google rankings for individual blog content on Advogato lately. It was suggested that a change in the Google ranking algorithm may be placing less value on pages with dynamic URLs like http://www.advogato.org/person/ncm/diary.html?start=191. Advogato has long had static URLs for individual articles, so I've added similar support for each individual blog post. If you click the permalink marker beside one of your blog posts, you'll see it now goes to a static URL with just that one post on the page instead of to a dynamic URL that includes a range of posts. For example: http://www.advogato.org/person/ncm/diary/190.html. The old, dynamic system is still in place so search engines and existing links will get to the right place, of course. There's another advantage to having the static URLs to individual blog entries. These will be used for comment pages eventually. Yes, blog comments are really coming. I promise. Some day.

There's also a fix to minor foaf:mbox_sha1sum bug that was noticed by Andreas Harth.

You may have noticed that our Italian cittaditorino spammers were back with a vengence the last couple of weeks. The community spam flagging system seems to be controlling them. Most of the bogus accounts are being deleted within a few days of creation. At ncm's suggestion, I've added rel="nofollows" attributes to all links to untrusted users in the recentlog, recent people joining list, and Advogato People index. There were already nofollows on all links created by untrusted users but this new addition should prevent search engines from even indexing their profile and blog pages. With all these spam control measures in place, keep in mind it's a little harder than it used to be for real users to create an Advogato account and get certified. Well-known users aren't having much trouble and the new trust injected by adding mako as a seed has helped tremendously. But there are users here and there who haven't collected enough certs to become trusted, like pabs3.

That's all the news for now but more new features are on the way.

The URL rendering bug that redi spotted has been fixed, I think. Looks like it was an artifact of the Apache APR 1.3 to 2.0 upgrade that had gone unnoticed for a quite a while. If anyone spots any other URL issues in the project section, let me know.

Advogato Status Report

A new rev of mod_virgule code is live on Advogato. See the changelog for the details.

Aside from the usual minor bugfixes and tweaks, there are two new features you may have noticed already.

New certification indicators: A visual indication is now added to trust certifications that are less than 30 days old. This should make it easier to spot new certs on the user profiles. You can check this out on your own user profile if you've certified anyone, or been certified by anyone, in the last 30 days.

Article lists: Ever wonder how many Advogato articles you've posted? Or wanted to read other articles by a particular poster? Each user profile now includes a reverse chronological list of the 10 most recent articles posted by that user. For users who are more prolific, there is a link to a separate page that includes a complete listing of all articles posted by that user.

In addition to providing a new way to explore Advogato's articles, this should provide another direct route for search engine robots to find the static links to the articles.

11 Jul 2007 (updated 11 Jul 2007 at 20:40 UTC) »
Advogato Status Report

New mod_virgule code is live on Advogato. See the changelog for the details.

More minor bugs fixes. The aggregator should do a better job now of rejecting dupes from feeds that retroactively alter the post date on blog entries. The no_cache and no_local_copy flags in the Apache request records are now set for logouts to prevent browsers from caching old logout results and to prevent the server from sending a 304. This was preventing some Galeon users (and possibly other browsers) from logging out.

I replaced the social bookmarking test links on the article pages with a fully functional social bookmarking tool, linked from the standard "share this" icon. The share link is now available on project and profile pages as well as on articles. If someone has a favorite social bookmarking service that's not listed yet, let me know and I'll add it.

Time has been a scare resource for me lately, so progress through the ToDo list has been slower. More updates as time allows and, as always, patches are welcome.

Social Networking

Google sponsered a CMU project last year to study and reinvent online social networking. The result was Socialstream, a design concept based on the idea of a Unified Social Network (USN). A lot of what they came up sounds similar to what the semantic web folks are working on with OpenID and formats such as FOAF and DOAP. Basically, they're suggesting that social network sites standardize on a data sharing format that would allow them easily interact with each other and become part of a larger network of sites.

The project also did some interesting research, ranging from social networking theory and taxonomy to identifying common complaints about social networking sites and desirable features. They also researched who uses social networks and broke down the results into archetypical user types. The researchers also created a video demo of the Socialstream concept site. Some of the ideas they mention are already in Advogato or are on the ToDo list. I think there are plenty of other ideas here we can incorporate into Advogato as well.

Trust/Authority Metrics

Someone pointed out a link to an article by Michael Jensen in the Chronical Review: The New Metrics of Scholarly Authority. It talks a lot about Web 2.0 authority models. It mentions the Google PageRank system but, oddly, leaves out any mention of the mod_virgule trust metrics implemented on Advogato. Still, it's an interesting read.

26 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!