Older blog entries for robogato (starting at number 13)

15 Jan 2007 (updated 15 Jan 2007 at 22:11 UTC) »

Advogato Status Report

A new rev of mod_virgule code went live today. See the changelog for the details. This upgrade required taking Advogato offline for about an hour to modify the XML database.

Until today, mod_virgule has stored timestamps in the XML data store that reflected the server's local time zone. The code then made assumptions about the time zone when rendering articles, posts, or RSS feeds. Prior to 3pm, 1 October, 2006, the server's local time zone was US Pacific time. When Advogato got transferred to our hosting facility, the new server was using the US Central time zone. This created a further complication because of the two hour time shift. Adding the blog aggregator made things worse because 99% of the incoming blog feeds use UTC timestamps.

Having to juggle three time zones on a regular basis was creating a bit of a headache for me. I decided it was time to get things under control before the code got so complicated that only a Time Lord from Gallifrey could understand it. So mod_virgule now uses UTC for everything. The code changes were relatively straightforward but normalizing Advogato's rather large XML data store was another matter. I wrote a Perl program that recursively scanned Advogato's 30,000+ XML files looking for timestamps in several different formats and adjusted them to UTC (which required a different offset depending on whether they were recorded before or after 3pm, 1 Oct, 2006). That's the reason for the brief downtime.

So, anyway, we're back up and everything should be working the same as always aside from being on UTC time rather than Central time. If anyone notices any breakage, let me know.

5 Jan 2007 (updated 7 Jan 2007 at 02:47 UTC) »

Advogato Status Report

The first new rev of mod_virgule code for 2007 went live today. See the changelog for the details. Basically, it's all bug fixes.

The important one is a rewrite of the diary entry storage code. For users whose posts arrive via syndication, the new code will allow local editing and xml-rpc editing without the save wiping out all the extra XML tags that store syndication state info. This bug was causing the occasional duplicate of syndicated posts (and it's why I warned against mixing local blogging with syndicated blogging when we turned on the aggregator).

Update: Hmmmm... okay, there's still at least one other problem with mixed local and syndicated blogging that can lead to duplicated entries. I'll see if I can track it down soon...

Update 2: Fixed what should be the last issue causing problems for mixed posting. It may actually be safe now. Unfortunately, I discovered one more cause of duplicated posts. There's an RSS variant that retroactively alters the post time of an entry each time it's edited, which confuses our simple little aggregator into thinking it's a new post. Working on a fix now. The world would be such a nicer place if everyone used a sane syndication method like Atom...

Update 3: RSS feeds with shifting date stamps should now be handled a little better. At least if the feed in question has unique item identifiers (some do, some don't - you never know what you'll get with RSS).

Advogato Status Report

A new rev of mod_virgule code went live today. See the changelog for the details - but only if you're really bored. There were only very, very minor changes. With the holidays coming up, I'm not sure how much time I'll have to work on the code over the next couple of weeks. So don't expect any spectacular new features.

What would be nice is seeing one shiny new article posted on Advogato before the end of December. If any Advogato users presented at the recent OSDC and have an interesting paper, maybe you could post it here as an article. Just a thought.

Advogato and Greenhouse Gas

I noticed pphaneuf's post about Second life, computer power consumption and the relation to CO2 emissions. I may not have mentioned before that the server Advogato is hosted on now, and our entire little facility, is powered by 100% wind generated power. We recently got our EPA Green Power Partner approval. I've never calculated the electricity used by just Advogato but overall we use about 4,000 kWh per month. According to most estimates I've seen, this translates into 6,000 - 8,000 pounds of CO2 that we avoid putting into the air each month. And we aren't the first. I've seen several other hosting facilities that have gone to 100% non-polluting power providers. Here in Texas, it's actually saving us money too, since the cost of wind tends not to be affected much by the rising cost of gas and coal. So maybe some of the Second Life users should ask about that.

7 Dec 2006 (updated 7 Dec 2006 at 01:37 UTC) »

Advogato Status Report

A new rev of mod_virgule code went live today. See the changelog for the details.

I've added support for a couple of additional RSS variants with ever more unusual date stamp formats. In theory the RSS pubDate tag is suppose to use the date format described in RFC822. The first problem is that RFC822 allows a lot of variation. The second problem is that RFC822 specifies a two digit year. For obvious reasons most RSS feeds use a four digit year. Mod_virgule's first line of defense is to call the Apache APR routine apr_date_parse_rfc(), which will parse all date strings that actually comply with RFC822, plus nine variants that are not strictly RFC822 compliant but are commonly seen in the wild. So far, at least one common blogging app, Blosxom, produces a pubDate field that is not RFC822 compliant and can't be parsed by apr_date_parse_rfc(). I've added a custom strptime() call that handles these. A patch for the Apache APR folks is in the works.

Some RSS feeds don't have a pubDate tag at all. Instead they have a date tag which, instead of RFC822, contains an RFC3339 formatted date string. This is actually much nicer, since it's a slightly more sane format and is the same one used in Atom feeds, so we already have code for handling it.

Speaking of Atom, the mod_virgule aggregator now supports the old, deprecated Atom v0.3 feeds in addition to the current Atom v1.0 standard.

So here's what we support right now:

  • Atom 0.3
  • Atom 1.0
  • RSS 0.91 *(only if optional pubDate or date tags are included)
  • RSS 0.92 *(only if optional pubDate or date tags are included)
  • RSS 2.0
  • RDF Site Summary 0.9 *(untested)
  • RDF Site Summary 1 *(all variants seen so far work)
  • RDF Site Summary 1.1 *(untested)

I wish I could support the RSS 0.91/0.92 feeds that don't have any sort of time or date stamps at all but it would require some reworking of the code in the aggregator that sorts out which posts are new and which have been seen before. In most cases RSS 0.91/0.92 allows the use of both date and pubDate, so if you make sure those tags are included, things should work fine. Otherwise, your best bet is to use something a little more recent like RSS 2.0 or Atom 1.0.

The other update this week was a performance improvement. Each hour the trust metric and blog interest eigen vector ratings are recalculated. The eigen vector recalculation takes several minutes to complete. In the past the process held a read lock on the XML database, preventing any other process from taking a write lock. This caused some operations on Advogato to block (such as clicking on the "Read more..." link of articles, which writes an update to the user's "last read" pointers). This problem is now fixed. The site should seem signficantly less sluggish at the top of the hour when the update runs. The eigen vector processing now releases the read lock and gives up its time slice, then re-acquires the lock on each iteration. The total processing time is slightly longer (from 3 minutes to 3.25 minutes) but during that time the site can be used normally without feeling slow.

Advogato Status Report

A new rev of mod_virgule code went live Wednesday, with some additional fixes going live last night. See the changelog for the details.

This release adds support to the aggregator for blog entry updates via syndicated feeds. As far as I can tell, only Atom supports updates in any obvious way. In theory, it should be possible to detect updates to RSS or RDF Site Summary feeds by doing a diff on the content of the entry in the feed against the local copy or by making some other type of guess but it didn't seem worth the trouble right now (patches accepted, of course). Meanwhile, updates should work fine if you're using Atom. For an example see Zaitcev's blog. The Advogato date stamp and "updated" date stamp reflect the time at which the original post and the update respectively hit Advogato. The date stamps in the syndication link at the bottom of the entry reflect the times claimed in the Atom feed for the original post and update. All times have been converted to server local time (currently CST but I feel a change coming...).

It looks like we've now got 10 ex-Advogatoans who've returned to the recentlog via the syndication feature. Hopefully more will follow as word gets out that it's available.

As professor Farnsworth likes to say, "Good news, everyone". The mod_virgule codebase is now in a Subversion repository. The latest changelog can be found in mod_virgule/trunk/ChangeLog. If you want to submit any patches make them against the code in mod_virgule/trunk. Release versions can be found in mod_virgule/tags. To checkout the latest development code:

svn checkout http://svn.dprg.org/repos/mod_virgule/trunk

Or to get the current release:

svn checkout http://svn.dprg.org/repos/mod_virgule/tags/1.41-20061201

Advogato Status Report

A new rev of mod_virgule code went live today. See the changelog for the details. This release includes slightly refactored diary (blog) code that does two things: 1) it can display permalinks and timestamps gathered from syndicated blog entries and 2) it reduces the amount of code by providing a single function to render blog entries.

If you look at an Advogato blog entry posted via syndication, you'll notice the new features. The blog entry's title will be incorporated in bold and a permalink to the original posting is provided appended in grey at the end of the post.

I've also added support for more variants of RSS.

I think the blog aggregation code is solid enough now to let people know about it. It would be nice if someone whose blog is syndicated over at Planet (former) Advogato could post something about it. Any ex-Advogato users who'd like to see their blog return to the Advogato recentlog need only log into their account, check the "Syndicate your blog from another site" box, and add the URL of an Atom, RSS, or RDF Site Summary feed.

I filed a bug report and patch for the UTF8ToHtml() function in libxml2 to correct the handling of UTF8 characters like the Chinese Han ideographs here on Advogato. DV indicated he'd accept the patch pending regression testing.

17 Nov 2006 (updated 17 Nov 2006 at 06:44 UTC) »
Advogato Status Report

Okay, I think we have a fix for badvogato's Chinese character problem. I've posted four test cases below. Remember that even with mod_virgule working 100%, some browsers may not have a UTF-8 font that will render every possible character correctly. If your UTF-8 font is missing a character it will normally display a little box with the character code in it.

This one was a brain teaser. Turns out the problem has been there (in my codebase) for well over a year and was never noticed because most bloggers at robots.net post in English. I added the accept-charset="UTF-8" to all the forms generated by mod_virgule sometime back as part of an attempt to make it more UTF-8 friendly. As it turns out, one of the older mod_virgule functions, virgule_nice_htext(), is not UTF-8 safe. It assumes the input is ASCII or, at least, something where one byte = one character. UTF-8 characters that were multiple bytes were getting mangled, leading to undesirable results.

Initially I thought a fix would be as simple as passing the form data through the libxml2 function UTF8ToHtml() which should convert UTF-8 to ASCII + encoded entities. Many hours later, I figured out this just doesn't work. Due to what I believe is a bug in UTF8ToHtml(), it fails on valid UTF-8 strings that contain characters for which there is not a named HTML entity value. That means it fails on almost all UTF-8 strings that contain anything other than common European variants of ASCII characters. A Latin character with an acute or a circumflex is converted correctly but, for example, a Chinese ideograph would cause the conversion process to terminate with an error.

In the end, I patched UTF8ToHtml() to use numerical entities in this case and now all seems to be well. I'll run this by DV and see if incorporating the patch upstream is warranted.

UTF-8 Tests

1. Problematic Han ideographs as mentioned in the Chinese XML FAQ:

兡也包因沘氓侷柵苗孫孫財 崧淫設弼琶跑愍窟榜蒸奭稽 霄瓢館縲擻鼕孃魔釁佉沎岠 狋垚柛胅娭涘罞偟惈牻荺傒 焱菏酡廅滘絺赩塴榗箂踃嬁 澕蓴醊獧螗餟燱螬駸礑鎞瀧 鄿瀯騬醹躕鱕

2. Cut-and-paste sample from hjclub.com website:

今天在海归网上浏览,发现一个贴子:《[保陈良宇的出笼新解释]胡 锦涛被套牢 陈良宇是赢家不是输家?》 (海纳百川 www.hjclub.com)

粗读了一下,觉得这篇文章大有深意,跟党中央不太一致是肯定的。 我看了一下别的网站,文学城、万维都登了。但海归网是商业网站, 不能成为政治斗争的牺牲品。海归网的版主因为国庆长假,未必会上 网看着。所以我就顺手删去了这个贴子。我删贴其实没有什么用处, 因为这个贴子在海外已经广泛流传。 (海纳百川 www.hjclub.com)

3. Sample from badvogato's blog






4. Cut-and-paste from Wikipedia language menu:

# العربية # Bahasa Indonesia # Български # Català # Česky # Dansk # Deutsch # Eesti # Español # Esperanto # Français # עברית # Hrvatski # Italiano # Nederlands # 日本語 # 한국어 # Lietuvių # Magyar # Norsk (bokmål) # Polski # Português # Română # Русский # Slovenščina # Slovenčina # Српски / Srpski # Suomi # Svenska # తెలుగు # Türkçe # Українська # 中文

Advogato Status Report

A new rev of mod_virgule code went live today. See the changelog for the details. Other than a few more bug fixes, the big change is the addition of a blog aggregator. This will allow Advogato users who keep their blog somewhere else to syndicate it here so it shows up in the recentlog. There are already seven users whose posts have returned to the recentlog. Hopefully more past Advogato users will follow.

Initially the aggregator supports Atom v1.0, RSS v0.91, v0.92, v2.0, and RDF Site Summary (sometimes known as RSS v1.0, a fork of "real" RSS). My recommendation is to use Atom v1.0 if you've got, with RSS v2.0 as a safe alternative. I expect there are still some bugs to work out, so bear with me for a week or so as we sort things out. There are a few known caveats:

  • Due to limitations in the existing recentlog code, bursts of multiple syndicated entries from the same user that arrive within a narrow time window will only result in one recentlog entry. This is only likely to be noticed the first time the feed is grabbed when maybe 5 or 10 entries get sucked in at once.
  • The blog post title, link, and original posting date are stored locally but the current diary code doesn't display them yet. The additional info should start showing up after the next code release. Soon...
  • Some variants of older RSS (v0.xx) feeds may produce unexpected results. There seem to be an endless number of variations of the RSS formats and I may not have accounted for them all yet.
  • RDF Site Summary format is more complex than Atom or RSS. It's a "modular standard" with dozens of different modules. Trying to parse the output of every conceivable combination of modules is non-trivial. Fortunately, this format isn't very common. Right now, I'm parsing a couple of combinations that use RDF Site Summary v1.0, plus the date tag from the Dublin Core module and the content encoding fields of the most recent draft version of the Content module. That's working for the one RDF Site Summary feed I know of on Advogato. If you can't use Atom or RSS and your RDF Site Summary feed doesn't work, send me a link and I'll try to support it.
  • It will be safer to either to use Advogato for blogging or syndicate your blog here from another site. Mixing the two options, while possible, may produce unexpected results with regard to the ordering of the posts if you post multiple times per day.
9 Nov 2006 (updated 9 Nov 2006 at 16:26 UTC) »
Advogato Status Report

A new rev of mod_virgule code went live this morning. This is an bugfix only release to correct a couple of bugs introduced in the last version.

The missing projects are now visible again thanks to the addition of a missing pair of brackets around an if statement.

A bug in the account deletion code was causing only the first reference to a user to be deleted from the recentlog. The switch to multiple recentlog posts revealed the problem. This is now fixed also.

There was some doubt expressed about whether a recently deleted account was actually spam. I've restored two accounts deleted last night, phpgurru and xerox (the most recent blog post of each account was lost as both were deleted after the Wednesday backup and just prior to the Thursday backup). I'm not sure what these accounts are. Maybe just non-native English speakers? Xerox appears to have been a member since 2002 but most of Xerox's blog posts are either in Chinese or some sort of autogenerated content. Maybe another Chinese-speaking Advogato user could check out Xerox's blog and give us a clue as what it's all about?

I've increased the spam score needed for account deletion from 10 to 15. Now that most of the easy to ID spammers are gone, it probably makes sense to require a larger concensus of users before doing something as drastic as deleting an account.

4 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!