Older blog entries for robogato (starting at number 15)

Advogato Status Report

I'm working on more code improvements but it will probably be next week before anything interesting emerges. In the meantime...

FAQ: I've added the beginnings of an Advogato FAQ to the site to help cut down on the time I spend answering emails. At present, there's no index and the questions are roughly in order of how frequently they're asked. (okay, one or two I just made up - I suppose they're Frequently Imagined Questions!). Have a look and don't hesitate to point errors or new questions that need to be added.

FOAF: Does anyone have any strong opinions on FOAF? Someone requested we add a FOAF file for each profile and this looks like it would be relatively easy to do. I'm not entirely sure I grok what the point of it is though. Does anyone, anywhere actually use FOAF RDF files for anything useful? Would it be a Good Thing if we suddenly add 10,000 people to the FOAF-o-sphere?

Article Quality: One thing that still seems to need fixing on Advogato is the quality of articles posted on our home page. At present every trusted Advogato user has the freedom to post articles. Unfortunately, not every trusted Advogato user has the ability to post relevant, quality articles. Is there a way to enforce quality without taking away everyone's freedom to post? For background see these two previous Advogato discussions on this subject:

Advogato Status Report

If you haven't been following the saga of Netscape and the RSS 0.91 DTD, here's the summary: On Jan 12 the folks at Deviceforge noticed that Netscape had removed the DTD from their website sometime after Jan 1, 2007. After Slashdot picked up on it, enough people complained that even someone at Netscape acknowledged the problem.

Yesterday, we got an official pronouncement from Netscape. They've agreed to restore the DTD but only until July 1, 2007 after which it will be removed again. Why? According to Netscape, your application shouldn't be "relying on the availability of a static document on a third-party Web server" like, say, a DTD. It's not clear what will happen to RSS 0.91 after July 1. Maybe Netscape will transfer their copyright on the DTD to the W3C and the URL will change. Maybe everyone will have to update their RSS software to ignore the DTD. Maybe everyone will stop using RSS 0.91. Who knows.

Why do we care? Because mod_virgule has always generated RSS 0.91 feeds for the articles on the main page and the user blog feeds. Most RSS readers don't bother to check the DTD but many do, and if the DTD is gone, no more Advogato feed. There was already a task on the ToDo list to bump all our feeds to RSS 2.0, so I did that today as it seemed like the easiest way to bypass the whole issue. All Advogato feeds are now RSS 2.0. I also added some of the optional tags that make life easy for aggregators like guid and pubDate.

15 Jan 2007 (updated 15 Jan 2007 at 22:11 UTC) »

Advogato Status Report

A new rev of mod_virgule code went live today. See the changelog for the details. This upgrade required taking Advogato offline for about an hour to modify the XML database.

Until today, mod_virgule has stored timestamps in the XML data store that reflected the server's local time zone. The code then made assumptions about the time zone when rendering articles, posts, or RSS feeds. Prior to 3pm, 1 October, 2006, the server's local time zone was US Pacific time. When Advogato got transferred to our hosting facility, the new server was using the US Central time zone. This created a further complication because of the two hour time shift. Adding the blog aggregator made things worse because 99% of the incoming blog feeds use UTC timestamps.

Having to juggle three time zones on a regular basis was creating a bit of a headache for me. I decided it was time to get things under control before the code got so complicated that only a Time Lord from Gallifrey could understand it. So mod_virgule now uses UTC for everything. The code changes were relatively straightforward but normalizing Advogato's rather large XML data store was another matter. I wrote a Perl program that recursively scanned Advogato's 30,000+ XML files looking for timestamps in several different formats and adjusted them to UTC (which required a different offset depending on whether they were recorded before or after 3pm, 1 Oct, 2006). That's the reason for the brief downtime.

So, anyway, we're back up and everything should be working the same as always aside from being on UTC time rather than Central time. If anyone notices any breakage, let me know.

5 Jan 2007 (updated 7 Jan 2007 at 02:47 UTC) »

Advogato Status Report

The first new rev of mod_virgule code for 2007 went live today. See the changelog for the details. Basically, it's all bug fixes.

The important one is a rewrite of the diary entry storage code. For users whose posts arrive via syndication, the new code will allow local editing and xml-rpc editing without the save wiping out all the extra XML tags that store syndication state info. This bug was causing the occasional duplicate of syndicated posts (and it's why I warned against mixing local blogging with syndicated blogging when we turned on the aggregator).

Update: Hmmmm... okay, there's still at least one other problem with mixed local and syndicated blogging that can lead to duplicated entries. I'll see if I can track it down soon...

Update 2: Fixed what should be the last issue causing problems for mixed posting. It may actually be safe now. Unfortunately, I discovered one more cause of duplicated posts. There's an RSS variant that retroactively alters the post time of an entry each time it's edited, which confuses our simple little aggregator into thinking it's a new post. Working on a fix now. The world would be such a nicer place if everyone used a sane syndication method like Atom...

Update 3: RSS feeds with shifting date stamps should now be handled a little better. At least if the feed in question has unique item identifiers (some do, some don't - you never know what you'll get with RSS).

Advogato Status Report

A new rev of mod_virgule code went live today. See the changelog for the details - but only if you're really bored. There were only very, very minor changes. With the holidays coming up, I'm not sure how much time I'll have to work on the code over the next couple of weeks. So don't expect any spectacular new features.

What would be nice is seeing one shiny new article posted on Advogato before the end of December. If any Advogato users presented at the recent OSDC and have an interesting paper, maybe you could post it here as an article. Just a thought.

Advogato and Greenhouse Gas

I noticed pphaneuf's post about Second life, computer power consumption and the relation to CO2 emissions. I may not have mentioned before that the server Advogato is hosted on now, and our entire little facility, is powered by 100% wind generated power. We recently got our EPA Green Power Partner approval. I've never calculated the electricity used by just Advogato but overall we use about 4,000 kWh per month. According to most estimates I've seen, this translates into 6,000 - 8,000 pounds of CO2 that we avoid putting into the air each month. And we aren't the first. I've seen several other hosting facilities that have gone to 100% non-polluting power providers. Here in Texas, it's actually saving us money too, since the cost of wind tends not to be affected much by the rising cost of gas and coal. So maybe some of the Second Life users should ask about that.

7 Dec 2006 (updated 7 Dec 2006 at 01:37 UTC) »

Advogato Status Report

A new rev of mod_virgule code went live today. See the changelog for the details.

I've added support for a couple of additional RSS variants with ever more unusual date stamp formats. In theory the RSS pubDate tag is suppose to use the date format described in RFC822. The first problem is that RFC822 allows a lot of variation. The second problem is that RFC822 specifies a two digit year. For obvious reasons most RSS feeds use a four digit year. Mod_virgule's first line of defense is to call the Apache APR routine apr_date_parse_rfc(), which will parse all date strings that actually comply with RFC822, plus nine variants that are not strictly RFC822 compliant but are commonly seen in the wild. So far, at least one common blogging app, Blosxom, produces a pubDate field that is not RFC822 compliant and can't be parsed by apr_date_parse_rfc(). I've added a custom strptime() call that handles these. A patch for the Apache APR folks is in the works.

Some RSS feeds don't have a pubDate tag at all. Instead they have a date tag which, instead of RFC822, contains an RFC3339 formatted date string. This is actually much nicer, since it's a slightly more sane format and is the same one used in Atom feeds, so we already have code for handling it.

Speaking of Atom, the mod_virgule aggregator now supports the old, deprecated Atom v0.3 feeds in addition to the current Atom v1.0 standard.

So here's what we support right now:

  • Atom 0.3
  • Atom 1.0
  • RSS 0.91 *(only if optional pubDate or date tags are included)
  • RSS 0.92 *(only if optional pubDate or date tags are included)
  • RSS 2.0
  • RDF Site Summary 0.9 *(untested)
  • RDF Site Summary 1 *(all variants seen so far work)
  • RDF Site Summary 1.1 *(untested)

I wish I could support the RSS 0.91/0.92 feeds that don't have any sort of time or date stamps at all but it would require some reworking of the code in the aggregator that sorts out which posts are new and which have been seen before. In most cases RSS 0.91/0.92 allows the use of both date and pubDate, so if you make sure those tags are included, things should work fine. Otherwise, your best bet is to use something a little more recent like RSS 2.0 or Atom 1.0.

The other update this week was a performance improvement. Each hour the trust metric and blog interest eigen vector ratings are recalculated. The eigen vector recalculation takes several minutes to complete. In the past the process held a read lock on the XML database, preventing any other process from taking a write lock. This caused some operations on Advogato to block (such as clicking on the "Read more..." link of articles, which writes an update to the user's "last read" pointers). This problem is now fixed. The site should seem signficantly less sluggish at the top of the hour when the update runs. The eigen vector processing now releases the read lock and gives up its time slice, then re-acquires the lock on each iteration. The total processing time is slightly longer (from 3 minutes to 3.25 minutes) but during that time the site can be used normally without feeling slow.

Advogato Status Report

A new rev of mod_virgule code went live Wednesday, with some additional fixes going live last night. See the changelog for the details.

This release adds support to the aggregator for blog entry updates via syndicated feeds. As far as I can tell, only Atom supports updates in any obvious way. In theory, it should be possible to detect updates to RSS or RDF Site Summary feeds by doing a diff on the content of the entry in the feed against the local copy or by making some other type of guess but it didn't seem worth the trouble right now (patches accepted, of course). Meanwhile, updates should work fine if you're using Atom. For an example see Zaitcev's blog. The Advogato date stamp and "updated" date stamp reflect the time at which the original post and the update respectively hit Advogato. The date stamps in the syndication link at the bottom of the entry reflect the times claimed in the Atom feed for the original post and update. All times have been converted to server local time (currently CST but I feel a change coming...).

It looks like we've now got 10 ex-Advogatoans who've returned to the recentlog via the syndication feature. Hopefully more will follow as word gets out that it's available.

As professor Farnsworth likes to say, "Good news, everyone". The mod_virgule codebase is now in a Subversion repository. The latest changelog can be found in mod_virgule/trunk/ChangeLog. If you want to submit any patches make them against the code in mod_virgule/trunk. Release versions can be found in mod_virgule/tags. To checkout the latest development code:

svn checkout http://svn.dprg.org/repos/mod_virgule/trunk

Or to get the current release:

svn checkout http://svn.dprg.org/repos/mod_virgule/tags/1.41-20061201

Advogato Status Report

A new rev of mod_virgule code went live today. See the changelog for the details. This release includes slightly refactored diary (blog) code that does two things: 1) it can display permalinks and timestamps gathered from syndicated blog entries and 2) it reduces the amount of code by providing a single function to render blog entries.

If you look at an Advogato blog entry posted via syndication, you'll notice the new features. The blog entry's title will be incorporated in bold and a permalink to the original posting is provided appended in grey at the end of the post.

I've also added support for more variants of RSS.

I think the blog aggregation code is solid enough now to let people know about it. It would be nice if someone whose blog is syndicated over at Planet (former) Advogato could post something about it. Any ex-Advogato users who'd like to see their blog return to the Advogato recentlog need only log into their account, check the "Syndicate your blog from another site" box, and add the URL of an Atom, RSS, or RDF Site Summary feed.

I filed a bug report and patch for the UTF8ToHtml() function in libxml2 to correct the handling of UTF8 characters like the Chinese Han ideographs here on Advogato. DV indicated he'd accept the patch pending regression testing.

17 Nov 2006 (updated 17 Nov 2006 at 06:44 UTC) »
Advogato Status Report

Okay, I think we have a fix for badvogato's Chinese character problem. I've posted four test cases below. Remember that even with mod_virgule working 100%, some browsers may not have a UTF-8 font that will render every possible character correctly. If your UTF-8 font is missing a character it will normally display a little box with the character code in it.

This one was a brain teaser. Turns out the problem has been there (in my codebase) for well over a year and was never noticed because most bloggers at robots.net post in English. I added the accept-charset="UTF-8" to all the forms generated by mod_virgule sometime back as part of an attempt to make it more UTF-8 friendly. As it turns out, one of the older mod_virgule functions, virgule_nice_htext(), is not UTF-8 safe. It assumes the input is ASCII or, at least, something where one byte = one character. UTF-8 characters that were multiple bytes were getting mangled, leading to undesirable results.

Initially I thought a fix would be as simple as passing the form data through the libxml2 function UTF8ToHtml() which should convert UTF-8 to ASCII + encoded entities. Many hours later, I figured out this just doesn't work. Due to what I believe is a bug in UTF8ToHtml(), it fails on valid UTF-8 strings that contain characters for which there is not a named HTML entity value. That means it fails on almost all UTF-8 strings that contain anything other than common European variants of ASCII characters. A Latin character with an acute or a circumflex is converted correctly but, for example, a Chinese ideograph would cause the conversion process to terminate with an error.

In the end, I patched UTF8ToHtml() to use numerical entities in this case and now all seems to be well. I'll run this by DV and see if incorporating the patch upstream is warranted.

UTF-8 Tests

1. Problematic Han ideographs as mentioned in the Chinese XML FAQ:

兡也包因沘氓侷柵苗孫孫財 崧淫設弼琶跑愍窟榜蒸奭稽 霄瓢館縲擻鼕孃魔釁佉沎岠 狋垚柛胅娭涘罞偟惈牻荺傒 焱菏酡廅滘絺赩塴榗箂踃嬁 澕蓴醊獧螗餟燱螬駸礑鎞瀧 鄿瀯騬醹躕鱕

2. Cut-and-paste sample from hjclub.com website:

今天在海归网上浏览,发现一个贴子:《[保陈良宇的出笼新解释]胡 锦涛被套牢 陈良宇是赢家不是输家?》 (海纳百川 www.hjclub.com)

粗读了一下,觉得这篇文章大有深意,跟党中央不太一致是肯定的。 我看了一下别的网站,文学城、万维都登了。但海归网是商业网站, 不能成为政治斗争的牺牲品。海归网的版主因为国庆长假,未必会上 网看着。所以我就顺手删去了这个贴子。我删贴其实没有什么用处, 因为这个贴子在海外已经广泛流传。 (海纳百川 www.hjclub.com)

3. Sample from badvogato's blog

情不知所起,一往而深.

生者可以死,死可以生,

生而不可与死,死而不可复生者,

皆非情之至也.

梦中之情,何必非真,天下岂少梦中人耶?

4. Cut-and-paste from Wikipedia language menu:

# العربية # Bahasa Indonesia # Български # Català # Česky # Dansk # Deutsch # Eesti # Español # Esperanto # Français # עברית # Hrvatski # Italiano # Nederlands # 日本語 # 한국어 # Lietuvių # Magyar # Norsk (bokmål) # Polski # Português # Română # Русский # Slovenščina # Slovenčina # Српски / Srpski # Suomi # Svenska # తెలుగు # Türkçe # Українська # 中文

6 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!