Advogato: Blog for robogato

Advogato Status Report

Okay, I think we have a fix for badvogato's Chinese character problem. I've posted four test cases below. Remember that even with mod_virgule working 100%, some browsers may not have a UTF-8 font that will render every possible character correctly. If your UTF-8 font is missing a character it will normally display a little box with the character code in it.

This one was a brain teaser. Turns out the problem has been there (in my codebase) for well over a year and was never noticed because most bloggers at robots.net post in English. I added the accept-charset="UTF-8" to all the forms generated by mod_virgule sometime back as part of an attempt to make it more UTF-8 friendly. As it turns out, one of the older mod_virgule functions, virgule_nice_htext(), is not UTF-8 safe. It assumes the input is ASCII or, at least, something where one byte = one character. UTF-8 characters that were multiple bytes were getting mangled, leading to undesirable results.

Initially I thought a fix would be as simple as passing the form data through the libxml2 function UTF8ToHtml() which should convert UTF-8 to ASCII + encoded entities. Many hours later, I figured out this just doesn't work. Due to what I believe is a bug in UTF8ToHtml(), it fails on valid UTF-8 strings that contain characters for which there is not a named HTML entity value. That means it fails on almost all UTF-8 strings that contain anything other than common European variants of ASCII characters. A Latin character with an acute or a circumflex is converted correctly but, for example, a Chinese ideograph would cause the conversion process to terminate with an error.

In the end, I patched UTF8ToHtml() to use numerical entities in this case and now all seems to be well. I'll run this by DV and see if incorporating the patch upstream is warranted.

UTF-8 Tests

1. Problematic Han ideographs as mentioned in the Chinese XML FAQ:

兡也包因沘氓侷柵苗孫孫財崧淫設弼琶跑愍窟榜蒸奭稽霄瓢館縲擻鼕孃魔釁佉沎岠狋垚柛胅娭涘罞偟惈牻荺傒焱菏酡廅滘絺赩塴榗箂踃嬁澕蓴醊獧螗餟燱螬駸礑鎞瀧鄿瀯騬醹躕鱕

2. Cut-and-paste sample from hjclub.com website:

今天在海归网上浏览，发现一个贴子：《[保陈良宇的出笼新解释]胡锦涛被套牢陈良宇是赢家不是输家？》 (海纳百川 www.hjclub.com)

粗读了一下，觉得这篇文章大有深意，跟党中央不太一致是肯定的。我看了一下别的网站，文学城、万维都登了。但海归网是商业网站，不能成为政治斗争的牺牲品。海归网的版主因为国庆长假，未必会上网看着。所以我就顺手删去了这个贴子。我删贴其实没有什么用处，因为这个贴子在海外已经广泛流传。 (海纳百川 www.hjclub.com)

3. Sample from badvogato's blog

情不知所起,一往而深.

生者可以死,死可以生,

生而不可与死,死而不可复生者,

皆非情之至也.

梦中之情,何必非真,天下岂少梦中人耶?

4. Cut-and-paste from Wikipedia language menu:

# العربية # Bahasa Indonesia # Български # Català # Česky # Dansk # Deutsch # Eesti # Español # Esperanto # Français # עברית # Hrvatski # Italiano # Nederlands # 日本語 # 한국어 # Lietuvių # Magyar # Norsk (bokmål) # Polski # Português # Română # Русский # Slovenščina # Slovenčina # Српски / Srpski # Suomi # Svenska # తెలుగు # Türkçe # Українська # 中文

17 Nov 2006 robogato » (Master)