17 Nov 2006
(updated 17 Nov 2006 at 06:44 UTC) »
Advogato Status Report
Okay, I think we have a fix for badvogato's
Chinese character problem. I've posted four test cases
below. Remember that even with mod_virgule working 100%,
some browsers may not have a UTF-8 font that will render
every possible character correctly. If your UTF-8 font is
missing a character it will normally display a little box
with the character code in it.
This one was a brain teaser. Turns out the problem has been
there (in my codebase) for well over a year and was never
noticed because most bloggers at robots.net post in English.
I added the accept-charset="UTF-8" to all the forms
generated by mod_virgule sometime back as part of an attempt
to make it more UTF-8 friendly. As it turns out, one of the
older mod_virgule functions, virgule_nice_htext(), is not
UTF-8 safe. It
assumes the input is ASCII or, at least, something where one
byte = one character. UTF-8 characters that were multiple
bytes were getting mangled, leading to undesirable results.
Initially I thought a fix would be as simple as passing the
form data through the
libxml2 function UTF8ToHtml()
which should convert UTF-8 to
ASCII + encoded entities. Many hours later, I figured out
this just doesn't work. Due to what I believe is a bug in
UTF8ToHtml(), it fails on valid UTF-8 strings that contain
characters for which there is not a named HTML entity value.
That means it fails on almost all UTF-8 strings that contain
anything other than common European variants of
ASCII characters. A Latin character with an acute or a
circumflex is converted correctly but, for example, a
Chinese ideograph would cause the conversion process to
terminate with an error.
In the end, I patched UTF8ToHtml() to use numerical entities in
this case and now all seems to be well. I'll run this by
DV and see if incorporating the patch
upstream is warranted.
1. Problematic Han ideographs as mentioned in the Chinese XML
2. Cut-and-paste sample from hjclub.com
3. Sample from badvogato's
4. Cut-and-paste from Wikipedia
# Bahasa Indonesia
# Norsk (bokmål)
# Српски / Srpski