Met with Rasmus, JByers,
and a few others in San Francisco (at LinuxCare) to discuss
internationalization in PHP. The PHPi home is at SourceForge at http://php-i18n.sourceforge.net/,
but it's just getting off the ground. There are a few
efforts out there that have started internationalization on
different levels. Hiro (can't remember his last name =(,
while working for a Far East web portal added JIS and other
Japaneese support to PHP to accept form POST vars. It seems
like it would be a good starting point to see what problems
he ran in to.
On the other hand, an IBM project called ICU exists as an
apache/php module. It seems quite messy, written in C++ and
prone to bring down the apache thread if not handled with
care. Carl, the contact at IBM, said that it was under a
sort of BSD license, so hopefully we can fix up whatever is
wrong with it and see what it affords us. They seem to have
much of the VERY specific work done, including sorting
charts, multi character glyph grouping, etc. It was done
using a collate function that normalizes the input string to
separate out diacritical marks (accents) and group
characters and then run it through various levels of sorting
(exact, whitespace insensitive, case insensitive, etc.)
Looks very useful, but it looks like more than we would
need.
The final debate was on how to handle the difference between
UTF-8, UCS-2, and differentiating between them and high
ascii. There seems to be no good way at all (is a form
being submitted in multibyte japaneese, or is it a JPEG).
When we do a strlen() on it, do we get the number of bytes
or the number of characters. Hopefully someone has some
magic solution to this one.