Advogato: Internationalization guidelines (request for comments)

Nowadays, more often than not, free software tends to try to speak foreign languages, reflecting the fact that the free software development community is spread across dozens of countries around the world. For instance, if I have set the environment variable LANG to fr_FR on a RedHat version 6 machine, I get the following behavior:

[monniaux@quatramaran monniaux]$ ls --help Usage: ls [OPTION]... [FICHIER]... Afficher les informations au sujet des FICHIERS (du répertoire courant par défaut). Trier les entrées alphabétiquement si aucune des options -cftuSUX ou --sort n'est utilisée. -a, --all afficher les noms cachés débutant par . ...

Most users in non English-speaking countries will be delighted to see computers at least trying to speak their mother tongue. Some people, particularly in the United States, tend to believe that most people around the world speak English, at least as a foreign language. This is a myth. The reality is rather the following:

In developing countries, people having to interact with foreigners, especially at tourist resorts, speak just enough for the interaction (for instance, they might know the few words necessary to sell you food).
In developped countries, people learn English at school and tend to forget it afterwards (do you remember all the things you were taught in school?). Of course, there are disparities: for instance, northern europeans, with languages spoken by only a few million people worldwide, tend to speak English better than the Italians.

In both cases, only a minority of the population is fluent enough in English to deal with it in lengthy documentations, especially if lots of technical words are used (these tend not to be found in standard bilingual dictionaries).

I shall not deal here with the technical issues involved, like the use of Gettext. I will rather try to focus on simple acts that can make internationalization and localization better. Internationalization, or i18n, is the fact of adapting a program for l10n localization. I18n isolates all the country or language-dependent parts of programs, so that l10n can adapt these for particular countries or languages.

I see quite a few thorny issues, on which I am going to give my opinion:

What not to translate: often, overzealous i18n'ers list for translation some internal error messages that are completely ununderstandable for the end-user, whether or not he is a native English speaker. Internal error messages should be marked as such, even in the English version. For translation, only the "Internal Error" headline should be translated; the error message itself should not. Why? A translation would be nothing to the end-user, and hamper efforts to find the bug (are you prepared to receive a bug report featuring the russian translation of "not enough linked lists in heap"?).
There are often many more country-specific traits in a program that one would think at first sight. Let me quote a few:
- postal addresses and phone numbers are often formatted somewhat differently between countries. I have seen all to many Web sites refusing French postal addresses for want of a "state" field, even though France does not use states for postal addressing, and even though those sites pretended to address international customers. Furthermore, French postal codes go before the name of the city, not after. For instance, a French mail address might look like:
  Martineaud SA
  M. Henri Martin
  13, rue du Moulin Vert
  75361 PARIS cedex 11
  Now imagine that that person gets his own address refused by the program because he did not specify a state!
  
  I do not advocate putting a database of postal address and phone number formats in programs. Instead, programs or WWW sites should accept free-form addresses. Let us stop second-guessing users and assume that they can at least write addresses correctly.
- Another country-specific trait is the use of pictograms or jokes. Pictograms often refer to cultural traits (like gestures meaning "it's ok"; the gesture meaning "it's ok" for Americans can mean "it's a zero, absolute crap" to the French) or to road signs (which are different between Europe and the US). Puns cannot be translated easily. Furthermore, jokes often make references to events and people totally unknown outside the country of the author: would an English-speaking Canadian understand that "eat apples" is a reference to a joke about the French presidential campaign in 1995? All the same, references such as "beam-me up!" are meaningful only to those who have watched an English version of Star-Trek.
- Locales have tried to deal with the currency unit. Even more annoying are the length units or paper formats. Inches and the US letter paper format are unknown to many people around the globe, who use centimeters and A4. Not only should programs be able to accomodate both standards, but this should be a customizable item. Even better, the default value would be locale-dependent.
Other issues rather have to do with language itself:
- All too often, mysterious vocabulary, coined-up terms and the like prevent translators from working efficiently. They also often native speakers from understanding what is being dealt with. If specialized terms and coined-up words are necessary, they should be explained in the documentation.
  
  Translators should try to find the commonest translation of the word, possibly looking at the major commercial products in the same field. When the foreign version of the word is more common, they should stick to it, no matter the official version.
- Translators should really be careful not to translate text into gibberish. The GNU libc French locale contained (or maybe still contains) some ridiculously translated strings. I think that, when unsure, one should abstain from translating. Issues are best left to those mastering the target language and having a good command of the source language.

I would be very much obliged for some constructive comments, because I am quite sure I have overlooked many other thorny aspects. Notably, the use of Unicode and composite characters looks like a must for future developments. I would like people speaking Japanese, Korean or other languages with large scripts to give their insights on this subject.

Internationalization guidelines (request for comments)

Posted 29 Feb 2000 at 17:52 UTC by monniaux

Second guessing users on their addresses, posted 29 Feb 2000 at 20:55 UTC by Uruk » (Apprentice)

Hmm, aren't you somewhat generalizing a French problem here?, posted 1 Mar 2000 at 08:46 UTC by Radagast » (Journeyer)

Cultural and language problems..., posted 1 Mar 2000 at 09:23 UTC by caolan » (Master)

English and Arabic, posted 1 Mar 2000 at 09:49 UTC by rakholh » (Journeyer)

making the English version clearer, posted 1 Mar 2000 at 11:19 UTC by monniaux » (Journeyer)

My reflection on the matter, posted 1 Mar 2000 at 12:31 UTC by ajkroll » (Journeyer)

"normal users" should use their native language, programmers should use English, posted 1 Mar 2000 at 13:45 UTC by Raphael » (Master)

Good Article, posted 1 Mar 2000 at 19:19 UTC by idcmp » (Journeyer)

Tried using foreign language tools, posted 1 Mar 2000 at 20:03 UTC by DarkBlack » (Apprentice)

Precisions on several issues, posted 2 Mar 2000 at 13:00 UTC by monniaux » (Journeyer)

English and Arabic - Take two :), posted 2 Mar 2000 at 14:55 UTC by rakholh » (Journeyer)

Western European Focus, posted 3 Mar 2000 at 00:17 UTC by alan » (Master)

RFC on natural language and programming language, posted 3 Mar 2000 at 08:11 UTC by graydon » (Master)

Installation instructions (Japanese etc...), posted 3 Mar 2000 at 12:32 UTC by monniaux » (Journeyer)

Quality of translation, posted 3 Mar 2000 at 21:11 UTC by monniaux » (Journeyer)

Poor Translation, posted 5 Mar 2000 at 02:08 UTC by alan » (Master)

Do misspellings in the original cause problems, too?, posted 5 Mar 2000 at 23:38 UTC by Telsa » (Master)

resources, posted 7 Mar 2000 at 19:34 UTC by rillian » (Master)

I18n, dos and don'ts, posted 17 Mar 2000 at 15:59 UTC by ingvar » (Master)

Phone Numbers, posted 28 Mar 2000 at 04:04 UTC by dcs » (Master)