Nowadays, more often than not, free software tends to try to speak foreign languages, reflecting the fact that the free software development community is spread across dozens of countries around the world. For instance, if I have set the environment variable LANG to fr_FR on a RedHat version 6 machine, I get the following behavior:
[monniaux@quatramaran monniaux]$ ls --help
Usage: ls [OPTION]... [FICHIER]...
Afficher les informations au sujet des FICHIERS (du répertoire
courant par défaut). Trier les entrées alphabétiquement si aucune
des options -cftuSUX ou --sort n'est utilisée.
-a, --all afficher les noms cachés débutant par .
...
Most users in non English-speaking countries will be delighted to see computers at least trying to speak their mother tongue. Some people, particularly in the United States, tend to believe that most people around the world speak English, at least as a foreign language. This is a myth. The reality is rather the following:
I shall not deal here with the technical issues involved, like the use of Gettext. I will rather try to focus on simple acts that can make internationalization and localization better. Internationalization, or i18n, is the fact of adapting a program for l10n localization. I18n isolates all the country or language-dependent parts of programs, so that l10n can adapt these for particular countries or languages.
I see quite a few thorny issues, on which I am going to give my opinion:
postal addresses and phone numbers are often formatted somewhat differently between countries. I have seen all to many Web sites refusing French postal addresses for want of a "state" field, even though France does not use states for postal addressing, and even though those sites pretended to address international customers. Furthermore, French postal codes go before the name of the city, not after. For instance, a French mail address might look like:
Martineaud SANow imagine that that person gets his own address refused by the program because he did not specify a state!
M. Henri Martin
13, rue du Moulin Vert
75361 PARIS cedex 11
I do not advocate putting a database of postal address and phone number formats in programs. Instead, programs or WWW sites should accept free-form addresses. Let us stop second-guessing users and assume that they can at least write addresses correctly.
Another country-specific trait is the use of pictograms or jokes. Pictograms often refer to cultural traits (like gestures meaning "it's ok"; the gesture meaning "it's ok" for Americans can mean "it's a zero, absolute crap" to the French) or to road signs (which are different between Europe and the US). Puns cannot be translated easily. Furthermore, jokes often make references to events and people totally unknown outside the country of the author: would an English-speaking Canadian understand that "eat apples" is a reference to a joke about the French presidential campaign in 1995? All the same, references such as "beam-me up!" are meaningful only to those who have watched an English version of Star-Trek.
Locales have tried to deal with the currency unit. Even more annoying are the length units or paper formats. Inches and the US letter paper format are unknown to many people around the globe, who use centimeters and A4. Not only should programs be able to accomodate both standards, but this should be a customizable item. Even better, the default value would be locale-dependent.
All too often, mysterious vocabulary, coined-up terms and the like prevent translators from working efficiently. They also often native speakers from understanding what is being dealt with. If specialized terms and coined-up words are necessary, they should be explained in the documentation.
Translators should try to find the commonest translation of the word, possibly looking at the major commercial products in the same field. When the foreign version of the word is more common, they should stick to it, no matter the official version.
Translators should really be careful not to translate text into gibberish. The GNU libc French locale contained (or maybe still contains) some ridiculously translated strings. I think that, when unsure, one should abstain from translating. Issues are best left to those mastering the target language and having a good command of the source language.
I would be very much obliged for some constructive comments, because I am quite sure I have overlooked many other thorny aspects. Notably, the use of Unicode and composite characters looks like a must for future developments. I would like people speaking Japanese, Korean or other languages with large scripts to give their insights on this subject.
I agree with you on the point that including databases of postal codes in programs is bogus - I'd just like to point out though that for the most part, the reason address fields are usually so chopped up and input is required in all fields is because of the way the info is stored. It's not that the programmer doesn't trust the user, just that it's much easier to deal with the address internally if you break it up into a bunch of pieces and store it that way ala a relational database rather than putting it into one "block" of info and then going through the added pain of extracting pieces.
I think most of the negative aspects of the way things like that are done are due to the limitations in the program put there (not necessarily on purpose) by the programmer - not that the programmer doesn't trust the user.
As a general followup both to the Advogato's number article which touched on this subject, and to this, I think you're trying to make this problem much bigger than it is. And the main reason is, I belive, that you're French.
Basically, France has a foreign language problem. There is no other country in western Europe where comprehension of English is so poor, and resistance to learning foreign languages is so strong. There are doubtlessly historical, political and cultural reasons for this, but the fact remains: French programmers are handicapped in the international community, because they're unable to write and comment code in the language the rest of the community uses. I'm not a native English speaker, I'm Norwegian, but both in Norway, and here in Mexico (a country notorious for how small a percentage of the population is fluent in English) programmers tend to know enough English to write and comment code "properly", and also comprehend English documentation and tools. However, a French programmer living in Mexico whom we were working with, was unable to do this.
Similarly, my mother, who is 50 years old, manages to use English language Win98 on her home computer, and most office workers in Norway run at least a few English language programs on their computers, seemingly without problems.
So I must repeat my earlier opinion: Making free software understand and produce international scripts, through projects like Pango is much more important than actual translation efforts. As for the French, I hope and trust they will at some point see that their xenophobia is becoming a liability, and change the attitude. I'm afraid it'll take time, though, and until then, French programmers and French software industry will be at a disadvantage.
I do agree that there is nothing more annoying that input forms that have no idea of differences from the writes own cultural background. In Ireland we just do not have postal codes outside of Dublin, and there the numbers range from 1 to 16! (or there abouts). Again and again you hit the problem especially on the web where the form will reject the address because the concept of not having a postal code breaks the logic, and of course the state issue is another nonsensical problem
Another thing about free software and localization is a thought I had some time back reading about Iceland, where the population is considered too small for many commercial companies to provide translations of their software into the native language which they are trying hard to maintain. It strikes me that free software makes it very easy for speakers of smaller and more ignored languages to do their own localization of software. Bringing native language versions of software to speakers of tamil and thai to name two that have caught my attention. (aside: I have my doubts about the common belief that the web and greater connectivity in general will finally wipe out all minor smaller languages). But there is the other issue mentioned in the article about the problems of translating when an appropiate word or phrase is missing. There appears to be two reasons that this might happen
I strongly agree. I often have rejected addresses because i need to input a 'state' or i have to insert a 'zip code' (mail gets delivered in Egypt WITHOUT zip codes - this is related to illiteracy by the way).
Recently - I also encountered another problem. They expect '10-digit' phone numbers (intl code + number). THis is a problem cos different countries have different numbers (some (small) countries have 6 numbers + 3 intl. code) - My mobile has 2-digit country code 2-digit city code and then 7 numbers - total of 11. I noticed this on the TWA sight by the way.
There also two types of computer users (generally). There are the 'developers' and 'users' (actually - I could classify computer users into much more specific types and I think I could write an essay/article about that - but lets assume that everything falls under these two broad categories). Developers need to know English. No doubt about that. They might need to know English /very/ well to be good programmers too. All computer languages are in English or are based on English. A lot of computer terms also stem from English and everything else is a translation (buffer, core, RAM (Random Access Memory), etc.)
For the user they don't care what the developer does. They want a LOCALIZED version. This is a two-step process. First i18n, then l10n (by different people/groups usually). This is VERY important in certain areas. I firmly believe that Linux has not gotten wide-spread popularity in the Middle East because of this. Companies/Governments/Shops/Buisnesses - They all want ARABIC stuff. The official language is ARABIC so all their legal stuff is in Arabic. A /HUGE/ selling point of Linux and other OpenSource stuff is that its /FREE/ - you can download all programs off the net. They love this here - but then you have to tell them "You can't have arabic" - so they stick with 'Arabic Windows'. The fundamental problem with i18n/l10n is DIFFERENT SCRIPTS. The English Alphabet is used in A LOT of languages (i.e. even if you don't have those little caret's on top of the chars you can still generally understand french). But you can't write Arabic in English. Actually - the IRC world has been pioneers in this department - they have invented ways of writing arabic in english :) for example they use english equivelants of letters (to make the sound) and use numbers to represent the 'missing' letts - The two most popular ones are '3' and '7' (since they kind of 'look' like arabic characters). Obviously though this is not a REAL solution (i.e. you can't have your official buisness or government documentations in that - and you need to know english (at least the alphabet) to understand it).
I believe Pango is the /BIGGEST/ step in this department. Being able to NATIVELY show different scripts is a big plus. I know you can do CJK right now in Xwindows (is it possible on the console?) - You can't do Arabic in Xwindows (there are a few very limited applications that can (AraMosaic (arabic web browser) - and a propriety X application (free beer - its a word processor). They also sell a 'toolkit' of some sort so you can write arabic applications (I think - the word processor is statically linked to the toolkit, but you (devleoper and user) need to buy the toolkit for other applications). Anyway - Arabic also has an 'arabized' console called 'acon' - you run it - this software lies in between your terminal and your screen. It recognized certain (arabic) codes on your screen and renders them appropriately. SO i think you can mix&match arabic/english on the console thanks to acon. I also believe its free. Arabic is sorely missing a 'free' (libre) renderer in X.
The awnser to my comments is: PANGO :)
"As for the French, I hope and trust they will at some point see that their xenophobia is becoming a liability, and change the attitude."
Ever been in the metro in Paris ? You would then notice that most signs are subtitled in two languages, including English. Even the conservative senate has a Web site available in several languages. I do not therefore think there is some kind of concerted effort to ward off English. The problem is perhaps on the way foreign languages are taught; pretending to have magic recipes to solve such complex educational problems is pretentious at best.
My experience is that most people with a reasonable level of education (my mother, for instance) can deal with application menus in English, as long as the vocabulary does not become obscure and/or very specialized. That is why I recommend that even in the English version, obscure terminology should be kept to a minimum. If for the sake of brevity or precision some obscure terminology should be used, its meaning should be recalled by an easily accessible online help.
However, even people that can cope with an application in English do not necessarily want to read an entire documentation in English. Declaring that free software should be available only to those that master English is akin to declaring that it should be reserved for those Unix-savvy enough to know how to set up custom initialization scripts. Proprietary software vendors have for long understood that users want software that does its task without requiring too much learning. That includes not requiring to know too much about the internals of the system and not requiring to learn foreign languages.
I agree that translation is not the only answer. Making the English version clearer could be very valuable too. All too often, programs and documentation contain abbreviations, jokes and allusions that are not essential to the use of the system; such "clever" language greatly hampers comprehension. Please note that having learnt a foreign language at school does not imply having seen 15 years of TV in that language.
For further reading, Jakob Nielsen has written some interesting paper on International-oriented WWW sites and International useability testing.
English is the language of business.
Programming is very much a business.
Internationalization is a neat idea, works well if your mother language has words that can be exchanged and not lose meaning, especially where syntax is concerned.
My thoughts on developing? Screw it, you best learn English if your going to do programming on a computer. You can better communicate ideas in English to most of us who are developers anyhow. Not to go without mentioning that 99.9% of the programming languages are based on English words and syntax.
As far as French language in particular goes, it might be really great for law because it is very concise and clear. It has no place in something that I personally reguard as abstarct art. The art of programming is very abstract. It almost requires an abstract language and should continue to do so. For example, If it was all concise, it would stagnate very quickly. Abstract thought is what fuels new ideas and concepts. Sometimes a little confusion is good because it forces others to spawn entirely new projects.
Aribic is a mystery to me, so i have no thoughts in reguard to that, but hey, English does barrow your symbols and concepts for number representation... :-)
Enuff jabbering, I have a very small footprint, highly portable TCP/IP stack to complete.
ajk
I see that some of the replies posted here tend to confuse two categories of users: the average user (think about your mother) and the programmers.
I think that it is obvious that the programmers must know English. Furthermore, I firmly believe that a (very) good knowledge of English is a prerequisite for becoming a good programmer. See my comment (#9) attached to this previous story for more details about this. In summary, any student in computer science can learn the basics without knowing English, but it is not possible to go very far like that, because many APIs are written in English, as well as many of the best reference books and most of Open Source code, which is probably the best way to learn new tricks.
But things are different for the users who do not intend to write any code and who use the computer as a simple tool to get their job done or to have some fun (usually without knowing what is really inside the beast). As others have pointed out, most of these users have a very limited knowledge of English, if they know it at all.
I live in Belgium, a country that has three official languages: Flemish (Dutch), French, and a bit of German (again, see my previous comment for more background info about me). Many young children start be learning one of the other official languages of the country before learning English. Some of them learn English afterwards and manage to get quite good at it, but others don't. So, contrary to what some people might think, it is not true that almost all kids nowadays learn English at school. Why should they anyway? If they intend to get a job as an accountant, social worker, or anything that involves contacts with the local population but not with foreigners, it is more important to know the languages spoken locally than knowing English (except when you go on vacation, but then knowing Spanish or Portuguese might be equally important). In many cases, this is actually required: the majority of administrative jobs in Belgium require the knowledge of Flemish and French only. This is true for countries that have more than one official language (like Belgium or Switzerland) and for the countries that have strong local dialects that are sufficiently different from the official language.
If English is your third or fourth language, chances are that you cannot understand it very well. And nobody should blame your for that.
Even for those who understand English reasonably well, this often involves some extra effort: a novice user has to remember that the "Save" option under the "File" menu is the thing that makes sure that her work is not lost. People who are not so familiar with English must in addition remember what these words mean. I took a simple example, but think about the meaning of options such as "merge visible layers" or "round rectangular selection" in the GIMP. Their meaning should be obvious for a native English speaker, but not for someone who has to translate word for word. Even if they know the individual words, this implies an extra memorization effort that makes the tool harder to use.
Besides, if everything I wrote above was complete nonsense, why would all commercial software companies invest so much in translating their programs in as many languages as possible?
A very well written, easy to read, and to the point article. Having
applications
show up in your native tongue is a very important issue, and I think
those
of us who speak English and are used to using English apps don't realize
what we take forgranted.
Boot up your system with LANG=fr for a day and realize that anything you
can read in English, whomever is really using that tool cannot.
I've tried turning on a different language, namely fr_FR, to see how well the tools that I use on a regular basis are translated. What I noticed is that in many GNOME applications, the text to translate is not really a phrase or sentence in french. I know that the official french language does not have anywhere near the number of words that english has, but this is ridiculous.
Not to pick on Balsa (many other GNOME apps are just as bad), but many of the preferences options are quick phrases that are not easily translated to other languages. The same thing goes for the text and hints for toolbars, and menus. I think that application writers need to make sure that they are clear in what they are trying to communicate to the user.
First of all, reading some comments I fear I was not clear enough about the software parts I want internationalized and localized. I think that
From the feedback I've received on Advogato and by mail, it seems indeed that, apart from language issues, there are lots of annoying behaviors cause by software assuming that some country-specific or language-specific standards are universal:
Myself and others have mentioned the stupid behavior of software and WWW sites enforcing US-style addresses (requiring state & ZIP code for instance, but this is not the only issue) or US-style phone numbers.
Eric Moreau reports that French-language Windows sets the keyboard as AZERTY (standard in France) even though Quebec uses special QWERTY keyboards. All the same, French-language versions of some software assume that the paper size is ISO A4 (international standard, used by nearly everybody outside North America) while Quebec uses US Letter.
The issue here is that locale-specific information is not only a question of language, but also of many "cultural" issues and local standards. The locale for French-speaking Canada is fr_CA, and differences such as the ones cited above have to be taken into account. There are similar issues between the United States (us_EN) and the United Kingdom (en_UK).
I think that defaults for paper size, measurement units (length, temper ature...) and the like should be set according to the locale. Be sure to consider the full locale, not only the language part. However, such settings should also be fully customizable. There are some peculiar situations (for instance, people from one country working temporarily in another country and using an operating system fitted for their language) where some settings must be taken from the locale and some other overriden. Imagine an American working in France: he wants his software in English, but he also wants A4 paper to fit the printer.
There have been some comments on Arabic. I'd be delighted to hear precisions about Arabic (apart from the fact that it's written right-to-left). Namely:
Arabic is relatively simple when it comes to keyboard stuff. We have 26 letters - you guys have 24. There is a slight problem though - our alphabet is in script format so it makes a big difference whether at the start/end/middle of a word. Actually - the software handles that, but there are a few exceptions (and thus those keys are on the keyboard). We also have something called "tashkeel" which are symbols that are put on letters to change the way they are prounounced (like the caret in France). Anyway - the keyboard's sold basically have the english letters and arabic letters on the keys (using the space for some symbols for letters (i.e. if u want the symbols go into English mode). What Microsoft does is have a little 'docklet' thingy (next to the time) that says 'En' you click on it (or right click) - and you can choose Arabic.
As for direction changes in the middle of text. Well - that is a surprisingly complex issue for such a simple idea :) There is a whole ruleset that does it. Furthermore each ruleset does it slightly different. There is a Unicode standard and I believe Microsoft does its own thing, Netscape does its own thing too I think (although Mozilla might be Unicode compliant). By the way, even though the Unicode thing is the theoretical "standard" I find that it does not make sense (but apparently it does to everyone else). KDE had screenshots a while back (at mosfet.org) that shows Hebrew/English text directions.
Furthermore - Arabic has another problem :) Here we have an Islamic calendar (which is not really used to 'schedule' stuff) - It is lunar based - but it is /worse/ than the Hebrew calendar (lunar based). Hebrew calendar can be determined /algorithmically/ - Ours isn't - it is based on the MOON SIGHTINGS. i.e. If it is a very cloudy end of month or something the date can shift. There is no way you can determine the months for an Islamic calendar - There is always the possibility of error :) Thankfully though, they only "check" the moon sightings twice a year (doing religious festivals (Eid)). I'm gonna try and work on it after Eid this year (which falls on the exact same time as GUADEC by the way) so the calendar can be 'stable'. See calendar- list@gnome.org archives for more info :)
I think it speaks volumes that everyone is discussing western european or mid european languages. We have a lot to do beyond that. Different writing directions and fonts are an obvious beginning
Also calling something a 'french problem' neither makes it go away in France nor solves it. It isnt a problem for just France. Icelanders are very proud of their language. Use English isnt an acceptable answer. Its a deep cultural thing .
And once you get to Japan then English is a big barrier. Im happy that I have folks translating some of my articles into Japanese. There is a HUGE barrier between Japan and Europe/USA in free software. I only really got a feel for the scale of it when I got the PC110 japanese palmtop. You try finding and configuring a special X server when the notes are in a language you cannot read and the links aren't all obvious to follow.
Similarly we have a Linux/PC98 porting project. Anyone here heard of it - probably not -why because you need to speak Japanese to follow it
I would really love to have EN<->JP autotranslation tools even as bad as the babblefish.
Alan
While I agree fully that writing internationalized software is important (indeed, it forces you to think about layering and abstraction issues you probably ought to be thinking about anyway), this thread has brought up a different interesting issue.
When we write code, we are not actually writing "in english". We are writing in some programming language, which has bits of english and bits of discrete math and type theory strewn through it. I have found it pleasantly surprising that in some cases a person I cannot communicate with about, say the weather, I can occasionally exchange code fragments with (albeit with some difficulty translating verbs and nouns -- at least you can look such things up in dictionaries). This phenomenon was even more pronounced when reading math texts in another language, where the mechanism of proof is so similar and the notation so precise and terse that the issue of not speaking the language of the narration seems like much less of a problem than when, say, one tries to read a newspaper.
Comparing this situation to the deep culture shock of being in a society where one cannot read or hear anything, it seems clear that the programming community has a bit of a leg up in getting to know one another cross-culturally since we can at very least get code ideas from one another.
I wonder in which cultures, natural languages and programming languages people have found the language barrier mitigated by the ability to express things in code. Code has a much simpler grammar and vocabulary, If you can translate the identifiers a person is using, it seems possible to even use the code as a makeshift basic communication system. Are certain languages better or worse for this? I recall reading a page which stated that FORTH was a very "korean friendly" language due to its postfix form, but that seems somewhat of a surface issue. Has anyone here ever programmed in a language written/designed by a culture you consider "foreign"? western europeans and americans invented most of the languages I use commonly, I think, but I don't know for certain.
I agree with Alan that Japanese is indeed difficult to handle:
I would like to hear some details about these two points. I know the Emacs-MULE kanji and kana input system. I know that other applications must use software such as kinput2. How are such things handled in Gtk+?
This leads me to another issue: documentation. A Japanese-speaking friend asked me to install a Japanese-aware Linux. The problem was the same as Alan's: nearly all the available documentation for this is in Japanese! I don't speak Japanese, and my friend doesn't know about computer terms, even when their are English in Japanese pronunciation.
The lesson to be retained from this problem is that even though some software is meant for users speaking a certain language, documentation (at least for installation) should also be available in English. Two reasons:
Sorry to be so verbose these days.
The quality of some free software translations is not adequate. It seems that translations, including in software such as GNU libc, Gnome and KDE, suffers from the following problem:
Even though a bilingual dictionary says a word can be translated to another, those words are not necessarily equal; for instance, some software (GNU info if I remember well) says "Fouille infructeuse" when search has failed. Fouille indeeds means search, but has a sense either of archeological burrowing or as police searching a suspect; the correct translation is recherche.
Ridiculous translations such as the one quoted above can often be avoided by following simple rules:
Being a native English speaker (well at least in most eyes, some people aren't sure that my Birmingham accent counts as anything but Caveman) I normally get to avoid translation problems
Reading SuSE manuals is something I recommend for people who want to get an idea what it must be like. These are good translations but you will find random quirks in them, and escapee bits of German or German screen shots.
Like many native English speakers (in the UK, at least), I have a very meagre grasp of other languages. I have recently got into attempts to document things. I have learned some things: bad puns do not translate; using a word with two meanings is a mistake unless you clarify which meaning you intend; and simple grammar is good. (Yes, I know I'm bad at the last one.)
One thing I do wonder: how much does misspelling matter? I can follow some French and the occasional tiny piece of German. This usually involves heavy use of a dictionary. When people use contractions, I find it hard. But sometimes they use words which are not in my dictionary at all. I am never sure whether they're real words that my dictionary is too polite to mention. Or whether they're misspellings.
Is the same true for other languages? I imagine it must be. How much of a problem is that? I know I sometimes have trouble with some writing in English. But am I patronising other people (who tend to be a lot better at my language than I am at theirs), or is it important to check spelling first?
Personally speaking, I hate bad spelling and grammar from people who should know better, but that's not really the question. The question is how bad it makes it for people who aren't native speakers. Or whether it does at all?
It would be good if there was more documentation available on i18n issues. Tomohiro KUBOTA wrote an excellent desciption of Japanese issues, now part of the Debian documentation project. I'd suggest that anyone here who's familar with a particular locale/language contribute a section; so far there's only Japanese and Spanish. The currect document is available through the debian documentation project. I think it would be especially valuable to have sections on languages with non-roman scripts.
There's also a website, i18nlinux.org, that came out of the same effort, but it seems to be stalled at this point.
On an unrelated note, I can really relate to monniaux's complaint about documentation. I've had the same problem installing chinese support on our lab computers. The debian packages presumedly work, but I can't even tell if the help file is being properly displayed! Fortunately, these things are easy to solve through collaboration with a literate speaker.
As a point of comparison, I tried out the Arabic, Chinese and Japanese support on a friend's iMac last week. Much as it's been described for Win32, there's an extra menu that lets you select your input method (this applies to switching roman keyboard layouts as well). This didn't seem to switch the localization of any of the applications, it just let you enter text in a different language, and you can mix scripts freely. There were some input specific menus which were localized--and for some languages also available in English--things like character dictionaries and the ability to choose among various imput methods and (I think) character encodings. Generally I found it much easier to deal with than the options under Linux, but still far from ideal.
Moving even further afield, does anyone know about handwriting recognition? I've seen little 3x5 cm drawing tablets sold around here, apparently specifically for Chinese input. The character dictionary is of course much bigger, but I'd think the well-defined order of strokes would help alot with recognition. Can anyone confirm this? OTOH, a cursive script like Arabic is probably much harder than Roman, where we can at least print.
Being of a foreign persuasion (Swedish) and fully functional in at least Swedish and hopefully English, there's a few pitfalls and annoyances to think of when translating anything (especially from English to Swedish, that being what I am familiar with):
I have a feeling that a good way of doing a software translation is to first make a quick-and-dirty dictionary atta^wtranslation, then hand it over to several people who are bi-lingual in the source and target languages.
And, to really make my life hard, I hereby volunteer to do what I can, if someone needs help in translating to Swedish.
To those complaining about phone numbers... the 11 digits international phone number is an International Standard. If you expect to be able to make/receive international phone calls without problems, you ought to observe them. That's one reason why some countries which used ad-hoc schemes for celular phone numbers later changed them to conform to the 11 digits universal number.
New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.
Keep up with the latest Advogato features by reading the Advogato status blog.
If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!