Older blog entries for roozbeh (starting at number 158)

Fonts and Languages: I was repackaging my fonts for Fedora 11, when something caught me. The font packaging policy involved the list of languages my font package supported. But it was a font with a wide range of Latin and Cyrillic glyphs, and it probably supported dozens of languages. Happening at the same time, I found that Fedora 11 is considering supporting automatic font installation. Among various things, this means that we need to know which fonts support which languages.

Font files don’t have that information directly. How would a font designer know that his font supports Arbuan Papiamento just fine, which uses a different orthography than Papiamento as written in Netherlands Antilles, for example? What about African or native American languages? Or Mongolian? Or Kurdish? He just designs and tests glyphs for characters and languages he is interested in. If the resulting font happens to support Filipino too, good for him and his users, if it doesn’t, he may not care. At best, a list of the languages the font designer believes the font is supporting may be found somewhere in the documentation.

In the present freedesktop stack, the language support detection task is done by fontconfig. When an application, like Firefox, wants to display text in some language, a text layout engine, like Pango, will ask fontconfig for a font that supports displaying text in the language (possibly with some other properties, like the font being bold and sans serif). fontconfig then uses its various font suggestion rules and orthography files to give the best font it can find back to the engine. If FontConfig doesn't know anything about the language, or has wrong information, it may give you something totally off, like a Latin or Devanagari font for a language written in the Arabic script.

What font designers may not know (or care about), fontconfig needs to know. The usual way of knowing, especially for not-very-famous fonts or languages, is through orthography files. These files contain a list of Unicode characters that play a letter-like role in the language. For example, for French, it is a list of basic Latin letters plus all the ligatures (like œ) and accented letters (like ï). fontconfig runs the list through each font installed on your machine and sees if it has glyphs for all the characters listed. If it does, the font is assumed to support the language.

Getting back to my own story, I thought of checking orthography files to see which languages my packaged fonts support. But when I looked into a few, I found several bugs and unsupported languages. Behdad encouraged me to fix them early, for a chance for them to get them into fontconfig 2.7.

During the past few weeks, I’ve been trying to hunt things down and fix them during my free time. I achieved my first target of matching glibc locales (those without ‘@’). I’m now on my second target of matching languages with two-letter codes; remaining are: Akan, Avestan, Cree, Ewe, Herero, Sichuan Yi, Javanese, Kanuri, Kongo, Kuanyama, Luba-Katanga, Nauru, Navajo, North Ndebele, Ndonga, Ojibwa, Pali, Quechua, Rundi, Sango, Shona, Sundanese, Tahitian, and Zhuang. After that, there are thousands of languages with three letter codes, which would need an army the size of SIL International.

Everything I did is in my git tree here. If you want to help, file bugs with your findings at http://bugs.freedesktop.org/. You can also check out the existing orthography bugs to avoid duplication.

29 Jan 2009 (updated 29 Jan 2009 at 09:29 UTC) »
These Iranian government officials: I can’t stop laughing.

I was just reading an article (in Persian) about the registration of the 100,000th domain in “.ir”. There’s been an event, with a long list of speakers that includes quite a few Iranian politicians involved in linguistic or Information Technology issues.

The best quote ever is from the highest ranking government official in charge of IT issues: “Engineer Rezaee, the Secretary of the Supreme Council of Information Technology, [...] expressed his gratitude toward the people responsible in the institute [in charge of .ir] for their vigilance in in selecting the domain name .ir for Iran, and added that if the choice had not happened in time, other countries like Ireland or Iraq may have chosen it for themselves”. That’s all that is quoted from him, which tells the rest of his speech has probably been worse...

The poor guy probably doesn’t know about standards, and I’m quite sure no one corrected him, pointing to ISO 3166, first published in 1974, years before the founding of the institute in 1989. Even those codes were based on the codes introduced in the 1949 Geneva Convention on Road Traffic. When “IR” was first internationally introduced for Iran, Siavash Shahshahani, the gentleman in charge of .ir’s growth, had been seven years old!

Update: According to this Wikipedia page, “IR” has been in use for Iranian cars since 1936 (interesting date, since until early 1935, Iran was internationally called “Persia”). But the article does not cite its sources, so I can’t really confirm it. Still, even if it came into use in 1936, it was definitely not standardized internationally until 1949.

Arabic in movies: I’ve been watching some 24, which is so full of stereotypical “terrorists”. Most of them are Middle Eastern of course. To try to get “balanced”, in a few episodes they go and add a few “good” Muslims or Middle Easterners, probably to protect themselves. Sometimes it gets pretty funny too. To prove the innocence of some Muslim US government agent, someone says “But she’s even a registered Republican!” I really don’t know if they knew it’s funny... Anyway, that’s not what I want to talk about.

What’s really annoying is that to someone knows a bit about Middle Eastern culture and language, a lot of things are very phony. These are some random things from 24 that I found. (Note: I am not a native speaker of Arabic. I just learned some in school.)

  • There is an hostage execution scene, with the captors talking in front of a black background with Arabic text on it. Guess what the text says: “الموت لأمريكيين”, which means “Death to Americans”! I’m quite sure no “terrorist” would want to say that. “Death to America”, they may say.
  • The names of some Middle Easterners are pretty made up. There is this family, named “Araz”. Now that’s an Azerbaijani name, and no one would really be named Araz if he’s not an ethnic Azerbaijani or from the Caucasus. But guess what? Their first names are very Arab first names (not even names common in non-Arab Muslim world), and their son has a very Persian first name (Behrooz)! A totally impossible combination.
  • The writers seem to have taken “terrorist” names from whatever was at hand. Two minor terrorists, Arabs in apperance, whose names is mentioned almost next to each other in the same episodes. Guess what are they last names? The first is named “Khatami”, the second “Ardakani”. Where are these names coming from? They come from the full name of the very popular reformist former President of Iran, Seyyed Mohammad Khatami Ardakani. Interestingly, that full name is rarely mentioned, except in one place, an old version of CIA’s world factbook. The writers simply got their hand on whatever they could find about “terrorist” regimes, and took the smiling president’s name. They didn’t know that Ardakan is the name of a small city in central Iran, and Arabs would probably not name themselves after that city.
  • Arabic text is not what it looks like in the real world at all. The letters are usually disjoint, each letter on its own, instead of contextual shaping. In some cases, it’s even both left-aligned and left-to-right.

Of course, 24 is famous for showing torture to be working sometimes, depicting huge conspiracies, showing government officials on very foolish errands and breaking laws left and right, and very interestingly, a Democratic Chief of Staff becoming a Republican Chief of Staff in the next administration. (All in all, I really think the world of 24 is a parallel universe. Fun to watch, but not much connection to real world.)

The disjoint Arabic phenomenon is not unique to 24, of course. Even better-produced shows like Lost do it. In Season 4, Episode 9, a TV news programming is shown, supposedly in Tunisia broadcasting something happening in Iraq. The Arabic text is totally disjoint, and unacceptable to anybody who knows anything about the language or script.

I suppose the producers pay people to translate the text into Arabic. Can’t they also make sure the software they use to render the text also displays it fine? If it doesn’t, why bother? Just show some squiggles!

Tintin did it much better, with much lower budget, I guess.

New world: It’s still a couple of month until the beginning of spring, the time we Persians celebrate as our New Year, Nowrooz, the time the world renews itself.

But I think the world renewed itself earlier this year.

But today, I witnessed a new US president, clearly wise, clearly intelligent, and clearly a thinker. I was longing for the day to hear such a thing as “we reject as false the choice between our safety and our ideals” from a US president. Or pearls of wisdom like “know that your people will judge you on what you can build, not what you destroy” or “we can no longer afford indifference to the suffering outside our borders, nor can we consume the world's resources without regard to effect”.

I am so happy to be in this country at such a time as this. And I am surprised of myself for considering him my ideal US candidate for president since I found about him back in 2004. I didn’t think he would run, I didn’t think he would win, but I followed all his moves. All this time, I cried, laughed, drank, read, informed, and debated. Back home in Iran, in transit, and here in California. I could not vote him, and would not be able to vote for him in 2012 either, but as a fellow citizen of the world, he has my support.

Congratulations, World! Or should I say, Happy New World!

Fedora: The other weekend, I flew to Boston for FUDCon F11. I mostly did it to reboot myself back into free software contribution, something I hadn't done a lot last year (because of settling in California and various other stressful and depressing situations).

I saw interesting stuff and boring stuff, but the best thing that happened was meeting "spot". He spent a couple of hours with me over drinks, providing free wisdom (and selling me ideas?). He’s so amazing!

Wikipedia: Back in August, there was an article in Washington Post about me and my dear Ahmadinejad, including references to this weblog.

The reporter did not really contact me after the interviews, so I thought the article was cancelled. Apparently it was not.

He called it “Word War III”. It gives interesting insight towards the lifetime of a Wikipedia article. It also has some quotes from me that I find a bit funny now. It’s like the missing piece of this weblog. Read it!

AWOL: I’ve been AWOL for the most of 2008. Sorry guys. Now I’m trying to be back.

The short story is I moved to California in early February, and settling in the western world proved to be harder than it seemed at my age of almost thirty. I am working at HighTech Passport (HTP), an internationalization and localization company based in San Jose. I am HTP’s only “Internationalization Specialist”.

It was also very hard to get here. It all started when the now famous Mahmoud Ahmadinejad was elected as the President of Iran. I wrote a blog post, explaining my understanding of the situation and asking readers to point a way out to me.

Lot’s of friends and acquaintances wrote to me, with comforting advice and encouragement. But the most useful proved to come from Razvan Vilt, who I believe read my post from either Planet Fedora or Planet GNOME.

Razvan suggested a position at Bucharest, Romania, at the European branch of HighTech Passport. But after lots of paperwork, it proved impossible to get a Romanian work visa for me (there was no clear path). But after a short and depressing hiatus, the US headquarters came up with an offer for what I am doing now as my day job.

It took ages for everything to go through. Being an Iranian was another problem: the usual path for Iranians to go the US was either go as a student or win the green card lottery. Direct go-to-US-for-work cases are very rare for Iranians.

The process of applying for my H1-B visa started in November 2006. In February HTP applied to the US Department of Labour, and in April 2007, to US Citizenship and Immigration Services (USCIS). I was a happy winner of the first ever H1-B lottery, and USCIS answered in June 2007. After finding that the US Consulate at Dubai had no free appointment time in the next three months, I applied at the US Embassy at Ankara in late August. (The trip proved to be an adrenalin-heavy headache, which started because of a travel agent losing our reservation, and ended in pink yoghurt all over a plane, but that’s a story for another day.)

If all would go well, I was supposed to start working in San Jose in October 2007. After all, a German colleague was going through the same procedures, and got her visa the next day. But of course, being Iranian complicates everything.

At the embassy, they applied for a Security Advisory Opinion for me and Elnaz, which is basically permission from several US federal agencies to issue the visa (any single one of can make the procedure very long). We went back to Iran, to wait for it to get ready.

Elnaz's acceptance came in two weeks. But week after week we checked the embassy’s website, and there was no news of mine. The problem was that Elnaz, as a dependent of me, could not travel to the US before me, and her clearance was only valid for three months. And well, it happened: hers expired in December 2007. We were reserving new flights to Ankara and cancelling the previous reservation almost every week.

We reapplied for a new clearance for her, and waited for mine to come. When mine finally arrived in January 2008 (in about four months and a half), it was now her turn. We had assumed that it would be shorter the second time: it wasn’t. After a few weeks we bit the bullet and decided that I should travel to the US sooner: there was a chance that my clearance would expire before hers came, putting us into a forever-repeating loop.

I flew to Ankara again, got my visa, and flew to the US. The next day, in early February 2008, almost three years after I started seeking a job outside Iran, I started working at my new job. Elnaz got her visa and arrived four or five weeks later, with quite a few horror stories on the way.

...

Settling in the US has proved much harder than it seemed, and I plan to tell the stories here some day.

The best news is that I finally have a laptop, a Lenovo X300 that arrived yesterday. It’s quite comfortable: they keyboard is a wonder, and the whole thing is so light, one can mistake it with a book. Fedora 10 is quite fast on it too (although a bit buggy).

Mentioned in Knuth: My name is now mentioned in Knuth!

It is in pre-fascicle 1a, Bitwise Tricks and Techniques (PostScript, 1.1MiB): check the index for "Pournader, Roozbeh".

My contribution is a very small improvement on a UTF-8 bit manipulation trick I talked about in a former blog post.

17 Jan 2008 (updated 17 Jan 2008 at 15:57 UTC) »
Bit manipulation: Ages ago, in October 2005, Federico asked for improvements to a certain g_utf8_offset_to_pointer() function.

This resulted in an optimization match by various people, which Behdad has somehow summarized here (read also the comments).

Fast forward to December 2006, when I was going over the new Unicode book and was trying to make sure Gnome and friends are Unicode compliant. One of the bugs I filed was this one, and some of the answers I received somehow discouraged me from continuing the effort which basically led me to stop the whole thing. (The bug is about getting rid of legacy support for an old version of UTF-8 which is now considered by the Unicode Standard to be a security problem.)

Then, last month I have been reading some draft material Donald Knuth is putting online, for his infamous Volume 4 of The Art of Computer Programming. One of the pre-fascicles he has put online is about Bitwise Tricks and Techniques, which I really enjoyed reading. Knuth, being a Unicode fan, had inserted some interesting excercises, regarding UTF-8 and UTF-16.

One of the exercises included a magic (!) formula to replace the utf8_skip_data array (see Federico's post again). It is provided in exercise 197.

Knuth's formula not only needs no memory reference, it's also branch-free (which is considered very good for many modern CPU architectures). The formula does it with four operations, which would become five when adapted to the present formulation used in glib. The only problem is that it only works for proper UTF-8, the version the Unicode Standard requires, but not glib's UTF-8.

I tried to extend Knuth's formula to glib's UTF-8, and did it on paper with two more operations (seven instead of five), using 64-bit boolean arithmetic.

After chatting with Behdad, he told me it's not really worth it to replace the array with the formula (I cannot understand the reasons well enough to explain them here, but I trust him), but he was interested in seeing my extended formula.

So last night, I tried to make sure my formula works fine before emailing it to Behdad. And I found a bug, which meant that I needed to add two more operations to get it done properly, a total of nine operations.

This is my new formula, which is tested and works fine. It may not provide exactly the same results as the utf8_skip_data array for all values, but many of the array's cells are redundant. For necessary cells, it provides the same results:


def utf8_skipper(c):
  t = (c >> 1)^0x7F
  return ((0x924900009201B128 >> ((t & ~(t
>> 1))*3)) & 7)+1

Can you do it in less that nine operations? Or with 32-bit boolean arithmetic only? [With no branching or memory access, of course.]

This may just be a mental exercise, but please email me if you could, as I'm starting to feel an affection towards the problem!

The Middle Eastern view: The stand-up comedian Maz Jobrani talks about the Middle Eastern view in his Axis of Evil Comedy Tour. He says that he knew what would happen when he watched Zinedine Zidane headbutt Materazzi: He knew that some French people that considered Zidane one of themselves, would suddenly start to say: “This fucking guy’s Alegerian!”

So, I just wish to share my own Middle Eastern view. I just read the story on slashdot about a guy buying a hard drive, finding that it contains bathroom tiles, and having problems returning it. What I immediately thought was: “He did it himself!” I read more, and I saw that many people consider him honest and a victim, which was also what I came to after reading the comments.

The sad point is, unfortunately, lots of Iranian muslims lie very easily, and even enjoy it, although it’s something that’s very frowned upon in the scripture and the sayings. Even very religious muslims who own shops in the bazaar and contribute heavily to religious causes like building mosques, lie very easily about everything and even pose as victims. So in this case, I automatically thought that the victim had done it himself, because I had got used to such deceptive self-victimizations. Ah, before I forget, I believe President Ahmadinejad leads them all!

Not that the story has much similarity with Maz Jobrani’s, just that suddenly a Middle Easterner’s view of an event or a story may be so different from the rest of the world’s.

By the way, I also found a wonderful quote in the slashdot comments: “I was tired of North Korea’s harsh penalties for being a citizen. That’s why I moved to Iran!”

149 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!