Recent blog entries for roozbeh

Unicode 6.0 was released today. Here is the link to the announcement: http://www.unicode.org/press/pr-6.0.html

The following changes should be interesting to the Persian and Iranianist computing community (based on an original post to the Persian Computing list):

  • Sixteen symbols have been encoded in the Arabic Presentations Forms-A block for use in pedagogical materials and documents discussing the features of the Arabic script.

    Please note that these are not combining characters but stand-alone symbols. These should only be used to display the dots and diacritics in isolation, and not for making new letters. For example, one can *not* use a Seen and add U+FBB6 Arabic Symbol Three dots Above to get a Sheen. If you type that, you will get a Seen followed by three dots. According to the standard, "These are spacing symbols representing Arabic letter diacritics considered in isolation, as for example as in discussions about the Arabic script."

    Updated Unicode chart:
    http://www.unicode.org/charts/PDF/Unicode-6.0/U60-FB50.pdf

  • The Qur'anic character U+06DE ARABIC START OF RUB EL HIZB has had its glyph and properties changed.

    For some unknown historical reason, the character was mistakenly classified as a combining character instead of just a symbol, which made it unusable. The character is now a normal spacing symbol and is usable as originally intended.

    Background document for the change (which I authored):
    http://unicode.org/review/pr-171-rub-el-hizb.pdf

  • Two characters have been encoded in the Arabic script block for use in Kashmiri, one of the official languages of Jammu and Kashmir, the Indian-administered part of Kashmir. The language is written in both Arabic and Devanagari, along religious lines of Muslims and Hindus.

    The two new characters are U+0620 Arabic Letter Kashmiri Yeh and U+065F Arabic Wavy Hamza Below. Also, U+0673 Arabic Letter Alef With Wavy Hamza Below has been deprecated (the first Arabic script character to ever get deprecated in Unicode), and the character sequence <U+0627, U+065F> should be used instead of it.

    Unicode proposal (I'm a coauthor):
    http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3673.pdf

    Updated Unicode chart:
    http://www.unicode.org/charts/PDF/Unicode-6.0/U60-0600.pdf

  • Mandaic has been encoded. Mandaic is the script used by the Mandaeans (mostly living in southern Iraq and southwestern Iran, especially Khouzestan) for liturgical purposes. This the community that some people believe the Qur'an refers to as Sabians, the third member group of the People of the Book (next to Jews and Christians).

    Michael Everson's proposal:
    http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3485.pdf

    Unicode chart:
    http://www.unicode.org/charts/PDF/U0840.pdf

  • Brahmi is also encoded, which is of use to Iranianists (some Iranian languages like Khotanese have been written in Brahmi).

    The most detailed proposal (although not the final one that got encoded):
    http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3491.pdf

    Final Unicode chart:
    http://www.unicode.org/charts/PDF/U11000.pdf

  • Unicode Standard Annex #9, The Unicode Bidirectional Algorithm, has been updated to include more information and some clarifications. Note that the algorithm has not changed. The update just explains the original intentions in more details. For the list of informational changes to the text, see the following link (Behdad Esfahbod and I have contributed to this and previous versions of the standard annex):
    http://www.unicode.org/reports/tr9/tr9-23.html#Modifications

  • A new data file has been added to the Unicode character database, listing some characters that are used with several scripts (and which scripts those are). For example, from the data file one can learn that the Arabic Tatweel and some of the Arabic harakat are also used with the Syriac script, the Arabic-Indic digits are also used with Thaana, and the Arabic comma, semicolon, and question mark are also used with both Syriac and Thaana:
    http://www.unicode.org/Public/UNIDATA/ScriptExtensions.txt

  • More than a thousand new symbols have been added, including lots of symbols that you can find on electronics, maps, menus, signs, etc. Most of these were added to support Emoji, symbols mostly used on Japanese mobile phones for text messages, emails, chat, and even cellphone novels:
    http://en.wikipedia.org/wiki/Emoji
    http://www.unicode.org/faq/emoji_dingbats.html

    For you chart browsers over there, here are some of the blocks that contain the new symbols (color-coded yellow):
    http://www.unicode.org/charts/PDF/Unicode-6.0/U60-2300.pdf
    http://www.unicode.org/charts/PDF/Unicode-6.0/U60-2600.pdf
    http://www.unicode.org/charts/PDF/Unicode-6.0/U60-2700.pdf
    http://www.unicode.org/charts/PDF/Unicode-6.0/U60-1F0A0.pdf (playing cards)
    http://www.unicode.org/charts/PDF/Unicode-6.0/U60-1F100.pdf
    http://www.unicode.org/charts/PDF/Unicode-6.0/U60-1F300.pdf (lots of interesting new symbols, including symbols for beverage containers)
    http://www.unicode.org/charts/PDF/Unicode-6.0/U60-1F600.pdf (emoticons, also known as smileys)
    http://www.unicode.org/charts/PDF/Unicode-6.0/U60-1F680.pdf (transport and map symbols)

    Please note that Unicode encodes beverage containers, but not alcoholic beverages (I personally made sure of that, to reduce possible objections). For example, there is no BEER encoded, but only BEER MUG (which is also used for non-alcoholic beer, among other uses).

    Religiously devout people that may object to some game characters or musical instruments getting encoded should note that Unicode implementations are not required to support any specific character, and are allowed to choose their own set of characters to support. The game symbols are encoded only for the sake of Unicode implementations (especially those in East Asia) that need them to support their users.

  • And finally, the official detail of additions and changes to the standard, for the hardcore:
    http://www.unicode.org/versions/Unicode6.0.0/
Ahmadinejad: I will be in New York next week, with thousands of other Iranians and non-Iranians, to show my opposition to Ahmadinejad’s being internationally recognized as Iran’s president. He stole the election, and he helped several of my people getting killed, raped, and tortured. He is not Iran’s president, he is just another liar, thief, and murderer.

If you wish to join us, information on events are at http://voices4iran.org/.

29 Jun 2009 (updated 29 Jun 2009 at 05:46 UTC) »
Calendrical calculations: For whoever who may be computing Singapore holidays any time in the future: Singapore’s Vesak Day (Buddha’s birthday) holiday does not follow the Buddhist calendar or the recommendation by the first Conference of the World Fellowship of Buddhists held in Sri Lanka in 1950 (that recommended the first full moon in May). It is calculated using the Chinese calendar, but not the 8th day of the 4th moon like the Chinese and the Koreans celebrate it, but seven days later, on the 15th day (calendrical full moon) of the 4th moon.

I lost at least three hours today finding about this, and I found about it by accident, because I had Calendrical Tabulations at hand and happened to look at the Chinese calendar column. There are several conflicting pieces of information on the internet here and there, which really confused me to the point that I thought the actual algorithm is not publicly available.

Losing weight: I just saw arc’s post on losing weight.

Just wanted to share a bit of my own experience with being overweight, losing a lot of it, and then gaining some of it back:

  • One may have misconceptions about how weight is lost and gained. Specifically, one may think that “by eating only what my body needs and some exercise, I can lose weight”. That’s rarely true.
  • You need to understand how diets work. Generally, one doesn’t really need nutritionists. But it’s important to understand the simple science behind dieting, in order to make the whole thing effective and avoid putting it just back.
  • The personal psychology of dieting is important. You need to know why you are doing it, and care about it.
  • You don’t need to spend time thinking about the diet, following it, or even exercising. There are good ways to lose weight without the usual obsessions associated with diets, like that of the Atkins diet.

I highly recommend The Hacker’s Diet, available online for free. It is written by John Walker, of AutoCAD fame.

The very short book helped me lose about 15 kilos easily (and with no exercising) a few years ago. I have started to diet again these days, with a goal of losing about 30 pounds (almost the same amount, but I know live in the US).

Even if you hate diets and diet books, still read it. I would recommend reading it even if you are not overweight!

Footnote: The author of the book has made all the code he used in the book (with several updates) available as public domain code online. He also runs a server with the tools installed for public use, if you are the lazy type, like me. It's all here.

6 Mar 2009 (updated 6 Mar 2009 at 03:19 UTC) »
Unicode: I am thinking again about the brilliant Joe Becker. I met the gentleman last October in San Jose, when everyone was celebrating twenty years of Unicode. His short 1988 article, titled Unicode 88, is amazing. It is interesting that a lot of Unicode principles remain the same, after twenty years.
Fonts and Languages: I was repackaging my fonts for Fedora 11, when something caught me. The font packaging policy involved the list of languages my font package supported. But it was a font with a wide range of Latin and Cyrillic glyphs, and it probably supported dozens of languages. Happening at the same time, I found that Fedora 11 is considering supporting automatic font installation. Among various things, this means that we need to know which fonts support which languages.

Font files don’t have that information directly. How would a font designer know that his font supports Arbuan Papiamento just fine, which uses a different orthography than Papiamento as written in Netherlands Antilles, for example? What about African or native American languages? Or Mongolian? Or Kurdish? He just designs and tests glyphs for characters and languages he is interested in. If the resulting font happens to support Filipino too, good for him and his users, if it doesn’t, he may not care. At best, a list of the languages the font designer believes the font is supporting may be found somewhere in the documentation.

In the present freedesktop stack, the language support detection task is done by fontconfig. When an application, like Firefox, wants to display text in some language, a text layout engine, like Pango, will ask fontconfig for a font that supports displaying text in the language (possibly with some other properties, like the font being bold and sans serif). fontconfig then uses its various font suggestion rules and orthography files to give the best font it can find back to the engine. If FontConfig doesn't know anything about the language, or has wrong information, it may give you something totally off, like a Latin or Devanagari font for a language written in the Arabic script.

What font designers may not know (or care about), fontconfig needs to know. The usual way of knowing, especially for not-very-famous fonts or languages, is through orthography files. These files contain a list of Unicode characters that play a letter-like role in the language. For example, for French, it is a list of basic Latin letters plus all the ligatures (like œ) and accented letters (like ï). fontconfig runs the list through each font installed on your machine and sees if it has glyphs for all the characters listed. If it does, the font is assumed to support the language.

Getting back to my own story, I thought of checking orthography files to see which languages my packaged fonts support. But when I looked into a few, I found several bugs and unsupported languages. Behdad encouraged me to fix them early, for a chance for them to get them into fontconfig 2.7.

During the past few weeks, I’ve been trying to hunt things down and fix them during my free time. I achieved my first target of matching glibc locales (those without ‘@’). I’m now on my second target of matching languages with two-letter codes; remaining are: Akan, Avestan, Cree, Ewe, Herero, Sichuan Yi, Javanese, Kanuri, Kongo, Kuanyama, Luba-Katanga, Nauru, Navajo, North Ndebele, Ndonga, Ojibwa, Pali, Quechua, Rundi, Sango, Shona, Sundanese, Tahitian, and Zhuang. After that, there are thousands of languages with three letter codes, which would need an army the size of SIL International.

Everything I did is in my git tree here. If you want to help, file bugs with your findings at http://bugs.freedesktop.org/. You can also check out the existing orthography bugs to avoid duplication.

29 Jan 2009 (updated 29 Jan 2009 at 09:29 UTC) »
These Iranian government officials: I can’t stop laughing.

I was just reading an article (in Persian) about the registration of the 100,000th domain in “.ir”. There’s been an event, with a long list of speakers that includes quite a few Iranian politicians involved in linguistic or Information Technology issues.

The best quote ever is from the highest ranking government official in charge of IT issues: “Engineer Rezaee, the Secretary of the Supreme Council of Information Technology, [...] expressed his gratitude toward the people responsible in the institute [in charge of .ir] for their vigilance in in selecting the domain name .ir for Iran, and added that if the choice had not happened in time, other countries like Ireland or Iraq may have chosen it for themselves”. That’s all that is quoted from him, which tells the rest of his speech has probably been worse...

The poor guy probably doesn’t know about standards, and I’m quite sure no one corrected him, pointing to ISO 3166, first published in 1974, years before the founding of the institute in 1989. Even those codes were based on the codes introduced in the 1949 Geneva Convention on Road Traffic. When “IR” was first internationally introduced for Iran, Siavash Shahshahani, the gentleman in charge of .ir’s growth, had been seven years old!

Update: According to this Wikipedia page, “IR” has been in use for Iranian cars since 1936 (interesting date, since until early 1935, Iran was internationally called “Persia”). But the article does not cite its sources, so I can’t really confirm it. Still, even if it came into use in 1936, it was definitely not standardized internationally until 1949.

Arabic in movies: I’ve been watching some 24, which is so full of stereotypical “terrorists”. Most of them are Middle Eastern of course. To try to get “balanced”, in a few episodes they go and add a few “good” Muslims or Middle Easterners, probably to protect themselves. Sometimes it gets pretty funny too. To prove the innocence of some Muslim US government agent, someone says “But she’s even a registered Republican!” I really don’t know if they knew it’s funny... Anyway, that’s not what I want to talk about.

What’s really annoying is that to someone knows a bit about Middle Eastern culture and language, a lot of things are very phony. These are some random things from 24 that I found. (Note: I am not a native speaker of Arabic. I just learned some in school.)

  • There is an hostage execution scene, with the captors talking in front of a black background with Arabic text on it. Guess what the text says: “الموت لأمريكيين”, which means “Death to Americans”! I’m quite sure no “terrorist” would want to say that. “Death to America”, they may say.
  • The names of some Middle Easterners are pretty made up. There is this family, named “Araz”. Now that’s an Azerbaijani name, and no one would really be named Araz if he’s not an ethnic Azerbaijani or from the Caucasus. But guess what? Their first names are very Arab first names (not even names common in non-Arab Muslim world), and their son has a very Persian first name (Behrooz)! A totally impossible combination.
  • The writers seem to have taken “terrorist” names from whatever was at hand. Two minor terrorists, Arabs in apperance, whose names is mentioned almost next to each other in the same episodes. Guess what are they last names? The first is named “Khatami”, the second “Ardakani”. Where are these names coming from? They come from the full name of the very popular reformist former President of Iran, Seyyed Mohammad Khatami Ardakani. Interestingly, that full name is rarely mentioned, except in one place, an old version of CIA’s world factbook. The writers simply got their hand on whatever they could find about “terrorist” regimes, and took the smiling president’s name. They didn’t know that Ardakan is the name of a small city in central Iran, and Arabs would probably not name themselves after that city.
  • Arabic text is not what it looks like in the real world at all. The letters are usually disjoint, each letter on its own, instead of contextual shaping. In some cases, it’s even both left-aligned and left-to-right.

Of course, 24 is famous for showing torture to be working sometimes, depicting huge conspiracies, showing government officials on very foolish errands and breaking laws left and right, and very interestingly, a Democratic Chief of Staff becoming a Republican Chief of Staff in the next administration. (All in all, I really think the world of 24 is a parallel universe. Fun to watch, but not much connection to real world.)

The disjoint Arabic phenomenon is not unique to 24, of course. Even better-produced shows like Lost do it. In Season 4, Episode 9, a TV news programming is shown, supposedly in Tunisia broadcasting something happening in Iraq. The Arabic text is totally disjoint, and unacceptable to anybody who knows anything about the language or script.

I suppose the producers pay people to translate the text into Arabic. Can’t they also make sure the software they use to render the text also displays it fine? If it doesn’t, why bother? Just show some squiggles!

Tintin did it much better, with much lower budget, I guess.

New world: It’s still a couple of month until the beginning of spring, the time we Persians celebrate as our New Year, Nowrooz, the time the world renews itself.

But I think the world renewed itself earlier this year.

But today, I witnessed a new US president, clearly wise, clearly intelligent, and clearly a thinker. I was longing for the day to hear such a thing as “we reject as false the choice between our safety and our ideals” from a US president. Or pearls of wisdom like “know that your people will judge you on what you can build, not what you destroy” or “we can no longer afford indifference to the suffering outside our borders, nor can we consume the world's resources without regard to effect”.

I am so happy to be in this country at such a time as this. And I am surprised of myself for considering him my ideal US candidate for president since I found about him back in 2004. I didn’t think he would run, I didn’t think he would win, but I followed all his moves. All this time, I cried, laughed, drank, read, informed, and debated. Back home in Iran, in transit, and here in California. I could not vote him, and would not be able to vote for him in 2012 either, but as a fellow citizen of the world, he has my support.

Congratulations, World! Or should I say, Happy New World!

Fedora: The other weekend, I flew to Boston for FUDCon F11. I mostly did it to reboot myself back into free software contribution, something I hadn't done a lot last year (because of settling in California and various other stressful and depressing situations).

I saw interesting stuff and boring stuff, but the best thing that happened was meeting "spot". He spent a couple of hours with me over drinks, providing free wisdom (and selling me ideas?). He’s so amazing!

154 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!