20 Aug 2009 fxn   » (Master)

Ruby Regexps and Unicode

In Ruby 1.8 strings have no encoding associated, they are only a handful of bytes from Ruby's view. Regexps are agnostic in that sense as well they match bytes against bytes. Unless you pass one of the flags /u for UTF8, /s for SJIS, or /e for EUC-JP. By the way note that /s in Ruby has a different meaning than in Perl, and it is not the only flag that conflicts.

If you set $KCODE to "u" then source code itself is assumed to be UTF8 and Ruby turns the /u flag on. Ruby on Rails does that since version 1.2 for example.

AFAICT it is not clearly defined which support does Ruby 1.8 provide for Unicode in regexps. For example Flanagan & Matz have little about it except for some vague descriptions. You could say it is just not supported, but some things do work. For example, it is a known trick that counting /./ matches gives you the length of a UTF8 string, whereas #length returns number of bytes.

A couple of important bits with definitely partial support are the character classes \w and \s (and thus their negations \W and \S).

In general, the definition of a word char depends on the locale. In Catalan "ò" is a word char. Regexp engines are locale-aware and the meaning of \w depends on it. That is, \w is equivalent to [a-zA-Z0-9_] only in ASCII-like locales. In Ruby, if source code is UTF8 and /u is enabled "ò" matches \w.

That's important of course, a Rails application that validates domain or account names against \w for example is permitting accented letters. If they should not be allowed you need to write the character class explicitly: [a-zA-Z0-9_].

On the other hand, since "ò" and friends match \w you could be tempted to validate Unicode against \w, I certainly have beed more than tempted :-). Wrong! There are characters that match but shouldn't. For example "¿" or "¡", or "·".

With whitespace there's also poor support. NEL (U+0085) belongs to \s, but it doesn't in Ruby 1.8. A string that consists of NELs not only is not blank in Rails, but it in addition matches \w in Ruby 1.8! Two gotchas for the price of one!

If you need proper Unicode support, among other goodies, you switch to using Oniguruma. That's the regexp engine used in Ruby 1.9, which is available for 1.8 as a gem:

    sudo gem install oniguruma

That needs a C library available as a tarball, and also packaged for Ubuntu (at least):

    sudo apt-get install libonig-dev

The API is here.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!