Advogato: Blog for crhodes

I was going to blog about the fact that using org-mode, referred to in my previous diary entry, made the important but non-urgent tasks more visible. I was going to use sorting out backups for interesting data (say, my e-mail archives) as an example, and discuss the solution I came up with, but I have just realised that the trust model is exactly backwards (my server trusts root on the backup machine). This is annoying, because I thought I'd got it right, and because getting it right would have been equally easy.

Oh well. So, instead, I'll return sorting out backups to the TODO (or maybe STARTED) state, and (prompted by some recent discussion on #lisp IRC) I'll blog about SBCL's interpretation of Unicode characters, with the up-front caveat that I'm Not An Expert in this languages, glyphs, graphemes, characters and all that jazz.

Common Lisp's string type is defined to be a vector specialized to hold only characters or a subtype thereof. This definition is already hard to wrap your head around, and has amusing consequences documented here in the past, but I don't want to get into it too much; merely to say that already this definition restricts to a fairly large extent the possible implementation strategies for supporting Unicode.

Why so? Because in Unicode there are several notions of ‘character’, and we have to decide which of them we're going to use as our Lisp character type (and use as string constituents). The simple answer from the implementation point of view (and the route that SBCL currently takes) is to define a Lisp character as an entity corresponding directly to a Unicode code point. This is simple and straightforward to implement, but unfortunately has the side effect of making various Common Lisp string functions less useful to the user.

How so? Well, consider the string comparison functions, such as string=. As specified, string= compares two strings, character by character. In SBCL, then, this compares two sequences of Unicode code points, character by character, for equality. The problem is that this operation doesn't in general have the semantics of ‘string equality’, because in Unicode there is more than one way to encode the same abstract character: for example, the e-acute ‘abstract character’, or possibly ‘grapheme’, e-acute (which is usually displayed ‘é’) can be represented either as the single code point U+00E9, or as the combining character sequence U+0065 U+0301.

So, that's OK; the Unicode FAQ on Combining Marks says that characters and combining character sequences are different, and even implies that programmers should be dealing with Unicode code points (SBCL characters). Unfortunately, Lisp has been around for longer than Unicode, and code has been written essentially assuming that string= performs a language-string equality comparison rather than a codepoint-by-codepoint equality comparison, simply because (pre-Unicode) these two concepts were conflated.

What about the alternative? We could try defining Lisp characters to be abstract characters, represented as combining character sequences. One problem with this idea is that there's the char-code function to implement: for every Lisp character there must be a corresponding unique integer. That's not so much a problem – Lisp has bignums after all – but it will make char-code-limit surprisingly large (in principle, I think every combining mark could be applied to a given base character). This means that we'd lose the ability to represent an arbitrary character as an immediate object, meaning that accessing characters from strings would in general cause heap allocation, and lead to surprises elsewhere in the system.

So, given that we stay with a Lisp character corresponding to a Unicode code point, what other pitfalls and details are there to consider? The memory representation of strings of type (simple-array character (*)) is worth mentioning; because there's a fairly strong cultural expectation of O(1) access time in vectors, we don't do any compression, but simply store each Unicode codepoint in a 32-bit cell. SBCL has a separate base-string representation, where each ASCII codepoint is stored in an 8-bit cell; a long time ago I gave a talk about this.

Also, interpretation of the contents of strings has caused confusion recently. Granted that a string is a vector of (effectively) code points, what does that mean for strings containing surrogate characters (code points in the range U+D800–U+DFFF)? These code points do not correspond to any abstract characters directly; instead, pairs of surrogates are (in certain Unicode encodings, such as UTF-16) interpreted as characters beyond the Basic Multilingual Plane. Some Lisp implementations (such as ~~OpenMCL~~Clozure Common Lisp) go so far to resolve this ambiguity as to forbid the creation of a Lisp character with a surrogate codepoint. In SBCL, however, we take the view that those characters exist, but should not be interpreted in any way; a string containing surrogate pairs should be considered to have individual surrogate characters in it, and no attempt should be made to combine them. If there is data in an encoding which uses surrogate pairs (such as UTF-16), then that data should be read in using the :utf-16 external format, so that no surrogates are present at the Lisp level; an attempt to write out a surrogate Lisp character in a Unicode encoding should generate an error. (NB: not all of this is implemented yet).

All of this merely scratches the surface of Unicode support; I'm hoping to find time to implement better support for finding properties of the Unicode Character Database, and to implement Unicode algorithms for normalization, collation and so on; I'm also planning to tighten up support for the Unicode encodings (to address the potential security issues that exist from nonconforming decoders) and generally to improve support for doing useful things with non-ASCII. As usual, there's likely to be a significant lag between planning and doing...

20 Aug 2009 crhodes » (Master)