I've been thinking about trees a lot lately, and yesterday I had an epiphany about why: I'm actually going to need a high-tech tree implementation for Fitz (the merger of Libart and the Ghostscript graphics library). This tree will hold the graphics objects to be displayed or printed. It's known in the graphics field as a "display list". The PDF 1.4 imaging model is inherently tree structured, so it is most natural for it to be a tree.
The goals are simple. I want a nice, abstract interface, insulated from the implementation details. This interface should support dom-like tree navigation and mutation. Each node will also have a bounding box. The most important query on a tree node is to iterate through all children intersecting a given bounding box.
Also, I want the tree to support a nice Model/View architecture. Mutating the tree (the "model") notifies the view that something has changed. The view then queues a region of the screen for redisplay. A background thread (or, more likely, idle handler) empties this queue, rerendering these regions and drawing them to the screen. To the client, the effect is that the screen updates automagically.
Behind that interface, I want an efficient implementation. The main criterion is RAM usage, because this will go into things like printers, where RAM is far from free. Using an in-memory object per tree node is too heavy, as my experience with both DOM and the Gnome Canvas proved. Of course, it also has to be fast, so things like linear scanning through a serialization are out.
I now have a design that I believe will meet these goals. Here is the broad outline.
The API is fairly similar to DOM, but with one subtle and important difference: instead of Node objects for each tree node, I plan to use "node ids". These node ids are transient, even if the node itself is persistent. When the client is done with a node id, it explicitly lets go its reference. DOM, by contrast, generally relies on the Java garbage collector. I expect that the number of active node ids (ie, those for which the client holds a reference) is small compared to the size of the tree.
Obviously, this API admits a DOM-style implementation. In this case, there is a one-to-one mapping between node id's and node objects. Admitting such an implementation is a good thing, because it should be relatively simple to code up and thus useful as a prototype.
However, the real goal is to admit a more compact implementation. In this implementation, a serialization of the tree is held in a B-tree. Thus, there are two trees. The structure of these two trees need not bear any relation to each other, though.
The alphabet of this serialization consists of left parenthesis, right parenthesis, and atoms. Let's simplify things a lot by treating atoms as if they are one byte in size. That way, counting is easy, and we also don't need to worry about splitting an atom across more than one (fixed size) B-tree block. Thus, a complete depth-2 binary tree is represented by this serialization: "((AB)(CD))"
For each B-tree node, we store a little bit of extra information about the parentheses. Starting at 0, incrementing for each '(', and decrementing for each ')', we record the minimum and ending values. For example, if our B-tree block size is 4, then our blocks are "((AB", ")(CD", and "))". The summary tuples are (0, 2), (-1, 0), and (-2, -2), respectively. These summaries are stored for both leaf and interior B-tree nodes.
Now, there are two notions of node id's, which I will christen "sliding" and "stable". A "sliding node id" is essentially the tuple of a B-tree (leaf) node identifier and an offset within the node. Thus, in our example, if the id's of the three B-tree leaf nodes are x, y, and z, then the root node of our tree is (x, 0), its first child is (x, 1), and its second child is (y, 1).
Given a sliding node id, it is pretty easy to see how the standard DOM methods of first child, next sibling, and previous sibling can be implemented efficiently. In particular, you can jump over irrelevant B-tree nodes without having to dive into them. The details are left as an exercise for the reader.
Finding the parent of a node is a little trickier, but only a little. Basically, you walk down from the root node, and remember where you were. If this operation is important, then it probably makes sense to have a cache. You can also populate this cache when you do a "first child" operation, using the result (the child) as the cache key.
This concept of a sliding node id works well if the tree is read-only, or even in the (interesting) special case where you're appending to the end. But I'm also interested in the general case where you are mutating the tree.
The basic idea is to maintain a mapping from stable node id's to sliding node id's. When the tree is mutated, update this mapping. For example, assume that we have a node id pointing to "B". The sliding node id is (x, 3). Now, let's insert an "E" between "A" and "B", resulting in the following B-tree leaf nodes: x = "((A", w = "EB", y = ")(CD", and z = "))". We update the sliding node id in the stable-to-sliding mapping to (w, 1). Note that the node id for the root's second child stays at (y, 1). In fact, we expect most node id's to remain stable unless that neighborhood of the tree is being mutated.
This design is quite a bit less ambitious than, say, IronDoc. I'm not concerned with updating over a network, doing transactions, or generalized undo.
Messaging without email
David McCusker has also been thinking a lot about trees. Reading over his site has provided much of the inspiration for the above design.
I realized recently that he and I are conducting a full scale dialog through our respective weblogs. This has traditionally been the domain of email.
I feel like I've found a kindred soul. We've spent a lot of time thinking in the same space, and come up with parallel conclusions. I can also relate to much of the more personal info that David blogs - for example, when we were trying to live with another couple in an extended family, we had money issues similar to those that David describes. For us, those ended when we split back into our respective nuclear families.
I get the feeling that I would enjoy meeting David and hanging out with him. The normal thing to do would be to email him, but I'm a bit reluctant to break the blog-not-email pattern.
Yesterday, Roger Dingledine, who's in town from Boston, invited me to go to dinner with him. On the spur of the moment, I hopped over to San Francisco (where the cfp2002 conference is being held. It was great fun. In addition to Roger, I got to meet a number of people I hadn't seen in a while, including adam and Paul Syverson. I also got to meet some people I knew only online, including Len Sassaman and Lance Cottrell. Also, meeting George Danezis was a special treat. Bram stopped by late, but I had to leave to catch Bart by midnight.
These personal connections are vitally important. I'm happy that I was able to do this.
Someone (sorry I don't remember who) recommended Forth. I'm happy that he likes it, but my experience is otherwise. PostScript is actually one of the more powerful Forth dialects (lists and maps as first-class objects, etc), but I avoid it as much as possible.
It's awesome that projects like Parrot and Psyco are underway. Unfortunately, I fear that the goal of getting good performance may impose subtle but deep changes on the languages implemented. Python, in particular, is radically dynamic, much more so than a typical Lisp.
I also think it's great that we have different competing approaches. We don't yet know the Best Way to efficiently implement highly dynamic languages. My personal guess is that Arc will kick the ass of its competitors, by virtue of drawing on the deep knowledge well of efficiently implementing Lisp, but I would be happy to be proved wrong.
My main frustration, as I've written before, is my lack of good choices among stable languages. But this will inevitably happen in time. If there are free software values (in analogy to the extreme programming values), then I'm sure that patience is among them.