Trees
I've been thinking about trees a lot lately, and
yesterday I
had an
epiphany about why: I'm actually going to need a high-tech
tree
implementation for Fitz (the merger of Libart and the
Ghostscript
graphics library). This tree will hold the graphics objects
to be
displayed or printed. It's known in the graphics field as a
"display
list". The PDF 1.4 imaging model is inherently tree
structured, so
it is most natural for it to be a tree.
The goals are simple. I want a nice, abstract interface,
insulated
from the implementation details. This interface should
support
dom-like tree navigation and mutation. Each node will also
have a
bounding box. The most important query on a tree node is to
iterate
through all children intersecting a given bounding box.
Also, I want the tree to support a nice Model/View
architecture.
Mutating the tree (the "model") notifies the view that
something has
changed. The view then queues a region of the screen for
redisplay. A
background thread (or, more likely, idle handler) empties
this queue,
rerendering these regions and drawing them to the screen. To
the
client, the effect is that the screen updates automagically.
Behind that interface, I want an efficient
implementation.
The main
criterion is RAM usage, because this will go into things
like
printers, where RAM is far from free. Using an in-memory
object per
tree node is too heavy, as my experience with both DOM and
the Gnome
Canvas proved. Of course, it also has to be fast, so things
like
linear scanning through a serialization are out.
I now have a design that I believe will meet these
goals.
Here is the
broad outline.
The API is fairly similar to DOM, but with one subtle
and
important
difference: instead of Node objects for each tree node, I
plan to use
"node ids". These node ids are transient, even if the node
itself is
persistent. When the client is done with a node id, it
explicitly lets
go its reference. DOM, by contrast, generally relies on the
Java
garbage collector. I expect that the number of active node
ids (ie,
those for which the client holds a reference) is small
compared to the
size of the tree.
Obviously, this API admits a DOM-style implementation.
In
this case,
there is a one-to-one mapping between node id's and node
objects.
Admitting such an implementation is a good thing, because it
should be
relatively simple to code up and thus useful as a prototype.
However, the real goal is to admit a more compact
implementation. In
this implementation, a serialization of the tree is held in
a B-tree.
Thus, there are two trees. The structure of these two trees
need not
bear any relation to each other, though.
The alphabet of this serialization consists of left
parenthesis, right
parenthesis, and atoms. Let's simplify things a lot by
treating atoms
as if they are one byte in size. That way, counting is easy,
and we
also don't need to worry about splitting an atom across more
than one
(fixed size) B-tree block. Thus, a complete depth-2 binary
tree is
represented by this serialization: "((AB)(CD))"
For each B-tree node, we store a little bit of extra
information about
the parentheses. Starting at 0, incrementing for each '(',
and
decrementing for each ')', we record the minimum and ending
values.
For example, if our B-tree block size is 4, then our blocks
are
"((AB", ")(CD", and "))". The summary tuples are (0, 2),
(-1, 0), and
(-2, -2), respectively. These summaries are stored for both
leaf and
interior B-tree nodes.
Now, there are two notions of node id's, which I
will
christen
"sliding" and "stable". A "sliding node id" is essentially
the tuple
of a B-tree (leaf) node identifier and an offset within the
node. Thus, in our example, if the id's of the three B-tree
leaf nodes
are x, y, and z, then the root node of our tree is (x, 0),
its first
child is (x, 1), and its second child is (y, 1).
Given a sliding node id, it is pretty easy to see how
the
standard DOM
methods of first child, next sibling, and previous sibling
can be
implemented efficiently. In particular, you can jump over
irrelevant
B-tree nodes without having to dive into them. The details
are left as
an exercise for the reader.
Finding the parent of a node is a little trickier, but
only
a little.
Basically, you walk down from the root node, and remember
where you
were. If this operation is important, then it probably makes
sense to
have a cache. You can also populate this cache when you do a
"first
child" operation, using the result (the child) as the cache
key.
This concept of a sliding node id works well if the tree
is
read-only,
or even in the (interesting) special case where you're
appending to
the end. But I'm also interested in the general case where
you are
mutating the tree.
The basic idea is to maintain a mapping from stable node
id's to
sliding node id's. When the tree is mutated, update this
mapping. For
example, assume that we have a node id pointing to "B". The
sliding
node id is (x, 3). Now, let's insert an "E" between "A" and
"B",
resulting in the following B-tree leaf nodes: x = "((A", w =
"EB", y =
")(CD", and z = "))". We update the sliding node id in the
stable-to-sliding mapping to (w, 1). Note that the node id
for the
root's second child stays at (y, 1). In fact, we expect most
node id's
to remain stable unless that neighborhood of the tree is
being
mutated.
This design is quite a bit less ambitious than, say,
IronDoc. I'm not
concerned with updating over a network, doing transactions,
or
generalized undo.
Messaging without email
David McCusker has also been thinking a lot about trees.
Reading over
his site
has
provided much of the inspiration for the above design.
I realized recently that he and I are conducting a full
scale dialog
through our respective weblogs. This has traditionally been
the domain
of email.
I feel like I've found a kindred soul. We've spent a lot
of
time
thinking in the same space, and come up with parallel
conclusions. I
can also relate to much of the more personal info that David
blogs -
for example, when we were trying to live with another couple
in an
extended family, we had money issues similar to those that
David
describes. For us, those ended when we split back into our
respective
nuclear families.
I get the feeling that I would enjoy meeting David and
hanging out
with him. The normal thing to do would be to email him, but
I'm a
bit reluctant to break the blog-not-email pattern.
Community
Yesterday, Roger Dingledine, who's in town from Boston,
invited me to
go to dinner with him. On the spur of the moment, I hopped
over to San
Francisco (where the cfp2002
conference is being held. It was great fun. In addition to
Roger, I
got to meet a number of people I hadn't seen in a while,
including
adam and Paul Syverson. I also
got to meet some
people I
knew only online, including Len Sassaman and Lance Cottrell.
Also,
meeting George Danezis was a special treat. Bram
stopped
by late, but I had to leave to catch Bart by midnight.
These personal connections are vitally important. I'm
happy
that I
was able to do this.
Languages
Someone (sorry I don't remember who) recommended Forth.
I'm
happy that
he likes it, but my experience is otherwise. PostScript is
actually
one of the more powerful Forth dialects (lists and maps as
first-class
objects, etc), but I avoid it as much as possible.
It's awesome that projects like Parrot and Psyco are
underway.
Unfortunately, I fear that the goal of getting good
performance may
impose subtle but deep changes on the languages implemented.
Python,
in particular, is radically dynamic, much more so than a
typical Lisp.
I also think it's great that we have different competing
approaches.
We don't yet know the Best Way to efficiently implement
highly dynamic
languages. My personal guess is that Arc will kick the ass
of its
competitors, by virtue of drawing on the deep knowledge well
of
efficiently implementing Lisp, but I would be happy to be
proved
wrong.
My main frustration, as I've written before, is my lack
of
good
choices among stable languages. But this will
inevitably happen
in time. If there are free software values (in analogy to
the extreme
programming values), then I'm sure that patience is
among them.