13 Jun 2002 raph   » (Master)

Caching in Trees

This will be another fairly technical entry on trees. It's been in the queue for a few days. The focus is on the Fitz display tree, for which memory efficiency is a major factor. Tonight we look at caching.

If you have a display tree, there are lot of things you might want to cache: intermediate renderings, bounding boxes and other geometry information, etc. For each Bezier curve, for example, you might want to cache a decomposition into triangles. The memory footprint for these cached objects might be significant in size compared to the original tree. This is why we want a cache rather than simply annotating the tree nodes with the extra info.

Mutating the tree can invalidate cached data, as well. In some cases, the relationship between the mutation and the cache is nontrivial. For example, if you change the color of a Bezier shape, you invalidate an intermediate RGBA rendering, but the triangle decomposition remains valid.

We're not going to pin down the exact representation of the tree. One in-memory object per node is the simplest way, and should work. Another approach that should work is storing a serialization of the tree in a btree-like structure. In this case, our node id is effectively a file offset to the beginning of the serialization. Thus, the id of a node can change as the tree is mutated.

Dealing with "sliding" node id's is probably too hard for clients of the tree, so we have an additional concept of "node reference", which is an in-memory object that essentially wraps a node id. When a node id moves, the tree implementation updates the corresponding node reference. This way, clients holding node references don't have to worry about them moving around.

Node reference might take dozens of bytes of RAM each, but node id's are essentially weightless. We hope that tree clients hold a relatively small number of node references, even as the size of the tree scales up.

Now we get to tonight's central design question: what should the key of our various caches be? A node id? A node reference? Something else? Here, we consider some alternatives.

Persistent id

A very common pattern in databases is to add a persistent id to each node. The value is somewhat arbitrary, but must be unique for each node in the tree. If we had persistent id's, then it would make good sense to use them as the cache keys. The problem is the extra storage cost. We're trying to keep that down to the bone.

Node id

It's tempting to use node id's (ie file offsets in the btree case) as cache keys. The problem is that if the node id moves, the cache key needs to be updated. Keeping the inverse map from node id to cache keys has nontrivial storage costs, also.

A more subtle, but important, argument against is that updating them may be very rare in some usage scenarios. Thus, there is a risk that the update machinery won't be adequately tested.

Node reference

The cache key could simply be a pointer to a node reference object. If the tree moves the corresponding node id, it updates the internals of the node reference, but the pointer remains constant.

Cache in tree

A rather different approach is to insert the cached values into the tree. The advantage is that the RAM costs can be very low (near-zero if the tree is stored on disk as a btree). The disadvantage is that computing and evicting cached values now requires traffic with the tree, with attendant performance and fragility problems.

Also, if there are multiple caches, then they'll need to be properly

multiplexed so values from the caches don't interfere with each other.

Cache in node reference

This is something of a hybrid of the above three approaches. Instead of the cache being represented as a hash table to the side of the tree, the cache entry is an extra field in the node reference. If there is a single cache (or small, bounded number of caches), then this approach is appealing. Otherwise, you have to do multiplexing as above.

I think you see most of these approaches in real systems. For one example, the Gnome Canvas includes a bounding box in all nodes, and also an SVP (sorted vector path) in all Bezier shape nodes (thus, I claim, it represents "cache in tree"). Unfortunately, it never evicts any elements from the cache, so the memory requirements can become quite painful.

For Fitz, I now think I have an answer. Most caches will use node references as keys. However, I may treat bounding boxes specially, and store them in node references. Saving a hashtable lookup may be a significant win, and it also helps that the value is of small constant size.

Also, note that navigational links (parent, first child, next sibling), which are internal to the tree implementation, can similarly be cached in the node reference. The basic rendering traversal is: given a bounding box, find all child nodes intersecting that bbox. If the nodes are in-cache, it should be possible to do this very quickly.

I'm happy with this, but still don't feel I've come to terms with change notification. I'll blog that in the next few days, and probably lose my remaining two readers.

Kids

Alan started his Korean Royal Court Martial Arts (Koong Joong Mu Sool) classes this week. It seems to be a perfect match for him - he's in much better shape.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!