19 May 2002 raph   » (Master)

Trees

Why am I so fascinated by trees?

One of my pie-in-the-sky projects is Athshe, a somewhat nebulous data format for trees. The writeup isn't that good, apologies. In any case, I've been kicking the ideas around in my head for quite some time now, and I still think it's worthwhile.

I'll probably write about some of the more detailed pieces, but I want to get some sense of the large scale across this time.

I think formats for tree-structured data have many intertwingled aspects. It might help to consider XML and filesystems as two special cases.

Abstract tree language

This is basically the language of what's allowed. Obviously, there are trees. The children of a node may be primarily sequential (as in XML), or primarily mapped by name (as in filesystems). What's in a leaf is also important - Unicode text in XML, arbitrary binary in filesystems.

It's good for the abstract tree language to be simple. Often, if you want more complex stuff, you find you can map it into simpler types fairly easily.

Athshe has exactly four types in the language: list, map, bytes, and tag (similar to MIME Content-Type). It's an appealing set, but I'm not sure it's quite right. In particular, I get the feeling that references (perhaps analogous to the # part of URL's) may be important too.

Serialization

Once you have a tree in hand, you want to serialize it out as a file, so you can store it in a file or move it over the net as one chunk. Also, because parsing is so well understood, it's often more useful to talk about the serialization than the abstract tree language.

For a given tree language, there may be many different serializations. In the case of filesystems, there's tar, pax, zip, and many others. XML is mostly just XML, but there's stuff like wbxml too. ASN.1 is another example of an abstract tree language with many different encodings.

Access API

If you want to actually process trees, you'll need an API to access the tree data. In the case of XML, the API of choice is DOM. In the case of filesystems, it's the traditional Unix open/read/write/close API, or a variation.

The access API has many implications for performance. For one, grain size is important. For another, it's important for nodes to be very, very cheap, but active node references (open file descriptors) can be much less cheap. DOM doesn't quite get this right.

Another important API consideration is how calls map across the process boundary. For files, most calls cross the process boundary, which generally implies that files are not good for very small grains. However, note that fread() and fwrite() manage to cache a goodly fraction of file accesses within the process.

This is a good place for complexity to set in. There are all kinds of important issues in real systems only incidentally related to the abstract data language, such as asynchrony (nonblocking I/O), integrity (see fsync), access control, and so on. Applications may also want generalized undo (versioning), copy-on-write, transactions, and more.

Finally, the access API is a reasonably good place for change notification to happen. Unix has been traditionally deficient in this area; it's a standard feature in NT. In the case of XML, DOM has it, but with flaws.

Protocol

It used to be enough to access a tree local to disk or memory. Now, you usually want to do it over the network too. For this, you need a protocol.

Protocols for remote access to filesystems are a very rich, fertile, and increasingly mature area. People use this stuff heavily, though not quite yet universally. Performance is important, and there are lots of clever ideas. A major performance goal is to cut down on the number of network boundary crossings.

In the XML world, the most important protocol is HTTP, which basically gives you the entire tree as one chunk. You can also run DOM through over CORBA, as described in the DOM spec, but the performace would be horrible. There's no good general way to access a large tree incrementally.

Random access file format

Filesystem trees typically don't get manipulated as tarfiles. Usually, the filesystem implementation stores the tree on disk in some random access format. Only a small part needs to be in RAM, and it's pretty quick to get to an arbitrary place in the tree. Again, this is a mature area. In the case of filesystems, it's almost a solved problem.

XML doesn't really have a random access file format. There are such things as "hierarchical databases", but they are not popular.

The main idea of Athshe is that a simple abstract data language makes the problem of building a high-performance suite of tools more tractable. The major performance theme is scalability. Operations on a small local tree should be almost as fast as a pure in-memory implementation. On the other end of the scale, it should be possible to access huge remote trees efficiently. Chunk size in your process and network boundary crossings should match the chunk size of the application; in particular, traversing a complete subtree should resemble transferring it as a serialization.

I find it very difficult to communicate why this is such a good thing. But see a recent thread on wmf's discussion page, which reminded me of my Athshe ideas.

MLP

AaronSw's ETCon schedule with weblog links is useful and interesting. One day soon, most conferences will be annotated like this, probably in realish time.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!