13 Apr 2002 raph   » (Master)

Languages

This is well worth reading. There are some problem domains in which performance does matter. It used to be that performance mattered in most problem domains; no longer. But the performance-critical domains tend to be the ones that interest me personally.

Languages such as Python and Perl have become popular for good reasons - it really is possible to express concepts in these languages more succinctly than in lower-level languages like C. However, this power generally comes at the price of a dramatic loss of performance. I don't agree that there necessarily has to be a tradeoff between beauty and performance; quite the contrary.

I guess what bothers me most is that the technology to extract good performance from highly dynamic languages is well known by now, but the implementors of Python and friends have largely chosen to ignore this art.

Fortunately, future implementors of popular scripting languages are as free to reject low-performance implementations as past implementors of popular scripting languages have been to reject high-performance implementations. From the looks of it, that's precisely what Paul Graham intends to do. If so, it would rock.

blobs

One of the things David McCusker has going at his website is an ongoing discussion of a data structure/algorithm he's hacking, which he calls blobs. From the outside, it looks like a file (ie sequence of bytes), but for which inserting and deleting in the middle is efficient. In most file implementations, only manipulating the end is efficient.

Under the hood, it's basically a b-tree. There seems to be a lot more to it than that - David is obviously a very smart guy, and equally obviously, he's put a lot of thought over a long time into the algorithms. Fortunately, as he points out, you don't necessarily have to understand the implementation of blobs to use them effectively. I think this is true of a lot of the best technology.

I can think of a few applications in my own experience where blobs would be useful. The most obvious is text editors. There is also a large class of binary file formats that can usefully be edited in this way - PDF definitely comes to mind, and many producers/editors of PDF files seem to go through all kinds of contortions to deal with variable length editing.

But, thinking about this a bit today on BART, I'm more intrigued with the idea of using this data structure for generic tree-structured data. In some serializations of trees, the basic tree editing operations map well to edits of the serialization. XML is one such serialization; ASN.1 is not, because it includes length fields.

The basic tree edits I have in mind are inserting and deleting nodes, and moving nodes up or down. In XML, the first two map to just inserting and deleting the XML representations of the nodes. The second two are not much harder - you delete or insert (respectively) matching start and end tags.

You, faithful reader, probably think I'm out of my gourd at this point. "Doesn't he realize", you might say, "that he's proposing to use a tree structure to store a byte sequence representing a tree? Why not just use a tree to store a tree, like DOM does?" Most of the answer is: performance.

Most DOM implementations take on the order of a hundred bytes or so per tree node. To me, this is clearly ridiculous. Even XML, which is a fairly redundant and bloated representation, takes a dozen bytes or so per node, assuming concise tags. It's not hard to imagine more compact tree serializations that take just a few bytes of overhead per node to store the tree structure. Lisp S-expressions, which aren't even binary, are a good example.

With a blob, the amount of memory taken by the data structure is only a tiny amount more than the serialization itself. So it seems like you might get an order of magnitude or so improvement without even trying very hard.

But RAM usage isn't the only reason this is interesting. Manipulating byte ranges is actually a lot simpler than doing edits on a tree. This can be interesting for a number of reasons.

Let's say you're trying to keep two copies of a tree sync'ed up, over some kind of network connection. If you were to do something really stupid, like DOM over CORBA, you'd generally get one method invocation per node. Since DOM over CORBA incurs a round-trip latency for every method invocation, performance would be amazingly bad. You also need to preserve a great deal of DOM complexity so you can express the edits to the tree in rich enough detail.

By contrast, it's very easy to imagine an efficient implementation based on byte range edits. One major win is that you can batch up operations on as many nodes as you like into a single byte range edit. Another major win is the dramatic simplification of the protocol: just insert and delete.

This is just one example; another is undo functionality. It's possible, but painful to implement a generic undo facility in DOM. Basically, what you'd need to do is provide a proxy DOM implementation that records the inverse of each tree edit into a log. When the client inserts, you log "delete". When the client moves a node from A to B, you log "B to A". Then, to undo, you play the log back. The devil is in the details - there are lots of operations, and a vanilla DOM is hard enough to implement without having to worry about maintaining integrity of such a log. It's also difficult to see how you'd aggregate changes, for example undoing back to a blank document in a single step rather than replaying the entire tree edit history.

In a byte range implementation, you just have insert and delete. These operations are each other's inverses. Further, it's pretty easy to reduce a sequence of such edits if need be.

Undo, of course, has lots of applications, including aborting transactions in a transactional system. You can also easily imagine this kind of thing being used for copy-on-write, version control, etc.

I'm not saying that this approach solves every problem. Holding on to a reference to a node as it slides around, fairly easy in DOM, becomes more challenging.

I spent a lot of time thinking about how DOM might be implemented efficiently when I was working on SVG. I've basically come to the conclusion that it can't be. But I do think that it's possible to implement something DOM-like that's a lot simpler, a lot more powerful, and a lot more efficient. Such a thing, I believe, would be very useful in a lot of interesting applications.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!