7 Feb 2003 cuenca   » (Master)


I've been studying trees lately. The goal that I pursuit is to do a data structure/algorithms that make editing operations a O(log n), where n is the number of previous changes. In AbiWord we have O(n) algorithms, so the O(log n) should provide a good boost to the performance.

I will present a somewhat simplified vision the current data structure. It is composed of 2 buffers. One of them is read only (the buffer "0"), and contains the original text of the file that you're editing. The second one is append only (buffer "1"), and at the beginning is empty. We also have a linked list, whose nodes contain 3 fields: Buffer, Offset and Size. Each node represents a piece of text. At first, the linked list only contains a node, with has Buffer 0, Offset 0, and Size equal to the size of the read only buffer.

To explain the insert operation I will use an example. Let's say that you have a document with "ABCDEFGH" and you want to insert the 'a' letter at the 4th position in the document.

Buffer 0: ABCDEFGH
Buffer 1: [empty]
Linked list: <0, 0, 8>

To do that, you should append the 'a' letter to the end of the Buffer 1, and you split the node on the linked list and insert a new node to point to your letter. It ends like that:

Buffer 0: ABCDEFGH Buffer 1: a Linked list: <0, 0, 4> => <1, 0, 1> => <0, 4, 4>

In general, insertion is: search the right node on the linked list (O(n) operation), and split the node. Of course, there are some edge cases, when you insert at the beginning or at the end of a piece you don't need to split it.

The delete operation works much in the same way. If you want to delete de 'B' in our example's document, you don't have to touch any of the buffers, and you change the linked list to:

Linked list: <0, 0, 1> => <0, 2, 2> => <1, 0, 1> => <0, 4, 4>

To get the document contents, you only have to walk the linked list, and get the characters that the nodes point to.

I wanted to change the linked list to a balanced tree, to make lookup, insert and delete operations O(log n) instead of O(n). So I did it, and I put the results at http://e98cuenc.free.fr/wordprocessor/piecetable.html (along with a regression test suite).

You may be surprised that I choose a red-black tree instead some simpler structure (as a skip list) or some more efficient structure (as a b-tree). It was just because I already had a red-black tree that I implemented a while ago, so I just took the first balanced tree that I had on my hard disk. It was enough to show my idea, but before putting it in production code I want to convert it to a b-tree. The b-tree variant that I will try to implement is a fractal prefetching b+-tree (http://www.pdl.cmu.edu/ftp/Database/fpbtree.pdf).

The interesting part of changing from a linked list to a balanced tree was how to pass from a document position to a node on log(n). To do that I store the size of the children of each node in the node itself. Of course, it was interesting because I didn't know anything about the Enfilade theory, nor David McCusker blobs.

Using the enfilade terminology that Raph wrote about, the size of the nodes is a "wid" property.

David McCusker's blobs seems to share more or less the same goals as this data structure. Insertion, delete, and lookup are O(log n) operations. The structure is undo friendly, as you only have to store the operation (insert or delete) and at which point it happens to be able to undo it. David has obviously put a lot of work on blobs, and it seems to implement several optimization over the "piece table with a balanced tree" that I've implemented (as copy-on-write). I will have to look deeper inside Blobs to see how is it implemented, as right now I've only had a quick look at them.

I guess that all that should look like baby steps, as I've just started to look into the subject, and the trees field is a very mature one...

Btw, Raph, your tree representation for Athshe is beautiful :-)

Batch formatting

I'm kind of on a middle point between Raph and cinamod, but more on the cinamod side than on the Raph side :-) on the batch formatting tool vs. improve one of the current word processors.

I can see the Raph point of one batch formatting tool being easier to do than a full word processor. Raph says that you don't need to care about all the GUI, but you also don't need very complex data structures to hold your data. The text is not going to be inserted, nor deleted, so these operations don't need to be fast, as they don't exist. You don't need, either, undo.

I don't think that it will be possible to get a perfect word conversor if its layout engine is not done with the goal of duplicating the MS Word one. For instance, TeX will do a full paragraph justification, while word does not, so if you put TeX in the middle of the chain you will never get an identical result.

Also, I think that a fundamental part of a batch conversor will be a regression test suite that it should have if it wants any remote possibility of getting a perfect output.

But there is a lot of common code between a batch formatting and a word processor to just let them go through separate paths. I guess that the only added complexity due to be in a interactive word processor that will impact the batch formatting code is due to the above mentioned data structures to hold your data, and implementing them raises the usefulness of such a program a magnitude order. IMO the usefulness/complexity ratio is worth making the batch formatting tool "edition capable".

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!