Trees
I've been studying trees lately. The goal that I pursuit is to do a
data structure/algorithms that make editing operations a O(log n),
where n is the number of previous changes. In AbiWord we have
O(n) algorithms, so the O(log n) should provide a good boost to the
performance.
I will present a somewhat simplified vision the current data structure.
It is composed of 2 buffers. One of them is read only (the buffer "0"),
and contains the original text of the file that you're editing. The
second one is append only (buffer "1"), and at the beginning is empty.
We also have a linked list, whose nodes contain 3 fields: Buffer, Offset
and Size. Each node represents a piece of text. At first, the linked
list only contains a node, with has Buffer 0, Offset 0, and Size equal
to the size of the read only buffer.
To explain the insert operation I will use an example. Let's say that
you have a document with "ABCDEFGH" and you want to insert the 'a' letter
at the 4th position in the document.
Buffer 0: ABCDEFGH
Buffer 1: [empty]
Linked list: <0, 0, 8>
To do that, you should append the 'a' letter to the end of the Buffer 1,
and you split the node on the linked list and insert a new node to point
to your letter. It ends like that:
Buffer 0: ABCDEFGH
Buffer 1: a
Linked list: <0, 0, 4> => <1, 0, 1> => <0, 4, 4>
In general, insertion is: search the right node on the linked list (O(n)
operation), and split the node. Of course, there are some edge cases,
when you insert at the beginning or at the end of a piece you don't need
to split it.
The delete operation works much in the same way. If you want to delete
de 'B' in our example's document, you don't have to touch any of the
buffers, and you change the linked list to:
Linked list: <0, 0, 1> => <0, 2, 2> => <1, 0, 1> => <0, 4, 4>
To get the document contents, you only have to walk the linked list, and
get the characters that the nodes point to.
I wanted to change the linked list to a balanced tree, to make lookup,
insert and delete operations O(log n) instead of O(n). So I did it, and
I put the results at http://e98cuenc.free.fr/wordprocessor/piecetable.html
(along with a regression test suite).
You may be surprised that I choose a red-black tree instead some simpler
structure (as a skip list) or some more efficient structure (as a b-tree).
It was just because I already had a red-black tree that I implemented a
while ago, so I just took the first balanced tree that I had on my hard disk.
It was enough to show my idea, but before putting it in production code
I want to convert it to a b-tree. The b-tree variant that I will try to
implement is a fractal prefetching b+-tree
(http://www.pdl.cmu.edu/ftp/Database/fpbtree.pdf).
The interesting part of changing from a linked list to a balanced tree
was how to pass from a document position to a node on log(n). To do that
I store the size of the children of each node in the node itself. Of
course, it was interesting because I didn't know anything about the
Enfilade theory, nor David McCusker blobs.
Using the enfilade terminology that Raph wrote about, the size of the nodes
is a "wid" property.
David McCusker's blobs seems to share more or less the same goals as this
data structure. Insertion, delete, and lookup are O(log n) operations.
The structure is undo friendly, as you only have to store the operation
(insert or delete) and at which point it happens to be able to undo it.
David has obviously put a lot of work on blobs, and it seems to implement
several optimization over the "piece table with a balanced tree" that
I've implemented (as copy-on-write). I will have to look deeper inside
Blobs to see how is it implemented, as right now I've only had a quick
look at them.
I guess that all that should look like baby steps, as I've just started
to look into the subject, and the trees field is a very mature one...
Btw, Raph, your tree representation for Athshe is beautiful :-)
Batch formatting
I'm kind of on a middle point between Raph and cinamod, but more on the
cinamod side than on the Raph side :-) on the batch formatting tool vs.
improve one of the current word processors.
I can see the Raph point of one batch formatting tool being easier to do
than a full word processor. Raph says that you don't need to care about
all the GUI, but you also don't need very complex data structures to
hold your data. The text is not going to be inserted, nor deleted, so these
operations don't need to be fast, as they don't exist. You don't need, either,
undo.
I don't think that it will be possible to get a perfect word conversor
if its layout engine is not done with the goal of duplicating the MS Word
one. For instance, TeX will do a full paragraph justification, while word
does not, so if you put TeX in the middle of the chain you will never get
an identical result.
Also, I think that a fundamental part of a batch conversor will be a
regression test suite that it should have if it wants any remote possibility
of getting a perfect output.
But there is a lot of common code between a batch formatting and a word processor
to just let them go through separate paths. I guess that the only added
complexity due to be in a interactive word processor that will impact the
batch formatting code is due to the above mentioned data structures to
hold your data, and implementing them raises the usefulness of such a
program a magnitude order. IMO the usefulness/complexity ratio is worth
making the batch formatting tool "edition capable".