Recent blog entries for cuenca

Chema

I will miss you, dude.

I still remember the first time I meet Chema, at the first GUADEC. There was not a lot of people there, and the spanish speakers where a little bit packed together. I got thus specially tied with Rodrigo, alo, and Chema.

I still remember when Chema and me where alone together in the underground at Paris. Chema said me how exciting was to meet people that shared a common passion. He said me that he had not yet done anything in GNOME, but he wanted to help with gnome-print. He was almost a linux newbee.

A linux newbee, a future gnome hacker, that just wanted to help with gnome-print, just travelled half the planet to meet other gnome hackers. Man, *THAT* was passion. He not only helped with gnome-print, but it became its maintainer, along with gedit, glade3, gst, ...

As I write this, I'm seeing Chema giving away Ximian t-shirts at a Copenhagen's pub, or just explaining frenetically its ideas about GST.

I will miss you, dude.

Trees

I've been reading about David's blobs in more deep these days. I still don't fully understand them, but I think that I understand them well enough to start comparing them to a piece table with a tree.

The two structures are nearly identical. The only difference is on the leafs of the b-tree. From what I've read, it seems that a blob will store the text directly on the leafs, while a piece table will only store a pointer to the real text. This pointer is composed of the buffer (the read-only or the append-only one, see my previous diary entry), the offset and the size of the piece of text.

This extra indirection has several advantages.

First, the number of leafs is no more dependent on the size of the text, but on the number of pieces in your piece table. That's in theory dependent only on the number of changes that the piece table has suffered since the file was first read. In the real world, people likes to (over?) use the pieces to store the format of the text, so the pieces have the additional restriction of being contiguous chunks of text with the same format. But even with this additional restriction, there should be less nodes in the tree.

Second, the text is never, ever, really deleted. I don't find the delete algorithms of blobs, so maybe that's also a blob's feature, but I don't think so (my guess is only funded on the blob's data structure). That may seem at first like bloat. After all, why do you need to keep around even the deleted text? To be able to undo. In fact, the piece table enforces an infinite undo as a trivial feature.

The piece table also shares the strong points of blobs. Copy-on-write, for instance, can also be implemented more or less the same way as with blobs.

In short, I'm maybe missing something, but I still prefer a piece table with a auto-balanced tree than a blob.

7 Feb 2003 (updated 7 Feb 2003 at 14:37 UTC) »
Trees

I've been studying trees lately. The goal that I pursuit is to do a data structure/algorithms that make editing operations a O(log n), where n is the number of previous changes. In AbiWord we have O(n) algorithms, so the O(log n) should provide a good boost to the performance.

I will present a somewhat simplified vision the current data structure. It is composed of 2 buffers. One of them is read only (the buffer "0"), and contains the original text of the file that you're editing. The second one is append only (buffer "1"), and at the beginning is empty. We also have a linked list, whose nodes contain 3 fields: Buffer, Offset and Size. Each node represents a piece of text. At first, the linked list only contains a node, with has Buffer 0, Offset 0, and Size equal to the size of the read only buffer.

To explain the insert operation I will use an example. Let's say that you have a document with "ABCDEFGH" and you want to insert the 'a' letter at the 4th position in the document.

Buffer 0: ABCDEFGH
Buffer 1: [empty]
Linked list: <0, 0, 8>

To do that, you should append the 'a' letter to the end of the Buffer 1, and you split the node on the linked list and insert a new node to point to your letter. It ends like that:

Buffer 0: ABCDEFGH Buffer 1: a Linked list: <0, 0, 4> => <1, 0, 1> => <0, 4, 4>

In general, insertion is: search the right node on the linked list (O(n) operation), and split the node. Of course, there are some edge cases, when you insert at the beginning or at the end of a piece you don't need to split it.

The delete operation works much in the same way. If you want to delete de 'B' in our example's document, you don't have to touch any of the buffers, and you change the linked list to:

Linked list: <0, 0, 1> => <0, 2, 2> => <1, 0, 1> => <0, 4, 4>

To get the document contents, you only have to walk the linked list, and get the characters that the nodes point to.

I wanted to change the linked list to a balanced tree, to make lookup, insert and delete operations O(log n) instead of O(n). So I did it, and I put the results at http://e98cuenc.free.fr/wordprocessor/piecetable.html (along with a regression test suite).

You may be surprised that I choose a red-black tree instead some simpler structure (as a skip list) or some more efficient structure (as a b-tree). It was just because I already had a red-black tree that I implemented a while ago, so I just took the first balanced tree that I had on my hard disk. It was enough to show my idea, but before putting it in production code I want to convert it to a b-tree. The b-tree variant that I will try to implement is a fractal prefetching b+-tree (http://www.pdl.cmu.edu/ftp/Database/fpbtree.pdf).

The interesting part of changing from a linked list to a balanced tree was how to pass from a document position to a node on log(n). To do that I store the size of the children of each node in the node itself. Of course, it was interesting because I didn't know anything about the Enfilade theory, nor David McCusker blobs.

Using the enfilade terminology that Raph wrote about, the size of the nodes is a "wid" property.

David McCusker's blobs seems to share more or less the same goals as this data structure. Insertion, delete, and lookup are O(log n) operations. The structure is undo friendly, as you only have to store the operation (insert or delete) and at which point it happens to be able to undo it. David has obviously put a lot of work on blobs, and it seems to implement several optimization over the "piece table with a balanced tree" that I've implemented (as copy-on-write). I will have to look deeper inside Blobs to see how is it implemented, as right now I've only had a quick look at them.

I guess that all that should look like baby steps, as I've just started to look into the subject, and the trees field is a very mature one...

Btw, Raph, your tree representation for Athshe is beautiful :-)

Batch formatting

I'm kind of on a middle point between Raph and cinamod, but more on the cinamod side than on the Raph side :-) on the batch formatting tool vs. improve one of the current word processors.

I can see the Raph point of one batch formatting tool being easier to do than a full word processor. Raph says that you don't need to care about all the GUI, but you also don't need very complex data structures to hold your data. The text is not going to be inserted, nor deleted, so these operations don't need to be fast, as they don't exist. You don't need, either, undo.

I don't think that it will be possible to get a perfect word conversor if its layout engine is not done with the goal of duplicating the MS Word one. For instance, TeX will do a full paragraph justification, while word does not, so if you put TeX in the middle of the chain you will never get an identical result.

Also, I think that a fundamental part of a batch conversor will be a regression test suite that it should have if it wants any remote possibility of getting a perfect output.

But there is a lot of common code between a batch formatting and a word processor to just let them go through separate paths. I guess that the only added complexity due to be in a interactive word processor that will impact the batch formatting code is due to the above mentioned data structures to hold your data, and implementing them raises the usefulness of such a program a magnitude order. IMO the usefulness/complexity ratio is worth making the batch formatting tool "edition capable".

The university is over, na na na na :)

Well, the uni has been over (for this year) for 3 weeks, now, but I´ve been busy searching a job, getting back at home, etc...

And *NOW*, I have a Real Computer(TM). A pentium 100Mz with 64Mb... better than my french computer (only 16Mb).

Finally, I managed to learn a bit of perl (It has been in my TODO for ~two years).

And, the hottest news!! Now, I even have Inet connection!!! (who said that Spain is not a developped country? :)

Yesterday I started to hack the toolbars of AbiWord (they are too long, and they use only one toolbar per band...), translating some stuff, and played a bit more with Dia... Dude, that´s a KICK ASS app!

I´ve been playing with the idea of making a parser that generates a UML Dia diagram of a bunch of C++ source files (to do all the classes by hand is too cumbersome).

I miss the other GNOME guys, specially Chema and Rodrigo (and I want to meet acs!). I´m looking forward for GUADEC 2 or something...

All the day wasted in a stupid program for BD2. And *all the day* is *all the day*. 24h/24h. I'm tired and I want to sleep.

BD sucks.

Well, it was a long time since my last entry...

In the lasts two weeks I've been doing my homeworks: I've been working in hdoc (a C++ documentation extractor, a la javadoc or headerdoc), in IT, GL2, BD2, etc...

The university is *TOO* boring.

I've don't touched a single line of code of Abiword in the last two weeks, and it sucks too much.

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!