21 Mar 2005 msevior   » (Master)

Performance Improvements

The interesting discussion with Rick Shaut, Microsoft Word Developer for the Mac, has got Tomas Frydrych and I thinking the various performance bottlenecks in Abiword. Tomas, being a biblical scholar by profession, was able to decipher the following text describing the MS Word binary format.

"The piece table relates a logical character number, called a CP (Character Position), to a physical location within a Word file (an FC). The array of CPs in the plcfpcd defines a partitioning of the Word document into disjoint pieces. The second array is an array of PCDs (Piece Descriptors) which is in 1-to-1 correspondence to the array of CPs that records the physical location in the Word file where the corresponding piece begins. To find the physical location of a particular logical character ... [find] the index of the largest CP in the array of CPs that is less than the character CP. Then reference the PCD with that index in the array of PCDs. The FC stored in the PCD gives the position of the beginning of the piece in the file. Finally, add the offset of the desired character from the beginning of its piece to the FC of the beginning of the piece. This gives the actual file offset of the character."

Acording to Tomas this means that MS Word stores content in data structures indexed by two arrays. One array indexes the paragraph of a piece of text. The other is an offset of the array from where it would otherwise be if no editing had taken place in the document.

Acording to Tomas this means that MS Word stores content in data structures indexed by two arrays. One array indexes the paragraph of a piece of text. The other is an offset of the array from where it would otherwise be if no editing had taken place in the document.

There are two good things about this structure. The first is that searches for the location of a particular point in the document to map to the document structure is fast, order Log(N), because the paragraphs, (blocks), are ordered and hence can be binary searched.

The other is that for insertions or deletions to the document, the offsets of the different character fragments within the blocks do not need to be updated as they are positioned relative to the offsets of the blocks.

The fragments outside the block do not need to be altered since they are postioned relative to the block.

Update

Tomas has told me that this was his initial idea for how we could use the MS Word double array technique to speed up insertions and deletions in Abiword and that MS Word is likely not do this.

In contrast Abiword has a doubly linked-list of fragments which can be any sort of document structure including paragraph blocks and text structures. Each fragment knows it's position in the document and has links to the next and previous fragment.

To speed up searches for character positions we have a vector of fragment positions. We can also do binary searches of this to give us a search complexity of log(N).

However upon insertions or deletions in the piecetable we need to update every fragment downstream of the change. This operation is order (N) where N is the number of fragments.

Tomas's initial wrinkle has a considerable speed advantage over the current scheme. The update time is Order(N)*(Number Blocks)/(Number text fragments) which is typically a factor 5-10 smaller. For a typical 300 page document Tomas estimates AbiWord does about 16000 updates for each insertion/deletion event. The double array PT would require only about 1600.

However as you can see the overall complexity for the double array structure still grows as Order(N).

Two years ago Joaquin Cuenca Abela developed a prototype Piece Table based on a balanced Red/Black tree structure where all operations (search,insert,delete) require Order(Log(N)) operations.

http://e98cuenc.free.fr/wordprocessor/piecetable.html

I estimate that Tomas' example only requires 30 operations for Joaquin's Piece Table. This is one of the major performance improvements we can make to AbiWord.

I'll talk about other improvements later.

You can post a comment on this post here.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!