5 Jul 2006 pesco   » (Journeyer)

Structured text in Haskell redux, or "Posting to Advogato ASCII style"

A few weeks ago I posted my initial thoughts about how to represent a structured text document in Haskell. As often with initial thoughts, they were flawed. ;-) In particular, I wrote:

I've tried to design the above types in such a way as to minimize the possibility of forming non-sensical or ambiguous documents. That's why there is such deep nesting of different types instead of just one big algebraic data type with constructors for concatenation, paragraph and section breaks, etc.

Without going into too much detail, I stumbled over the problem in the representation of block quotes: The corresponding constructor expected a full Doc and I couldn't break sufficiently small fragments out of those for quoting. For example, had I wanted to quote one item out of a numbered list, it would have gotten the number "1" again in the quote, because it would have been the first item within that document (fragment)...

So, I thought, should I make one big data type for the whole document after all? But still, I didn't want to include too much possibilities for such senseless combinations as bulleted lists and blockquotes within section headings. After some pondering I've decided on using a two-level structure, with one data type representing documents (or document fragments) at the block level (paragraphs, sections, etc.) and a second one for in-line structures only. In particular, a document is taken to be a list of "block-level tokens" ('Btok's):

type Doc = [Btok]
Analogously, (logical) lines are represented by line-level tokens:
type Line = [Ltok]
Apart from plain strings, these consist of things like emphasis, code spans, etc.:
data Ltok =  St  String           -- character string
          |  Em  Line             -- emphasis
          |  Co  String           -- inline code
          ...
Note that under their intended interpretation, these constructors form homomorphisms into the monoid of Ltok-lists (wrapping their result in a single-element list): e.g. [St a] ++ [St b] is expected to be equal to [St (a++b)], [Em a] ++ [Em b] == [Em (a++b)], etc. This enables one to split a line in two at an arbitrary point without "losing any information". :-)

I noticed that if something like the above could hold for the block-level Doc type as well, my block quote problem would be solved pretty elegantly. And indeed, I found the following representation:

data Btok =  TEXT  [Line]    -- text block (logical lines)
 
          |  PARA  Line      -- new paragraph (w. opt. title)
          |  SECT  Int Line  -- new section (of given level)
 
          |  QUOT  Doc       -- blockquote
          |  BULL  [Doc]     -- bulleted list
          ...
Notice that the PARA and SECT constructors do not wrap the corresponding paragraph or section. Their meaning is analogous to that of a new-line character in plain-text: they start a new structure which spans everything up to the next break. This is exactly what makes QUOT work: We can break the document apart in the middle of a section or paragraph and rightfully consider the resulting parts fragments in the common sense that they can be glued back together to form the original. Or to put it differently, a quoted fragment carries all of its "own" structural information (where the meaning of "own" also results from the data type's structure).
"Wait, there's more fun to be had!"

Remember that the whole point of my foray into structured documents originated in my little project of posting to my Advogato diary via email. So, say I had a document to be posted in memory as a value of type Doc as above. I'd want to convert it to HTML (or Advogato's subset thereof). While the representation of section breaks as stand-alone SECTs within a stream of block tokens closely fits with the <h1-6> tags to appear in the HTML, paragraphs are supposed to be wrapped in <p> tags. One obvious (but annoying) solution to this discrepancy would be to build a second data type, probably a little more like my original idea again, and convert to HTML via the detour of this intermediate type.

Luckily I remembered that Oleg[1] wrote something about folds being superior to cursors[2] as interfaces to data collections, so I implemented a function of the following humiliating type signature. ;)

-- Fold over a Doc's block-level structure.
folddoctree
  ::  (Biblio -> Line -> line)     -- logical line
 
      -> ([line] -> tok)           -- text block
      -> (String -> tok)           -- code block
      -> (doc -> tok)              -- blockquote
      -> ([doc] -> tok)            -- bulleted list
      -> (Int -> [doc] -> tok)     -- numbered list
      -> ([(String,line)] -> tok)  -- bibliography
 
      -> (  line                   -- para. title
            -> [tok]               -- para. content
            -> para  )             -- paragraph
 
      -> (  Int                    -- sect. level
            -> line                -- sect. title
            -> doc                 -- sect. content
            -> sect  )             -- section
 
      -> (  [para]                 -- initial paragraphs
            -> [sect]              -- following sections
            -> doc  )              -- document
 
      -> Doc -> doc
As you can see, it takes ten functions as parameters, each telling it what to do with a certain part of the document, which it then transforms accordingly. Given this function, an HTML export for Docs is straight-forward to define.

NB. Notice how the different type variables are threaded through the type signature. It feels almost like LISP, in the sense that the functions can return any type they want, as long as everything fits. But unlike in a dynamically-typed language, the typechecker will tell you up-front if you make a mistake. :-P

"Show me the code!"

Ah right, most important! The full definitions of the Doc type, the folddoctree function, and some further utilities can be seen at the following address:

http://www.khjk.org/~sm/code/advopost/Doc.lhs

Enjoy! :-)

References:

[1]
Oleg Kiselyov, Haskell (and Scheme (and probably more)) guru. See http://okmij.org/.
[2]
Oleg Kiselyov: "Towards the best collection traversal interface". http://okmij.org/ftp/Computation/Continuations.html#enumerator-stream

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!