A few weeks ago I posted my initial thoughts about how to represent a structured text document in Haskell. As often with initial thoughts, they were flawed. ;-) In particular, I wrote:
I've tried to design the above types in such a way as to minimize the possibility of forming non-sensical or ambiguous documents. That's why there is such deep nesting of different types instead of just one big algebraic data type with constructors for concatenation, paragraph and section breaks, etc.
Without going into too much detail, I stumbled over the problem in the representation of block quotes: The corresponding constructor expected a full Doc and I couldn't break sufficiently small fragments out of those for quoting. For example, had I wanted to quote one item out of a numbered list, it would have gotten the number "1" again in the quote, because it would have been the first item within that document (fragment)...
So, I thought, should I make one big data type for the whole document after all? But still, I didn't want to include too much possibilities for such senseless combinations as bulleted lists and blockquotes within section headings. After some pondering I've decided on using a two-level structure, with one data type representing documents (or document fragments) at the block level (paragraphs, sections, etc.) and a second one for in-line structures only. In particular, a document is taken to be a list of "block-level tokens" ('Btok's):
type Doc = [Btok]Analogously, (logical) lines are represented by line-level tokens:
type Line = [Ltok]Apart from plain strings, these consist of things like emphasis, code spans, etc.:
data Ltok = St String -- character string
| Em Line -- emphasis
| Co String -- inline code
...
Note that under their intended interpretation, these constructors form
homomorphisms into the monoid of Ltok-lists (wrapping their
result in a single-element list): e.g. [St a] ++ [St b] is expected
to be equal to [St (a++b)], [Em a] ++ [Em b] == [Em (a++b)], etc.
This enables one to split a line in two at an arbitrary point without
"losing any information". :-)
I noticed that if something like the above could hold for the block-level Doc type as well, my block quote problem would be solved pretty elegantly. And indeed, I found the following representation:
data Btok = TEXT [Line] -- text block (logical lines)
| PARA Line -- new paragraph (w. opt. title)
| SECT Int Line -- new section (of given level)
| QUOT Doc -- blockquote
| BULL [Doc] -- bulleted list
...
Notice that the PARA and SECT constructors do not wrap the
corresponding paragraph or section. Their meaning is analogous to
that of a new-line character in plain-text: they start a new structure
which spans everything up to the next break. This is exactly what makes
QUOT work: We can break the document apart in the middle of a
section or paragraph and rightfully consider the resulting parts
fragments in the common sense that they can be glued back together to
form the original. Or to put it differently, a quoted fragment
carries all of its "own" structural information (where the meaning of
"own" also results from the data type's structure).
Remember that the whole point of my foray into structured documents originated in my little project of posting to my Advogato diary via email. So, say I had a document to be posted in memory as a value of type Doc as above. I'd want to convert it to HTML (or Advogato's subset thereof). While the representation of section breaks as stand-alone SECTs within a stream of block tokens closely fits with the <h1-6> tags to appear in the HTML, paragraphs are supposed to be wrapped in <p> tags. One obvious (but annoying) solution to this discrepancy would be to build a second data type, probably a little more like my original idea again, and convert to HTML via the detour of this intermediate type.
Luckily I remembered that Oleg[1] wrote something about folds being superior to cursors[2] as interfaces to data collections, so I implemented a function of the following humiliating type signature. ;)
-- Fold over a Doc's block-level structure.
folddoctree
:: (Biblio -> Line -> line) -- logical line
-> ([line] -> tok) -- text block
-> (String -> tok) -- code block
-> (doc -> tok) -- blockquote
-> ([doc] -> tok) -- bulleted list
-> (Int -> [doc] -> tok) -- numbered list
-> ([(String,line)] -> tok) -- bibliography
-> ( line -- para. title
-> [tok] -- para. content
-> para ) -- paragraph
-> ( Int -- sect. level
-> line -- sect. title
-> doc -- sect. content
-> sect ) -- section
-> ( [para] -- initial paragraphs
-> [sect] -- following sections
-> doc ) -- document
-> Doc -> doc
As you can see, it takes ten functions as parameters, each telling it
what to do with a certain part of the document, which it then
transforms accordingly. Given this function, an HTML export for
Docs is straight-forward to define.
NB. Notice how the different type variables are threaded through the type signature. It feels almost like LISP, in the sense that the functions can return any type they want, as long as everything fits. But unlike in a dynamically-typed language, the typechecker will tell you up-front if you make a mistake. :-P
Ah right, most important! The full definitions of the Doc type, the folddoctree function, and some further utilities can be seen at the following address:
http://www.khjk.org/~sm/code/advopost/Doc.lhs
Enjoy! :-)
