2 Sep 2004 simos   » (Journeyer)

Docbook XML and creating pritable documents (like PDF).

Is that an interesting topic? Well, it sure is. I'll go in details, in layman terms, so it's approachable.

XML is a versatile markup language that you can use to represent almost any information. You typically enclose pieces of data in tags, such as with <name>Simos</name>. These tags are custom and signify what is that they contain. Therefore, XML is so versatile that you need to have a so-called "schema" or a description of the available tags for the type of document you want to represent.

There is a standardisation process of schemata (plural of schema) for different domains at xml.org and specifically at their registry page.

XML is used in open-source software in many places and the most common use is that of the documentation. Here, DocBook XML is used. For example, see The Linux Documentation Project (TLDP) which has standardised to DocBook XML (if you remember it used to be LinuxDoc some years back).

Suppose you have a document written in DocBook XML. With tools you can convert it to other presentation formats such as plaintext, HTML (+variants), PostScript, PDF and so on.

For the first two the process is quite easy as the tags are either stripped (plaintext) or converted to other tags (HTML). Your text editor or Web browser can be used to represent these, and they do a good job representing Unicode characters as well.

For PostScript or PDF the story is a bit different. It works relatively well with latin-based scripts. For example, see Docbook bits which shows how to setup your system with Fedora Core 2. No need for compilation, simply install the available RPM packages. For non-latin languages it's not so easy.

To convert from DocBook XML to PDF you need two programs; one that will take your DocBook XML source file and apply a stylesheet, producing a Format Objects (FO) intermediate file that contains both content and presentation information, and another that takes the FO file and converts to PDF.

The first program is an XSLT engine and the second an FO engine. There are several such engines for both programs, listed at XSL Engines. We mentioned Docbook bits above; it uses xsltproc to convert DocBook XML to FO and then passivetex to convert FO to PDF (or PostScript). Another combination is to use Xalan and FOP (example). A third option is xmlroff that can do both jobs; start from XML source and stylesheet and produce PDF. xmlroff is interesting because it uses Pango (yeah!) to render fonts (example with sample text in greek, russian, arabic and tamil).

To sum up, what the community would need is a way to create quality PDF and PostScript files from DocBook XML for any language (assuming there is a font), this process is easy to follow (like Docbook bits) and distributions have the necessary tools available as packages (RPM, DEB, etc).

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!