Advogato: Source code is to object format as XML is to what?

We once had a problem in compiler-land that shared much of what XML is trying to address, and we also had most of the problems that we still have with XML. We have ended up (in the fullness of time) with a lot more helpful results than we have yet for XML.

The problem with XML is that everything that operates on it is huge and complicated. That seems, in practice, to make most things that only use the format huge and complicated too. The reason for the complexity, if I understand it correctly, is unavoidable: everything that in compiler-land we do at compile-time, with XML we must do at runtime, and so to get tolerable performance demands great cleverness. For any generic XML utility, or library, there's no getting around that — they have to read the DTD and adapt. Most programs that we want to have use XML formats, though, don't have that problem.

Generating XML to any particular DTD is always trivial. The complexity is in parsing efficiently and generically. The key insight is that most applications that actually use data that can usefully be XML-formatted might just as well have the DTD compiled in. But while a Yacc equivalent for DTDs might speed up some programs, I doubt it would really reduce the complexity of programs built with it.

I'm thinking, instead, about a standard intermediate format, something that fills the role of ELF in compiler-land — albeit without the nightmares peculiar to ELF and its ilk. An OS program loader doesn't know about every possible CPU architecture, and it doesn't compile every possible programming language. It knows only enough to map in code, initialize data segments, and connect up symbol references. The code that does that is almost generic among all OSes, CPUs, and languages, yet is remarkably simple (modulo the familiar nightmares.)

The idea here is that any particular program, much like the OS's program loader, implicitly knows only one DTD, and need have only gross sanity checking built in. Instead of reading XML itself, it reads a sanitized, pre-parsed "object file" format that has been generated by some other, generic program — in compiler-land, the compiler and linker. This intermediate form would be a read-only format: you would always regenerate it from XML, rather than editing it directly.

As in compiler-land, this standard intermediate format standardizes only (1) gross structural details and (2) just a few important semantic abstractions. (For XML data, (1) might be tree-structure markers and attribute lists, and (2) might be an interned-symbol table, some kind of annotation for cross-references within the tree, and maybe a table of contents using those.) Since whatever reads it can assume it is all well-formed, the format can be optimized for compactness and easy random access (unlike XML!).

We don't build a C compiler, assembler, and linker into every program and our kernel. (It is common to build Lisp compilers into everything Lispy, and in UI-land we seem to build mail clients into everything; e.g. Emacs and Netscape. But many consider that tendency a Bad Thing.) Why should we build all this XML handling into everything when it might just as well be sequestered in a few generic programs? Libraries encapsulate complexity, but enormous-library dependencies constitute a complexity of their own, with endemic versionitis and (too often) cross-language integration ugliness. A library to help read only this intermediate format would be much smaller and simpler than any XML-parsing library, and correspondingly more stable and maintainable.

Maybe this intermediate format already exists somewhere.

Source code is to object format as XML is to what?

Posted 1 Jun 2002 at 09:03 UTC by ncm

WBXML?, posted 1 Jun 2002 at 11:55 UTC by tk » (Observer)

I think you overestimate the complexity of an XML parser., posted 1 Jun 2002 at 12:23 UTC by egnor » (Journeyer)

xml is just another file format, posted 1 Jun 2002 at 14:48 UTC by graydon » (Master)

A Database, posted 1 Jun 2002 at 18:56 UTC by neil » (Master)

Simple, posted 1 Jun 2002 at 19:48 UTC by ncm » (Master)

Unfortunate Practicality, posted 1 Jun 2002 at 21:58 UTC by Bram » (Master)

Complexity, Apples, Oranges and Design Antipatterns., posted 2 Jun 2002 at 15:57 UTC by bjf » (Journeyer)

It doesn't have to be complex..., posted 2 Jun 2002 at 22:44 UTC by simonstl » (Master)

XML complicated?, posted 3 Jun 2002 at 03:10 UTC by mx » (Journeyer)

I said "simple"., posted 3 Jun 2002 at 05:31 UTC by ncm » (Master)

Alternatives to XML, posted 3 Jun 2002 at 06:09 UTC by robla » (Master)

Yet another wire format, posted 3 Jun 2002 at 19:15 UTC by Bram » (Master)

Handcranked XMLish, posted 3 Jun 2002 at 23:36 UTC by whytheluckystiff » (Master)

Parser State?, posted 4 Jun 2002 at 01:24 UTC by idcmp » (Journeyer)

parsers are simple in real languages, posted 4 Jun 2002 at 18:56 UTC by splork » (Master)

I don't think it's a good idea, posted 5 Jun 2002 at 08:42 UTC by raph » (Master)

I have been thinking about this too., posted 5 Jun 2002 at 11:55 UTC by sarum » (Journeyer)

Missed the point, I think., posted 6 Jun 2002 at 06:28 UTC by ncm » (Master)