Source code is to object format as XML is to what?

Posted 1 Jun 2002 at 09:03 UTC by ncm Share This

raph wrote: "it's an XML format file, so we'd need to link in an XML parser just to figure out what fonts are installed."

Raph's is a universal complaint. I'm going to riff on an idea to for how to alleviate it once and for all. (I have a vague recollection that something like this came up before, but i think I'm going somewhere else with it this time.)

We once had a problem in compiler-land that shared much of what XML is trying to address, and we also had most of the problems that we still have with XML. We have ended up (in the fullness of time) with a lot more helpful results than we have yet for XML.

The problem with XML is that everything that operates on it is huge and complicated. That seems, in practice, to make most things that only use the format huge and complicated too. The reason for the complexity, if I understand it correctly, is unavoidable: everything that in compiler-land we do at compile-time, with XML we must do at runtime, and so to get tolerable performance demands great cleverness. For any generic XML utility, or library, there's no getting around that — they have to read the DTD and adapt. Most programs that we want to have use XML formats, though, don't have that problem.

Generating XML to any particular DTD is always trivial. The complexity is in parsing efficiently and generically. The key insight is that most applications that actually use data that can usefully be XML-formatted might just as well have the DTD compiled in. But while a Yacc equivalent for DTDs might speed up some programs, I doubt it would really reduce the complexity of programs built with it.

I'm thinking, instead, about a standard intermediate format, something that fills the role of ELF in compiler-land — albeit without the nightmares peculiar to ELF and its ilk. An OS program loader doesn't know about every possible CPU architecture, and it doesn't compile every possible programming language. It knows only enough to map in code, initialize data segments, and connect up symbol references. The code that does that is almost generic among all OSes, CPUs, and languages, yet is remarkably simple (modulo the familiar nightmares.)

The idea here is that any particular program, much like the OS's program loader, implicitly knows only one DTD, and need have only gross sanity checking built in. Instead of reading XML itself, it reads a sanitized, pre-parsed "object file" format that has been generated by some other, generic program — in compiler-land, the compiler and linker. This intermediate form would be a read-only format: you would always regenerate it from XML, rather than editing it directly.

As in compiler-land, this standard intermediate format standardizes only (1) gross structural details and (2) just a few important semantic abstractions. (For XML data, (1) might be tree-structure markers and attribute lists, and (2) might be an interned-symbol table, some kind of annotation for cross-references within the tree, and maybe a table of contents using those.) Since whatever reads it can assume it is all well-formed, the format can be optimized for compactness and easy random access (unlike XML!).

We don't build a C compiler, assembler, and linker into every program and our kernel. (It is common to build Lisp compilers into everything Lispy, and in UI-land we seem to build mail clients into everything; e.g. Emacs and Netscape. But many consider that tendency a Bad Thing.) Why should we build all this XML handling into everything when it might just as well be sequestered in a few generic programs? Libraries encapsulate complexity, but enormous-library dependencies constitute a complexity of their own, with endemic versionitis and (too often) cross-language integration ugliness. A library to help read only this intermediate format would be much smaller and simpler than any XML-parsing library, and correspondingly more stable and maintainable.

Maybe this intermediate format already exists somewhere.


WBXML?, posted 1 Jun 2002 at 11:55 UTC by tk » (Observer)

Does WBXML cut it?

I think you overestimate the complexity of an XML parser., posted 1 Jun 2002 at 12:23 UTC by egnor » (Journeyer)

I also think you underestimate the value of having these data files always available in a standard, human-readable format. I also don't think the compiler metaphor is a good one.

I don't understand what you are railing against, nor do I really understand your proposed solution. Are you suggesting that XML parser libraries are large and complex? (They are not, by almost any measure.) Are you suggesting that the XML processing code in the application itself is complex? It's usually no worse than walking any other kind of tree and looking for what you want.

Are you proposing that we use a single binary format which is semantically equivalent to XML but somehow simpler to parse (and with tools "xml2bin" and "bin2xml")? I'm sure that's a bad idea. Or are you proposing a different binary format for every application, which would somehow reflect that application's needs, and an application-specific tool to compile the source XML into that application's particular format and decompile it? That, too, sounds like a bad idea, which will increase complexity and create headache for very little gain.

But perhaps I misunderstand you entirely.

xml is just another file format, posted 1 Jun 2002 at 14:48 UTC by graydon » (Master)

I think a simpler goal is to stop treating "use of xml" as a "taken as read" design aspect of any piece of software. it's a file format. it makes sense when you need all of:

  • unicode
  • mixed text / tree nodes (not the same as "some text nodes")
  • the need to embed / extend / interoperate with an existing xml format
  • primarily a "marked up text" application

it is not a good choice if you want a "simple format for storing tree-structured data". an xml document is already two overlaid trees (elements and attributes) with unsatisfactory semantics for each, a difficult parsing task, poor facilities for embedding data blocks, complex normalization rules, etc. the supposed reuse you gain from "not having to write your own parser" is not that great, especially if your data model is simple and doesn't fit xml well. writing parsers for small languages isn't tremendously hard.

that said, some programming systems automatically provide tools to serialize objects or data structures as xml or sexps or whatever. if it's a built-in feature of the programming system, well, you'd be silly not to use such a thing, whatever its format is.

A Database, posted 1 Jun 2002 at 18:56 UTC by neil » (Master)

An XML file is a verbose, human readable and editable store of data that feels inefficient to work with programmatically. A Database is a binary store of data that would be a pain to work with by hand but is supposed to be efficient when you use the provided database API. So there you go.

Simple, posted 1 Jun 2002 at 19:48 UTC by ncm » (Master)

egnor wrote:

Are you suggesting that XML parser libraries are large and complex? (They are not, by almost any measure.)

Maybe. I can't believe how huge and fragile (and, often, slow) the XML libraries I know about are. Without reading the huge things I can only speculate about where the bloat comes from. It does seem pervasive, though, and everybody complains about it. (Everybody complains about binutils, too, of course, but at least you don't have to link most of it to every program, and there's only one, not one or two for each language.)

Many of us would prefer to leave the bloat in a separate program, written in a language ideally suited to the purpose, rather than linked into our program and run again every time it reads anything. I don't want to put words into Raph's mouth, but that's how I read his plaint.

Are you proposing that we use a single binary format which is semantically equivalent to XML but somehow simpler to parse (and with tools "xml2bin" and "bin2xml")? ... Or are you proposing a different binary format for every application, which would somehow reflect that application's needs, and an application-specific tool to compile the source XML into that application's particular format and decompile it? ... But perhaps I misunderstand you entirely.

Neither. I'm proposing a format that can usefully be mmapped. The library that reads it does that, and then just patches up pointers. (Probably it mmaps two parts of it separately, the bulk of it shared, and the patched-up part privately.) While it might be possible, in principle, to decompile such a file, no sensible person would. (Who decompiles object files?) The file ends up as a native data structure in memory with hardly any work and with a tiny library interface to make it happen, more or less like dlopen(). There is only one xml2bin program, and probably a style sheet for each application, and a tiny libxmlbin library.

graydon wrote:

I think a simpler goal is to stop treating "use of xml" as a "taken as read" design aspect...
I'm very sympathetic to that view, but there seem to be lots of [what might even be good] reasons for wanting to use standard tools to operate on external file representations. Anyway it always seems to be somebody else who decides we're supposed to "do" XML. XML is an awful format, but it's one that people seem to be able to agree on, and agreement is a scarce resource.

Unfortunate Practicality, posted 1 Jun 2002 at 21:58 UTC by Bram » (Master)

There's a good essay on Why XML is technologically terrible, but you have to use it anyway

Complexity, Apples, Oranges and Design Antipatterns., posted 2 Jun 2002 at 15:57 UTC by bjf » (Journeyer)

Here are some ideas I'd like to contribute to this discussion :)

I don't believe that XML should be rejected entirely just because some implementations are unwieldly and complex. If you can do without DTD/Schema validation, XPath, etc etc, you could probably write a simple SAX-based parser in C in a couple of hundred lines of code. You don't have to go the whole hog to make effective use of XML. It has it's place, and best of all, it does not have to be complicated. Hey, would you rather parse an XML document that comes with a DTD, or manage with something like ASN.1 or COM compound documents? Give me the former any day, please.

BTW, are you suggesting here that to handle the complexities of parsing and validating an XML document against a given DTD, that you add even more complexity by having some sort of tool to generate a custom parser? This problem's already been solved, for Java at least. There are tools that take care of the details of doing a validated parse of a given schema into Java data structures.

By the way, I think that trying to view the problems of manipulating semi-structured data from a compiler-writer's perspective is a bad fit. Thinking about semi-structured data is probably better approached from a database person's background instead. Much of the theoretical work done relating to semi-structured data that I've come across approaches the problems as an extension and outgrowth of database stuff and data modelling. As I understand it, an XML DTD/Schema is a data model (like, say, UML, or a souped-up conceptual schema) and XML data is data conforming to that model. In the programming language/compiler writer's world, a computer program, whether in source form/AST/whatever is an instance of a context-free or regular language. Sorry, but I cannot see a great deal of overlap between, say, a context-free grammar and a UML-like data model. It's like trying to compare apples and oranges.

One should try to avoid approaching complex problems like this by looking at them at too low a level. When we're discussing general concepts like the best and computationally-most-efficient way to work with semi-structured data like XML (amongst others), one should avoid making high-level design decisions based on low-level details like "In the implementation, I wanna be able to... mmap() this.. malloc() that. This way of thinking gratuitously introduces certain bad design decisions that may cripple a given technology years down the track. mbp notes that Microsoft, when designing certain network protocols, are guilty of this particular "antipattern". Well, Martin's example was a little different, in that he cited the example of how certain network protocols were based on straightforward serializations of binary data structures for laziness/convenience, rather than properly designing them and reasoning about them at a higher level. The message is essentially the same though.

Donald Knuth, in a Dr Dobbs Journal article several years back says that once trait of a good computer scientist is that (s)he has the ability to think at both the high-level and the low-level. I think this is excellent advice.

It doesn't have to be complex..., posted 2 Jun 2002 at 22:44 UTC by simonstl » (Master)

Though the W3C really seems to be trying to build as much junk as possible into XML, there are pretty regular revolts against the ever-growing complexity.

For one (old) take, see Common XML.

On the tools side, it's plenty possible to use XML without having to go too deep into XML-specifc code. Python and Perl are shaping up as good candidates for this, though Java and C++ offer some decent facilities.

That said, I'm not going to push XML as the answer to all problems. I don't think XML is necessarily what programmers want, and I blame a lot of the complexity it has acquired since XML 1.0 on programmers trying to twist XML into doing things for which it's just not well-suited.

XML complicated?, posted 3 Jun 2002 at 03:10 UTC by mx » (Journeyer)

I've used XML in a number of production projects and am not sure why the author would classify it as complex. Schema validation isn't required in all circumstances, just as non-xml apps do not always (or often) validate the layout of their data formats. XML is just a markup for data represented in text.

Even selecting a parser for XML is not complicated - and is dependant on how the data is to be used. Some relativly simple perl code will go a long ways to grep chunks of data from XML. A SAX-like parser does well in quasi real-time, and a DOM does fine for document-like uses. It is just a matter of applying the bits that fit well.

Now, I'd agree that XML is not the end of the universe. I'd even venture as far as to say that it is about as attractive as a rotted potatoe. But, it does work, is human readable, and it makes the pointy-haired souls smile (slurpage points). Not a best-fit solution, but not so complicated that managers are unable to see some potential.

Just think of it in a zen sort of way: it is a tool that has some use - use it where it fits. Complication is in the eye of the beholder. Of course, it could be better ... but then who would standardize it?

I said "simple"., posted 3 Jun 2002 at 05:31 UTC by ncm » (Master)

bjf wrote:
BTW, are you suggesting here that to handle the complexities of parsing and validating an XML document against a given DTD, that you add even more complexity by having some sort of tool to generate a custom parser?

Of course I am not suggesting any such thing — as I already said in the original article, and explained further in my addendum. Instead of repeating it all here, I will just refer you to the text above.

Thinking about semi-structured data is probably better approached from a database person's background instead.

When you're after simplicity, approaching anything from a "database person's background" is the last thing any sensible person would do. A database approach is a good way to add two orders of magnitude extra cruft, overhead, and fragility. (That's not to say there is no place for those things. I just don't see any place for them here.) We're not talking about anything that needs transactions and rollback. Anyway, traditional databases are awful at handling graph-structured data.

Sorry, but I cannot see a great deal of overlap between, say, a context-free grammar and a UML-like data model.

I'm afraid you are all alone in that view. Everybody else seems to find the close analogy between XML and S-expressions (which represent Lisp programs) natural and compelling.

mx wrote:

I've used XML in a number of production projects and am not sure why the author would classify it as complex.
I don't classify XML as complex. I classify XML-handling libraries as complex. They are enormous, and do far more than any sensible person who just wants to read a file into memory wants to do or have done. For most programs that might use them, they are the single greatest source of instability in the system.

Why should every program in the universe be obliged to parse things that could better be parsed (and validated, and minced) once and for all by a program specialized for the job?

To help focus discussion, consider this. Suppose it became necessary to communicate data that is normally represented in XML to your favorite OS kernel — e.g., some funny network-routing or firewall configuration. Would you suggest that the kernel have your favorite XML parser/validator/slicer-and-dicer library linked into it? Of course not. You would pre-chew the XML in user-space, and just show the kernel the results.

Now, suppose you found lots of different kinds of data, formatted in XML, to tell the kernel about. You could write a whole bunch of programs to chew them all for the kernel, or you could write just one that generates a file the kernel can map in and use, no matter the subject. What I want is that program. Call it xml2abc, and the mmappable format ABC, for "already been chewed".

Alternatives to XML, posted 3 Jun 2002 at 06:09 UTC by robla » (Master)

I tend to think that just using XML is the right answer for a large number of problems. However, it has been interesting keeping track of the alternative formats.

Here's a few that I know of:

  • WBXML - WAP Binary XML format. This was the WAP Forum's answer to this problem of "wah, XML is too big for our little cell phones". I hadn't realized that the WAP Forum submitted this to the W3C before this evening.
  • YAML - YAML Ain't Markup Language. This was explicitly designed to give text representations to data structures. It claims to shed XML's legacy problem, and provides a direct mapping into datatypes commonly supported in programming languages.
  • BLOB - From the abstract: Binary Low-Overhead Block (BLOB) protocol for on-the-wire presentation of data in the context of higher-level protocols. BLOB is designed to encode and decode data with low overhead on most CPUs, to be reasonably space-efficient, and for its representation to be sufficiently precise that it is suitable as a canonical format for digital signatures.

I'm glad there's thought going into alternate formats. While I think XML will reign supreme for some time to come, I do think there are a lot of "nails" being beaten by the XML hammer that don't look like nails to me.

Yet another wire format, posted 3 Jun 2002 at 19:15 UTC by Bram » (Master)

Yet another data format is my own bencode - nested dictionaries and lists, containing strings, integers, and nulls, and that's it.

XML has several bad properties which simply can't be fixed by wrapping it in a good library. It will never be canonical. It will never support adding new fields to messages in a clean, consistent way. It will always be a performance pig (Sorry, it will. It may sometimes not be a bottleneck but it will always be a pig.) And the sheer complexity of the spec ensures that there will always be some (albeit hopefully not very painful) version problems. Thank you W3C.

Handcranked XMLish, posted 3 Jun 2002 at 23:36 UTC by whytheluckystiff » (Master)

This article reminds me of this old one from the article archive:

RFC: Binary Markup Language

I'd really like to see a stripped-down XML standardized, but I think that the standardization process (with everyone getting a fair say!) would kill the whole idea.

Hand-editing XML is a pain. On the Psychoo project, we use a simple XML-ish format which is a breeze to parse. It's nothing more than a dictionary really. Works great. It's hard to resist the temptation to extend it, but it's important to bridle that specific passion. The W3C is gushing with those sorts of hormones.

Parser State?, posted 4 Jun 2002 at 01:24 UTC by idcmp » (Journeyer)

Why not allocate all the memory your XML parser needs in an mmap'd file and then just reload it to give a "preparsed" version of the file?

parsers are simple in real languages, posted 4 Jun 2002 at 18:56 UTC by splork » (Master)

All high level languages have a good set of XML tools ranging from the grossly complex to the very sweet and simple. Python, Java and Perl in particular. So don't whine about using it there, for most uses just use a simple DOM style parser that returns a data structure and ignore all of the silly dtd and document correctness stuff.

I liked Bram's "why you have to use it anyways" link. Very humorous and a decent quick overview of what is out there at that.

As for protocol encodings, don't use XML. SOAP is rediculious due to this, but it'll keep multi Ghz cpu vendors and redmond software companies in business for another few years just trying to parse it before they push for an even worse standard. Encodings like the above mentioned bencode and BLOB are ideal for protocols moving useful amounts of arbitrar y data in a canonical form with negligable overhead.

I don't think it's a good idea, posted 5 Jun 2002 at 08:42 UTC by raph » (Master)

A "chewed" form of XML will inherit the limitations of XML's data model, and have new disadvantages to boot: it won't be human readable, and you'll have version skew problems keeping the binary format in sync with the XML source. Anyone who's ever messed with virtusertable configuration in sendmail can appreciate the potential for nasty real-world problems.

I don't think these problems are worth the (relatively thin, imho) advantages of more compact representation and simpler parsing code. In fact, a read-only parser for XML is not that big a deal. Expat is one such, and available under the MIT license to boot.

There are other interesting problems I brought up in my diary that chewed XML would not seem to solve. High on this list is change notification.

I still think that XML is not the best choice for fontconfig, but I also don't think it's a bad choice either. The excellent presentation recommended by Bram explains why quite well.

I have been thinking about this too., posted 5 Jun 2002 at 11:55 UTC by sarum » (Journeyer)

In have started work on an XDR data file to allow XML to be transmitted around PrintExpress in a transparent fashion, so that modules that are not intrested in the loading and dumping of XML data do not need to understand it. This is as far as I got before I had to look at something etc.

I am pretty sure it is not complete and it definatly can't handle namespaces or schema's yet.

Suggestion on where to go from here are welcome :)

%// *****************************************************************
%// $Id: xdr/px_xml.x 2603.2 2002/05/03 15:09:59BST sarum Exp  $
%// -----------------------------------------------------------------
%// This attempts to allow the XML Datastruct format to be
%// passed via XDR in a simple for PrintExpress to parse format
%// *****************************************************************

#ifdef JAVA #else % % struct xml_element; % struct xml_param; % #endif

/* --------------------------------------------------------------- */ /* This is the definition of a XML Element's Payload type */ /* --------------------------------------------------------------- */

enum xml_type { xml_null, /* there is no pay load */ xml_scaler, /* there is only one element */ xml_list, /* there is a list of elements */ xml_unknown /* the is some random data */ };

union xml_payload switch ( xml_type type ) { case xml_null:

void;

case xml_scaler:

string scaler<>;

case xml_list:

xml_element* list;

case xml_unknown:

opaque unknown<>;

};

/* --------------------------------------------------------------- */ /* This is the definition of a XML Element */ /* --------------------------------------------------------------- */

struct xml_element { xml_element* next;

string name<>; xml_param* params; xml_payload payload; };

/* --------------------------------------------------------------- */ /* This is the definition of a Tag Parameter */ /* --------------------------------------------------------------- */

struct xml_param { xml_param* next;

string name<>; string value<>; };

/* --------------------------------------------------------------- */ /* XML Document */ /* --------------------------------------------------------------- */

struct xml_doc { xml_param* prolog; xml_param* params; xml_element* doc; };

%// ***************************************************************** %// $Log: xdr/px_xml.x $ %// Revision 2603.2 2002/05/03 15:09:59BST sarum %// add prologs. %// Revision 2603.1 2002/05/02 14:59:30BST sarum %// Revision 1.2 2002/05/01 13:19:44BST sarum %// Start of the new PX/GenApp interface %// Revision 1.1 2002/04/22 17:24:54BST sarum %// Initial revision %// *****************************************************************

Missed the point, I think., posted 6 Jun 2002 at 06:28 UTC by ncm » (Master)

I'm afraid raph, like many who replied, missed some of the point of the proposal. First, a binary format is Good. You could certainly represent the fonts themselves in XML, but it seems advisable to have a binary form for them. (Would that they were usefully mmappable.) If you're not going to validate when you read, you need some assurance that the file hasn't been tinkered with since the last time it was validated. The binary form offers that assurance. The right time to validate, and chew, would be at font installation time.

More importantly, the idea isn't to have something "easier" to parse, it's to avoid doing any parsing at all. Furthermore, using mmap means you have to look at — page in — only the parts of the file you care about, instead of having to scan the whole thing sequentially before you can do anything. In my work we routinely handle XML files of multiple gigabytes, so the difference between scanning and mmapping is counted in minutes.

That's not to say that mmapping is the answer to all of life's problems. While I hasten to mention that change notification was first mentioned after my original posting, note that you can unmap the old and map in a new version any time, if you know when to do it. A protocol that just says "change now" can be a lot simpler than one that says "change the following ... ". (Old-timers are familiar with the former implemented using just a signal. (What, sendmail again?))

As for versionitis, the sendmail versions I knew statted file timestamps, and rebuilt the "chewed" file when needed. (I assume they invoked a separate program for it.)

Finally, any argument against binary files works as an argument against visible object-code files as well. The author of Goo might argue that way; we should see how well his experiment works out.

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

X
Share this page