Advogato: Where Should XML Go?

The World Wide Web Consortium published the Extensible Markup Language (XML) as a Recommendation in 1998. We envisioned use cases primarily in technical documentation, although a number of academic text-based projects were also very significant. Motivation for XML had come from a number of sources:

The late Yuri Rubinksy, then president of SoftQuad Inc, a visionary in the field of structured (semantic) markup, and also a champion for assistive technologies, had given a number of talks about the importance of sharing meaning;
C. Michael Sperberg-McQueen gave a talk at SGML '95 about SGML (the Standard Generalized Markup Language) as infrastructure; he suggested that we needed marked up information to become part of the invisible infrastructure of computing, much like the many service tunnels under the city of Chicago;
SoftQuad Inc. had shipped (in 1994) a Netscape browser plugin that displayed SGML documents, downloading additional definitions and components from the remote Web server if necessary, and also providing a way for people to share sets of annotations and superimposed links.

Panorama was generating a lot of interest and attention in the technical documentation world as people started to understand that the World Wide Web could be used as an instant delivery mechanism.
Jon Bosak was at Novell, where he was responsible for managing over a hundred thousand pages of documentation. He was an early adopter of Panorama, but was worried about committing to a proprietary technology: although SGML was a published standard, and all of Novell's documentation was already in SGML, the way that Panorama supported only a subset of SGML, and the way it was deployed on the Web, were not standard.

Jon searched for a group to work standardise "Web SGML", but unfortunately Yuri had just died, and it seemed that there wasn't anyone else who could persuade the ISO SGML Committee to look at this problem: perhaps they were still busy hoping the Web would go away.
The W3C had published the HTML specification as a Recommendation, and had a number of people (including Dan Connolly) who werefamiliar with SGML and the tools around it; Jon took his problem to the W3C and a new Working Group was born.

So you can see that there's a history of writing, of documentation; if I were to introduce more of the people involved in the early days of XML, and more of the projects, you'd see this even more strongly.

Three main specifications were envisioned, with a fourth following close on their heels: a way to style XML (XSL and XSLT, like SGML's DSSSL); a way to link within and between XML documents (XLink, a tiny subset of SGML HyTime), a way to search and query XML (XQuery, another tiny subset of SGML HyTime) and also a way to constrain the structure and content of XML documents so you could tell if a document conformed to a predefined set of expectations (XML Schema, like an SGML DTD).

W3C has since published XSL, XSLT, Xlink and Schema, and is working on a suite of related specifications with XQuery - Xpath 2, XSLT 2, XML Schema 1.1 and of course XML Query 1.0. There's also an XSL-FO 1.1 in the works (XSL-FO is the formatting part of XSL, as opposed to XSLT, the transformation part).

The big question is this: what should we do next?

Technical documentation was once a primary use case for the World Wide Web, but that's no longer true. It's no longer primary for XML either, although it's still very important. Instead, technical documentation is now one of a great many uses of XML. A Web service to provide a current stock price has, at first glance, very little in common with a 150,000-page aircraft repair manual.

As the uses of XML have spread, limitations and weaknesses have become apparent. Many of those limitations apply to technical documents even though it took widely differing applications to give us the perspective needed to understand them.

The natural verbosity of XML is excellent for robustness. This is essential for situations where the "correctness" of structured data cannot be automatically verified: a mismatched end-tag must be repaired by hand, because a computer can't generally tell if author or title was intended.

With verboseness, however, comes higher bandwidth usage and also greater time to read, write and process.
XML is a textual format, and is designed to be processed from start to end. This makes it difficult or impossible to start in the middle (for example, with a continuous news feed) or to jump directly to the Nth page of a large book.
XML has a number of features and constraints on those features that can be a pain to implement but that are rarely used. Some of them, like Notations, are not a good fit with the World Wide Web, and echo XML's SGML pre-Internet heritage. Others, like parameter entities, can be difficult to understand and use,and have an arcane syntax.

There's a cost to such features. XML is already substantially simpler than SGML, but it could be even simpler.

There are many other such items. The W3C has a Working Group currently devoted to working out whether a more efficient way to transfer XML between applications or systems should be published at W3C. This is sometimes called bianry XML although that's a misleading name for a number of reasons, and I personally prefer efficient interchange.

If W3C does publish a new way to interchange XML, we risk damaging the story that every processor can understand every document. Strictly speaking this is already a fiction because of encodings, and because of XML 1.1, so perhaps this isn't such a big deal as it might sound.

One way to introduce an efficient interchange format might be to publish an XML 2.0 with two separate syntaxes: the human-editable textual format and the more efficient and probably binary format. But if we do that we've changed XML. No XML processor today can handle an XML 2.0 document, since there isn't such a thing.

Should we change XML?

XML doesn't moo at taxis: it's not sacred. It's a spec that we should keep around as long as it's useful. But if we change it we have to wonder what other changes should we make. To determine that, we have to ask people who are using XML today what changes they would like to see, and also ask people not using XML today exactly why that's so.

So here are my questions for the Open Source and Free Software community:

People writing software and representing structured information (whether it's a configuration file or documentation or data) - if you're not using XML, what's stopping you?
People using XML: what are the edge cases, the limits, the places where you've tried to push XML and failed?
What (if anything) should we change?

Finally, I should note that I'm not trying to push XML as a single solution for all problems. Rather, I want to discover places where it's almost a solution: places where you think it's the right answer but you can't use it for some reason. Or reasons not to make changes, of course.

Where Should XML Go?

Posted 2 Feb 2005 at 11:29 UTC by Ankh

XML for configurations? No, not that, posted 2 Feb 2005 at 18:53 UTC by gwolf » (Journeyer)

Re: XML for configurations? No, not that, posted 2 Feb 2005 at 20:37 UTC by Ankh » (Master)

XML for configuration and data, posted 3 Feb 2005 at 04:14 UTC by tk » (Observer)

re: XML for configuration and data, posted 3 Feb 2005 at 06:22 UTC by jamesh » (Master)

Get rid of features that nobody uses, posted 3 Feb 2005 at 11:43 UTC by tjansen » (Journeyer)

Re: Get rid of features that nobody uses, posted 3 Feb 2005 at 14:01 UTC by Ankh » (Master)

Xml Core Parser, posted 3 Feb 2005 at 16:37 UTC by tjansen » (Journeyer)

Re: Xml Core Parser, posted 6 Feb 2005 at 04:50 UTC by Ankh » (Master)

DOM and SAX, posted 6 Feb 2005 at 15:56 UTC by jef » (Master)

Re: Xml Core Parser, posted 7 Feb 2005 at 00:56 UTC by tjansen » (Journeyer)

Mixed Content, posted 7 Feb 2005 at 16:43 UTC by Ankh » (Master)

My Suggestions, posted 14 Feb 2005 at 22:42 UTC by johnnyb » (Journeyer)

Re: My Suggestions, posted 17 Feb 2005 at 04:42 UTC by Ankh » (Master)

GJXDM, posted 9 Mar 2005 at 17:59 UTC by badvogato » (Master)

Fully integrate selected related standards, posted 14 Mar 2005 at 00:46 UTC by jrobbins » (Master)

Configuration in XML: a blessing, posted 1 Apr 2005 at 05:31 UTC by MartySchrader » (Journeyer)