Markup Abuse: some comments on the XML panacea
Posted 12 Mar 2000 at 05:24 UTC by graydon
XML is presently hailed as a universal encoding format, yet the
"extensions" (DTDs) of XML have nothing in common with one another. This
article addresses some existing DTDs, discusses the psychology of
text-encoded standards, and makes a case for standards
bodies returning to the task of agreeing on minimal, reasonable features
and practical
solutions rather than waving their hands and pointing at a DTD.
The article is posted here, and is
a little long to be re-posting here in its entirety.
I agree. A DTD is only good for checking the low-level syntax of a
document, and you almost invariably need higher-level tools (some of
which W3C tries to standardize) to check the correctness (not to mention
the meaning) of the document. Thus the DTD alone is almost no
use, and since it's rather tedious to write, you could just do without
it and move all the checks to the semantic analysis.
The only benefit from vanilla XML that I can see is that
it
is a standardized syntax for representing general structural data, and
you can find prewritten tools for reading and writing it. This is, to my
mind, not a very great feat. Parsers are relatively easy to write
anyway.
Frankly, I think XML should have had a much more
semantically-oriented approach in the first place. Now things like XML Schema are just
optional sugar that came as an afterthought, and horribly complex to my
mind.
I can only hope that some day we will get a relatively simple
markup
language that attempts from the start to model the semantics of
documents, not just their external structure. Programming languages have
wonderfully advanced type systems these days (real programming languages,
that is), that can express the "meaning" of data at quite high a level.
It would be wonderful to see such techniques used for pure data
structure specification, as well.
Once the semantic model were clear, the actual textual representation
for such structures would be trivial to define.
In the meanwhile, if you just want a simple format for structured
data, use S-expressions,
like I do. Any lisp or scheme interpreter can read or write them (though
there are some dialect differences), and they are much more concise than
XML with its mandatory start and end tags and other standardized cruft.
S-expressions are even a bit more expressive than XML: A number is a
number, not a string of numeric characters. This is not really
a new idea, see LAML for a
practical example of this approach. I'm working on something similar.
Sexpy MF, posted 12 Mar 2000 at 16:37 UTC by dan »
(Master)
It's all true. XML is no more a solution to communicating
Interesting pointer to LAML. From my reading they
actually
define
functions
with the same name as each markup element, though: I'm wondering why
they don't just keep it as data. As long as you have a tree, you can
process it with any tree-processing functions you like
Plug time: this is what Araneida
does.
Look at
the counter-handler function in http://araneida.telent.net/examples/main.lisp. For non-Lispers,
the features that need explaining are
- html is a function that converts sexpr notation into angle
brackets:
(html `(p "paragraph with " (b "bold") " word"))
=>
"<P>paragraph with <B>bold</B>
word</P>
- ` and , are used for generating "semi-literal"
data: ` introduces literal data, and , escapes out of
it so that computed data can be inserted. Thus,
(html `(p "1+1=" ,(+ 1 1))) =>"<P>1+1=2</P>
Apologies for the formatting. advogato won't let me have
<pre>
and
<tt> doesn't actually show up any different from normal text.
Lisp was not meant for proportional fonts.
... between heterogenous systems than a JVM with no standard classes
is to running code on them.
I did preview! Honest!
I've only started using XML recently, and I have to say that it is far
more than a fad. XML gives you a standard way to write your own format -
an angle-bracket is the opening of a tag, and etc... everyone
understands the basic syntax if they know XML or even HTML. Another
convinience provided by XML is that you don't have to write, test, and
publish your own parser. libxml (gnome-xml) is simple and sweet. Using
XML means that if you want to import multiple file formats you don't
need different parsers as you would if these were binary formats.
In the free software world, there is a lot of participation from other
people who did not neccessarily design the file format of your
application. Sure, documenting your file format is a good thing to do,
but is everyone going to do it? XML does not eliminate the need for file
format documentation, but it makes it pretty darn obvious. I'm not going
to paste some sample XML code from my application because I'm afraid
your browsers will ignore it as undefined HTML tags and I'm too lazy to
get it to display right in HTML, but I doubt that anyone who understood
the function of the application would have any questions after looking
at a sample document.
I can think of numerous other reasons why XML is superior to binary file
formats unless you really need to press data in tightly, but my favorite
is the HTML compatability. IIRC you can take an AbiWord XML file (not
exporting to HTML - taking a native AbiWord XML file) and view it in a
web browser and it will be readable although not all of the formatting
will be preserved. Try THAT with a MSWord binary file!
missed point, posted 12 Mar 2000 at 19:46 UTC by graydon »
(Master)
I'm not debating that XML "makes sense" as document markup (it does),
but rather that in cases where you're storing semantically rich data
with a lot of weird structure, being able to "view it in a web browser"
is completely unimportant. Take one of the earliest uses: CML. Chemical
Markup Language. Unless you have a plugin for your browser which can
model bond lengths and dihedral angles (nope, it's not in CSS3 even)
viewing it in a web browser shows you a big meaningless pile of numbers.
The document has no meaningful use-case in any "standard" XML tool, only
in specialized CML tools, therefore CML is no better than any other open
encoding of dihedral algles and bond lengths, and is potentially
worse because it suggests to people, as you are saying, that
somehow all the reading, writing and interpreting software already
exists. the gnome-xml parser isn't going to write any CML manipulation
software for you. It does nothing but basic tokenizing and parsing into
a tree. We've had both textual (sexps) and binary (ANDF,HDF) ways of
doing this for a while.
Hmm. HDF seems interesting.
Never knew of it before. Rather low-level, though.
What all of us seem to want is a standard format for defining the
semantics of data, while XML only provides syntax.
In my mind, semantics equates types. Indeed, when we use (say)
simple
C structs and arrays and pointers, we can already define quite complex
data structures, and definitely more conveniently than with DTDs. It
appears to me that what we need is a data-centric type system. The
problem with type systems of programming languages is that with
lower-level languages they are more oriented more towards specifying
representation than meaning. And of course there's non-data stuff like
functions and objects and abstract data types in there too.
I once had the idea that one could use (the non-OO subset of) CORBA IDL for
defining data structures language-independently. CORBA even defines a
standard binary format (CDR) for transmitting this data. What you can
transfer you can also store, so this would be one approach for
standardizing a data format. However, I think that IDL is still rather
limited. IIRC, you can't do non-tree graphs very straightforwardly.
So perhaps we really need a nice language-independent high-level
type
system for pure data (no code or objects). We could then have multiple
written formats for this data, say, xml, sexps and binary, as well as
compilers to convert the data definitions into types in different
programming languages. With proper tools, all you'd need to do is write
a definition for the data, and you could get an autogenerated parser
that converted the data into properly typed objects in your programming
language.
A while ago I entertained the idea of starting such a project,
but
haven't still done so. Perhaps there really would be need for it...
Why? The whole point is that, no matter how you specify the
layout of the data, that data is till meaningless until
interpreted. Unless your "Typed Document Definition" is
actually a complete library for operating on the data, there is no sense
in defining it. You would gain nothing above what starting from scratch
would give you, and it would blind you into designing a format
out-of-touch with the data it actually carries.
The advantage of having relatively standardized data formats is so that
when a new data format is needed, less effort is required to write
loaders and savers for the format, both for
the original implementer and for people doing ports.
Several languages already support some concept of
serialization
(e.g.
Java's Serializable interface; Python's pickles; Perl's Storable
module), but there is no standardization between these methodologies,
and in any case they're based on the representation of the data, which
is likely to be dependent on the implementation language (e.g. explicit
reference in Perl versus implicit reference in Python or Java).
This is somewhat related to a point Jim Blandy raised at
dinner
a
few
nights ago: take the C data structure:
struct node {
struct node
*left,
*right;
DataType
d;
};
What is this? The answer is, you don't know.
It might be a doubly linked list. It might also be a binary tree. It
could be any of a number of other things. Without knowing more,
specifically the invariants that apply to
left and
right, you can't intuit anything. Jim was talking about this
in the context of making gdb more useful, but the same point applies to
attempting to serialize data structures.
In any case, it's clear that XML does nothing to solve this problem;
all
it does is create a new (and, IMO, excessively cumbersome) language to
do serialization in. I think the real point of rampant XMLification is
to increase bandwidth demands. :)
Why XML?, posted 13 Mar 2000 at 21:03 UTC by Ankh »
(Master)
You're right, Graydon, that the world has changed since we first started working on XML.
Originally, the project was SGML on the Web, and was intended to allow people with
large structured documents to share them. Examples of structured documents include
150,000 pages of computer manuals at Novell, for example, or the Solaris documentation; library
records about books (currently using a number of incompatibly different variations of MARC);
encyclopaedia entries; dictionaries; metadata (such as the Dublin Core interoperability set),
and even music.
Many of the people involved at the start of XML already had SGML documents -- Michael Sperberg-McQueen,
then from the University of Illinois at Chicago and representing the Text Encoding Initiative, Jon Bosak,
first at Novell and later at Sun; Tim Bray, who had worked with the 500MByte Oxford English Dictionary,
amongst other projects -- and others had SGML software.
None of us predicted what would happen. In fact, Jon had a tremendous struggle at first to get people at
the World Wide Web Consortium to take any interest.
but look at it a different way. . .
HTML is defined in terms of SGML. Not many people know that, and still fewer understand that
every <P> tag has a corresponding </P>, whether or not the author supplies it. The programmers
here at Advogato are likely to know that parsing HTML is a pain in the ankle, mostly because authors
in practice leave off quotes, think <! starts a comment, or generally don't bother to check syntax.
Imagine using a C compiler that automaticallay and silently corrected errors. If you left off a },
it inserted it, usually in the right place but not always. Then imagine having to port that C code to another platform
where the C compiler corrected a different set of errors, also silently. An HTML browser is like that.
A second problem with HTML is that it has a fixed set of elements. You can't legally add a <partNo> element
of your own because if you do, it isn't HTML any more.
Newer browsers let you apply cascading style sheets (CSS) to such elements, as long as they have content; there's
no way to add your own IMG or BR elements, because there is no syntax available in HTML to say that an element is
empty -- the list is hard-wired into the browser source code.
So XML addresses two important problems with HTML: the syntax is both rigorous and extensible.
But why not s-expressions? LISP is more popular than C and COBOL and more fun than elbow sex!
XML has an important property that s-expressions lack: it allows error detection. In environments where automatic
checking of semantics isn't possible, this is very important. And for what it's worth, the XML working group did measurements
of marked-up data to see if the text in the end-tags was significant. After compression, it isn't significant. But the ability to
rmatch <etymology><lexeme> to </lexeme><derivation><etymology> and see what's wrong is
a major benefit in document-oriented markup.
but I'm not marking up documents, I'm using XML to describe pixels on 3-D surfaces of planets!
More fool you :-) -- Graydon is right that XML isn't good for everything. It's better than a lot of things, and where
you have structured, mostly textual information, it's generally a good first bet.
The XML Schema group has been working on better constraints; if they've made it too complex, and also left out too many
features (?!), then let them know. Well-reasoned and informed messages from a couple of thousand people would certainly
be noticed.
You are right that people are using DTDs that are incompatible. You can't load molecular data into an e-commerce credit card
validator. Good thing. As implementation experience builds, perhaps more DTD design patterns will emerge. Dr Ian Graham and I
published a few design patterns on the
companion web site for our book (The XML Specification Guide, Wiley, 1999).
You have to remember that XML is to a small extent, for a human reader, self describing, or can be: if I use an
element called partNumber, you can deduce what it is just from the data, which is difficult to do with a Binary interface.
There are many alternatives for specific domains; like most standards, XML is a bunch of compromises, and ends up
being good enough to use, for most purposes within its intended domain. Use XML for configuration files instead of writing
your own parser, and you probably get a small but non-zero improvement. But the more tools that use XML, the bigger
the improvement. So let's praise XML for its strengths, and try to understand how to live with its weaknesses.
Thanks, graydon, for articulating this unease with the rush to XML-ize
everything. I'll just add my two cents.
XML is a good way to do Unicode
One thing I haven't seen mentioned here is that XML is based on Unicode.
This is a Good Thing. The handling of non-Latin1 text with pre-XML tools
is so incredibly arbitrary it's not funny. Ever gotten an email in
Russian? More often than not, the Content-Type says 'text/plain;
charset=iso-8895-1', but the text is actually in KOI-8. Broken, broken,
broken.
I don't everyone to do Unicode right in XML (Advogato still doesn't, for
example), but at least it says pretty clearly what the right thing is.
XML is not "as simple as possible, but no simpler"
While XML is a dramatic simplification over the large, vague, and
ambiguous SGML universe that preceded it, it still has a bunch of cruft
in it. A lot of it has to do with backwards compatibility with SGML,
which didn't end up working out that well anyway (SGML ended up needing
some changes to bring it in line with XML). When XML was being designed,
one of the goals was "a programmer should be able to do an XML parser in
a week." Well, that hasn't happened anyway.
My single biggest gripe with XML complexity is entities. My guess is
that very few people use these in any effective way. External unparsed
entities are another kettle of fish altogether. As far as I can tell,
they are a completely inferior way of doing <img src="url" />.
I'll be willing to bet that 9 out of 10 deployments of XML choke on
them, too.
DTD's represent cruft too. Most XML users will either ignore them
altogether or go to schemas. That leaves a lot of vestigial syntax and
semantics.
But, overall, I don't want to bitch too much. XML is reasonably
simple, and adding it to a project does not spell doom and disaster
because of the added complexity.
XML is not a markup language
Graydon explained this very well already, but I'll make another argument
that may resonate differently. Some people describe XML as a successor
to HTML. However, this is not true. While HTML browsers are extremely
useful tools, there is, and can be, no such thing as an XML browser[1].
This is because XML is not a markup language, as is HTML. Rather, it's a
metaset of possible markup languages, each of which requires its own
support in a browser.
[1] Ok, you can have a generic XML viewer or editor that displays the
structure of the document with tags exposed, but that isn't what I was
talking about.
Suburban sprawl: greater metropolitan XML
XML seems to induce a strong temptation to drink the rest of the
Kool-Aid. While XML itself is a reasonably simple thing, by the time you
add XML namespaces, XPath, XLink, XPointer, schemas, RDF, DOM, CSS, and
XSLT, you've got scary beast[2].
In general, the amount of payoff you get from adding these additional
layers dwindles as the complexity shoots up. This is, I think,
particularly true if you're using XML for things other than textual
documents.
[2] So it's a mixed metaphor, They're fun metaphors, don't you think?
Maybe the beast drinks the jug of Kool-Aid. In LA.
Tools, tools, tools
Support for XML in tools is getting very good. In fact, in just about
any modern programming environment, there's an XML parser handy, and you
can traverse and manipulate the structure just as easily as car and cdr
in your lisp days.
This is probably the single best reason to use XML. Just as Perl regexps
are a powerful tool for fiddling with plain text files, the suite of XML
tools becoming available make XML relatively fun and easy to deal with,
even in cases where a custom binary file format might seem simpler and
more direct.
One of the constraints we had with XML was that every valid XML document be a valid SGML document. For this to be
sensible, you have to think of a world in which there are terabytes of SGML (as there still are), with no way of
interchanging that data over the web except with SoftQuad Panorama, and even then only a subset.
Very late in the XML process, the ISO SGML WG agreed to make some changes to accommodate XML.
These changes were a very great help, but several earlier decisions of the XML group were not reconsidered, and
should have been, I think. If anyone is really interested, email me, it's basically history now. There efforts (such as
SML) to develop a cleaner subset of XML, but they're a little hampered, in turn, by their requirement that every
SML document be well-formed XML.
In The XML Specification Guide, I basically tell people not to use unparsed external entties, or, if they do,
to use a MIME content media type (or list thereof) as the identifier for the notation. A few SGML die-hards implemented
an XML version of the SGML Catalog, for mapping PUBLIC identifiers, but use of it is actually forbidden by
the XML spec, so that also has to be ignored.
In practice, a few of the wacky features have survived into common use (CDATA sections and processing instructions
are two that come to mind) but not too many. And you know, C became popular despite the GECOS Constant feature
and the "entry" keyword, and despite the way i = -3; used to decrement i by 3, not set i to -3, in V6 Unix.
Most standards are about politics, about getting people to agree on a compromise. I think XML is an OK compromise.
I'm less happy with DOM, and a lot unhappier with SVG -- which was brought up on the xml-dev list recently, and I sent
a fairly strongly worded message that seems to have stopped discussion altogether :-) . No, I don't want to
use DOM to access windowing system events and spline points in (unhinted) fonts. I'd rather have PostScript.
Yes, the world is rushing to XML, nd on the whole I think it's good, even if some groups seem to be building
edifices that make a merger between C++, ADA, ICMP and HTTP look minimalist.
So let's be glad XML is OK, and try and help people use it more effectively. Do we need an XML Traps and Pitfalls
book, like Andrew Koenig's for C?
Hmm, there are bugs in Advogato's html parser, doesn't like </p> very much. Another argument for XML
:-)
>barefoot in Toronto
XML Everywhere!, posted 16 Mar 2000 at 02:24 UTC by dwaite »
(Journeyer)
While I agree that XML is being touted as the solution for every
problem known to man, we all know what these people are trying to sell
us.
However, the point of XML is that it is an eXtensible Markup
Language.
It does not validate the meaning any more than a spellchecker or
grammar checker would validate the phrase "The cheese is flying over
Egypt." Correct grammar, correct spelling. Means nothing. What you can
do is validate that it does have correct form, which is the point of a
DTD.
There are several types of data that should never be XMLized.
Video,
audio, images and other binary data first. Second would be that
Spaceship markup language - that is valid XML, but it is definately not
done in the spirit of human readability.
In other words, there is no miracle cure. But it really sounds like
you
are talking bad about XML because it _isn't_ the cure, which isn't
quite fair. It was never meant to hold logic to validate that a set of
data contained within made sense, just that the actual format which the
data was received is valid.
Entity's text tag, posted 16 Mar 2000 at 06:41 UTC by mwimer »
(Journeyer)
I get the feeling that some people don't follow graydon's driff for this
article... I have an example that i think should clearify the issues
presented in the article. Entity (a project listed here at advogato)
uses XML to layout GUIs and control user interaction. It does a very
good job of laying out the window for the user and feels quite a bit
like writing a web site in html, using only slightly differnet tags.
XML probably the best tool to date for this kind of work. Let me go
out on a limb and say that graydon would agree the XML is the
tool for this job.
But there is an issue with the <text> tag. The text tag needs
to have markup of the data inside the tag. Example:
<object>
<window>
<text>
I am text that needs to be marked up! :(
</text>
</window>
</object>
Sure we could invent all sorts of new tags to define the text inside the
tag but we would pollute the tag namespace of Entity. At this point
XML as a markup language breaks down and we would only create more
work for ourselves and users by implementing a new text markup system.
The article mentions binary data and other human unreadable content,
we can see that XML is at times inapropriate for human readable data.
As a solution i would like to see our text tag accept html, tex, pod,
and .... These are better tools to mark text up so we should probably
use them and not force XML to do work that is better left to other
tools.
Hopefully i cleared up some missunderstanding about the article
and draw a consice picture of issues surrounding the topic. And, maybe
spark a few people's curiosity about Entity, enough to have them
download
a newish copy and give it a whirl.
Consider mwimer's example,
<object&ht; <window> <text> I am text that needs to be marked up! :( </text> </window>
</object>
Markup within the text element could be approached in three basic ways>
- Extending the schema (or DTD)
- Using CDATA sections
- Using XML Namespaces
Of these, the first does indeed generate a cluttered namespace, since XML only has global
declarations. If you allow <w> here, then <w> is visible to the declaration of all elements in the DTD
or Schema, and has the same allowed content exerywhere, which is pretty poor.
The second is spectacularly ugly, but works:
<text>
<![CDATA[ I am <NOUN>text</NOUN> that needs to be marked up! ]]></text>
Well, I did say it was ugly.
The third also introduces some syntax, but this time it's syntax that adds meaning, so I resent it less:
<text xmlns="http://www.w3.org/1999/xhtml">
I am <i>XHTML</i> text now!
</text>
An HTML 1.0 processor that is not namespace aware will choke on this unless you do some
fancy DTD work, declaring all the HTML elements you want to use. But the newer XML parsers are
Namespace-aware.
You can also mix namespaces, to use both MathML and HTML, for example:
<text
xmlns:html="http://www.w3.org/1999/xhtml"
xmlns:maths="http://www.w3.org/1998/Math/MathML"
>
I am <html:i>XHTML</html:i> text with <maths:xxx>stuff</maths:xxx>!!
</text>
Did that address your problem?
Ankh,
This is quite interesting. I didn't realize this sorta
markup
was
possible. But,
it still doesn't address not xml based markups like latex or ps. I
would
personnally like to see something like:
<object>
<window>
<text markup="c">
#include <stdio.h>
int main (int argc, char** argv)
{
printf ("hello world");
}
</text>
</window>
</object>
non-xml markup, posted 17 Mar 2000 at 08:02 UTC by Ankh »
(Master)
One of the big goals of XML was addressing the plight of the
"desperate perl hacker" -- suppose you've got 150,000 pages of
documentation, and part number
1998 has been changed to 2041, so you do a search and replace. You can
do that
reliably in Perl with a simple regexp, because of th 'well-formedness'
rules.
CDATA sections broke that somewhat, sadly, but it's still much better
in
XML than it was in full SGML, where <i/this/ is a legal shortcut for
<i>this</i> and there are complex rules about when you can
omit end tags.
You can't include arbitrary text directly in XML, though.
<stdio.h> is
going to be a problem outside a CDTA section.
It's interesting that people often want to do this. C programmers
don't
want to say char *data = [[ stuff with " and ' and \ and [ and ] in it
]];, and
LISPers don't ask for (literal any " thing ) you ( like here); I think
the reason is
that XML seems to promise more, with the longer end tag.
It's best to get used to escaping stuff, because there is always
soemthing you
need to escape, whether it's ) or " in LISP/Scheme, or </socks> in
XML.
Entity looks really really neat, btw.
Hmm, i didn't even notice my inclusion of <stdio.h> as part
of the example. Now that i think about it
there doesn't seem to be any technical reason the
xml parser and the c grammer can't work together.
So when the c
grammer/parser
thinks
its
done, it
can
pass execution back to the xml parser. Sure, i don't think
i'll be implementing this feature anytime soon, but
it looks like its a solution to the issues of
escaping in xml.
On topic; at work we
are
storing
n
by m
matrices
of
pure
mathimaltical data points into an sql db. To me it seems
silly to put these matracies into the db a cell at a time
wasting space and munging our effectivly binary data
into bloated tables with only:
[matricKey, mKey, nKey, nmCellVal]
as the table data.
This same data can
be
layed
out
with
xml
much
more
effectively,
and will still be quite a bit more bloated than using a gdbm, or just
an indexed flatfile.
Example:
<matric name="one" dim="1x2">
<cell val=".3"/>
<cell val=".4"/>
<cell val=".5"/>
<cell val=".6"/>
</matrix>
This looks like a flat file db with
extra
formating,
and
you still have to rearange the layout so that it becomes
a matrix in memory. Basically graydon's message is well
taken, don't munge your data into xml just because
everyone else is doing it. Its best to leave your data in
machine readable form so the machine can pump out
information from the data, and put your information
into xml or some other markup language.
I had some very strong anti-XML views until I started using it. Now I'm
uneasy about it in a more abstract way. I still stick to the view that
if you're going to do something you do it properly, and use SGML as God
intended. I'm also extremely uneasy about all these XML manipulation
languages that are partially or fully specified in XML themselves; I
have a worry they are about to disappear up their own arses.
On the other hand, I'm currently working on a package management system
for CTAN (TeX) packages, and the existing CTAN catalogue is written in
XML - without a DTD. Working with it is fairly easy, although as easy as
working in SGML. The advantage of both cases over, say, plain text as
that all the hard work is done for me - I don't have to write myself a
parser for whatever breed of markup I would use to structure plain text.
Sure, for a lot of cases, you don't need the structure; a set of
data points has no implicit structure, so it's just as sensible to say
10:20:30:Something
as:
<coordinatesystem&rt;
<x>10</x>
<y>20</y>
<z;>30</y>
<data;>Something<data&gr;
</coordinatesystem>
and in that case, it's easier and faster to parse the colon
separated
version.
I'd say XML has a place in structuring primarily textual data which is
designed to be read and written by computers but can be understandable
to humans. However, I'd also say that in most cases, you either don't
need the structure, or you should write a proper SGML DTD. It's not as
if it's difficult, once you've done it once...
simon's SGML, posted 19 Mar 2000 at 04:26 UTC by Ankh »
(Master)
I was going to reply by email, but then I decided there was an
implication
that should be corrected.
There was a widespread rumuor in the SGML comunity at one point
that XML was in some way less "rigorous" than SGML, and did not support
the notion of a DTD.
In fact, XML is more mathematically rigorous, and
does support validation by a DTD.
The main differences between SGML and XML are as follows.
1. SGML has a 200-page specification that you have to buy.
2. The SGML specification is written in the language of 1960s
computer documentation, and edited by a lawyer. SGML gurus (including
me)
still don't completely agree on how to implement al of SGMLs features.
As a result,
there are a number of serious interoperability problems.
3. XML is slightly less expressive than SGML: iin particular,
XML lacks inclusions, exclusions and and-groups.
Unfortunately,
these three features are difficult to implement, and interact in
spectacularly unpleasant ways with SGML's arcne whitespace rules, so we
removed them from
XML for a reason.
If there is another reason, besides ignorance,:-), prejudice
or being stuck with
old tools, for preferring SGML over XML, I'm interested to hear it.
Probably not many others are, so email me, liam at holoweb.net, if you
prefer.
Someone complained that XML does not provide semantics. Well, Good. Because providing semantics is by no way a small
feat. There is intention, which KQML handles, and knowledge, handled by KIF, which requires Ontologies much in the same
way XML uses DTD. Semantics is not simple.
Interested parties can check The Logic Group home page.
IMHO, the great advantage of XML is that it blurs the distinction
between text and data. This is very hard to do in other serialization
syntaxes. (With the exception of SGML, of course.)
I don't think that XML is needed if all you want to express is a
triple of numbers, say. But XML really shines if you have data which
contains textual parts as well as `fact-like' parts. Such as the
abstract of an article in a library system. Or the product
description in a product catalog.
Consider an arts information system which describes works of art as
well as artist. An artist has a vita. Suppose we also want to say
which places an artist has visited, or which other people they have
met. The fulltext markup that's possible with XML is ideal for this:
you just provide a `visited-place' element which is used in the
natural language text of the vita, and, lo! you can search for all
artists which have visited Paris.
Adding a list of places visited to the data model is not
necessarily the right solution, because quite probably the amount of
information available about visited places is so different between
different artists that this does not make sense. But in the fulltext,
the markup can conveniently be used.
That's my view of where XML shines.