Markup Abuse: some comments on the XML panacea

Posted 12 Mar 2000 at 05:24 UTC by graydon Share This

XML is presently hailed as a universal encoding format, yet the "extensions" (DTDs) of XML have nothing in common with one another. This article addresses some existing DTDs, discusses the psychology of text-encoded standards, and makes a case for standards bodies returning to the task of agreeing on minimal, reasonable features and practical solutions rather than waving their hands and pointing at a DTD.

The article is posted here, and is a little long to be re-posting here in its entirety.


For now, use S-expressions, posted 12 Mar 2000 at 12:19 UTC by nether » (Journeyer)

I agree. A DTD is only good for checking the low-level syntax of a document, and you almost invariably need higher-level tools (some of which W3C tries to standardize) to check the correctness (not to mention the meaning) of the document. Thus the DTD alone is almost no use, and since it's rather tedious to write, you could just do without it and move all the checks to the semantic analysis.

The only benefit from vanilla XML that I can see is that it is a standardized syntax for representing general structural data, and you can find prewritten tools for reading and writing it. This is, to my mind, not a very great feat. Parsers are relatively easy to write anyway.

Frankly, I think XML should have had a much more semantically-oriented approach in the first place. Now things like XML Schema are just optional sugar that came as an afterthought, and horribly complex to my mind.

I can only hope that some day we will get a relatively simple markup language that attempts from the start to model the semantics of documents, not just their external structure. Programming languages have wonderfully advanced type systems these days (real programming languages, that is), that can express the "meaning" of data at quite high a level. It would be wonderful to see such techniques used for pure data structure specification, as well. Once the semantic model were clear, the actual textual representation for such structures would be trivial to define.

In the meanwhile, if you just want a simple format for structured data, use S-expressions, like I do. Any lisp or scheme interpreter can read or write them (though there are some dialect differences), and they are much more concise than XML with its mandatory start and end tags and other standardized cruft. S-expressions are even a bit more expressive than XML: A number is a number, not a string of numeric characters. This is not really a new idea, see LAML for a practical example of this approach. I'm working on something similar.

Sexpy MF, posted 12 Mar 2000 at 16:37 UTC by dan » (Master)

It's all true. XML is no more a solution to communicating

Interesting pointer to LAML. From my reading they actually define functions with the same name as each markup element, though: I'm wondering why they don't just keep it as data. As long as you have a tree, you can process it with any tree-processing functions you like

Plug time: this is what Araneida does. Look at the counter-handler function in http://araneida.telent.net/examples/main.lisp. For non-Lispers, the features that need explaining are

  • html is a function that converts sexpr notation into angle brackets:
    (html `(p "paragraph with " (b "bold") " word"))
    => "<P>paragraph with <B>bold</B> word</P>
  • ` and , are used for generating "semi-literal" data: ` introduces literal data, and , escapes out of it so that computed data can be inserted. Thus,
    (html `(p "1+1=" ,(+ 1 1))) =>"<P>1+1=2</P>

Apologies for the formatting. advogato won't let me have <pre> and <tt> doesn't actually show up any different from normal text. Lisp was not meant for proportional fonts.

"... no more a solution to communicating", posted 12 Mar 2000 at 16:39 UTC by dan » (Master)

... between heterogenous systems than a JVM with no standard classes is to running code on them.

I did preview! Honest!

XML makes sense for most application, posted 12 Mar 2000 at 19:24 UTC by aaronl » (Master)

I've only started using XML recently, and I have to say that it is far more than a fad. XML gives you a standard way to write your own format - an angle-bracket is the opening of a tag, and etc... everyone understands the basic syntax if they know XML or even HTML. Another convinience provided by XML is that you don't have to write, test, and publish your own parser. libxml (gnome-xml) is simple and sweet. Using XML means that if you want to import multiple file formats you don't need different parsers as you would if these were binary formats.

In the free software world, there is a lot of participation from other people who did not neccessarily design the file format of your application. Sure, documenting your file format is a good thing to do, but is everyone going to do it? XML does not eliminate the need for file format documentation, but it makes it pretty darn obvious. I'm not going to paste some sample XML code from my application because I'm afraid your browsers will ignore it as undefined HTML tags and I'm too lazy to get it to display right in HTML, but I doubt that anyone who understood the function of the application would have any questions after looking at a sample document.

I can think of numerous other reasons why XML is superior to binary file formats unless you really need to press data in tightly, but my favorite is the HTML compatability. IIRC you can take an AbiWord XML file (not exporting to HTML - taking a native AbiWord XML file) and view it in a web browser and it will be readable although not all of the formatting will be preserved. Try THAT with a MSWord binary file!

missed point, posted 12 Mar 2000 at 19:46 UTC by graydon » (Master)

I'm not debating that XML "makes sense" as document markup (it does), but rather that in cases where you're storing semantically rich data with a lot of weird structure, being able to "view it in a web browser" is completely unimportant. Take one of the earliest uses: CML. Chemical Markup Language. Unless you have a plugin for your browser which can model bond lengths and dihedral angles (nope, it's not in CSS3 even) viewing it in a web browser shows you a big meaningless pile of numbers. The document has no meaningful use-case in any "standard" XML tool, only in specialized CML tools, therefore CML is no better than any other open encoding of dihedral algles and bond lengths, and is potentially worse because it suggests to people, as you are saying, that somehow all the reading, writing and interpreting software already exists. the gnome-xml parser isn't going to write any CML manipulation software for you. It does nothing but basic tokenizing and parsing into a tree. We've had both textual (sexps) and binary (ANDF,HDF) ways of doing this for a while.

Semantics == types, posted 12 Mar 2000 at 21:03 UTC by nether » (Journeyer)

Hmm. HDF seems interesting. Never knew of it before. Rather low-level, though.

What all of us seem to want is a standard format for defining the semantics of data, while XML only provides syntax.

In my mind, semantics equates types. Indeed, when we use (say) simple C structs and arrays and pointers, we can already define quite complex data structures, and definitely more conveniently than with DTDs. It appears to me that what we need is a data-centric type system. The problem with type systems of programming languages is that with lower-level languages they are more oriented more towards specifying representation than meaning. And of course there's non-data stuff like functions and objects and abstract data types in there too.

I once had the idea that one could use (the non-OO subset of) CORBA IDL for defining data structures language-independently. CORBA even defines a standard binary format (CDR) for transmitting this data. What you can transfer you can also store, so this would be one approach for standardizing a data format. However, I think that IDL is still rather limited. IIRC, you can't do non-tree graphs very straightforwardly.

So perhaps we really need a nice language-independent high-level type system for pure data (no code or objects). We could then have multiple written formats for this data, say, xml, sexps and binary, as well as compilers to convert the data definitions into types in different programming languages. With proper tools, all you'd need to do is write a definition for the data, and you could get an autogenerated parser that converted the data into properly typed objects in your programming language.

A while ago I entertained the idea of starting such a project, but haven't still done so. Perhaps there really would be need for it...

re: Semantics == Types, posted 12 Mar 2000 at 23:01 UTC by fatjim » (Journeyer)

Why? The whole point is that, no matter how you specify the layout of the data, that data is till meaningless until interpreted. Unless your "Typed Document Definition" is actually a complete library for operating on the data, there is no sense in defining it. You would gain nothing above what starting from scratch would give you, and it would blind you into designing a format out-of-touch with the data it actually carries.

Why standardized data formats?, posted 13 Mar 2000 at 00:58 UTC by kelly » (Master)

The advantage of having relatively standardized data formats is so that when a new data format is needed, less effort is required to write loaders and savers for the format, both for the original implementer and for people doing ports.

Several languages already support some concept of serialization (e.g. Java's Serializable interface; Python's pickles; Perl's Storable module), but there is no standardization between these methodologies, and in any case they're based on the representation of the data, which is likely to be dependent on the implementation language (e.g. explicit reference in Perl versus implicit reference in Python or Java).

This is somewhat related to a point Jim Blandy raised at dinner a few nights ago: take the C data structure:

struct node {
    struct node *left, *right;
    DataType d;
};
What is this? The answer is, you don't know. It might be a doubly linked list. It might also be a binary tree. It could be any of a number of other things. Without knowing more, specifically the invariants that apply to left and right, you can't intuit anything. Jim was talking about this in the context of making gdb more useful, but the same point applies to attempting to serialize data structures.

In any case, it's clear that XML does nothing to solve this problem; all it does is create a new (and, IMO, excessively cumbersome) language to do serialization in. I think the real point of rampant XMLification is to increase bandwidth demands. :)

Why XML?, posted 13 Mar 2000 at 21:03 UTC by Ankh » (Master)

You're right, Graydon, that the world has changed since we first started working on XML.

Originally, the project was SGML on the Web, and was intended to allow people with large structured documents to share them. Examples of structured documents include 150,000 pages of computer manuals at Novell, for example, or the Solaris documentation; library records about books (currently using a number of incompatibly different variations of MARC); encyclopaedia entries; dictionaries; metadata (such as the Dublin Core interoperability set), and even music.

Many of the people involved at the start of XML already had SGML documents -- Michael Sperberg-McQueen, then from the University of Illinois at Chicago and representing the Text Encoding Initiative, Jon Bosak, first at Novell and later at Sun; Tim Bray, who had worked with the 500MByte Oxford English Dictionary, amongst other projects -- and others had SGML software.

None of us predicted what would happen. In fact, Jon had a tremendous struggle at first to get people at the World Wide Web Consortium to take any interest.

but look at it a different way. . .
HTML is defined in terms of SGML. Not many people know that, and still fewer understand that every <P> tag has a corresponding </P>, whether or not the author supplies it. The programmers here at Advogato are likely to know that parsing HTML is a pain in the ankle, mostly because authors in practice leave off quotes, think <! starts a comment, or generally don't bother to check syntax.

Imagine using a C compiler that automaticallay and silently corrected errors. If you left off a }, it inserted it, usually in the right place but not always. Then imagine having to port that C code to another platform where the C compiler corrected a different set of errors, also silently. An HTML browser is like that.

A second problem with HTML is that it has a fixed set of elements. You can't legally add a <partNo> element of your own because if you do, it isn't HTML any more.

Newer browsers let you apply cascading style sheets (CSS) to such elements, as long as they have content; there's no way to add your own IMG or BR elements, because there is no syntax available in HTML to say that an element is empty -- the list is hard-wired into the browser source code.

So XML addresses two important problems with HTML: the syntax is both rigorous and extensible.

But why not s-expressions? LISP is more popular than C and COBOL and more fun than elbow sex!
XML has an important property that s-expressions lack: it allows error detection. In environments where automatic checking of semantics isn't possible, this is very important. And for what it's worth, the XML working group did measurements of marked-up data to see if the text in the end-tags was significant. After compression, it isn't significant. But the ability to rmatch <etymology><lexeme> to </lexeme><derivation><etymology> and see what's wrong is a major benefit in document-oriented markup.

but I'm not marking up documents, I'm using XML to describe pixels on 3-D surfaces of planets!
More fool you :-) -- Graydon is right that XML isn't good for everything. It's better than a lot of things, and where you have structured, mostly textual information, it's generally a good first bet.

The XML Schema group has been working on better constraints; if they've made it too complex, and also left out too many features (?!), then let them know. Well-reasoned and informed messages from a couple of thousand people would certainly be noticed.

You are right that people are using DTDs that are incompatible. You can't load molecular data into an e-commerce credit card validator. Good thing. As implementation experience builds, perhaps more DTD design patterns will emerge. Dr Ian Graham and I published a few design patterns on the companion web site for our book (The XML Specification Guide, Wiley, 1999).

You have to remember that XML is to a small extent, for a human reader, self describing, or can be: if I use an element called partNumber, you can deduce what it is just from the data, which is difficult to do with a Binary interface.

There are many alternatives for specific domains; like most standards, XML is a bunch of compromises, and ends up being good enough to use, for most purposes within its intended domain. Use XML for configuration files instead of writing your own parser, and you probably get a small but non-zero improvement. But the more tools that use XML, the bigger the improvement. So let's praise XML for its strengths, and try to understand how to live with its weaknesses.

-- Ankh / Liam Quin

XML strengths and weaknesses, posted 13 Mar 2000 at 22:41 UTC by raph » (Master)

Thanks, graydon, for articulating this unease with the rush to XML-ize everything. I'll just add my two cents.

XML is a good way to do Unicode

One thing I haven't seen mentioned here is that XML is based on Unicode. This is a Good Thing. The handling of non-Latin1 text with pre-XML tools is so incredibly arbitrary it's not funny. Ever gotten an email in Russian? More often than not, the Content-Type says 'text/plain; charset=iso-8895-1', but the text is actually in KOI-8. Broken, broken, broken.

I don't everyone to do Unicode right in XML (Advogato still doesn't, for example), but at least it says pretty clearly what the right thing is.

XML is not "as simple as possible, but no simpler"

While XML is a dramatic simplification over the large, vague, and ambiguous SGML universe that preceded it, it still has a bunch of cruft in it. A lot of it has to do with backwards compatibility with SGML, which didn't end up working out that well anyway (SGML ended up needing some changes to bring it in line with XML). When XML was being designed, one of the goals was "a programmer should be able to do an XML parser in a week." Well, that hasn't happened anyway.

My single biggest gripe with XML complexity is entities. My guess is that very few people use these in any effective way. External unparsed entities are another kettle of fish altogether. As far as I can tell, they are a completely inferior way of doing <img src="url" />. I'll be willing to bet that 9 out of 10 deployments of XML choke on them, too.

DTD's represent cruft too. Most XML users will either ignore them altogether or go to schemas. That leaves a lot of vestigial syntax and semantics.

But, overall, I don't want to bitch too much. XML is reasonably simple, and adding it to a project does not spell doom and disaster because of the added complexity.

XML is not a markup language

Graydon explained this very well already, but I'll make another argument that may resonate differently. Some people describe XML as a successor to HTML. However, this is not true. While HTML browsers are extremely useful tools, there is, and can be, no such thing as an XML browser[1]. This is because XML is not a markup language, as is HTML. Rather, it's a metaset of possible markup languages, each of which requires its own support in a browser.

[1] Ok, you can have a generic XML viewer or editor that displays the structure of the document with tags exposed, but that isn't what I was talking about.

Suburban sprawl: greater metropolitan XML

XML seems to induce a strong temptation to drink the rest of the Kool-Aid. While XML itself is a reasonably simple thing, by the time you add XML namespaces, XPath, XLink, XPointer, schemas, RDF, DOM, CSS, and XSLT, you've got scary beast[2].

In general, the amount of payoff you get from adding these additional layers dwindles as the complexity shoots up. This is, I think, particularly true if you're using XML for things other than textual documents.

[2] So it's a mixed metaphor, They're fun metaphors, don't you think? Maybe the beast drinks the jug of Kool-Aid. In LA.

Tools, tools, tools

Support for XML in tools is getting very good. In fact, in just about any modern programming environment, there's an XML parser handy, and you can traverse and manipulate the structure just as easily as car and cdr in your lisp days.

This is probably the single best reason to use XML. Just as Perl regexps are a powerful tool for fiddling with plain text files, the suite of XML tools becoming available make XML relatively fun and easy to deal with, even in cases where a custom binary file format might seem simpler and more direct.

Not as simple as possible, posted 14 Mar 2000 at 01:15 UTC by Ankh » (Master)

One of the constraints we had with XML was that every valid XML document be a valid SGML document. For this to be sensible, you have to think of a world in which there are terabytes of SGML (as there still are), with no way of interchanging that data over the web except with SoftQuad Panorama, and even then only a subset.

Very late in the XML process, the ISO SGML WG agreed to make some changes to accommodate XML. These changes were a very great help, but several earlier decisions of the XML group were not reconsidered, and should have been, I think. If anyone is really interested, email me, it's basically history now. There efforts (such as SML) to develop a cleaner subset of XML, but they're a little hampered, in turn, by their requirement that every SML document be well-formed XML.

In The XML Specification Guide, I basically tell people not to use unparsed external entties, or, if they do, to use a MIME content media type (or list thereof) as the identifier for the notation. A few SGML die-hards implemented an XML version of the SGML Catalog, for mapping PUBLIC identifiers, but use of it is actually forbidden by the XML spec, so that also has to be ignored.

In practice, a few of the wacky features have survived into common use (CDATA sections and processing instructions are two that come to mind) but not too many. And you know, C became popular despite the GECOS Constant feature and the "entry" keyword, and despite the way i = -3; used to decrement i by 3, not set i to -3, in V6 Unix.

Most standards are about politics, about getting people to agree on a compromise. I think XML is an OK compromise.

I'm less happy with DOM, and a lot unhappier with SVG -- which was brought up on the xml-dev list recently, and I sent a fairly strongly worded message that seems to have stopped discussion altogether :-) . No, I don't want to use DOM to access windowing system events and spline points in (unhinted) fonts. I'd rather have PostScript. Yes, the world is rushing to XML, nd on the whole I think it's good, even if some groups seem to be building edifices that make a merger between C++, ADA, ICMP and HTTP look minimalist.

So let's be glad XML is OK, and try and help people use it more effectively. Do we need an XML Traps and Pitfalls book, like Andrew Koenig's for C?

Hmm, there are bugs in Advogato's html parser, doesn't like </p> very much. Another argument for XML :-)

>barefoot in Toronto

XML Everywhere!, posted 16 Mar 2000 at 02:24 UTC by dwaite » (Journeyer)

While I agree that XML is being touted as the solution for every problem known to man, we all know what these people are trying to sell us.

However, the point of XML is that it is an eXtensible Markup Language. It does not validate the meaning any more than a spellchecker or grammar checker would validate the phrase "The cheese is flying over Egypt." Correct grammar, correct spelling. Means nothing. What you can do is validate that it does have correct form, which is the point of a DTD.

There are several types of data that should never be XMLized. Video, audio, images and other binary data first. Second would be that Spaceship markup language - that is valid XML, but it is definately not done in the spirit of human readability.

In other words, there is no miracle cure. But it really sounds like you are talking bad about XML because it _isn't_ the cure, which isn't quite fair. It was never meant to hold logic to validate that a set of data contained within made sense, just that the actual format which the data was received is valid.

Entity's text tag, posted 16 Mar 2000 at 06:41 UTC by mwimer » (Journeyer)

I get the feeling that some people don't follow graydon's driff for this article... I have an example that i think should clearify the issues presented in the article. Entity (a project listed here at advogato) uses XML to layout GUIs and control user interaction. It does a very good job of laying out the window for the user and feels quite a bit like writing a web site in html, using only slightly differnet tags. XML probably the best tool to date for this kind of work. Let me go out on a limb and say that graydon would agree the XML is the tool for this job.

But there is an issue with the <text> tag. The text tag needs to have markup of the data inside the tag. Example:

<object> <window> <text> I am text that needs to be marked up! :( </text> </window> </object>

Sure we could invent all sorts of new tags to define the text inside the tag but we would pollute the tag namespace of Entity. At this point XML as a markup language breaks down and we would only create more work for ourselves and users by implementing a new text markup system. The article mentions binary data and other human unreadable content, we can see that XML is at times inapropriate for human readable data.

As a solution i would like to see our text tag accept html, tex, pod, and .... These are better tools to mark text up so we should probably use them and not force XML to do work that is better left to other tools.

Hopefully i cleared up some missunderstanding about the article and draw a consice picture of issues surrounding the topic. And, maybe spark a few people's curiosity about Entity, enough to have them download a newish copy and give it a whirl.

Entity, Markup and Namespaces, posted 16 Mar 2000 at 17:01 UTC by Ankh » (Master)

Consider mwimer's example,

<object&ht; <window> <text> I am text that needs to be marked up! :( </text> </window> </object>

Markup within the text element could be approached in three basic ways>

  1. Extending the schema (or DTD)
  2. Using CDATA sections
  3. Using XML Namespaces

Of these, the first does indeed generate a cluttered namespace, since XML only has global declarations. If you allow <w> here, then <w> is visible to the declaration of all elements in the DTD or Schema, and has the same allowed content exerywhere, which is pretty poor.

The second is spectacularly ugly, but works:

<text> <![CDATA[ I am <NOUN>text</NOUN> that needs to be marked up! ]]></text>

Well, I did say it was ugly.

The third also introduces some syntax, but this time it's syntax that adds meaning, so I resent it less:

<text xmlns="http://www.w3.org/1999/xhtml"> I am <i>XHTML</i> text now! </text>

An HTML 1.0 processor that is not namespace aware will choke on this unless you do some fancy DTD work, declaring all the HTML elements you want to use. But the newer XML parsers are Namespace-aware.

You can also mix namespaces, to use both MathML and HTML, for example:

<text
     xmlns:html="http://www.w3.org/1999/xhtml"
     xmlns:maths="http://www.w3.org/1998/Math/MathML"
>
      I am <html:i>XHTML</html:i> text with <maths:xxx>stuff</maths:xxx>!!
</text>

Did that address your problem?

Non xml based markup inside xml, posted 17 Mar 2000 at 02:15 UTC by mwimer » (Journeyer)

Ankh,

This is quite interesting. I didn't realize this sorta markup was possible. But, it still doesn't address not xml based markups like latex or ps. I would personnally like to see something like:

<object>
<window>
<text markup="c">

#include <stdio.h>
int main (int argc, char** argv)
{
printf ("hello world");
}

</text>
</window>
</object>

non-xml markup, posted 17 Mar 2000 at 08:02 UTC by Ankh » (Master)

One of the big goals of XML was addressing the plight of the "desperate perl hacker" -- suppose you've got 150,000 pages of documentation, and part number 1998 has been changed to 2041, so you do a search and replace. You can do that reliably in Perl with a simple regexp, because of th 'well-formedness' rules.

CDATA sections broke that somewhat, sadly, but it's still much better in XML than it was in full SGML, where <i/this/ is a legal shortcut for <i>this</i> and there are complex rules about when you can omit end tags.

You can't include arbitrary text directly in XML, though. <stdio.h> is going to be a problem outside a CDTA section.

It's interesting that people often want to do this. C programmers don't want to say char *data = [[ stuff with " and ' and \ and [ and ] in it ]];, and LISPers don't ask for (literal any " thing ) you ( like here); I think the reason is that XML seems to promise more, with the longer end tag.

It's best to get used to escaping stuff, because there is always soemthing you need to escape, whether it's ) or " in LISP/Scheme, or </socks> in XML.

Entity looks really really neat, btw.

Re: non-xml markup, posted 17 Mar 2000 at 15:20 UTC by mwimer » (Journeyer)

Hmm, i didn't even notice my inclusion of <stdio.h> as part of the example. Now that i think about it there doesn't seem to be any technical reason the xml parser and the c grammer can't work together.

So when the c grammer/parser thinks its done, it can pass execution back to the xml parser. Sure, i don't think i'll be implementing this feature anytime soon, but it looks like its a solution to the issues of escaping in xml.

On topic; at work we are storing n by m matrices of pure mathimaltical data points into an sql db. To me it seems silly to put these matracies into the db a cell at a time wasting space and munging our effectivly binary data into bloated tables with only:
[matricKey, mKey, nKey, nmCellVal] as the table data.

This same data can be layed out with xml much more effectively, and will still be quite a bit more bloated than using a gdbm, or just an indexed flatfile.

Example:

<matric name="one" dim="1x2">
<cell val=".3"/>
<cell val=".4"/>

<cell val=".5"/>
<cell val=".6"/>
</matrix>

This looks like a flat file db with extra formating, and you still have to rearange the layout so that it becomes a matrix in memory. Basically graydon's message is well taken, don't munge your data into xml just because everyone else is doing it. Its best to leave your data in machine readable form so the machine can pump out information from the data, and put your information into xml or some other markup language.

Not a panacea - but can be useful, posted 18 Mar 2000 at 12:30 UTC by Simon » (Master)

I had some very strong anti-XML views until I started using it. Now I'm uneasy about it in a more abstract way. I still stick to the view that if you're going to do something you do it properly, and use SGML as God intended. I'm also extremely uneasy about all these XML manipulation languages that are partially or fully specified in XML themselves; I have a worry they are about to disappear up their own arses.

On the other hand, I'm currently working on a package management system for CTAN (TeX) packages, and the existing CTAN catalogue is written in XML - without a DTD. Working with it is fairly easy, although as easy as working in SGML. The advantage of both cases over, say, plain text as that all the hard work is done for me - I don't have to write myself a parser for whatever breed of markup I would use to structure plain text.

Sure, for a lot of cases, you don't need the structure; a set of data points has no implicit structure, so it's just as sensible to say
10:20:30:Something

as:


<coordinatesystem&rt;
<x>10</x> <y>20</y> <z;>30</y> <data;>Something<data&gr;
</coordinatesystem>

and in that case, it's easier and faster to parse the colon separated version.

I'd say XML has a place in structuring primarily textual data which is designed to be read and written by computers but can be understandable to humans. However, I'd also say that in most cases, you either don't need the structure, or you should write a proper SGML DTD. It's not as if it's difficult, once you've done it once...

simon's SGML, posted 19 Mar 2000 at 04:26 UTC by Ankh » (Master)

I was going to reply by email, but then I decided there was an implication that should be corrected.

There was a widespread rumuor in the SGML comunity at one point that XML was in some way less "rigorous" than SGML, and did not support the notion of a DTD.

In fact, XML is more mathematically rigorous, and does support validation by a DTD.

The main differences between SGML and XML are as follows.

1. SGML has a 200-page specification that you have to buy.

2. The SGML specification is written in the language of 1960s computer documentation, and edited by a lawyer. SGML gurus (including me) still don't completely agree on how to implement al of SGMLs features. As a result, there are a number of serious interoperability problems.

3. XML is slightly less expressive than SGML: iin particular, XML lacks inclusions, exclusions and and-groups. Unfortunately, these three features are difficult to implement, and interact in spectacularly unpleasant ways with SGML's arcne whitespace rules, so we removed them from XML for a reason.

If there is another reason, besides ignorance,:-), prejudice or being stuck with old tools, for preferring SGML over XML, I'm interested to hear it. Probably not many others are, so email me, liam at holoweb.net, if you prefer.

Ankh

Semantics??? Don't go that way..., posted 28 Mar 2000 at 02:57 UTC by dcs » (Master)

Someone complained that XML does not provide semantics. Well, Good. Because providing semantics is by no way a small feat. There is intention, which KQML handles, and knowledge, handled by KIF, which requires Ontologies much in the same way XML uses DTD. Semantics is not simple.

Interested parties can check The Logic Group home page.

XML integrates structure and full-text, posted 8 Apr 2000 at 19:28 UTC by kaig » (Journeyer)

IMHO, the great advantage of XML is that it blurs the distinction between text and data. This is very hard to do in other serialization syntaxes. (With the exception of SGML, of course.)

I don't think that XML is needed if all you want to express is a triple of numbers, say. But XML really shines if you have data which contains textual parts as well as `fact-like' parts. Such as the abstract of an article in a library system. Or the product description in a product catalog.

Consider an arts information system which describes works of art as well as artist. An artist has a vita. Suppose we also want to say which places an artist has visited, or which other people they have met. The fulltext markup that's possible with XML is ideal for this: you just provide a `visited-place' element which is used in the natural language text of the vita, and, lo! you can search for all artists which have visited Paris.

Adding a list of places visited to the data model is not necessarily the right solution, because quite probably the amount of information available about visited places is so different between different artists that this does not make sense. But in the fulltext, the markup can conveniently be used.

That's my view of where XML shines.

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

X
Share this page