Older blog entries for simonstl (starting at number 35)

19 Dec 2002 (updated 19 Dec 2002 at 17:50 UTC) »
Content negotiations

If there's a slowly ticking time bomb in the architecture of the Web, I suspect the best candidate is URI references.

Uniform Resource Identifiers are plagued with a circular philosophy constrained only lightly with a standard syntax, but at least plain old URIs are really only identifiers, with no strong bonds between the structure of URIs and the resources they identify. This loose connection (sometimes valued as opacity) and the circularity have made it possible for URI supporters to brush away all kinds of objections to their scheme-based scheme for years, and URIs themselves seem safely inert.

URIs have given developers a lot of freedom to create flexible systems. One of the coolest features this permits is content negotiation. Because the resource is separated from any single representation, it's possible, for instance, to visit "http://example.com/" and get back a result in any format under the sun, depending on what your browser is configured to ask for.

Typically, people expect HTML, but it could also plausibly return SVG, SMIL, Flash, RDF, a JPEG, or whatever. MIME Media types are the largest part of this negotiation, but there are also possibilities for negotiation language, character set, and anything else you can describe easily in a header. This isn't arcane functionality supported only by a privileged few - it's built into pretty much every Web server out there now, and it's not that hard to configure. (The browser side is messier, but I can call that an interface problem.)

So where does my supposed "ticking time bomb" appear? It's not in the URI itself, but rather in a key set of features that extend URIs, most particularly in the fragment identifier portion of URI references. Section 4.1 of RFC 2396 states:

When a URI reference is used to perform a retrieval action on the identified resource, the optional fragment identifier, separated from the URI by a crosshatch ("#") character, consists of additional reference information to be interpreted by the user agent after the retrieval action has been successfully completed....

The semantics of a fragment identifier is a property of the data resulting from a retrieval action, regardless of the type of URI used in the reference. Therefore, the format and interpretation of fragment identifiers is dependent on the media type [RFC2046] of the retrieval result. The character restrictions described in Section 2 for URI also apply to the fragment in a URI-reference. Individual media types may define additional restrictions or structure within the fragment for specifying different types of "partial views" that can be identified within that media type.

A fragment identifier is only meaningful when a URI reference is intended for retrieval and the result of that retrieval is a document for which the identified fragment is consistently defined.

Unpacking all that produces a fairly simple processing model. Fragment identifiers are separate from the information sent to the server during retrieval. That retrieval uses the URI portion of the URI reference, and gets back a representation of the resource. Once it has this representation, it applies the fragment identifier to the representation and the application (hopefully) does something interesting with the fragment or fragments returned.

The problem with that process is the gap between the retrieval process and fragment identifier processing. The retrieval process is (very cool) subject to content negotiation, but there's no mechanism for fragment identifiers to communicate their expectations for that negotiation. As RFC 2396 makes explicit, "the format and interpretation of fragment identifiers is dependent on the media type of the retrieval result". If the media type returned from the URI is different from the media type the creator of the URI reference expected, then fragment identifier processing will quite likely fail.

Roy Fielding got me thinking about this with a post that pretty much blasted XPointer. Given that Fielding is one of the authors of RFC 2396, and is in fact starting up a revision process, his opinion clearly matters - but the processing model described above seems to have driven him into a fit of conservatism about the nature of fragment identifiers. In a follow-up message he writes:

However, URI and fragment identifiers are not media type specific, and in fact do not allow media type concerns to be interleaved with identification.

ID is a reasonable solution, but one that existed prior to XPointer. Other ways of identifying content independent of media type include search terms, paragraph text, and regular expressions.

I worry that this first paragraph is largely wrong given RFC 2396 and existing MIME Media Type registration practices (which explicitly define fragment identifiers for given media types), and because of that, fragment identifiers quite certainly do mix media type concerns with identification. (URIs quite clearly do not mix media types and identification.)

The second paragraph is perhaps the most important, however, as it suggests at least one route out of this problem. It's pretty much what we've done for HTML, for instance. Unfortunately, IDs are a fairly messy issue in XML (see this algorithm for figuring out what's an ID). While text and regular expressions are both great ideas (I'm working on Internet-Drafts for XPointer schemes which do those), they also don't work so well with things like SVG, where identifying a particular view of a drawing may be more important. Sorting out conservatism which produces consistent results with a more liberal approach that takes greater advantage of the diversity of media types will take a long while.

(I did post an Internet-Draft that supports different media types through fallthrough, but it's far from clear that it's a useful approach.)

There's a lot of work yet to be done here. My current conclusion is that URI references should be considered abbreviations, and that developers who want more control than the current framework for processing URI references can provide should start thinking hard about these problems. The XPointer Framework, which will probably be the flashpoint for discussions in this area to come, has a lot of useful ideas in it. We need to sort out whether URI reference fragment identifiers are the right home for all of those ideas, and how best to integrate those ideas with the Web.

Open Source, Open Data

Eric van der Vlist's posted a piece on Microsoft's Office 11 announcements at XML 2002. I see that it's been noted on an Advogato diary already, so maybe I won't get shot at for suggesting people take a look, but this is some very interesting work, well-worth examining whatever your feelings about its creators. Eric does a nice job of contemplating the consequences for better and worse.

I gave a presentation at the Open Source Conference two years ago called "Open Source, Open Data: What XML has to offer Open Source". At the time, .NET (and the now-quiet "Hailstorm") was the main Microsoft data opening, but I think a lot of it applies as much or more to the Office story. There should be a lot more "free love" out there in the near future, though that's certainly different from "free speech" in the Open Source sense.

Markup and code

I've had some strange discussions lately, both at XML 2002 and outside, where people seem to think that programming code matters and markup is just an accident, something that really doesn't matter much. Progams are the ones creating and interpreting all that markup, right? So why don't the markup people shut up, roll over, and let the programmers do the real work?

It gets pretty depressing sometimes. There are certainly people out there who grasp the difference between "data" and "code" and understand why the constraints on the two and their respective practices are very different, right?

Getting the details right in your markup should mean that writing the code to process it will be easy, whatever the environment. Instead, a lot of people seem to look at how they write code and assume that what's easy for them is easy for everyone else, so they let their code assumptions flow into their markup design - making their markup easy for them but not necessarily for anyone else.

Fortunately, there are other reasons out there to be optimistic about the future of markup.

I've gone a little wacky writing five Internet-Drafts in a week. Or rather, one I-D and four documents that would be Internet-Drafts if I'd remembered the cutoff date for drafts was yesterday, 9am EST. They're all about various schemes for XPointer XML fragment identifiers.

Of course, they may not matter at all if Roy Fielding's opinion on XPointer carries the day.

Finally wrote some code again - release 0.05 of Gorille, a Java library for testing character conformance in XML documents. I changed the rules file format to use elements instead of attributes, and updated it generally for the candidate recommendation of XML 1.1.

If you're one of the twenty people who really cares about URIs, you may not like what I've posted here or here.

I've just posted Making Web Services Part of the Web, a set of pretty general suggestions for what I'd consider "Web Services" in a sense that actually takes the Web seriously.

Can't say I expect the Web Services folks to like it, but since I don't like most of what they're doing, that's just fine.

Now that I've emptied my head a bit, maybe I can focus on converting the Tiny API for Markup parser to support my Markup Object Events framework, and write a lot more documentation.

Taking a break

If you haven't heard from me in a while, it's not unusual. I've decided to slow down a bit, get through some projects at work, and figure out where I really want to go with my markup work. XML seems to be getting more and more cluttered with junk and buried behind the Web Services hype, which itself seems to be dying.

It's time to cool out for a bit, write some code, and recharge the batteries for some serious thoughts on information modeling. In the meantime Monastic XML can sum up where I'm at with markup pretty nicely.

I've just been to a couple of conferences - the O'Reilly Open Source Conference (OSCON) and Extreme Markup Languages. OSCON went pretty well, and I got to meet Raph, an extra bonus.

At OSCON, I expressed some concern about XML's declining direction, so far as I can see it, and the W3C's lack of understanding of the technology it supposedly stewards. My doubts unfortunately continue to grow.

Except for one moment, a presentation by the brilliant Liam Quin which was a masterpiece of the very kinds of URI-poisoning that rest at the heart of my concerns about the W3C and XML, Extreme was a showcase of the possibilities that lie ahead for markup.

Extreme is wonderfully refreshing because it tends to be filled with people who understand markup and care about markup, rather than with people who have problems they think markup might solve sort of adequately. Extreme is also filled with people who incorrigibly think for themselves, so questions are hard-hitting and ideas are fast-flowing, with little deference to institutional authority. Makes for a great set of conversations.

I put up a poster on Monastic XML, and presented on my Out-of-line (Ool) markup work There were some great pieces on comparing markup, approaches to creating overlapping markup, extending the reach of regular expressions, and building systems for autonomous processing of markup.

Phew!

Is closed poisonous to open?

I've been thinking a lot about various architectures and approaches to creating them, and I'm concluding that closed systems can benefit from open systems but closed systems are generally detrimental to open systems.

In the XML world, I'm thinking about things like W3C XML Schema and XQuery, which seem to pile on truckloads of features rather than building the smallest core possible (the approach XML 1.0 took, or tried to). As a result, these things are huge. Developers building closed systems can use or ignore whatever features they want, so more features is good for them. Developers building open systems - well, they don't know what's coming, so they have to be prepared for anything legal. In these cases, that's a lot.

In the standards world, I'm thinking about the W3C and its confidentiality rules. Despite the W3C's general interest in doing the right thing by openly publishing royalty-free standards, the organization is controlled by its members. The membership is mostly vendors, with a few large customers and the occasional individual willing to shell out $5K/year. The W3C has become a little more accustomed to doing work in public (with things like the TAG and most of the work the Semantic Web, not to mention XML Protocol, aka SOAP. Overall, though, it's still pretty much a pay-to-play organization, and some recent blowups, described in this diary, illustrate that the W3C's role as keeper of the open Web and its role as a vendor consortium don't always go well together.

In the worlds of open source and free software, it's fairly clear that those who make their living with proprietary approaches see free software as a threat, though they're less hostile to things like BSD - since heck, they can just grab the code, tinker with it, and use it. (I publish my work under the MPL, but definitely lean toward the GPL side, not the BSD side. And then I'm writing this in Mac OS X. Brief interlude while I slap myself a few times and resolve to write some GPL code as penance.)

I dunno. I can't say I'm feeling optimistic, as the money generally seems to live on the closed side, and its approach to openness is, well, opportunistic.

26 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!