22 Feb 2010 dwmw2   » (Master)

My God, I've been vaguely aware of the HTML5 video train wreck but I hadn't realised just how much of a fucking abortion the rest of the HTML5 'standard' is.

I had the misfortune to read the section on character encodings over the weekend, and it almost made me lose my lunch.

Not only does it codify the crappy and unreliable practice of applying heuristics to guess character encodings, it also requires that a user agent deliberately ignore the explicitly specified character set in some cases — for example, text explicitly labelled as US-ASCII or ISO8859-1 MUST be rendered as if it were Windows-1252!

It justifies this idiocy, which it admits is a 'willful violation', on the basis that it aids compatibility with legacy content. By which of course it means "broken content", since this was never actually necessary for anyone who published content correctly even with older versions of HTML.

But that doesn't make any sense — surely legacy content won't be identifying itself as HTML5? It might be reasonable to do these stupid things for legacy content, but not HTML5. The complete mess we have with charset labelling is a prime example of where the RFC1122 §1.2.2 approach of being lenient in what you accept has turned out to be massively counter-productive — if we'd simply refused to make stupid guesses about character sets in the first place, then people would have actually started getting the labelling right.

The sensible approach to take with HTML5 would just have been to say "All content which identifies itself as HTML5 MUST be in the UTF-8 character encoding. A conforming user agent MUST NOT attempt to interpret content as if it has any other encoding; any invalid UTF-8 byte sequences MUST be shown using the Unicode replacement character U+FFFD (�) or equivalent."

Or, if we really must continue to permit the legacy crap 8-bit character sets, it should have said that the content MUST be in the character set specified in the HTTP Content-Type: header or equivalent <META> tag.

Keep the stupid heuristics for legacy content by all means, but it should be forbidden to render HTML5 content in a character set other than the one it is labelled with, and all invalid characters (including the C1 control characters in ISO8859-1 which in Windows-1252 would map to extra printable characters like the Euro sign) MUS be shown as U+FFFD (�). And then the people who publish broken crap would see that they're publishing broken crap, rather than thinking it's OK because the browser they use just happens to assume the same character set as the system they're publishing from.

To me, HTML5 looks less like a standard and more like a set of broken hackish kludges to work around the fact that people out there aren't actually capable of following a standard.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!