13 Jan 2003 Omnifarious   » (Journeyer)

I'm adding a very simple, UTF-8 only XML parser to the StreamModule system.

The feature I've worked hardest on is having the lexer portion reports the positions of tokens to the parser. There are a cascade of things this allows me:

  • It enables me to write a parser that can build up an internal structure representing the XML that references the original XML.
    • This allows me to pass XML through my StreamModule system without modifying it, or only modifying those exact sections I choose to.
      • Which is vital if portions of the XML are signed. Converting to and from a canonical format is a horrible thing to do if you need to preserve message integrity at the byte level, especially if you don't have control over all the implementations that may be creating or consuming messages.
  • It enables me to minimize copying
  • It makes it easy to have the lexer and parser skip quickly over large sections of XML that the application doesn't care about.

The parser will have some shortcomings. It doesn't allow non-ascii tag names, and it doesn't allow non-ASCII whitespace to be treated as such. It also has no support of entities right now, though such is planned in the future.

I'm writing it as part of a system I'm designing to route XML messages in a P2P framework. Speed, lack of copying, and the ability to ignore message bodies were my primary needs.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!