I'm adding a very simple, UTF-8 only XML parser to the StreamModule system.
The feature I've worked hardest on is having the lexer portion reports the positions of tokens to the parser. There are a cascade of things this allows me:
- It enables me to write a parser that can build up an internal structure representing the XML that references the original XML.
- This allows me to pass XML through my StreamModule system without modifying it, or only modifying those exact sections I choose to.
- Which is vital if portions of the XML are signed. Converting to and from a canonical format is a horrible thing to do if you need to preserve message integrity at the byte level, especially if you don't have control over all the implementations that may be creating or consuming messages.
- This allows me to pass XML through my StreamModule system without modifying it, or only modifying those exact sections I choose to.
- It enables me to minimize copying
- It makes it easy to have the lexer and parser skip quickly over large sections of XML that the application doesn't care about.
The parser will have some shortcomings. It doesn't allow non-ascii tag names, and it doesn't allow non-ASCII whitespace to be treated as such. It also has no support of entities right now, though such is planned in the future.
I'm writing it as part of a system I'm designing to route XML messages in a P2P framework. Speed, lack of copying, and the ability to ignore message bodies were my primary needs.
