19 Dec 2004 titus   » (Journeyer)

Spent a bit of time on Python bugs today, as per Martin v. Löwis's suggestion -- building up karma for my own (5 line) patch to sgmllib ;). Took a lot more time than I thought, sigh; I only took a look at one. Here are my notes: <hr>

patch 755660, fixes bug 736428. Comments on bug 917188 (closed) are relevant. May also fix or at least allow amelioration of behavior in bug 683938 (assigned to frdrake) and bug 699079 (closed).

I don't understand why in the comments on bug 736428 it says that the "patch in bug 917188 (closed) may be better" because there's no patch attached. Perhaps kingwood means that you need to pay attention to markupbase.py, too?

My comments:

This patch allows developers to override the behavior of HTMLParser
when parsing malformed HTML.  Normally HTMLParser calls the function
self.error(), which raises an exception.  This patch adds appropriate
return values for situations where self.error has been redefined in
subclasses to *not* raise an exception.

It does not change the default behavior of HTMLParser and so presents no backwards compatibility issues.

The patch itself consists of an added comment and two added lines of code that call 'return' with appropriate values after a self.error call. Nothing wrong with 'em. I can't verify that the "junk characters" error call will leave the parser in a good state, though, if execution returns from error().

The library documentation could be updated to reflect the ability to override error() behavior; I've written a short patch, available at

http://issola.caltech.edu/~t/transfer/HTMLParser-doc-error.patch

More problems exist with markupbase.py, upon which HTMLParser is based. markupbase calls error() as well, and has some stickier situations. See comments in bug 917188 as well.

Comments in 683938 and 699079 suggest that raising an exception is the correct response to the parse errors. I recommend application of the patch anyway, because it (a) doesn't change any behavior by default and (b) may solve some problems for people.

An alternative would be to distinguish between unrecoverable errors and recoverable errors by having two different functions, e.g. error() (for recoverable errors) and _fail() (for unrecoverable errors). By default error() would call _fail() and internal code could be changed to call _fail() where recovery is impossible. This might alter behavior in situations where subclasses override error() but then again that's not legitimate to do anyway, at least not at the moment -- error() isn't in the docs ;).

If nothing done, at least close patch 755660 and bug 736428 with a comment saying that this behavior will not be addressed ;).

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!