Spent a bit of time on Python bugs today, as per Martin v. Löwis's suggestion -- building up karma for my own (5 line) patch to sgmllib ;). Took a lot more time than I thought, sigh; I only took a look at one. Here are my notes: <hr>
patch 755660, fixes bug 736428. Comments on bug 917188 (closed) are relevant. May also fix or at least allow amelioration of behavior in bug 683938 (assigned to frdrake) and bug 699079 (closed).
I don't understand why in the comments on bug 736428 it says that the "patch in bug 917188 (closed) may be better" because there's no patch attached. Perhaps kingwood means that you need to pay attention to markupbase.py, too?
My comments:
This patch allows developers to override the behavior of HTMLParser when parsing malformed HTML. Normally HTMLParser calls the function self.error(), which raises an exception. This patch adds appropriate return values for situations where self.error has been redefined in subclasses to *not* raise an exception.It does not change the default behavior of HTMLParser and so presents no backwards compatibility issues.
The patch itself consists of an added comment and two added lines of code that call 'return' with appropriate values after a self.error call. Nothing wrong with 'em. I can't verify that the "junk characters" error call will leave the parser in a good state, though, if execution returns from error().
The library documentation could be updated to reflect the ability to override error() behavior; I've written a short patch, available at
http://issola.caltech.edu/~t/transfer/HTMLParser-doc-error.patch
More problems exist with markupbase.py, upon which HTMLParser is based. markupbase calls error() as well, and has some stickier situations. See comments in bug 917188 as well.
Comments in 683938 and 699079 suggest that raising an exception is the correct response to the parse errors. I recommend application of the patch anyway, because it (a) doesn't change any behavior by default and (b) may solve some problems for people.
An alternative would be to distinguish between unrecoverable errors and recoverable errors by having two different functions, e.g. error() (for recoverable errors) and _fail() (for unrecoverable errors). By default error() would call _fail() and internal code could be changed to call _fail() where recovery is impossible. This might alter behavior in situations where subclasses override error() but then again that's not legitimate to do anyway, at least not at the moment -- error() isn't in the docs ;).
If nothing done, at least close patch 755660 and bug 736428 with a comment saying that this behavior will not be addressed ;).