17 Dec 2004 titus   » (Journeyer)

Python problems: sgmllib/htmllib vs HTMLParser

While playing with PBP, I noticed that tag attributes weren't being correctly parsed. For example,

<option value="Small (10&quot;)"> Small (10&quot;)

was coming through as

<option value="Small (10&quot;)"> Small (10")

This caused problems in two areas: first, trying to set the value of the associated select widget failed unless the entity-encoded string was used (Small (10&quot;) instead of Small (10")). This in turn caused problems on submission of the form to the Web server, because the value was encoded once more for HTTP transmission. cgi.FieldStorage would decode it on the server side and set the select widget value to Small (10&quot;). So overall badness happened on both client and server sides.

I dug deeply into PBP, which led me to mechanize, which in turn led me to ClientForm, which led me to htmllib.HTMLParser. The trail finally ended in sgmllib. Long story short: there are two HTML parsing classes in Python, htmllib.HTMLParser (derived from sgmllib.SGMLParser) and HTMLParser.HTMLParser, which is more-or-less standalone. mechanize can use either, but prefers htmllib because it is present in older versions of Python. And here's the essential clue: the problem goes away if you switch to using HTMLParser.HTMLParser instead of htmllib.HTMLParser.

Once I figured this out, the root cause was easy to find: sgmllib.SGMLParser (and therefore htmllib.HTMLParser) does not unescape tag attributes, while HTMLParser.HTMLParser does. Oddly enough it doesn't use handle_entityref to unescape tag attributes; it uses string.replace to handle a small number of specific entity refs. I'm not sure if this is correct, but it's easy to move the same code over to sgmllib.py.

The diff to sgmllib.py is below. It's pretty small; I'll send it out the comp.lang.python newsgroup and see what people think, before I waste the time of Python maintainers specifically. It sure is nice to dig deeply into the code and find such a simple fix ;).

--- sgmllib.py  2004-12-16 23:30:51.000000000 -0800
***************
*** 272,277 ****
--- 272,278 ----
              elif attrvalue[:1] == '\'' == attrvalue[-1:] or \
                   attrvalue[:1] == '"' == attrvalue[-1:]:
                  attrvalue = attrvalue[1:-1]
+                 attrvalue = self.unescape(attrvalue)
              attrs.append((attrname.lower(), attrvalue))
              k = match.end(0)
          if rawdata[j] == '>':
***************
*** 414,419 ****
--- 415,432 ----
      def unknown_charref(self, ref): pass
      def unknown_entityref(self, ref): pass

+ # Internal -- helper to remove special character quoting + def unescape(self, s): + if '&' not in s: + return s + s = s.replace("<", "<") + s = s.replace(">", ">") + s = s.replace("'", "'") + s = s.replace(""", '"') + s = s.replace("&", "&") # Must be last + + return s +

class TestSGMLParser(SGMLParser):

g'nite.

--titus

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!