Older blog entries for titus (starting at number 23)

Web browsing via Python, refreshed.

The saga continues... after many patches, and diversions, and attempts to grok, I finally got PBP to work for a variety of purposes. In particular, I can now:

  • Use maxq to record browsing sessions to PBP scripts, and run those scripts successfully in PBP;
  • Use an only slightly altered HTMLParser.HTMLParser class to parse the moderately crappy SourceForge/mailman output;
  • Grok RFC 2965 cookies well enough to actually log into and play with mailman admindb pages;
  • Save, load, and view cookies in PBP.

I've submitted a bunch of patches to PBP & hopefully Cory can take a look at them in the next few weeks.

HTML/HTTP/URL support in Python's base is surprisingly messy, given my past experience with Python modules. I look forward to the day when Python 2.4 or above is the standard and 2.3 is no longer supported -- it will make cookie and URL handling much simpler!

I may spend some quality time refitting the HTMLParser classes in the htmllib and HTMLParser modules; as-is, they don't have good failure modes.

--titus

"He realized the fastest way to change is to laugh at your own
folly -- then you can let go and quickly move on."
-- via Dossy Shiobara
20 Dec 2004 (updated 28 Dec 2004 at 23:10 UTC) »
The joys, trials, and tribulations of open-source work

This is becoming a bad joke.

The story starts innocently enough: when playing with PBP, I found a bug in Python's sgmllib.py. This bug caused PBP (via ClientForm) to fail on my test case, the Quixote demo application. So I went and fixed that bug and submitted a patch to the Python developers.

Then I diverted myself and decided to take up the gauntlet thrown by Martin v. Löwis, in order to see how quickly I could get my patch reviewed. I proceeded to work through first one and then nine add'l Python patches. My python-dev post still hasn't gone through so nothing further has happened there.

Today I put myself back on track and spent some time with maxq, a Java HTTP proxy recording system that outputs Jython code tracking a Web session. I added a script generator module for PBP to maxq so that it would output PBP scripts as well.

In the process I ran across two additional problems in PBP. First I found a bug in the way PBP used shlex (PBP patch 274). Then maxq's behavior of recording all form variables, even hidden ones, led me to discover that PBP crashed when trying to change hidden variables. I changed this to a warning (PBP patch 273). (Not strictly a PBP problem, but the crash behavior was probably inappropriate.)

So, to try out one package (PBP) and modify another to fit my needs (maxq), I ended up crawling through a bunch of nifty packages (mechanize, ClientForm, and four or five Python modules I'd never used before), submitting three different patches to two different projects (not including revised patches I contributed to Python), and writing a small chunk of new code in Java. whee!

The worst part is that one of the PBP patches I'm submitting contains this:

 ...See http://issola.caltech.edu/~t/transfer/quixote-demo.pbp for an
operative example, although you will not be able to run it without
changing ClientForm.py to use the XHTMLCompatibleFormParser, because
of a bug in sgmllib.py (Python patch 1087808).
Yep -- to test this patch, go apply this other patch to the language you're using...

O well. At least now I can get around to the original purpose of writing a testing suite for Cartwheel.

My wife and cat are upset with the amount of time I've spent on this, that's for sure... ;) Of course, my wife doesn't get a vote because she went skiing today while I abused sea urchins. (The cat doesn't get a vote either: he sleeps all day.)

On the plus side, at least I could fix the software myself, so everything works now. I'd be SOL if any of this stuff had been closed-source.</a></a>

Continued karma-grubbing.

My advice for the Python patches neophyte: start high. I foolishly started in the low-numbered patches (#755660) & spent over an hour trying to figure out the code (which was easy) and the various patches and counterarguments and ... That part was less easy and nowhere near as pleasant. Folks, there's a reason why those patches and bug reports have been lying around for over a year -- no one else wants to deal with them.

Another piece of advice is to stick with modules you already know about or are interested to learn about. In my case I'd just gone through the HTMLParser stuff to find the sgmllib bug, so I decided to focus on HTML/CGI/URL stuff.

Anyhoo, worked my way through 9 add'l patches. Great! Together with my comment on patch 755660, I ran through 10 different patches. I may now pray to The Gods That Be to review patch 1087808, my sgmllib fix.

Here's a list:

  • patch 1055159, a simple docstring/doc patch to CGIHTTPServer. (This was really a documentation "enhancement request" masquerading as a patch, so I verified the behavior described & then wrote the doc string appropriately.)

  • patch 755670, a patch to make HTMLParser parse invalid HTML. Recommended not applying it.

    (I also put in a comment that was meant for a different patch (755660). Very embarrassing. Sigh. Tabbed browsing is dangerous in my hands. ;)

  • patch 1037974, a patch to fix HTTP digest authentication when accessing LiveJournal feeds (and any other feeds that don't listen to RFC 2617, which considers 'Algorithm' notification optional).

    Had to sign up for a livejournal account to test this, too...

    This is my first time dealing with digest authentication, but as I understand it the new behavior is merely verbose and a bit redundant. It certainly shouldn't break anything. Recommend apply.

  • patch 1028908, a bunch of small stuff by John J. Lee of wwwsearch. Apparently much of the code he modified was originally written by him anyway (?) and the regression tests passed, so *shrug*... recommend application.

  • patch 901480, fixing bug 735248. This fixes a bug in the way urllib2.parse_http_list parses unquoted elements. Recommend application, although I submitted a slightly modified patch (against the current CVS) that fixed a doctest string.

  • patch 827559 fixes SimpleHTTPServer to add a trailing '/' to directory names. Recommend application; it does fix the behavior, and I think it's a reasonable way to treat directories, too.

    An analogy: Links to 'http://some.place.or.other:port' get rewritten to 'http://some.place.or.other:port/', so links to '/this/dir' should get rewritten to '/this/dir/'.

  • patch 810023, a very nice patch to fix some reporthook behavior in urllib. The submitter, J. Lewis Muir, wrote regression tests to show that his new urllib worked, so testing this one was pretty easy. Recommend apply, based on the regression test behavior failed by the current source tree.

  • patch 893642, which adds an optional allow_none argument to SimpleXMLRPCServer and classes that use it. I updated the patch & added some documentation.

  • patch 1067760, to bug 1067728 (closed). This patch changes the behavior of seek() on file objects to do float --> long conversion instead of float --> int conversion. This allows 2.0**62 to be used in a seek, just like 2 ** 62. I recommended applying it because it shouldn't lead to new bugs. I'm probably missing something.

Spent a bit of time on Python bugs today, as per Martin v. Löwis's suggestion -- building up karma for my own (5 line) patch to sgmllib ;). Took a lot more time than I thought, sigh; I only took a look at one. Here are my notes: <hr>

patch 755660, fixes bug 736428. Comments on bug 917188 (closed) are relevant. May also fix or at least allow amelioration of behavior in bug 683938 (assigned to frdrake) and bug 699079 (closed).

I don't understand why in the comments on bug 736428 it says that the "patch in bug 917188 (closed) may be better" because there's no patch attached. Perhaps kingwood means that you need to pay attention to markupbase.py, too?

My comments:

This patch allows developers to override the behavior of HTMLParser
when parsing malformed HTML.  Normally HTMLParser calls the function
self.error(), which raises an exception.  This patch adds appropriate
return values for situations where self.error has been redefined in
subclasses to *not* raise an exception.

It does not change the default behavior of HTMLParser and so presents no backwards compatibility issues.

The patch itself consists of an added comment and two added lines of code that call 'return' with appropriate values after a self.error call. Nothing wrong with 'em. I can't verify that the "junk characters" error call will leave the parser in a good state, though, if execution returns from error().

The library documentation could be updated to reflect the ability to override error() behavior; I've written a short patch, available at

http://issola.caltech.edu/~t/transfer/HTMLParser-doc-error.patch

More problems exist with markupbase.py, upon which HTMLParser is based. markupbase calls error() as well, and has some stickier situations. See comments in bug 917188 as well.

Comments in 683938 and 699079 suggest that raising an exception is the correct response to the parse errors. I recommend application of the patch anyway, because it (a) doesn't change any behavior by default and (b) may solve some problems for people.

An alternative would be to distinguish between unrecoverable errors and recoverable errors by having two different functions, e.g. error() (for recoverable errors) and _fail() (for unrecoverable errors). By default error() would call _fail() and internal code could be changed to call _fail() where recovery is impossible. This might alter behavior in situations where subclasses override error() but then again that's not legitimate to do anyway, at least not at the moment -- error() isn't in the docs ;).

If nothing done, at least close patch 755660 and bug 736428 with a comment saying that this behavior will not be addressed ;).

Python problems: sgmllib/htmllib vs HTMLParser

While playing with PBP, I noticed that tag attributes weren't being correctly parsed. For example,

<option value="Small (10&quot;)"> Small (10&quot;)

was coming through as

<option value="Small (10&quot;)"> Small (10")

This caused problems in two areas: first, trying to set the value of the associated select widget failed unless the entity-encoded string was used (Small (10&quot;) instead of Small (10")). This in turn caused problems on submission of the form to the Web server, because the value was encoded once more for HTTP transmission. cgi.FieldStorage would decode it on the server side and set the select widget value to Small (10&quot;). So overall badness happened on both client and server sides.

I dug deeply into PBP, which led me to mechanize, which in turn led me to ClientForm, which led me to htmllib.HTMLParser. The trail finally ended in sgmllib. Long story short: there are two HTML parsing classes in Python, htmllib.HTMLParser (derived from sgmllib.SGMLParser) and HTMLParser.HTMLParser, which is more-or-less standalone. mechanize can use either, but prefers htmllib because it is present in older versions of Python. And here's the essential clue: the problem goes away if you switch to using HTMLParser.HTMLParser instead of htmllib.HTMLParser.

Once I figured this out, the root cause was easy to find: sgmllib.SGMLParser (and therefore htmllib.HTMLParser) does not unescape tag attributes, while HTMLParser.HTMLParser does. Oddly enough it doesn't use handle_entityref to unescape tag attributes; it uses string.replace to handle a small number of specific entity refs. I'm not sure if this is correct, but it's easy to move the same code over to sgmllib.py.

The diff to sgmllib.py is below. It's pretty small; I'll send it out the comp.lang.python newsgroup and see what people think, before I waste the time of Python maintainers specifically. It sure is nice to dig deeply into the code and find such a simple fix ;).

--- sgmllib.py  2004-12-16 23:30:51.000000000 -0800
***************
*** 272,277 ****
--- 272,278 ----
              elif attrvalue[:1] == '\'' == attrvalue[-1:] or \
                   attrvalue[:1] == '"' == attrvalue[-1:]:
                  attrvalue = attrvalue[1:-1]
+                 attrvalue = self.unescape(attrvalue)
              attrs.append((attrname.lower(), attrvalue))
              k = match.end(0)
          if rawdata[j] == '>':
***************
*** 414,419 ****
--- 415,432 ----
      def unknown_charref(self, ref): pass
      def unknown_entityref(self, ref): pass

+ # Internal -- helper to remove special character quoting + def unescape(self, s): + if '&' not in s: + return s + s = s.replace("<", "<") + s = s.replace(">", ">") + s = s.replace("'", "'") + s = s.replace(""", '"') + s = s.replace("&", "&") # Must be last + + return s +

class TestSGMLParser(SGMLParser):

g'nite.

--titus

16 Dec 2004 (updated 16 Dec 2004 at 09:20 UTC) »
dcoombs -- have you tried NJAMD? I've had moderately good luck with it...

Testing Web sites

Revisited Cory Dodt's Python Browser Poseur (PBP) today. This is one of those projects that frequently pops into my head as something worth investigating, but I've never actually looked at it seriously. (And the last time I looked, there were still some fairly obvious broken bits that prevented me from making use of it -- but there's a new version...)

PBP is the best (simplest + easiest) way I've seen to test dynamic Web sites. It's based on mechanize, a Python version of WWW::Mechanize, and it provides a simple scripting ability to automate Web site browsing. Even someone without extensive Python experience can write scripts for it, which is an advantage for groups that aren't all programmers. I haven't tried extending it but I doubt it's that difficult; the package code looks clean & is relatively short.

PBP is relatively simple, at least on the surface: here's the example script from their site.

go http://mailinator.com
code 200
find "property of Outsc.*me"
showform
formvalue search email pbp.berlios.be
submit search
code 200
find "NO MESSAGES"

When executed by pbpscript, this script goes to mailinator.com, searches for the regexp "Outsc.*me" (which matches "...is property of Outscheme, Inc"), and then checks e-mail for the pbp.berlios.be@mailinator.com e-mail address. If there are any messages, the script fails. (Try changing '.be' to '.de' if you want to see -- sorry, I screwed up the example on the Web page by stupidly sending e-mail to pbp.berlios.de@mailinator.com.)

This is cool.

I'm puzzled that neither mechanize nor PBP are better known (as in, I haven't seen it mentioned anywhere but in c.l.p.announce on the occasion of a new release). I don't monitor freshmeat, which is another place it's been posted. Apart from that a google search doesn't turn up much mention. Am I missing a wealth of similar software that is better? What do people use to test Web sites, anyway? A currently-defunct list archive has a reference to HttpUnit, which is a nice-looking Java framework. Unfortunately I doubt it's as Python-extensible as PBP ;). John Lee of mechanize also points out webunit, by Richard Jones (also author of Roundup). I may have to take a look at that. Anything else?

In the interests of exercising PBP a bit, I wrote a simple PBP script (note: transient link) to run through my WSGI adapter interface for the Quixote demo. You can try it out if you want; the site just runs quixote.demo through a CGI-->WSGI -->QWIP bridge. And yes, it's veeeeeeery slow.

I ran into only one real problem with PBP: HTML encoded form values. In the Quixote widget demo, there's a select widget that takes pizza sizes with inch units, e.g. 'Medium (10")'. The mechanize ClientForm is returning this in HTML-encoded form, 'Medium (10&quot;)', and PBP demands that it be set to this value. However, Quixote barfs on this because it is expecting 'Medium (10")' -- which is in fact what Quixote sees from browsers. There may be some invisible layers of encoding/decoding going on; Quixote uses cgi.FieldStorage which presumably decodes a properly-encoded string from the browser. I think the appropriate thing to do here is to change mechanize's behavior, but I will ask Cory what he thinks first; I haven't dealt with this aspect of HTML forms before, having been spoiled by nice libraries ;).

Next I'll have to try extending PBP from Python & vice versa. Anon...

--titus

"Consider the situation of two trauma surgeons arriving at an accident scene. The patient is bleeding profusely. If surgeons were like programmers, they'd leave the patient to bleed out in order to have a really satisfying argument over the merits of two different kinds of tourniquet." -- Philip Greenspun.

16 Dec 2004 (updated 5 Jan 2005 at 00:43 UTC) »

a whole entry vanished!

14 Dec 2004 (updated 14 Dec 2004 at 18:49 UTC) »
haruspex... the problem is that the right-wing nutsos are saying "let's burn all the oil before we do anything else" and the left-wing nutsos are saying "nothing but non-nuclear renewable energy will do", leaving me with the currently unimplementable centrist view of "let's switch over to nukes, while exploring alternative energy options and weaning ourselves from fossil fuels".

Unfortunately for Americans (I'm in the US) where we have these things called "elections", we do often have to choose between only two real options. In this case I'm not even sure what Kerry's standpoint was on the environment, but I didn't like his views on the Patriot Act (he voted for it) or his views on the Iraq invasion (he voted for giving Bush the power to do it & then made a U-turn for political reasons -- which inconsistency I despise). I still voted for him because I despised Bush more ;).

So while I agree with you in theory, in the real world it's different. I can't stand Bush, but I also find most of his real political opposition to be anti-reason. Who do I support? Kucinich? Or Dean? Or Al Sharpton, who makes an awful lot of sense? I would have voted for McCain just based on consistency, but Bush managed to torpedo him in 2000... so I'm left with whomever the Democrats support. Which unfortunately was Kerry.

All of which is besides the point, eh? I think Crichton is dangerous but not necessarily wrong. And I really like your face-hugger image!

tk, I'd go even further and say only people following the scientific methodology are scientists, whatever the others may call themselves.

On a side note, I wish I didn't have to post a full diary entry to respond personally to you folks. Is Advogato undergoing much development these days? It would be nice to add a comment ability.

--titus

14 Dec 2004 (updated 14 Dec 2004 at 07:23 UTC) »
haruspex and tk are getting personal, o my!

I hadn't seen Crichton's latest before today, but the Caltech talk I mentioned in my own little screed is available online.

Honestly, I'm not sure what to make of it all. I stand by what I said before: science -- not religion, nor "public policy debates", nor extreme left-wing environmentalism -- is going to give us facts. Crichton's attitude is that scientists are polarized towards the left and biasing their results & discussion in that direction, and this needs to be corrected. Unfortunately he fails to note that, historically, only scientists actually correct science; it may take a while, but the truth will out. This omission, combined with the prevailing political climate in our federal gov't, means that he is simply playing into the hands of people that are at least as illogical as (and much less interested in objective truth than) any scientist.

There's an interesting article in Science Magazine (the top journal in scientific research) that might be useful reading. E-mail me if you can't get access.

I have to agree with tk that the Greenpeace style of environmentalists are somewhat idiotic. There's a reason why China may establish the first really large-scale use of nuclear reactors -- which is probably the only medium-term hope for decreasing fossil-fuel use. It's sad that we have to choose between right-wing nutsos and left-wing nutsos on issues like this. (Side anecdote: a few years back, my advisor was in Germany. He saw a a political protest against GM foods where the German Green Party was chanting "Food without genes!". Hmmm...)

You guys should both read Neal Stephenson's Zodiac. Fantastic book that spares no one -- and a crackin' good read, much like Snow Crash but without the long-winded ending ;).

Peace out,

--titus

Hey, berend, your "Mars global warming" reference was pulled by the paper that published it -- and o look, it's not been published anywhere else! Looks like the Denver Post got taken... but that's besides the point: it's not exactly hard to measure the Sun's energy output, and I'm pretty sure we'd have noticed if it was going up substantially. It <ahem> hasn't.

Gravity's just a theory, too.

Creationists and those who firmly believe climate change isn't driven by humans miss the point: science isn't about providing certainty. It's about providing uncertainty.

Take gravity. Gravity is something that we can observe pretty easily just by dropping an apple. We can note correlations (massier planets seem to have larger gravitational fields, for example). We can guess that, since the flux per unit area through the surface of a sphere decreases as the inverse square of the sphere's radius, gravity is subject to the inverse square law. We can even posit underlying mechanisms linking gravity to a specific particle, like the Higgs boson. What we can't do is prove that we understand how gravity works, except in terms of other theories (like particle theory and general relativity). We also can't guarantee that gravity functions the same way (or at all!) in places out of our direct experimental reach -- we can just show that the cosmological motions we see match our expectations were gravity to work the same.

These are the same objections that people bring to evolution and climatology: we don't understand much about the underlying mechanisms in either area. We can't show that the same rules that we see operating today are the rules that operated 2,000 or 4,000 or 500,000 years ago. We can say that what we see in the fossil record and among living organisms today strongly suggests a single common ancestor for all life on earth; but we can't rule out the theory that God created the earth 6,000 years ago, because we don't have any objective observers from that time. We certainly can't demonstrate that human activity has caused climate warming, although there do seem to be significant correlations between human activity and climate change. (Note that correlation does not imply causation, though.)

So, why is gravity undisputed (except by Flat Earth people)? And why are climate change and evolution such hot topics? I'm not sure, but I can suggest a few reasons.

Gravity is undisputed today partly because no religion has made the precise mechanism a point of recent dispute. It used to be in dispute, though; remember Galileo? That, ultimately, was a dispute about gravity on the scale of our solar system. Yet no edicts about the Higgs boson, or general relativity, have emanated from the Catholic Church, and Bush doesn't seem to care about gravity.

Another reason that people don't argue much about gravity is that the theory of gravity is predictive. Given a comet's position and momentum, we can tell you pretty much where it's going to go. It's a little harder in atmosphere, but we do it very well -- think ballistic missiles, for example. This predictive power goes a long way towards quieting dissent with the theory, because if you can predict something people will generally believe you understand it pretty well. (We'll come back to this.)

Evolution, for better or for worse, is not in the same position. It's a major point of dispute in at least a few places, and it's not predictive in the least. Even worse, it can't be very specific in predictions, because it's a stochastic theory that is subject to historical contingency. We will never be able to predict what mutations will arise randomly, and we will probably never be able to predict what effect those mutations will have on ecosystems. We might be able to predict general trends, but that is still far away from being an exact science.

Climatology is a much younger science than either the physics of gravity or the study of evolution. Like evolution, and unlike gravity, it seems to be very sensitive to certain kinds of perturbations -- that is, it's "chaotic". Very small changes may have large effects elsewhere. Moreover we don't understand many of the basic processes very well, and we don't have good ways to measure even relatively simple things like energy input from the sun, much less complicated things like CO2 consumption. Climatology is certainly not a predictive science in general, although some things can be predicted, just like in evolution: if you know where a hurricane is today, you can guess pretty well where it's going to be tomorrow.

Climatology is also a big point of contention for economic reasons: global warming, in particular. Corporations don't want to reduce the emissions of greenhouse gasses because they believe that it will have a negative economic impact on them. Therefore they (or their proxies) attack global warming as an unproven theory, in order to undermine its impact on public policy. As with the religiously motivated attacks on evolution, this is definitely bad for science.

If we could predict climate, or predict the effects of evolution, presumably people would regard these theories as being more credible than they are now. Unfortunately it's impossible to turn evolution into a predictive theory, and it's going to be a while before we get a predictive handle on climatology. So both theories are amenable to attack on the charges of being "unproven".

And here we come to the nut: the scientific method can't prove anything, in general. It is is much, much better at disproving theories than it is at confirming them; any working scientist will agree with that! All that an honest scientist can say about gravity, or evolution, or global warming, is that they haven't been disproven yet. There are reasons to believe that gravity and evolution are pretty good theories, scientifically speaking, because they've withstood the test of time. I'm not very knowledgeable about climatology but I do know it's quite a bit shakier in its underpinnings. But attacking any of these theories for not having provided proof is missing the whole point of science, which is to disprove as much as possible.

People -- even many intelligent people who should know better -- frequently get this wrong. Michael Crichton, the prolific author of (among other books) Jurassic Park, gave an interesting lecture at Caltech where he talked about scientist's involvement in political debates on public policy. Nuclear winter and global warming were two examples where a strongly biased view has been pushed strongly and publicly by a relatively small cadre of scientists. Crichton's view seemed to be that scientists were no less fallible than anyone else, which is undeniable (though unpopular among scientists ;). What he missed, and what I think many scientists fail to emphasize, is that thus far the scientific method -- with objective measurements and peer review, in particular -- is the only proven method of discovery known to mankind. We ignore it at our peril.

Scientists can do their part by proudly admitting ignorance. It's not pleasant, but it's undeniable: did you know, for example, that the underlying mechanism by which evolutionary novelty arises is still in dispute? Yep! We still don't really understand how new traits arise! And did you know that the precise reflectivity of the earth -- which is a major determinant of energy input into our climate, and is directly linked to the "greenhouse effect" -- is still not easily measurable? Yep! No long-term trends available! And these are just two things I've worked on -- I'm sure there's an ocean of ignorance out there, just waiting to be publicized. That's science!

The flip side of the coin is that those who critically examine scientific theories should apply the same level of critical analysis to their own beliefs. This applies to postmodern lit-crit as much as it applies to religious believers -- and I think it's as important as science is, as a method for making public policy.

Note to readers: I've been thinking about writing something like this for a while. It's an ongoing project, so please e-mail me at titus@caltech.edu if you have thoughts, criticisms, or suggestions.

14 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!