Older blog entries for titus (starting at number 25)

3 Jan 2005 (updated 3 Jan 2005 at 20:35 UTC) »
Is development of new software tools "moribund"?

I've always enjoyed Philip Greenspun's take on things; about 50% of what he says goes straight to the point. Unfortunately the other 50% seems to be complete crud. (I'm never sure how seriously he's taking himself, so it's hard to know if he's just trying to be thought-provoking. Even if not, separating out the gems from the crap is enjoyably difficult and hence worthwhile.)

His most recent weird post talked about a fairly simple Perl script to spam friends with invitations. In the comments, someone tweaked Philip about being a "Perl whore" despite his lauding of AOLserver, a Tcl-only (well, mostly) Web server. Philip responded that he had good reasons for liking AOLserver (connection pooling &c.), and since it happened to use Tcl, he used Tcl himself. Philip then felt compelled to say that he thought software tools were moribund, because his friends coding in .NET, his students working with Java, and people working with PHP weren't as productive as people who used AOLserver -- a technology designed over 10 years ago.

In the comments, I remarked that whinging about how Web software development tools haven't moved on since 1992 by citing Tcl, .NET, Java, and PHP was silly. Why not look at languages that support (and support well) a variety of techniques? Python (and by reputation Perl and Ruby) tools have come a loooong way in the last few years. In Python you can now choose between at least 5 different frameworks, all of them at least moderately mature. All of the popular Web programming techniques are represented between the various frameworks: the object publishing paradigm (Zope/Quixote), the object-kit templating paradigm (WebWare/CherryPy), and of course the utility belt that is Twisted.

The real question is, why not use Tcl, PHP, .NET, and Java for Web programming? It comes down to two conflicting considerations:

  1. the "each page is a class method" model encouraged by Java is overwrought for the simple pages that make up 20%-80% of any Web site (depending on the site, of course); it seems to be great for more complicated sites, though.

  2. the "each page is a string of spaghetti code" model encouraged by Tcl and PHP is great for the simple pages (20%-80% of any site) but disastrous for more complicated sites.

Or, to put it another way, straight scripting languages are great for constructing simple pages, while higher-level abstractions are needed for more complex pages. Most Web sites require both kinds of pages, but many languages do not support both kinds of programming well.

Python offers a great intermediate: it is a scripting language that deals well with outputting strings, but which offers nice higher-level ways of building out a framework. (This is likely true of other scripting languages that support object-oriented programming, such as Perl, [incr Tcl], Ruby, OCaml, ... what am I missing? VBA?)

Bas Scheffers pointed out (in the comments, again) that "it's not the tool, it's the programmer". Well, yes, this is the usual last resort of any language anti-advocate: heck, they're all identical at some level of abstraction, right? And you can write spaghetti code in any language, right? So how dare you say that language A is better than language B? Well, my experience tells me (and other people that you should trust more) that Python is better than both Tcl and Java for many things. And so the question is, why? I'm sure there are many considerations that make sense to people, but one that I haven't seen mentioned before is that of how examples are coded.

Since most programmers liberally "appropriate" code -- from books, from open-source programs, and especially from cookbooks and examples -- the quality of that code has a lot to do with the way they write programs. It also has a huge effect on the way they learn to code in the language. And I think this effect is grossly underestimated.

For Python, there's a nice tutorial, a variety of books (caveat: most of which I haven't read), and many, many examples from both the comp.lang.python newsgroup and the Python cookbook. By and large, the code contained in examples is clean, simple, short, and documented. It also lives at a level of abstraction that fits: classes/objects used when appropriate, and not used when not appropriate. Yes, there's plenty of gunk in Python, but it's not what you first encounter & it's easy to find pretty examples.

For other languages, example code is often much nastier. Spaghetti-code style may not be required by Tcl and PHP, but most of the example code I've seen can't be classed any other way. Overwrought object-oriented gunk may not be required by Java, but most of the example code I've seen certainly fits the description.

It could be that simple: if you write uncomplicated yet useful code examples, and people learn your language from them, then in the end your language will be used to write prettier code. And there will be less unmaintainable spaghetti code, or ugly overcomplex OOgunk, written in that language. And, if you're lucky & you've figured out how to grow your language over time with the help of the community, your language will continually move towards supporting that kind of coding.

Why Python examples are pretty is a different question, and the answer may be more sociological than technical. Perhaps it's as simple as Python fitting my brain better than other languages do, and it's not true for everyone else. Or, perhaps it's more than that -- for example, our beloved BDFL may be particularly good at designing a certain type of language.

In the end, supporting good mechanisms of abstraction may be necessary for good programming, but it is obviously not sufficient. It doesn't do much good to have a language that supports a bunch of mechanisms that don't cleanly fit into example code. Nor will it do the language any good in the long run if the example code is poorly constructed.

So, I don't think that the development of software tools is moribund -- but it may be time to move on from Tcl for doing Web programming.

Philip Greenspun's company died a horrible death several years ago partly because they were trying to transition a hideously complex mess o' Tcl into Java. I wonder what would have happened if they'd chosen to use something more scriptable than Java but a little more supportive of abstraction than Tcl?

--titus

p.s. I'm pretty supportive of using Java for other things, such as GUIs. I'm just not smart enough to make it do data reduction fast... so I switched to C++ / FLTK.

28 Dec 2004 (updated 28 Dec 2004 at 23:31 UTC) »

Odds and ends today...

PBP & SF hackage

SourceForge announce lists full of spam? Try this PBP script, with the rematch.py extension:

pyload rematch.py

go http://lists.sourceforge.net/lists/admindb/listname fv 1 adminpw pass submit 1

do set_match --form 1 "\\\\d+$" 3 submit 1

'twil flush out all messages in your queue...

Decorators in Quixote

Kevin Dangoor queried about decorators in Quixote, and here are implementations of his two suggested decorators. They seem to make sense syntactically.

First, one to restrict access to the decorated function to logged-in users:

from quixote.errors import AccessError

def require_login(func): """ decorator: require login to run decorated function. """ def wrapper(request): if not request.session.user: raise AccessError("you must be logged in!") return func(request)

return wrapper

Use like so:

@require_login
def func(request):
   ...

A slightly more complex case follows: here are two functions to export names for publication by Quixote.

def export(func):
    """
    decorator; export decorated function under its __name__.
    """
    _q_exports = func.func_globals['_q_exports']
    _q_exports.append(func.__name__)

return func

def export_names(*names): """ decorator; export decorated function under all given names. """

# build a new function to return; this is what will be called on # the following function. def export_func(func, names=names): _q_exports = func.func_globals['_q_exports'] for name in names: _q_exports.append((name, func.__name__,))

return func

return export_func

Use these like so:

@export
def func(request):
   ...

@export_names("name1", "name2") def func2(request): ...

It's a little bit irritating that you have to grab _q_exports from func_globals but *shrug* that's scoping for ya! (If you don't do this, then you can't import the decorators from another module.)

--titus

Web browsing via Python, refreshed.

The saga continues... after many patches, and diversions, and attempts to grok, I finally got PBP to work for a variety of purposes. In particular, I can now:

  • Use maxq to record browsing sessions to PBP scripts, and run those scripts successfully in PBP;
  • Use an only slightly altered HTMLParser.HTMLParser class to parse the moderately crappy SourceForge/mailman output;
  • Grok RFC 2965 cookies well enough to actually log into and play with mailman admindb pages;
  • Save, load, and view cookies in PBP.

I've submitted a bunch of patches to PBP & hopefully Cory can take a look at them in the next few weeks.

HTML/HTTP/URL support in Python's base is surprisingly messy, given my past experience with Python modules. I look forward to the day when Python 2.4 or above is the standard and 2.3 is no longer supported -- it will make cookie and URL handling much simpler!

I may spend some quality time refitting the HTMLParser classes in the htmllib and HTMLParser modules; as-is, they don't have good failure modes.

--titus

"He realized the fastest way to change is to laugh at your own
folly -- then you can let go and quickly move on."
-- via Dossy Shiobara
20 Dec 2004 (updated 28 Dec 2004 at 23:10 UTC) »
The joys, trials, and tribulations of open-source work

This is becoming a bad joke.

The story starts innocently enough: when playing with PBP, I found a bug in Python's sgmllib.py. This bug caused PBP (via ClientForm) to fail on my test case, the Quixote demo application. So I went and fixed that bug and submitted a patch to the Python developers.

Then I diverted myself and decided to take up the gauntlet thrown by Martin v. Löwis, in order to see how quickly I could get my patch reviewed. I proceeded to work through first one and then nine add'l Python patches. My python-dev post still hasn't gone through so nothing further has happened there.

Today I put myself back on track and spent some time with maxq, a Java HTTP proxy recording system that outputs Jython code tracking a Web session. I added a script generator module for PBP to maxq so that it would output PBP scripts as well.

In the process I ran across two additional problems in PBP. First I found a bug in the way PBP used shlex (PBP patch 274). Then maxq's behavior of recording all form variables, even hidden ones, led me to discover that PBP crashed when trying to change hidden variables. I changed this to a warning (PBP patch 273). (Not strictly a PBP problem, but the crash behavior was probably inappropriate.)

So, to try out one package (PBP) and modify another to fit my needs (maxq), I ended up crawling through a bunch of nifty packages (mechanize, ClientForm, and four or five Python modules I'd never used before), submitting three different patches to two different projects (not including revised patches I contributed to Python), and writing a small chunk of new code in Java. whee!

The worst part is that one of the PBP patches I'm submitting contains this:

 ...See http://issola.caltech.edu/~t/transfer/quixote-demo.pbp for an
operative example, although you will not be able to run it without
changing ClientForm.py to use the XHTMLCompatibleFormParser, because
of a bug in sgmllib.py (Python patch 1087808).
Yep -- to test this patch, go apply this other patch to the language you're using...

O well. At least now I can get around to the original purpose of writing a testing suite for Cartwheel.

My wife and cat are upset with the amount of time I've spent on this, that's for sure... ;) Of course, my wife doesn't get a vote because she went skiing today while I abused sea urchins. (The cat doesn't get a vote either: he sleeps all day.)

On the plus side, at least I could fix the software myself, so everything works now. I'd be SOL if any of this stuff had been closed-source.</a></a>

Continued karma-grubbing.

My advice for the Python patches neophyte: start high. I foolishly started in the low-numbered patches (#755660) & spent over an hour trying to figure out the code (which was easy) and the various patches and counterarguments and ... That part was less easy and nowhere near as pleasant. Folks, there's a reason why those patches and bug reports have been lying around for over a year -- no one else wants to deal with them.

Another piece of advice is to stick with modules you already know about or are interested to learn about. In my case I'd just gone through the HTMLParser stuff to find the sgmllib bug, so I decided to focus on HTML/CGI/URL stuff.

Anyhoo, worked my way through 9 add'l patches. Great! Together with my comment on patch 755660, I ran through 10 different patches. I may now pray to The Gods That Be to review patch 1087808, my sgmllib fix.

Here's a list:

  • patch 1055159, a simple docstring/doc patch to CGIHTTPServer. (This was really a documentation "enhancement request" masquerading as a patch, so I verified the behavior described & then wrote the doc string appropriately.)

  • patch 755670, a patch to make HTMLParser parse invalid HTML. Recommended not applying it.

    (I also put in a comment that was meant for a different patch (755660). Very embarrassing. Sigh. Tabbed browsing is dangerous in my hands. ;)

  • patch 1037974, a patch to fix HTTP digest authentication when accessing LiveJournal feeds (and any other feeds that don't listen to RFC 2617, which considers 'Algorithm' notification optional).

    Had to sign up for a livejournal account to test this, too...

    This is my first time dealing with digest authentication, but as I understand it the new behavior is merely verbose and a bit redundant. It certainly shouldn't break anything. Recommend apply.

  • patch 1028908, a bunch of small stuff by John J. Lee of wwwsearch. Apparently much of the code he modified was originally written by him anyway (?) and the regression tests passed, so *shrug*... recommend application.

  • patch 901480, fixing bug 735248. This fixes a bug in the way urllib2.parse_http_list parses unquoted elements. Recommend application, although I submitted a slightly modified patch (against the current CVS) that fixed a doctest string.

  • patch 827559 fixes SimpleHTTPServer to add a trailing '/' to directory names. Recommend application; it does fix the behavior, and I think it's a reasonable way to treat directories, too.

    An analogy: Links to 'http://some.place.or.other:port' get rewritten to 'http://some.place.or.other:port/', so links to '/this/dir' should get rewritten to '/this/dir/'.

  • patch 810023, a very nice patch to fix some reporthook behavior in urllib. The submitter, J. Lewis Muir, wrote regression tests to show that his new urllib worked, so testing this one was pretty easy. Recommend apply, based on the regression test behavior failed by the current source tree.

  • patch 893642, which adds an optional allow_none argument to SimpleXMLRPCServer and classes that use it. I updated the patch & added some documentation.

  • patch 1067760, to bug 1067728 (closed). This patch changes the behavior of seek() on file objects to do float --> long conversion instead of float --> int conversion. This allows 2.0**62 to be used in a seek, just like 2 ** 62. I recommended applying it because it shouldn't lead to new bugs. I'm probably missing something.

Spent a bit of time on Python bugs today, as per Martin v. Löwis's suggestion -- building up karma for my own (5 line) patch to sgmllib ;). Took a lot more time than I thought, sigh; I only took a look at one. Here are my notes: <hr>

patch 755660, fixes bug 736428. Comments on bug 917188 (closed) are relevant. May also fix or at least allow amelioration of behavior in bug 683938 (assigned to frdrake) and bug 699079 (closed).

I don't understand why in the comments on bug 736428 it says that the "patch in bug 917188 (closed) may be better" because there's no patch attached. Perhaps kingwood means that you need to pay attention to markupbase.py, too?

My comments:

This patch allows developers to override the behavior of HTMLParser
when parsing malformed HTML.  Normally HTMLParser calls the function
self.error(), which raises an exception.  This patch adds appropriate
return values for situations where self.error has been redefined in
subclasses to *not* raise an exception.

It does not change the default behavior of HTMLParser and so presents no backwards compatibility issues.

The patch itself consists of an added comment and two added lines of code that call 'return' with appropriate values after a self.error call. Nothing wrong with 'em. I can't verify that the "junk characters" error call will leave the parser in a good state, though, if execution returns from error().

The library documentation could be updated to reflect the ability to override error() behavior; I've written a short patch, available at

http://issola.caltech.edu/~t/transfer/HTMLParser-doc-error.patch

More problems exist with markupbase.py, upon which HTMLParser is based. markupbase calls error() as well, and has some stickier situations. See comments in bug 917188 as well.

Comments in 683938 and 699079 suggest that raising an exception is the correct response to the parse errors. I recommend application of the patch anyway, because it (a) doesn't change any behavior by default and (b) may solve some problems for people.

An alternative would be to distinguish between unrecoverable errors and recoverable errors by having two different functions, e.g. error() (for recoverable errors) and _fail() (for unrecoverable errors). By default error() would call _fail() and internal code could be changed to call _fail() where recovery is impossible. This might alter behavior in situations where subclasses override error() but then again that's not legitimate to do anyway, at least not at the moment -- error() isn't in the docs ;).

If nothing done, at least close patch 755660 and bug 736428 with a comment saying that this behavior will not be addressed ;).

Python problems: sgmllib/htmllib vs HTMLParser

While playing with PBP, I noticed that tag attributes weren't being correctly parsed. For example,

<option value="Small (10&quot;)"> Small (10&quot;)

was coming through as

<option value="Small (10&quot;)"> Small (10")

This caused problems in two areas: first, trying to set the value of the associated select widget failed unless the entity-encoded string was used (Small (10&quot;) instead of Small (10")). This in turn caused problems on submission of the form to the Web server, because the value was encoded once more for HTTP transmission. cgi.FieldStorage would decode it on the server side and set the select widget value to Small (10&quot;). So overall badness happened on both client and server sides.

I dug deeply into PBP, which led me to mechanize, which in turn led me to ClientForm, which led me to htmllib.HTMLParser. The trail finally ended in sgmllib. Long story short: there are two HTML parsing classes in Python, htmllib.HTMLParser (derived from sgmllib.SGMLParser) and HTMLParser.HTMLParser, which is more-or-less standalone. mechanize can use either, but prefers htmllib because it is present in older versions of Python. And here's the essential clue: the problem goes away if you switch to using HTMLParser.HTMLParser instead of htmllib.HTMLParser.

Once I figured this out, the root cause was easy to find: sgmllib.SGMLParser (and therefore htmllib.HTMLParser) does not unescape tag attributes, while HTMLParser.HTMLParser does. Oddly enough it doesn't use handle_entityref to unescape tag attributes; it uses string.replace to handle a small number of specific entity refs. I'm not sure if this is correct, but it's easy to move the same code over to sgmllib.py.

The diff to sgmllib.py is below. It's pretty small; I'll send it out the comp.lang.python newsgroup and see what people think, before I waste the time of Python maintainers specifically. It sure is nice to dig deeply into the code and find such a simple fix ;).

--- sgmllib.py  2004-12-16 23:30:51.000000000 -0800
***************
*** 272,277 ****
--- 272,278 ----
              elif attrvalue[:1] == '\'' == attrvalue[-1:] or \
                   attrvalue[:1] == '"' == attrvalue[-1:]:
                  attrvalue = attrvalue[1:-1]
+                 attrvalue = self.unescape(attrvalue)
              attrs.append((attrname.lower(), attrvalue))
              k = match.end(0)
          if rawdata[j] == '>':
***************
*** 414,419 ****
--- 415,432 ----
      def unknown_charref(self, ref): pass
      def unknown_entityref(self, ref): pass

+ # Internal -- helper to remove special character quoting + def unescape(self, s): + if '&' not in s: + return s + s = s.replace("<", "<") + s = s.replace(">", ">") + s = s.replace("'", "'") + s = s.replace(""", '"') + s = s.replace("&", "&") # Must be last + + return s +

class TestSGMLParser(SGMLParser):

g'nite.

--titus

16 Dec 2004 (updated 16 Dec 2004 at 09:20 UTC) »
dcoombs -- have you tried NJAMD? I've had moderately good luck with it...

Testing Web sites

Revisited Cory Dodt's Python Browser Poseur (PBP) today. This is one of those projects that frequently pops into my head as something worth investigating, but I've never actually looked at it seriously. (And the last time I looked, there were still some fairly obvious broken bits that prevented me from making use of it -- but there's a new version...)

PBP is the best (simplest + easiest) way I've seen to test dynamic Web sites. It's based on mechanize, a Python version of WWW::Mechanize, and it provides a simple scripting ability to automate Web site browsing. Even someone without extensive Python experience can write scripts for it, which is an advantage for groups that aren't all programmers. I haven't tried extending it but I doubt it's that difficult; the package code looks clean & is relatively short.

PBP is relatively simple, at least on the surface: here's the example script from their site.

go http://mailinator.com
code 200
find "property of Outsc.*me"
showform
formvalue search email pbp.berlios.be
submit search
code 200
find "NO MESSAGES"

When executed by pbpscript, this script goes to mailinator.com, searches for the regexp "Outsc.*me" (which matches "...is property of Outscheme, Inc"), and then checks e-mail for the pbp.berlios.be@mailinator.com e-mail address. If there are any messages, the script fails. (Try changing '.be' to '.de' if you want to see -- sorry, I screwed up the example on the Web page by stupidly sending e-mail to pbp.berlios.de@mailinator.com.)

This is cool.

I'm puzzled that neither mechanize nor PBP are better known (as in, I haven't seen it mentioned anywhere but in c.l.p.announce on the occasion of a new release). I don't monitor freshmeat, which is another place it's been posted. Apart from that a google search doesn't turn up much mention. Am I missing a wealth of similar software that is better? What do people use to test Web sites, anyway? A currently-defunct list archive has a reference to HttpUnit, which is a nice-looking Java framework. Unfortunately I doubt it's as Python-extensible as PBP ;). John Lee of mechanize also points out webunit, by Richard Jones (also author of Roundup). I may have to take a look at that. Anything else?

In the interests of exercising PBP a bit, I wrote a simple PBP script (note: transient link) to run through my WSGI adapter interface for the Quixote demo. You can try it out if you want; the site just runs quixote.demo through a CGI-->WSGI -->QWIP bridge. And yes, it's veeeeeeery slow.

I ran into only one real problem with PBP: HTML encoded form values. In the Quixote widget demo, there's a select widget that takes pizza sizes with inch units, e.g. 'Medium (10")'. The mechanize ClientForm is returning this in HTML-encoded form, 'Medium (10&quot;)', and PBP demands that it be set to this value. However, Quixote barfs on this because it is expecting 'Medium (10")' -- which is in fact what Quixote sees from browsers. There may be some invisible layers of encoding/decoding going on; Quixote uses cgi.FieldStorage which presumably decodes a properly-encoded string from the browser. I think the appropriate thing to do here is to change mechanize's behavior, but I will ask Cory what he thinks first; I haven't dealt with this aspect of HTML forms before, having been spoiled by nice libraries ;).

Next I'll have to try extending PBP from Python & vice versa. Anon...

--titus

"Consider the situation of two trauma surgeons arriving at an accident scene. The patient is bleeding profusely. If surgeons were like programmers, they'd leave the patient to bleed out in order to have a really satisfying argument over the merits of two different kinds of tourniquet." -- Philip Greenspun.

16 Dec 2004 (updated 5 Jan 2005 at 00:43 UTC) »

a whole entry vanished!

14 Dec 2004 (updated 14 Dec 2004 at 18:49 UTC) »
haruspex... the problem is that the right-wing nutsos are saying "let's burn all the oil before we do anything else" and the left-wing nutsos are saying "nothing but non-nuclear renewable energy will do", leaving me with the currently unimplementable centrist view of "let's switch over to nukes, while exploring alternative energy options and weaning ourselves from fossil fuels".

Unfortunately for Americans (I'm in the US) where we have these things called "elections", we do often have to choose between only two real options. In this case I'm not even sure what Kerry's standpoint was on the environment, but I didn't like his views on the Patriot Act (he voted for it) or his views on the Iraq invasion (he voted for giving Bush the power to do it & then made a U-turn for political reasons -- which inconsistency I despise). I still voted for him because I despised Bush more ;).

So while I agree with you in theory, in the real world it's different. I can't stand Bush, but I also find most of his real political opposition to be anti-reason. Who do I support? Kucinich? Or Dean? Or Al Sharpton, who makes an awful lot of sense? I would have voted for McCain just based on consistency, but Bush managed to torpedo him in 2000... so I'm left with whomever the Democrats support. Which unfortunately was Kerry.

All of which is besides the point, eh? I think Crichton is dangerous but not necessarily wrong. And I really like your face-hugger image!

tk, I'd go even further and say only people following the scientific methodology are scientists, whatever the others may call themselves.

On a side note, I wish I didn't have to post a full diary entry to respond personally to you folks. Is Advogato undergoing much development these days? It would be nice to add a comment ability.

--titus

16 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!