Older blog entries for titus (starting at number 30)

11 Jan 2005 (updated 11 Jan 2005 at 08:49 UTC) »
paircomp 0.9 (rc)

hooray. docs, tar.gz. It's only a small & simple comparative sequence analysis library for DNA, but it's been broken for about a month. (Asinine data structure, refactored yo' ass.)

Summary: complete reimplementation, now with regression testing. C++ library completely rewritten using the STL. 95% of the code is now tested via the Python API in one big mongo test script.

One more brick in the wall...

Ryan Tomayko comments on GPL vs the Python community. My work-related libraries are LGPLed, and my GUIs and Web interfaces are GPLed. Why? I'm an academic programmer, and my code is owned by Caltech. Neither I nor Caltech depend on income from these programs. However, I do intend to take them from job to job, and the GPL protects that. The L/GPL also protects Caltech. win/win.

--titus

9 Jan 2005 (updated 9 Jan 2005 at 18:38 UTC) »
Testing is addictive

After my many travails with PBP/maxq/sgmllib/HTMLParser/htmllib I finally sat down to work on my actual Web application, Cartwheel.

Cartwheel is a bioinformatics system that lets biologists upload sequences, analyze them, and export their analyses to a GUI, FamilyRelations II. Cartwheel itself is entirely written in Python, and FRII is written in C++ using FLTK -- a fine combination so far. I use a simple XML-RPC API to export the data & most of the internal communication between Cartwheel components is done in PostgreSQL. The system as a whole has been used by a few hundred people to do bioinformatics work and in general it's fairly robust. It's been around for several years, and I'm pretty much the only steady developer.

Normally I test Cartwheel's Web interface by roaming around in it with a Web browser and paying special attention to things I've changed. Until recently, I had no automated way to test it, and so in general I've been assuming it's mostly ok if no users yell at me after I post an update. (Known as the "Microsoft test method"... ;)

The mass of little bug reports recently reached a critical point, and so I started to fix them today. One of the problems had to do with some naively implemented search code in the Web interface, and so I set out to define the problem by changing the names of a bunch of the form variables. (This also fixed the bug, which tells you something about the code...) I had to edit files all over the place & quickly lost track of what code still needed to be patched.

So, I backed out all of my changes & used maxq to record a Web session that ran through all of the places where the search functionality was used. I saved the resulting PBP scripts and broke them into setup, test, and teardown scripts. I then went through the code base, made my changes, and re-ran the tests and fixed bugs caused by oversight until the tests all succeeded.

Another cool thing is that with the scripts separated into setup, tests, and teardown categories, I can also test my database export and import code quite easily:

setup-test-db
run-all-tests

export-db clear-db import-db

export-db clear-db import-db

run-all-tests

With only a few assumptions about what the setup script does to the DB (basically, that it's complete), this will tell me if my import/export scripts are catching everything.

Overall I probably spent about 3x the amount of time necessary to fix the bug on generating the tests in this manner, but now that I have a (simple) framework set up to do it, it should go faster... One thing is for sure: writing PBP tests without maxq would be painful!

A while back I asked about other Web testing tools, and John J. Lee recently responded with this link: http://wwwsearch.sourceforge.net/bits/GeneralFAQ.html. The Zope link is buggered, but overall I get the impression that there simply aren't many general Web testing tools for Python.

A few other Web testing link collections are on java-source.net and c2.com. JJL also pointed me towards opensourcetesting.org.

I'm interested in finding out about others, please let me know if you find any.

--titus

8 Jan 2005 (updated 9 Jan 2005 at 02:19 UTC) »
"Batteries included" may be a bad term

After some of the recent types kerfuffle, I decided I didn't like the "batteries included" description of Python. Let me explain.

Python-the-language is great, and fits most of my needs. Heck, it's done that since before 2.0 came out. It's always nice to discover that some nifty new feature (like list comprehensions) has been added, and often I can take advantage of them to write tighter/neater/more understandable code. Nonetheless, I'm fundamentally a systems & application programmer, and I don't generally write language frameworks, toolkits, etc., so these new features aren't usually all that critical to my work.

What is critical is the fantastic standard library that comes with Python. It has been steadily expanded in the years that I've been a Python programmer, to the point that when I'm looking around on the 'net for some functionality it's already in the Python lib more often than not. This saves me a lot of time & trouble installing new products. (I've also said that I think that the state of the included Python code, as well as the quality of the documentation & example code, plays a big part in educating new Pythonistas about acceptable coding practices with Python. The Python cookbook is a particularly nice collection of code, too. These both play a big role in the success of Python, IMO.)

So, batteries are included, if you think of the library as the "batteries" -- something essential to function, yet not terribly interesting in its own right. Looking at that description of batteries, though, it sounds more like it fits my conception of the Python language, though. After all, the language that things are written in is essential, yet not always all that interesting; doing things in the real world is more interesting. For that you usually need to interact with users or protocols, and that means some sort of interface, and ... you end up with a library of code that enables a lot of standard interactions. In Python, that code is distributed with the interpreter, which enables a stunningly wide range of functionality in your basic Python installation.

In other words, I think the included library & modules are more interesting than the word "batteries" might imply ;).

There may be a deeper schism lurking behind the recent decorator & types controversy: language vs tools. It's very sexy to work on extending the language: there are plenty of deep problems to be tackled, much thought must be expended, and only really really smart thoughtful people can do it well. As Iwan van der Kleyn pointed out, though, there are other things to do that will extend Python's reach, power, and utility. It might be worth the core team having a look at those, if their goal is to advance Python as a whole. If, instead, the goal is to advance Python-the-language, then I think they're doing a fantastic job of it & should keep it up. Personally I'm more worried about the future of the library.

UPDATE: AMK posted about this months ago, and got a lot of responses. I think that's what got me started...

And, before the naysayers get up & yell at me for my multitude of sins, I do have some specific work and proposals in mind. It's hard to grasp the library as a whole, though. (It's easier to write a criticism than it is to do the work, too, that's for sure.) Watch This Space.

As for the long-term future of Python, Guido's first priority clearly needs to be to grow a beard.

--titus

Python's future -- language, library, or tools?

There's been quite some kerfuffle about Guido van Rossum's proposal to add optional static typing to Python. In particular, Iwan van der Kleyn pointed out that Python seems to be caught between language extenders and application programmers, and that at the moment the language extenders are winning. Iwan points out that for application programmers, there are several library or tool improvements that are probably as, or more, important than any language addition. While I don't agree with all of his feature/app requests, I do agree with the general idea that Python would benefit hugely from investment in something other than optional static typing. Or decorators. Or (insert feature of the week).

There was an ACM Queue article a while back (if someone has the link, I'd appreciate it!) that discussed the divide between language power users & tool power users. As I recall, the article described the split between developers who invest time in learning new and more powerful languages, and developers who invest time in learning IDEs and language-specific tools. What seems to be happening in Python is that GvR & the core team consist of very smart people who believe the language should be extended -- and are busy doing that. IDEs and tool extension is left to others & their efforts are patched into the library system where appropriate.

The real question about language extensions like optional type checking and decorators is how contaminatory they are. Istvan Albert believes that "optional" type checking will soon become effectively non-optional; while he mentions only the social (workplace) pressures, I'm more worried about whether or not the language will require it in places where I don't care to use it. If a library module uses type checking, will any code that uses that library module be required to adhere to the type behavior specified?

I can see three scenarios.

The first scenario is that any code that uses type-checked code will be required to adhere to the type spec. In the medium term, type checking will become mandatory throughout the codebase because people will start using it in the standard library.

The second scenario is that "duck typing" will magically solve the problem & allow untyped code to use typed code. This seems to me to be the best option, and I bet it's what GvR has in mind.

The third scenario is that there will simply be command-line arguments to Python (or internal configuration directives) that disable type checking completely, allowing schlubs like me to go about our day and ignore large portions of the language.

I'd be ok with either #2 or #3. #1 sounds like a disaster to me. Either way, I expect a large amount of energy to be expended on this set of features that, frankly, is not so important to me. (Maybe it should be, but I'm not betting on it... I like Bruce Eckel's take on static typing.)

Regardless, there's the question of whether the core team is focused too much on language development. Is the relentless drive for new, immensely powerful language features detracting from the stability and usability of the language?

It may be time to consider how to "fork" Python development so that it works more like Linux. Why not keep a 2.x branch going that focuses on long-term stability & library expansion, while building a 3.x branch with new language features but fairly complete backwards compatibility? In my traipses through the library it doesn't seem like the library really takes advantage of the new language features in 2.4, which (if true) in itself is a telling trend that suggests that language development and library development are orthogonal. Even if they aren't orthogonal, maybe they could be made more so, and this could be a worthwhile endeavor.

The balance between language stability and new features seems like it's tough to maintain. GvR et al. have done a fantastic job so far. Here's hoping they can keep doing a fantastic job!

--titus

4 Jan 2005 (updated 5 Jan 2005 at 00:41 UTC) »
OCaml and string theory -- how cool is that?!

Open-source vs closed-source Web dev

On our trip to Mammoth over New Year's, we had an interesting discussion about Web development frameworks, specifically open-source vs closed-source. One of the things that made it especially interesting was that apart from me & my wife (your standard OSS advocates) there was an MBA student & software developer as well as an engineer working in management at a construction company. So rather than discussing things with the usual group of my academic friends who buy into OSS by default, I had to actually come up with business arguments for OSS Web frameworks ;).

There were a two requirements that came up:

  • The engineer wanted accountability: if the company they hired to build a Web site or CMS defaulted or screwed up, he wanted to be able to sue them and/or recover their assets. Beyond that, he didn't really care whether or not they used OSS.

  • The MBA/programmer wanted better overall documentation. His experience with the Java-based ArsDigita CMS that RedHat bought was extremely frustrating because the documentation was essentially nonexistent. I and others have had similar experiences with Python CMSs.

Beyond these two considerations, I don't think either one had a problem with open-source vs closed source, which was nice to hear. In particular, there was no mention of OSS lacking features, security, or stability -- it seems like FUD on this account isn't reaching these two people.

My responses were basically this:

  • Re accountability, it seems to be difficult to recover assets no matter what. Large companies (e.g. Oracle and MS) that sell software tend to have EULA that nix accountability at this level. Consulting companies are either heavyweights that have lots of lawyers (e.g. IBM) or small businesses that have no money anyway.

    They both more-or-less agreed to this point & redefined their requirements to be that there was a company, with a reputation to manage, behind the software. When I pointed out that there were now many companies that were building on OSS, they shrugged & said that would be fine.

  • On the better documentation issue I agreed but pointed out that (a) a lot of closed-source software doesn't have great documentation either, and (b) most companies aren't going to care what documentation there is for their CMS if they have a consulting firm running it. Since it's kinda silly for a small non-IT firm to do more than manage their content, this point doesn't really matter for the end-user.

I added that one strong reason to go with an OSS Web CMS supported by multiple companies is company failure. A continual problem (exemplified by ArsDigita) is that companies go out of business & leave their customers high & dry. Maybe the base CMS is stable enough to work without modification, but any future customization or extension will be impossible with a closed CMS if the owners go out of business. With Zope or Plone, other companies (or even individual consultants) can pick up the pieces, if the company you had hired fails.

This seemed to resonate with both of our friends.

Anyway... 'nuff for now.

--titus

3 Jan 2005 (updated 3 Jan 2005 at 20:35 UTC) »
Is development of new software tools "moribund"?

I've always enjoyed Philip Greenspun's take on things; about 50% of what he says goes straight to the point. Unfortunately the other 50% seems to be complete crud. (I'm never sure how seriously he's taking himself, so it's hard to know if he's just trying to be thought-provoking. Even if not, separating out the gems from the crap is enjoyably difficult and hence worthwhile.)

His most recent weird post talked about a fairly simple Perl script to spam friends with invitations. In the comments, someone tweaked Philip about being a "Perl whore" despite his lauding of AOLserver, a Tcl-only (well, mostly) Web server. Philip responded that he had good reasons for liking AOLserver (connection pooling &c.), and since it happened to use Tcl, he used Tcl himself. Philip then felt compelled to say that he thought software tools were moribund, because his friends coding in .NET, his students working with Java, and people working with PHP weren't as productive as people who used AOLserver -- a technology designed over 10 years ago.

In the comments, I remarked that whinging about how Web software development tools haven't moved on since 1992 by citing Tcl, .NET, Java, and PHP was silly. Why not look at languages that support (and support well) a variety of techniques? Python (and by reputation Perl and Ruby) tools have come a loooong way in the last few years. In Python you can now choose between at least 5 different frameworks, all of them at least moderately mature. All of the popular Web programming techniques are represented between the various frameworks: the object publishing paradigm (Zope/Quixote), the object-kit templating paradigm (WebWare/CherryPy), and of course the utility belt that is Twisted.

The real question is, why not use Tcl, PHP, .NET, and Java for Web programming? It comes down to two conflicting considerations:

  1. the "each page is a class method" model encouraged by Java is overwrought for the simple pages that make up 20%-80% of any Web site (depending on the site, of course); it seems to be great for more complicated sites, though.

  2. the "each page is a string of spaghetti code" model encouraged by Tcl and PHP is great for the simple pages (20%-80% of any site) but disastrous for more complicated sites.

Or, to put it another way, straight scripting languages are great for constructing simple pages, while higher-level abstractions are needed for more complex pages. Most Web sites require both kinds of pages, but many languages do not support both kinds of programming well.

Python offers a great intermediate: it is a scripting language that deals well with outputting strings, but which offers nice higher-level ways of building out a framework. (This is likely true of other scripting languages that support object-oriented programming, such as Perl, [incr Tcl], Ruby, OCaml, ... what am I missing? VBA?)

Bas Scheffers pointed out (in the comments, again) that "it's not the tool, it's the programmer". Well, yes, this is the usual last resort of any language anti-advocate: heck, they're all identical at some level of abstraction, right? And you can write spaghetti code in any language, right? So how dare you say that language A is better than language B? Well, my experience tells me (and other people that you should trust more) that Python is better than both Tcl and Java for many things. And so the question is, why? I'm sure there are many considerations that make sense to people, but one that I haven't seen mentioned before is that of how examples are coded.

Since most programmers liberally "appropriate" code -- from books, from open-source programs, and especially from cookbooks and examples -- the quality of that code has a lot to do with the way they write programs. It also has a huge effect on the way they learn to code in the language. And I think this effect is grossly underestimated.

For Python, there's a nice tutorial, a variety of books (caveat: most of which I haven't read), and many, many examples from both the comp.lang.python newsgroup and the Python cookbook. By and large, the code contained in examples is clean, simple, short, and documented. It also lives at a level of abstraction that fits: classes/objects used when appropriate, and not used when not appropriate. Yes, there's plenty of gunk in Python, but it's not what you first encounter & it's easy to find pretty examples.

For other languages, example code is often much nastier. Spaghetti-code style may not be required by Tcl and PHP, but most of the example code I've seen can't be classed any other way. Overwrought object-oriented gunk may not be required by Java, but most of the example code I've seen certainly fits the description.

It could be that simple: if you write uncomplicated yet useful code examples, and people learn your language from them, then in the end your language will be used to write prettier code. And there will be less unmaintainable spaghetti code, or ugly overcomplex OOgunk, written in that language. And, if you're lucky & you've figured out how to grow your language over time with the help of the community, your language will continually move towards supporting that kind of coding.

Why Python examples are pretty is a different question, and the answer may be more sociological than technical. Perhaps it's as simple as Python fitting my brain better than other languages do, and it's not true for everyone else. Or, perhaps it's more than that -- for example, our beloved BDFL may be particularly good at designing a certain type of language.

In the end, supporting good mechanisms of abstraction may be necessary for good programming, but it is obviously not sufficient. It doesn't do much good to have a language that supports a bunch of mechanisms that don't cleanly fit into example code. Nor will it do the language any good in the long run if the example code is poorly constructed.

So, I don't think that the development of software tools is moribund -- but it may be time to move on from Tcl for doing Web programming.

Philip Greenspun's company died a horrible death several years ago partly because they were trying to transition a hideously complex mess o' Tcl into Java. I wonder what would have happened if they'd chosen to use something more scriptable than Java but a little more supportive of abstraction than Tcl?

--titus

p.s. I'm pretty supportive of using Java for other things, such as GUIs. I'm just not smart enough to make it do data reduction fast... so I switched to C++ / FLTK.

28 Dec 2004 (updated 28 Dec 2004 at 23:31 UTC) »

Odds and ends today...

PBP & SF hackage

SourceForge announce lists full of spam? Try this PBP script, with the rematch.py extension:

pyload rematch.py

go http://lists.sourceforge.net/lists/admindb/listname fv 1 adminpw pass submit 1

do set_match --form 1 "\\\\d+$" 3 submit 1

'twil flush out all messages in your queue...

Decorators in Quixote

Kevin Dangoor queried about decorators in Quixote, and here are implementations of his two suggested decorators. They seem to make sense syntactically.

First, one to restrict access to the decorated function to logged-in users:

from quixote.errors import AccessError

def require_login(func): """ decorator: require login to run decorated function. """ def wrapper(request): if not request.session.user: raise AccessError("you must be logged in!") return func(request)

return wrapper

Use like so:

@require_login
def func(request):
   ...

A slightly more complex case follows: here are two functions to export names for publication by Quixote.

def export(func):
    """
    decorator; export decorated function under its __name__.
    """
    _q_exports = func.func_globals['_q_exports']
    _q_exports.append(func.__name__)

return func

def export_names(*names): """ decorator; export decorated function under all given names. """

# build a new function to return; this is what will be called on # the following function. def export_func(func, names=names): _q_exports = func.func_globals['_q_exports'] for name in names: _q_exports.append((name, func.__name__,))

return func

return export_func

Use these like so:

@export
def func(request):
   ...

@export_names("name1", "name2") def func2(request): ...

It's a little bit irritating that you have to grab _q_exports from func_globals but *shrug* that's scoping for ya! (If you don't do this, then you can't import the decorators from another module.)

--titus

Web browsing via Python, refreshed.

The saga continues... after many patches, and diversions, and attempts to grok, I finally got PBP to work for a variety of purposes. In particular, I can now:

  • Use maxq to record browsing sessions to PBP scripts, and run those scripts successfully in PBP;
  • Use an only slightly altered HTMLParser.HTMLParser class to parse the moderately crappy SourceForge/mailman output;
  • Grok RFC 2965 cookies well enough to actually log into and play with mailman admindb pages;
  • Save, load, and view cookies in PBP.

I've submitted a bunch of patches to PBP & hopefully Cory can take a look at them in the next few weeks.

HTML/HTTP/URL support in Python's base is surprisingly messy, given my past experience with Python modules. I look forward to the day when Python 2.4 or above is the standard and 2.3 is no longer supported -- it will make cookie and URL handling much simpler!

I may spend some quality time refitting the HTMLParser classes in the htmllib and HTMLParser modules; as-is, they don't have good failure modes.

--titus

"He realized the fastest way to change is to laugh at your own
folly -- then you can let go and quickly move on."
-- via Dossy Shiobara
20 Dec 2004 (updated 28 Dec 2004 at 23:10 UTC) »
The joys, trials, and tribulations of open-source work

This is becoming a bad joke.

The story starts innocently enough: when playing with PBP, I found a bug in Python's sgmllib.py. This bug caused PBP (via ClientForm) to fail on my test case, the Quixote demo application. So I went and fixed that bug and submitted a patch to the Python developers.

Then I diverted myself and decided to take up the gauntlet thrown by Martin v. Löwis, in order to see how quickly I could get my patch reviewed. I proceeded to work through first one and then nine add'l Python patches. My python-dev post still hasn't gone through so nothing further has happened there.

Today I put myself back on track and spent some time with maxq, a Java HTTP proxy recording system that outputs Jython code tracking a Web session. I added a script generator module for PBP to maxq so that it would output PBP scripts as well.

In the process I ran across two additional problems in PBP. First I found a bug in the way PBP used shlex (PBP patch 274). Then maxq's behavior of recording all form variables, even hidden ones, led me to discover that PBP crashed when trying to change hidden variables. I changed this to a warning (PBP patch 273). (Not strictly a PBP problem, but the crash behavior was probably inappropriate.)

So, to try out one package (PBP) and modify another to fit my needs (maxq), I ended up crawling through a bunch of nifty packages (mechanize, ClientForm, and four or five Python modules I'd never used before), submitting three different patches to two different projects (not including revised patches I contributed to Python), and writing a small chunk of new code in Java. whee!

The worst part is that one of the PBP patches I'm submitting contains this:

 ...See http://issola.caltech.edu/~t/transfer/quixote-demo.pbp for an
operative example, although you will not be able to run it without
changing ClientForm.py to use the XHTMLCompatibleFormParser, because
of a bug in sgmllib.py (Python patch 1087808).
Yep -- to test this patch, go apply this other patch to the language you're using...

O well. At least now I can get around to the original purpose of writing a testing suite for Cartwheel.

My wife and cat are upset with the amount of time I've spent on this, that's for sure... ;) Of course, my wife doesn't get a vote because she went skiing today while I abused sea urchins. (The cat doesn't get a vote either: he sleeps all day.)

On the plus side, at least I could fix the software myself, so everything works now. I'd be SOL if any of this stuff had been closed-source.</a></a>

Continued karma-grubbing.

My advice for the Python patches neophyte: start high. I foolishly started in the low-numbered patches (#755660) & spent over an hour trying to figure out the code (which was easy) and the various patches and counterarguments and ... That part was less easy and nowhere near as pleasant. Folks, there's a reason why those patches and bug reports have been lying around for over a year -- no one else wants to deal with them.

Another piece of advice is to stick with modules you already know about or are interested to learn about. In my case I'd just gone through the HTMLParser stuff to find the sgmllib bug, so I decided to focus on HTML/CGI/URL stuff.

Anyhoo, worked my way through 9 add'l patches. Great! Together with my comment on patch 755660, I ran through 10 different patches. I may now pray to The Gods That Be to review patch 1087808, my sgmllib fix.

Here's a list:

  • patch 1055159, a simple docstring/doc patch to CGIHTTPServer. (This was really a documentation "enhancement request" masquerading as a patch, so I verified the behavior described & then wrote the doc string appropriately.)

  • patch 755670, a patch to make HTMLParser parse invalid HTML. Recommended not applying it.

    (I also put in a comment that was meant for a different patch (755660). Very embarrassing. Sigh. Tabbed browsing is dangerous in my hands. ;)

  • patch 1037974, a patch to fix HTTP digest authentication when accessing LiveJournal feeds (and any other feeds that don't listen to RFC 2617, which considers 'Algorithm' notification optional).

    Had to sign up for a livejournal account to test this, too...

    This is my first time dealing with digest authentication, but as I understand it the new behavior is merely verbose and a bit redundant. It certainly shouldn't break anything. Recommend apply.

  • patch 1028908, a bunch of small stuff by John J. Lee of wwwsearch. Apparently much of the code he modified was originally written by him anyway (?) and the regression tests passed, so *shrug*... recommend application.

  • patch 901480, fixing bug 735248. This fixes a bug in the way urllib2.parse_http_list parses unquoted elements. Recommend application, although I submitted a slightly modified patch (against the current CVS) that fixed a doctest string.

  • patch 827559 fixes SimpleHTTPServer to add a trailing '/' to directory names. Recommend application; it does fix the behavior, and I think it's a reasonable way to treat directories, too.

    An analogy: Links to 'http://some.place.or.other:port' get rewritten to 'http://some.place.or.other:port/', so links to '/this/dir' should get rewritten to '/this/dir/'.

  • patch 810023, a very nice patch to fix some reporthook behavior in urllib. The submitter, J. Lewis Muir, wrote regression tests to show that his new urllib worked, so testing this one was pretty easy. Recommend apply, based on the regression test behavior failed by the current source tree.

  • patch 893642, which adds an optional allow_none argument to SimpleXMLRPCServer and classes that use it. I updated the patch & added some documentation.

  • patch 1067760, to bug 1067728 (closed). This patch changes the behavior of seek() on file objects to do float --> long conversion instead of float --> int conversion. This allows 2.0**62 to be used in a seek, just like 2 ** 62. I recommended applying it because it shouldn't lead to new bugs. I'm probably missing something.

21 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!