Older blog entries for titus (starting at number 127)

25 Nov 2005 (updated 26 Nov 2005 at 03:48 UTC) »
For the sloooooow Thanksgiving holiday... Happy TG, everyone US!

Object-relational stuff, revisited

After I dissed on Jeff Watkin's ORM assumptions & logic, Sean Jensen-Gray staged an intervention & basically told me I was acting like a git. He's undoubtedly right, and I'm continuing that part of the conversation off-line where it belongs.

However, in the name of separating the smoke from the fire, here's some more discussion about ORMs.

First of all, here's my ORM, cucumber2, just so y'all know where I'm coming from. I make no claims about generality or quality or goodness, except to say that I like it & have been using its predecessor for over 4 yrs now. Works great. cucumber2 is some of the nicer bits of concept code I've ever written; it's definitely on my refrigerator. (YMMV...)

Based on my relatively minimal experience with ORMs, then, here are some of my own beliefs about ORM writing in Python.

  • Use "magic".

    Properties, metaclasses, introspection, dynamic code generation, and "under-cover twiddling" can all help make a clean piece of code. Not using them can hurt by making your code over-verbose and cluttering your APIs with information not relevant to the task at hand.

    Document your use, test your use, sure -- but use them.

  • Object-relational impedance mismatch is a big issue.

    Do I need to say more? Just think: how do you encode a collection in a database? (Make sure you're maintaining referential integrity in your answer...) How do you encode an inheritance hierarchy? These are simple examples of a serious mismatch between the relational model and the object model. This is the problem that new ORMs should try to solve, IMO.

  • Don't start out to write a database-generic ORM.

    There's lots of discussion about using database-specific features in the SQL world (although my google-fu is failing me...), so I won't rehash that. I come down solidly on the side of committing yourself to a specific database. I think it's particularly important in the case of an ORM, which may use *very* database-specific stuff to work its magic (e.g. cucumber2 and the PostgreSQL ORDBMS features). Porting this magic between databases is likely to get very hairy & involve lots of additional complexity.

    The attempt to make your ORM generic to multiple databases may well be a specific case of premature optimization (below); it seems like over-reaching oneself by attempting to encompass database-generic issues prior to settling on a good, clean API.

  • Make sure you can still use straight SQL.

    Do you have specialized metainfo that will break SQL queries/inserts/etc. that don't know about this information? If so, this seriously reduces the utility of your databases: you can't use external tools any more, without adding in ORM-specific awareness.

    Even if you can hack this in with triggers and VIEWs, you're adding a whole 'nother layer of complexity. Bad.

  • Premature optimization is the root of all evil. (Hoare via Knuth)

    (Ironically, the first few google hits seem to be dedicated to discussing when this rule doesn't apply...)

    This covers things like caching and cache invalidation code, which in my experience is difficult to handle generically (although possible, esp. if you only allow transaction-wrapped access). Also, SQL query optimization is tough to do in SQL, much less in a layer wrapped around SQL. In many cases, you should consider optimizing by writing app/data-model-specific SELECT statements that integrate with your ORM interface.

Most of all, think of your ORM like an object database. Layering a procedural interface on top of an SQL database isn't building an ORM -- it's building a library that talks to an SQL database. Useful, but probably not new. If you solve a hard problem -- even poorly -- that's new.

For example, one of my absolute requirements: can you determine the class of an "object" (row, tuple, whatever) in the database without using metadata that's stored external to the database (like, say, in your Python object)? I think that's a pretty ORMy requirement, myself, and it helps to not violate condition #4 (straight SQL) above. Another requirement: can you store object hierarchies straightforwardly? Again, seems ORMy to me, but it speaks to the impedance mismatch problem -- it's a tough requirement.

Looking over this list, I think these are all pretty tough requirements. You would be justified in asking "well, why not just use an object database, then?"

There are a few obvious reasons.

  • Requirements. Maybe you have to (or really really prefer to) use an SQL database. Your support staff only understands SQL; your SQL backups are automated; you really like SELECT queries and the command-line interface; or your boss tells you you have to.

  • Language neutrality. Say what you will, but SQL databases are admirably language neutral... suppose you have to access the database from multiple languages, like Java, Python, Ruby, and Perl. Most object databases are language-specific (for obvious reasons...) so you're stuck with a relational DBMS.

  • Maturity. I personally dislike this argument, but: SQL databases like Oracle, PostgreSQL, MySQL, etc. have a long history and the flaws are well known. Not so with ODBMSs.

  • Teamwork. You work with people that only grok SQL. I am sympathetic to this argument, coming from an academic environment with moderately high turnover and people who have relatively little software engineering background.

  • Query performance. If a lot of your data is fundamentally organized in relational ways, I bet your SELECT statements can be heavily optimized in ways that no object database can match.

  • Support. Lots of companies support SQL databases. Not so many support object databases.

OK, so what use is an ORM? I'm assuming anybody who's made it this far is already sold on ORMs, but just in case, here are a couple of my reasons:

  • Impedance mismatch. Object-oriented languages organize data differently than the normal SQL data-model. You really want to be able to take advantage of both. (Or at least I do.)

  • Programming reliability and security. There are a number of mistakes -- some obvious, some not so much -- that can be made by SQL programmers. Hell, you're generating SQL code in another language -- how can this not be problematic? (It's largely solved by using appropriate libraries for SQL access, mind you.)

  • Joins. I don't know about you, but I'm not smart enough to understand LEFT OUTER JOINs. (Could someone else please write a library to do it for me, intelligently?)

You would now be even more justified in calling me somewhat nuts. I have strict requirements for an ORM that are nigh impossible to meet, and lots of reasons why you might be stuck with an SQL database. Yet I've also given a few good reasons to use an ORM. What to do?

My first point: it's not an easy problem. That's why seriously smart people -- much smarter than me -- have thought deeply on the matter and come up with very little.

My second point: it's worth tackling. 'nuff said, here; I think the benefits are obvious.

My third point: I personally guess that there are solutions to most of the problems that I lay out for ORMs, and these solutions lie in the dynamic nature of languages like Python (and probably languages like Ruby and Perl). Certainly I can easily do interesting things in Python that are tremendously difficult to do in Java, although many of these things use the "black magic" of metaclasses.

OK, there's no real conclusion here.

I'm at least minimally satisfied with the approach I've taken in cucumber2. Again, YMMV. Apart from polishing and optimizing the code a bit, I'm thinking about taking a pyparsing-style approach to SELECTs. More on that next time I get the yen to hack on something other than twill ;).

I hope you're at least mildly entertained by my wild-eyed ORM discussion, and I look forward to the horde of disapproving comments. (Luckily, I've disabled comments on this blog, so I won't have to make them public if I don't want to. [0])

cheers,
--titus

p.s. apologies for the weird formatting... advogato *shrug*

[0] I feel compelled to point out that this sentence is a joke.

OCaml the Python way

Cute.

Ooh, ooh, I've got an ORM!

I'm deeply skeptical of Jeff Watkins' approach to ORMs, revealed via the Daily Python-URL.

For example, why the heck would "No Magic" be on the list of anyone using Python? Things like metaclasses and properties are effin' great at making code act like it should act based on how it reads. Sure, they can be abused -- but what can't? I've used metaclasses for several projects now -- twill and cucumber2, in particular -- and I'm extraordinarily happy with them. "Twiddling under the covers" can make things very neat. And, done properly, how can it not be Pythonic?

Then there's this. The logic appears to be: I hate COM, but must use it. It doesn't garbage collect (<-- inferred). Therefore I like languages that do garbage collect, like Java (blech) and JavaScript (blech**2). That's not logic, man -- that's faulty generalization, at least the way it's written.

Oh, and rather than half-baked CherryPy, you might check out Quixote. You'll have to roll your own more of the time, but I guarantee you you'll get consistency ;).

I definitely recommend the Python-Is-Not-Java and Java-Is-Not-Python, Either posts; some of the code you're posting looks awfully Java-like, and I'd be wary of porting ideas straight across.

--titus

Unmaintainable code

OK, this is brilliant.

Obviously I should rewrite my "open-source project truisms" post in this manner... another time, perhaps.

Speaking of which:

Open-source project truisms: feedback

Marius Gedminas suggests also making your source code repository browseable through the Web, and Scott Lamb boils much of my advice down to "use Trac". That way, you get a source-code browser & a Wiki, too, so that people can correct things on the site themselves. Michele Simionato came out against Wikis (in a separate exchange a few days ago) but I'm on the fence & slipping towards the side of the Wiki... I think they do more good than harm ;).

Marius also suggested putting up screenshots -- that's not applicable to developer libraries, but is a great idea for anything a non-developer/user is expected to touch.

On a somewhat separate note... Python docs

Submitting patches to Python is a huge pain, because of all of the overhead and review process. (And the sourceforge bug/patch system is really, really lousy.) This process is certainly legitimate for code, but I think it's overkill for documentation -- especially when you consider how the standard library has grown, and how poor the documentation is in some of the places I frequent (urllib2, anyone?)

One possible solution?

I've been thinking setting up a Trac site and a darcs repository for editing Python documentation. The idea would be to use tailor to maintain a synced darcs version of the CVS/SVN documentation, and then build additional documentation off of that darcs base. The additional docs could thus be maintained in concert with the standard docs, but without the hassle of making them formal. Heck, with proper integration, the Trac wiki system could augment the whole thing, too.

I'm balking because I've had some recent experiences with darcs that suggests that it may not be ready for this kind of role. More anon. Thoughts welcome, however...

--titus

A few open-source project truisms

I'm sure these are all obvious, but I thought I'd write them down as part of my Saturday procrastination.

  • Make your source code repository available to anonymous users without any setup on your part. Don't hide it behind firewalls, passwords, or make it available "upon request".

    Why? That way, casual developers can drop in & visit your code. If your code is nice or your project is useful, you may even attract their input. They can also submit patches based on your latest code, rather than whatever release you last bothered to bundle.

    Corollaries: have a comprehensible build system that "just works"; if you have tests, same. This lowers the barrier to entry for developers.

    Pet peeve: academic projects tend to hide their source code, I think because they're afraid that someone will come along and steal their brilliant idea. Wrong, wrong, wrong -- unless you have some super-serious mojo hidden in there, and someone spends the time to dig it out, nobody will care about your project. So if you intend to make it open source, open it up immediately -- who knows, you might even get some participation!

  • Unless you're going to be very responsive and accomodating about patches from other developers, use a distributed version control system like arch or darcs.

    Why? This way, casual developers can (a) change hardcoded configuration settings without losing the ability to integrate your updates; (b) make sure that their own fixes jive with your patches; (c) work with your package without the need for accounts on your development machine.

    Pet peeve: most open source projects in existence ignore this. It may not have been an option until relatively recently, but these distributed systems are now very usable. I don't know too many developers who are happy giving CVS write access to Joe Blow, but distributed version control obviates this.

  • Remember that getting users will generally not get you developers, but getting developers will get you users. (Maybe not as many as you want, admittedly, but you've got to start somewhere!) So, when starting out, cater to the developers. Unless you intend to go it alone & produce a completely usable system with a GUI all on your own, you want to start with the developers and finish with the users...

    Nobody has time to work on projects that aren't ultimately going to be useful to them, so you can be sure that active developers are using your code. "Just users", however, can be a drain on a project, especially when it's just starting out.

    Pet peeve: many projects make it abysmally difficult to play with them. The source is poorly organized or in a nonstandard layout, the dependencies are unclear, and there may even be patched versions of dependencies lying around on the main developer's machine. (Guilty, your honor...) The barrier to entry for developers has to be really low for most people I know to even glance at a project.

  • You need a Web page and a mailing list (preferably with a public archive).

    Why? How the hell else are people supposed to find your project and appreciate your brilliance? Seriously, my goal for most open source projects is to get people hooked & contributing, if only by complaining about bugs. They won't do that if they can't find it, figure out what it does, and download it. (They might very well be able to install it, especially if it uses a normal layout with configure & make/Makefile.PL/setup.py.) And they'll be able to see that other people are using it by looking at the mailing list; plus, google will help them find solutions posted by other people.

    What with sourceforge, berlios, and <insert your favorite hosting site here> you really have no excuse for running a project without a Web site and mailing list.

    Pet peeve: developers who think that someone's going to download their .tar.gz and read the documentation in order to figure out if a project is worthwhile. Guys, unless they already think the project is worthwhile, they're not going to download it...

Just my 2 cents. I'd appreciate counterarguments and additions; send 'em to me.

Pretty pictures

Street painting.

Droplets.

--titus

19 Nov 2005 (updated 19 Nov 2005 at 00:44 UTC) »
Unit tests save my butt

Just updated twill to the latest versions of the wwwsearch/mechanize code. Because twill reaches moderately far into the wwwsearch code, some of the internal changes that John Lee made to mechanize affected the functioning of twill adversely.

Conveniently, my unit tests caught many problems and I could iterate through & fix them one by one.

Dunno what I would have done without 'em... probably slapped a "beta" sticker on it and asked my users to test it for me ;).

Web testing links

Grig's recent roundup of Web testing tools missed a few. (OK, to be fair not all of them are "recently updated"!) I've put 'em all into the twill README. It's quite a list!

Good experiences with Python embedding

For you planetpython people,

Iago Rubio talks about how wonderful Python embedding can be...

--titus

Delusions of grandeur

Sparked by Ian Bicking's simple implementation of the twill language in JavaScript, as well as Robert Marchetti's Python-IE bridge project, PAMIE, I googled about and found a very recent article on PyXPCOM, too.

It should be relatively easy to build a common API layer for JavaScript, mechanize/mechanoid/zope.testbrowser, PAMIE, and PyXPCOM that supports the twill language. Then you'd be able to have a single twill script that runs in-browser & from the command line, and also manipulates PAMIE and PyXPCOM. Wouldn't that be nice?

Just a thought.

--titus

E-mail to my titus at caltech.edu address is failing. If, for some reason, you feel the need to contact me ... use titus at idyll.org.

For now I've had to set my Reply-To header to that, too. How frustrating; I'm sure I've lost e-mail, but I have no idea what, and I have no ability to fix the problem. I feel like a user. Argh.

(Nothing to see here. Continue about your business.)

--titus

A Plea for Python

Seth Vidal of fedoraproject.org has a request -- stop asking for PHP apps, and start using Python!

To quote,

With that in mind I'd like to make a request out to the python web programmers in the fedora-verse. We need some people who are willing to work on web applications either in zope or turbogears. Let's see some options and ways to progress. We have a lot of python programmers who can help audit the code and contribute sections as we build up our module-base.

Yo. What he said.

Interactive exploration of Web apps

One (apparently) whiz-bang feature of PBP, and now twill, is the ability to interactively browse your Web apps from the command line. For example,

>> go http://www.advogato.org/
==> at http://www.advogato.org/
current page: http://www.advogato.org/
>> showlinks
Links:

0. Articles ==> /article/ 1. Account ==> /acct/ 2. People ==> /person/ 3. Projects ==> /proj/ 4. Omnifarious ==> /person/Omnifarious/ 5. Read more... ==> /article/862.html 6. tarzeau ==> /person/tarzeau/ ... current page: http://www.advogato.org/ >> follow tarzeau ==> at http://www.advogato.org/person/tarzeau/ current page: http://www.advogato.org/person/tarzeau/ >> showlinks Links:

0. http://www.linuks.mine.nu/ ==> http://www.linuks.mine.nu/ 1. freshmeat.net/~tarzeau/ ==> http://freshmeat.net/~tarzeau/ 2. iBackup ==> /proj/iBackup/ 3. AmigaSHELL ==> /proj/AmigaSHELL/ ...

This inevitably receives the most "oohs" and "ahhs" the times I've shown twill to people sitting next to me. Apparently none of the other Web testing tools out there do this.

Anyway, I mention it because with the in-process WSGI testing patch, you can now browse your WSGI apps interactively without going through the Internet.

That strikes me as pretty nifty, if only in a Python-geek kind of way.

Go about your business. These are not the comments you're looking for.

--titus

13 Nov 2005 (updated 13 Nov 2005 at 20:28 UTC) »
twill & in-process testing of WSGI apps

Ian Bicking's Best of the web app test frameworks? sparked an interesting discussion (read down to the comments). Of particular interest to me was Ian's suggestion that twill (or, really, urllib2/httplib) be modified to send requests directly to a WSGI application without going through a TCP connection.

A few hours of hacking later, I've got a simple implementation that works (inside of twill). Briefly:

  • I replace the HTTPHandler.http_open method with a call to my own HTTPConnection class, "myhttplib.MyHTTPConnection".

  • myhttplib.MyHTTPConnection overrides the 'connect()' function so that for *specific* host/port connections only, a fake socket is created.

  • this fake socket, an object of type 'wsgi_fake_socket', intercepts the HTTP traffic and behaves like a WSGI server, calling the app object with the appropriate translated environment & catching the response.

  • this response is then passed back up to the HTTPResponse object, and all is copacetic.

This solution is generic to httplib. It should be easy to pop it into anything that uses urllib2 to talk via HTTP... which means that virtually any Python Web testing code can use this kind of thing to talk directly to any Python WSGI application.

OK, so does it work? Yes!

For example, here's a simple script testing my conference submission system over TCP:

% ./twill-sh -u http://issola.caltech.edu/collar/ tst2
>> EXECUTING FILE tst2
==> at http://issola.caltech.edu/collar/
==> at http://issola.caltech.edu/collar/submit/
Note: submit is using submit button: name="view_status", value="view status"
Note: submit is using submit button: name="view", value="view paper"
--
1 of 1 files SUCCEEDED.

Here's the same script running. This time it's pointed at a host/port that's diverted to WSGI:

% ./twill-sh -u http://floating.caltech.edu/collar/ tst2
>> EXECUTING FILE tst2
INTERCEPTING call to floating.caltech.edu:80
==> at http://floating.caltech.edu/collar/
INTERCEPTING call to floating.caltech.edu:80
==> at http://floating.caltech.edu/collar/submit/
Note: submit is using submit button: name="view_status", value="view status"
INTERCEPTING call to floating.caltech.edu:80
INTERCEPTING call to floating.caltech.edu:80
Note: submit is using submit button: name="view", value="view paper"
INTERCEPTING call to floating.caltech.edu:80
INTERCEPTING call to floating.caltech.edu:80
--
1 of 1 files SUCCEEDED.

GET/POST, redirects, cookies, etc. are handled as they should be.

If you want to play with the code, it's available via darcs:

darcs get http://issola.caltech.edu/~t/twill/
You can also just view myhttplib.py, which contains the entire implementation.

Some notes:

  • The code is ugly; I know that. There are a number of things I could do to it to make it nicer it, but at the moment I'm going to rest on my laurels ;). Feel free to critique and/or patch.

  • The 'make_environ' function is one of the two weakest links: it needs to behave like a "real" Web server with respect to filling out the environment dictionary, and that's tough. Right now I think it's handling cookies right, but not much else.

  • The other weak link is in the interplay between the read/write in the wsgi_fake_socket. I'm assuming an awful lot here... in any case, it should be possible to write a 2-way FIFO that properly mimics an open socket. (Then HTTP/1.1 connections would work, too.)

  • It should be fairly easy to pop in your own WSGI app, just for grins. You just need to modify myhttplib.wsgi_intercept appropriately, at any time before your first call to grab a URL.

Have fun,
--titus

RELEASE: twill 0.7.4

Following the dictat of my users... a new twill release!

Links: Cheese Shop entry, announcement, ChangeLog, and download.

There are lots of bug fixes -- cold filtered for a smooth browsing experience. The main innovation in this release is the unit testing: now, twill uses nose-based unit testing, and it also provides some simple commands to help users use twill to unit test their own apps.

Sample unit-test code:

    import os
    import testlib
    import twill.unit
    import twilltestserver
    from quixote.server.simple_server import run as quixote_run

def test(): # port to run the server on PORT=8090

# create a function to run the server def run_server_fn(): quixote_run(twilltestserver.create_publisher, port=PORT)

# abspath to the script script = os.path.join(testlib.testdir, 'test-unit-support.twill')

# create test_info object test_info = twill.unit.TestInfo(script, run_server_fn, PORT)

# run tests! twill.unit.run_test(test_info)

(See unit testing with twill for more info.)

Comments welcome...

--titus

118 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!