25 Nov 2005 titus   » (Journeyer)

For the sloooooow Thanksgiving holiday... Happy TG, everyone US!

Object-relational stuff, revisited

After I dissed on Jeff Watkin's ORM assumptions & logic, Sean Jensen-Gray staged an intervention & basically told me I was acting like a git. He's undoubtedly right, and I'm continuing that part of the conversation off-line where it belongs.

However, in the name of separating the smoke from the fire, here's some more discussion about ORMs.

First of all, here's my ORM, cucumber2, just so y'all know where I'm coming from. I make no claims about generality or quality or goodness, except to say that I like it & have been using its predecessor for over 4 yrs now. Works great. cucumber2 is some of the nicer bits of concept code I've ever written; it's definitely on my refrigerator. (YMMV...)

Based on my relatively minimal experience with ORMs, then, here are some of my own beliefs about ORM writing in Python.

  • Use "magic".

    Properties, metaclasses, introspection, dynamic code generation, and "under-cover twiddling" can all help make a clean piece of code. Not using them can hurt by making your code over-verbose and cluttering your APIs with information not relevant to the task at hand.

    Document your use, test your use, sure -- but use them.

  • Object-relational impedance mismatch is a big issue.

    Do I need to say more? Just think: how do you encode a collection in a database? (Make sure you're maintaining referential integrity in your answer...) How do you encode an inheritance hierarchy? These are simple examples of a serious mismatch between the relational model and the object model. This is the problem that new ORMs should try to solve, IMO.

  • Don't start out to write a database-generic ORM.

    There's lots of discussion about using database-specific features in the SQL world (although my google-fu is failing me...), so I won't rehash that. I come down solidly on the side of committing yourself to a specific database. I think it's particularly important in the case of an ORM, which may use *very* database-specific stuff to work its magic (e.g. cucumber2 and the PostgreSQL ORDBMS features). Porting this magic between databases is likely to get very hairy & involve lots of additional complexity.

    The attempt to make your ORM generic to multiple databases may well be a specific case of premature optimization (below); it seems like over-reaching oneself by attempting to encompass database-generic issues prior to settling on a good, clean API.

  • Make sure you can still use straight SQL.

    Do you have specialized metainfo that will break SQL queries/inserts/etc. that don't know about this information? If so, this seriously reduces the utility of your databases: you can't use external tools any more, without adding in ORM-specific awareness.

    Even if you can hack this in with triggers and VIEWs, you're adding a whole 'nother layer of complexity. Bad.

  • Premature optimization is the root of all evil. (Hoare via Knuth)

    (Ironically, the first few google hits seem to be dedicated to discussing when this rule doesn't apply...)

    This covers things like caching and cache invalidation code, which in my experience is difficult to handle generically (although possible, esp. if you only allow transaction-wrapped access). Also, SQL query optimization is tough to do in SQL, much less in a layer wrapped around SQL. In many cases, you should consider optimizing by writing app/data-model-specific SELECT statements that integrate with your ORM interface.

Most of all, think of your ORM like an object database. Layering a procedural interface on top of an SQL database isn't building an ORM -- it's building a library that talks to an SQL database. Useful, but probably not new. If you solve a hard problem -- even poorly -- that's new.

For example, one of my absolute requirements: can you determine the class of an "object" (row, tuple, whatever) in the database without using metadata that's stored external to the database (like, say, in your Python object)? I think that's a pretty ORMy requirement, myself, and it helps to not violate condition #4 (straight SQL) above. Another requirement: can you store object hierarchies straightforwardly? Again, seems ORMy to me, but it speaks to the impedance mismatch problem -- it's a tough requirement.

Looking over this list, I think these are all pretty tough requirements. You would be justified in asking "well, why not just use an object database, then?"

There are a few obvious reasons.

  • Requirements. Maybe you have to (or really really prefer to) use an SQL database. Your support staff only understands SQL; your SQL backups are automated; you really like SELECT queries and the command-line interface; or your boss tells you you have to.

  • Language neutrality. Say what you will, but SQL databases are admirably language neutral... suppose you have to access the database from multiple languages, like Java, Python, Ruby, and Perl. Most object databases are language-specific (for obvious reasons...) so you're stuck with a relational DBMS.

  • Maturity. I personally dislike this argument, but: SQL databases like Oracle, PostgreSQL, MySQL, etc. have a long history and the flaws are well known. Not so with ODBMSs.

  • Teamwork. You work with people that only grok SQL. I am sympathetic to this argument, coming from an academic environment with moderately high turnover and people who have relatively little software engineering background.

  • Query performance. If a lot of your data is fundamentally organized in relational ways, I bet your SELECT statements can be heavily optimized in ways that no object database can match.

  • Support. Lots of companies support SQL databases. Not so many support object databases.

OK, so what use is an ORM? I'm assuming anybody who's made it this far is already sold on ORMs, but just in case, here are a couple of my reasons:

  • Impedance mismatch. Object-oriented languages organize data differently than the normal SQL data-model. You really want to be able to take advantage of both. (Or at least I do.)

  • Programming reliability and security. There are a number of mistakes -- some obvious, some not so much -- that can be made by SQL programmers. Hell, you're generating SQL code in another language -- how can this not be problematic? (It's largely solved by using appropriate libraries for SQL access, mind you.)

  • Joins. I don't know about you, but I'm not smart enough to understand LEFT OUTER JOINs. (Could someone else please write a library to do it for me, intelligently?)

You would now be even more justified in calling me somewhat nuts. I have strict requirements for an ORM that are nigh impossible to meet, and lots of reasons why you might be stuck with an SQL database. Yet I've also given a few good reasons to use an ORM. What to do?

My first point: it's not an easy problem. That's why seriously smart people -- much smarter than me -- have thought deeply on the matter and come up with very little.

My second point: it's worth tackling. 'nuff said, here; I think the benefits are obvious.

My third point: I personally guess that there are solutions to most of the problems that I lay out for ORMs, and these solutions lie in the dynamic nature of languages like Python (and probably languages like Ruby and Perl). Certainly I can easily do interesting things in Python that are tremendously difficult to do in Java, although many of these things use the "black magic" of metaclasses.

OK, there's no real conclusion here.

I'm at least minimally satisfied with the approach I've taken in cucumber2. Again, YMMV. Apart from polishing and optimizing the code a bit, I'm thinking about taking a pyparsing-style approach to SELECTs. More on that next time I get the yen to hack on something other than twill ;).

I hope you're at least mildly entertained by my wild-eyed ORM discussion, and I look forward to the horde of disapproving comments. (Luckily, I've disabled comments on this blog, so I won't have to make them public if I don't want to. [0])

cheers,
--titus

p.s. apologies for the weird formatting... advogato *shrug*

[0] I feel compelled to point out that this sentence is a joke.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!