Older blog entries for titus (starting at number 130)

twill 0.8

"85% unit tested". ;)

PyPi entry, announcement, ChangeLog.

--titus

Tidy

I spent an hour or two on Sunday adding a tidy preprocessor into twill.

There are a lot of tidy Python implementations out there: ElementTree tidylib, ElementTree TidyTools, pyTidy, mxTidy, and

utidylib. Some of them (elementtree) are part of other packages or require stuff that I don't want to bundle or require (utidylib requires ctypes); most of them require the tidylib binary and then interface with it. Because I want the twill distro to be cross-platform, I decided to go with the approach taken in ElementTree TidyTools, which relies only on the command-line binary. Inspection of the code revealed that it simply executed os.system, without much in the way of error trapping, so I ended up rolling my own (search for 'run_tidy'). Whee.

So, in the next release of twill, it will automatically preprocess stuff with tidy unless you turn it off; you can also assert that pages have no 'tidy' warnings.

Eggz Rock

The (imminent) next release of twill, twill 0.8, will include support for Python Eggs.

When I started, I was worried about a few technical issues: for example, I include pyparsing and mechanize/ClientForm/ClientCookie/pullparser within the twill distribution, and then munge sys.path to load them first. How would this work with eggs? No problem; the same path-munging code works whether I'm loading from a directory a zip file. (I just use os.path.join.)

Version numbering: would upgrading etc. work nicely? Yep. The pkg_resources version handling is so smart, it's not even inspired. (...by which I mean that it's brilliantly simple.)

As a bonus, it will be even easier to distribute "development" versions of twill. I can just build an egg with an alpha version number, e.g. '0.8.1a1' or '0.8.1a2', link to 'em on a page, and then point people to that page. easy_install will do the rest. In fact, I don't even need to build the page manually: I can just tell Apache to make my development dist/ directory available to the public via "Options +Indexes".

For example, typing

easy_install -f http://issola.caltech.edu/~t/twill-dist/ twill

will automatically scan for the latest version and install it. Nifty.

So far, my main gripe? 'ez_setup' is an ugly name, and it's an ugly file to have sitting around in my main development directory. (You may recall that I dislike cluttering up the main directory. So call me picky ;).

--titus

Today is a day for... Miscellany!

ORMs

Re my long post on object-relational mappers:

Jonathan Ellis points me towards a fairly negative post on PostgreSQL table inheritance, which cucumber2 uses. The thread basically states that no one is maintaining table inheritance and that only inertia is keeping it in the code. My impression was somewhat the opposite: I've seen statements that table inheritance will not be taken out, because there are people using it. *shrug* It's a neat feature, IMO.

Jonathan also points me towards PyDO2, which seems to have good documentation and a philosophy that supports working on the database with other tools. I've seen PyDO before but never had a chance to play with it seriously. I like the look of the code, though, on a cursory inspection.

Runar's Blog (written by Runar?) has a long post on relational model vs Python. Haven't finished digesting it yet. One particularly interesting link (broken in that article) is to SQLAlchemy.

An Open-Source Story: Producing Error-Free Software is Hard

Via RISKS, this story on an optimization bug in gcc (or so I infer) that affected X, and perhaps many other pieces of code. Whoo.

Python Docs

Stephen Ferg e-mailed me about my earlier post on Python docs. He pointed me towards a long, fascinating thread on Python doc updates.

I'll go into this more later, but it's worth mentioning that anecdotal evidence from genome annotation suggests that the PHP model (of allowing at least somewhat uncontrolled posting of information to docs) elicites far more contributions than rigorous up-front quality control. The reason? Experts won't go out of their way to add information on something they understand well, but they will put in the time to correct something that's just plain wrong. So you've just got to put in mechanisms to facilitate this kind of interaction.

Using arch/darcs from Windwows

In a response to my open source project truisms page, Moof points out that darcs and arch don't work very well for Windows. I'm sure he's right: I tend to forget about that platform; when I do have to develop for it, I try to use cygwin. So to get Windows developers you've got to use something like svn or CVS. And, as he points out, there are a lot more Windows developers out there than developers for any other platform... so you

want to get Windows developers.

Are there any darcs competitors out there for TortoiseSVN?

Moof also echoes Marius Gedminas's point that idea that Trac is something worth keeping an eye on. Trac is dangerously close to becoming "SourceForge in a box", which would be a good answer to most of my suggestions on how to run an OSS project.

In other news, it may be time to go get a blog that allows comments ;).

--titus

25 Nov 2005 (updated 26 Nov 2005 at 03:48 UTC) »
For the sloooooow Thanksgiving holiday... Happy TG, everyone US!

Object-relational stuff, revisited

After I dissed on Jeff Watkin's ORM assumptions & logic, Sean Jensen-Gray staged an intervention & basically told me I was acting like a git. He's undoubtedly right, and I'm continuing that part of the conversation off-line where it belongs.

However, in the name of separating the smoke from the fire, here's some more discussion about ORMs.

First of all, here's my ORM, cucumber2, just so y'all know where I'm coming from. I make no claims about generality or quality or goodness, except to say that I like it & have been using its predecessor for over 4 yrs now. Works great. cucumber2 is some of the nicer bits of concept code I've ever written; it's definitely on my refrigerator. (YMMV...)

Based on my relatively minimal experience with ORMs, then, here are some of my own beliefs about ORM writing in Python.

  • Use "magic".

    Properties, metaclasses, introspection, dynamic code generation, and "under-cover twiddling" can all help make a clean piece of code. Not using them can hurt by making your code over-verbose and cluttering your APIs with information not relevant to the task at hand.

    Document your use, test your use, sure -- but use them.

  • Object-relational impedance mismatch is a big issue.

    Do I need to say more? Just think: how do you encode a collection in a database? (Make sure you're maintaining referential integrity in your answer...) How do you encode an inheritance hierarchy? These are simple examples of a serious mismatch between the relational model and the object model. This is the problem that new ORMs should try to solve, IMO.

  • Don't start out to write a database-generic ORM.

    There's lots of discussion about using database-specific features in the SQL world (although my google-fu is failing me...), so I won't rehash that. I come down solidly on the side of committing yourself to a specific database. I think it's particularly important in the case of an ORM, which may use *very* database-specific stuff to work its magic (e.g. cucumber2 and the PostgreSQL ORDBMS features). Porting this magic between databases is likely to get very hairy & involve lots of additional complexity.

    The attempt to make your ORM generic to multiple databases may well be a specific case of premature optimization (below); it seems like over-reaching oneself by attempting to encompass database-generic issues prior to settling on a good, clean API.

  • Make sure you can still use straight SQL.

    Do you have specialized metainfo that will break SQL queries/inserts/etc. that don't know about this information? If so, this seriously reduces the utility of your databases: you can't use external tools any more, without adding in ORM-specific awareness.

    Even if you can hack this in with triggers and VIEWs, you're adding a whole 'nother layer of complexity. Bad.

  • Premature optimization is the root of all evil. (Hoare via Knuth)

    (Ironically, the first few google hits seem to be dedicated to discussing when this rule doesn't apply...)

    This covers things like caching and cache invalidation code, which in my experience is difficult to handle generically (although possible, esp. if you only allow transaction-wrapped access). Also, SQL query optimization is tough to do in SQL, much less in a layer wrapped around SQL. In many cases, you should consider optimizing by writing app/data-model-specific SELECT statements that integrate with your ORM interface.

Most of all, think of your ORM like an object database. Layering a procedural interface on top of an SQL database isn't building an ORM -- it's building a library that talks to an SQL database. Useful, but probably not new. If you solve a hard problem -- even poorly -- that's new.

For example, one of my absolute requirements: can you determine the class of an "object" (row, tuple, whatever) in the database without using metadata that's stored external to the database (like, say, in your Python object)? I think that's a pretty ORMy requirement, myself, and it helps to not violate condition #4 (straight SQL) above. Another requirement: can you store object hierarchies straightforwardly? Again, seems ORMy to me, but it speaks to the impedance mismatch problem -- it's a tough requirement.

Looking over this list, I think these are all pretty tough requirements. You would be justified in asking "well, why not just use an object database, then?"

There are a few obvious reasons.

  • Requirements. Maybe you have to (or really really prefer to) use an SQL database. Your support staff only understands SQL; your SQL backups are automated; you really like SELECT queries and the command-line interface; or your boss tells you you have to.

  • Language neutrality. Say what you will, but SQL databases are admirably language neutral... suppose you have to access the database from multiple languages, like Java, Python, Ruby, and Perl. Most object databases are language-specific (for obvious reasons...) so you're stuck with a relational DBMS.

  • Maturity. I personally dislike this argument, but: SQL databases like Oracle, PostgreSQL, MySQL, etc. have a long history and the flaws are well known. Not so with ODBMSs.

  • Teamwork. You work with people that only grok SQL. I am sympathetic to this argument, coming from an academic environment with moderately high turnover and people who have relatively little software engineering background.

  • Query performance. If a lot of your data is fundamentally organized in relational ways, I bet your SELECT statements can be heavily optimized in ways that no object database can match.

  • Support. Lots of companies support SQL databases. Not so many support object databases.

OK, so what use is an ORM? I'm assuming anybody who's made it this far is already sold on ORMs, but just in case, here are a couple of my reasons:

  • Impedance mismatch. Object-oriented languages organize data differently than the normal SQL data-model. You really want to be able to take advantage of both. (Or at least I do.)

  • Programming reliability and security. There are a number of mistakes -- some obvious, some not so much -- that can be made by SQL programmers. Hell, you're generating SQL code in another language -- how can this not be problematic? (It's largely solved by using appropriate libraries for SQL access, mind you.)

  • Joins. I don't know about you, but I'm not smart enough to understand LEFT OUTER JOINs. (Could someone else please write a library to do it for me, intelligently?)

You would now be even more justified in calling me somewhat nuts. I have strict requirements for an ORM that are nigh impossible to meet, and lots of reasons why you might be stuck with an SQL database. Yet I've also given a few good reasons to use an ORM. What to do?

My first point: it's not an easy problem. That's why seriously smart people -- much smarter than me -- have thought deeply on the matter and come up with very little.

My second point: it's worth tackling. 'nuff said, here; I think the benefits are obvious.

My third point: I personally guess that there are solutions to most of the problems that I lay out for ORMs, and these solutions lie in the dynamic nature of languages like Python (and probably languages like Ruby and Perl). Certainly I can easily do interesting things in Python that are tremendously difficult to do in Java, although many of these things use the "black magic" of metaclasses.

OK, there's no real conclusion here.

I'm at least minimally satisfied with the approach I've taken in cucumber2. Again, YMMV. Apart from polishing and optimizing the code a bit, I'm thinking about taking a pyparsing-style approach to SELECTs. More on that next time I get the yen to hack on something other than twill ;).

I hope you're at least mildly entertained by my wild-eyed ORM discussion, and I look forward to the horde of disapproving comments. (Luckily, I've disabled comments on this blog, so I won't have to make them public if I don't want to. [0])

cheers,
--titus

p.s. apologies for the weird formatting... advogato *shrug*

[0] I feel compelled to point out that this sentence is a joke.

OCaml the Python way

Cute.

Ooh, ooh, I've got an ORM!

I'm deeply skeptical of Jeff Watkins' approach to ORMs, revealed via the Daily Python-URL.

For example, why the heck would "No Magic" be on the list of anyone using Python? Things like metaclasses and properties are effin' great at making code act like it should act based on how it reads. Sure, they can be abused -- but what can't? I've used metaclasses for several projects now -- twill and cucumber2, in particular -- and I'm extraordinarily happy with them. "Twiddling under the covers" can make things very neat. And, done properly, how can it not be Pythonic?

Then there's this. The logic appears to be: I hate COM, but must use it. It doesn't garbage collect (<-- inferred). Therefore I like languages that do garbage collect, like Java (blech) and JavaScript (blech**2). That's not logic, man -- that's faulty generalization, at least the way it's written.

Oh, and rather than half-baked CherryPy, you might check out Quixote. You'll have to roll your own more of the time, but I guarantee you you'll get consistency ;).

I definitely recommend the Python-Is-Not-Java and Java-Is-Not-Python, Either posts; some of the code you're posting looks awfully Java-like, and I'd be wary of porting ideas straight across.

--titus

Unmaintainable code

OK, this is brilliant.

Obviously I should rewrite my "open-source project truisms" post in this manner... another time, perhaps.

Speaking of which:

Open-source project truisms: feedback

Marius Gedminas suggests also making your source code repository browseable through the Web, and Scott Lamb boils much of my advice down to "use Trac". That way, you get a source-code browser & a Wiki, too, so that people can correct things on the site themselves. Michele Simionato came out against Wikis (in a separate exchange a few days ago) but I'm on the fence & slipping towards the side of the Wiki... I think they do more good than harm ;).

Marius also suggested putting up screenshots -- that's not applicable to developer libraries, but is a great idea for anything a non-developer/user is expected to touch.

On a somewhat separate note... Python docs

Submitting patches to Python is a huge pain, because of all of the overhead and review process. (And the sourceforge bug/patch system is really, really lousy.) This process is certainly legitimate for code, but I think it's overkill for documentation -- especially when you consider how the standard library has grown, and how poor the documentation is in some of the places I frequent (urllib2, anyone?)

One possible solution?

I've been thinking setting up a Trac site and a darcs repository for editing Python documentation. The idea would be to use tailor to maintain a synced darcs version of the CVS/SVN documentation, and then build additional documentation off of that darcs base. The additional docs could thus be maintained in concert with the standard docs, but without the hassle of making them formal. Heck, with proper integration, the Trac wiki system could augment the whole thing, too.

I'm balking because I've had some recent experiences with darcs that suggests that it may not be ready for this kind of role. More anon. Thoughts welcome, however...

--titus

A few open-source project truisms

I'm sure these are all obvious, but I thought I'd write them down as part of my Saturday procrastination.

  • Make your source code repository available to anonymous users without any setup on your part. Don't hide it behind firewalls, passwords, or make it available "upon request".

    Why? That way, casual developers can drop in & visit your code. If your code is nice or your project is useful, you may even attract their input. They can also submit patches based on your latest code, rather than whatever release you last bothered to bundle.

    Corollaries: have a comprehensible build system that "just works"; if you have tests, same. This lowers the barrier to entry for developers.

    Pet peeve: academic projects tend to hide their source code, I think because they're afraid that someone will come along and steal their brilliant idea. Wrong, wrong, wrong -- unless you have some super-serious mojo hidden in there, and someone spends the time to dig it out, nobody will care about your project. So if you intend to make it open source, open it up immediately -- who knows, you might even get some participation!

  • Unless you're going to be very responsive and accomodating about patches from other developers, use a distributed version control system like arch or darcs.

    Why? This way, casual developers can (a) change hardcoded configuration settings without losing the ability to integrate your updates; (b) make sure that their own fixes jive with your patches; (c) work with your package without the need for accounts on your development machine.

    Pet peeve: most open source projects in existence ignore this. It may not have been an option until relatively recently, but these distributed systems are now very usable. I don't know too many developers who are happy giving CVS write access to Joe Blow, but distributed version control obviates this.

  • Remember that getting users will generally not get you developers, but getting developers will get you users. (Maybe not as many as you want, admittedly, but you've got to start somewhere!) So, when starting out, cater to the developers. Unless you intend to go it alone & produce a completely usable system with a GUI all on your own, you want to start with the developers and finish with the users...

    Nobody has time to work on projects that aren't ultimately going to be useful to them, so you can be sure that active developers are using your code. "Just users", however, can be a drain on a project, especially when it's just starting out.

    Pet peeve: many projects make it abysmally difficult to play with them. The source is poorly organized or in a nonstandard layout, the dependencies are unclear, and there may even be patched versions of dependencies lying around on the main developer's machine. (Guilty, your honor...) The barrier to entry for developers has to be really low for most people I know to even glance at a project.

  • You need a Web page and a mailing list (preferably with a public archive).

    Why? How the hell else are people supposed to find your project and appreciate your brilliance? Seriously, my goal for most open source projects is to get people hooked & contributing, if only by complaining about bugs. They won't do that if they can't find it, figure out what it does, and download it. (They might very well be able to install it, especially if it uses a normal layout with configure & make/Makefile.PL/setup.py.) And they'll be able to see that other people are using it by looking at the mailing list; plus, google will help them find solutions posted by other people.

    What with sourceforge, berlios, and <insert your favorite hosting site here> you really have no excuse for running a project without a Web site and mailing list.

    Pet peeve: developers who think that someone's going to download their .tar.gz and read the documentation in order to figure out if a project is worthwhile. Guys, unless they already think the project is worthwhile, they're not going to download it...

Just my 2 cents. I'd appreciate counterarguments and additions; send 'em to me.

Pretty pictures

Street painting.

Droplets.

--titus

19 Nov 2005 (updated 19 Nov 2005 at 00:44 UTC) »
Unit tests save my butt

Just updated twill to the latest versions of the wwwsearch/mechanize code. Because twill reaches moderately far into the wwwsearch code, some of the internal changes that John Lee made to mechanize affected the functioning of twill adversely.

Conveniently, my unit tests caught many problems and I could iterate through & fix them one by one.

Dunno what I would have done without 'em... probably slapped a "beta" sticker on it and asked my users to test it for me ;).

Web testing links

Grig's recent roundup of Web testing tools missed a few. (OK, to be fair not all of them are "recently updated"!) I've put 'em all into the twill README. It's quite a list!

Good experiences with Python embedding

For you planetpython people,

Iago Rubio talks about how wonderful Python embedding can be...

--titus

Delusions of grandeur

Sparked by Ian Bicking's simple implementation of the twill language in JavaScript, as well as Robert Marchetti's Python-IE bridge project, PAMIE, I googled about and found a very recent article on PyXPCOM, too.

It should be relatively easy to build a common API layer for JavaScript, mechanize/mechanoid/zope.testbrowser, PAMIE, and PyXPCOM that supports the twill language. Then you'd be able to have a single twill script that runs in-browser & from the command line, and also manipulates PAMIE and PyXPCOM. Wouldn't that be nice?

Just a thought.

--titus

E-mail to my titus at caltech.edu address is failing. If, for some reason, you feel the need to contact me ... use titus at idyll.org.

For now I've had to set my Reply-To header to that, too. How frustrating; I'm sure I've lost e-mail, but I have no idea what, and I have no ability to fix the problem. I feel like a user. Argh.

(Nothing to see here. Continue about your business.)

--titus

121 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!