"85% unit tested". ;)
utidylib. Some of them (elementtree) are part of other packages or require stuff that I don't want to bundle or require (utidylib requires ctypes); most of them require the tidylib binary and then interface with it. Because I want the twill distro to be cross-platform, I decided to go with the approach taken in ElementTree TidyTools, which relies only on the command-line binary. Inspection of the code revealed that it simply executed os.system, without much in the way of error trapping, so I ended up rolling my own (search for 'run_tidy'). Whee.
So, in the next release of twill, it will automatically preprocess stuff with tidy unless you turn it off; you can also assert that pages have no 'tidy' warnings.
The (imminent) next release of twill, twill 0.8, will include support for Python Eggs.
When I started, I was worried about a few technical issues: for example, I include pyparsing and mechanize/ClientForm/ClientCookie/pullparser within the twill distribution, and then munge sys.path to load them first. How would this work with eggs? No problem; the same path-munging code works whether I'm loading from a directory a zip file. (I just use os.path.join.)
Version numbering: would upgrading etc. work nicely? Yep. The pkg_resources version handling is so smart, it's not even inspired. (...by which I mean that it's brilliantly simple.)
As a bonus, it will be even easier to distribute "development" versions of twill. I can just build an egg with an alpha version number, e.g. '0.8.1a1' or '0.8.1a2', link to 'em on a page, and then point people to that page. easy_install will do the rest. In fact, I don't even need to build the page manually: I can just tell Apache to make my development dist/ directory available to the public via "Options +Indexes".
For example, typing
easy_install -f http://issola.caltech.edu/~t/twill-dist/ twill
will automatically scan for the latest version and install it. Nifty.
So far, my main gripe? 'ez_setup' is an ugly name, and it's an ugly file to have sitting around in my main development directory. (You may recall that I dislike cluttering up the main directory. So call me picky ;).
Today is a day for... Miscellany!
Re my long post on object-relational mappers:
Jonathan Ellis points me towards a fairly negative post on PostgreSQL table inheritance, which cucumber2 uses. The thread basically states that no one is maintaining table inheritance and that only inertia is keeping it in the code. My impression was somewhat the opposite: I've seen statements that table inheritance will not be taken out, because there are people using it. *shrug* It's a neat feature, IMO.
Jonathan also points me towards PyDO2, which seems to have good documentation and a philosophy that supports working on the database with other tools. I've seen PyDO before but never had a chance to play with it seriously. I like the look of the code, though, on a cursory inspection.
An Open-Source Story: Producing Error-Free Software is Hard
Via RISKS, this story on an optimization bug in gcc (or so I infer) that affected X, and perhaps many other pieces of code. Whoo.
I'll go into this more later, but it's worth mentioning that anecdotal evidence from genome annotation suggests that the PHP model (of allowing at least somewhat uncontrolled posting of information to docs) elicites far more contributions than rigorous up-front quality control. The reason? Experts won't go out of their way to add information on something they understand well, but they will put in the time to correct something that's just plain wrong. So you've just got to put in mechanisms to facilitate this kind of interaction.
Using arch/darcs from Windwows
In a response to my open source project truisms page, Moof points out that darcs and arch don't work very well for Windows. I'm sure he's right: I tend to forget about that platform; when I do have to develop for it, I try to use cygwin. So to get Windows developers you've got to use something like svn or CVS. And, as he points out, there are a lot more Windows developers out there than developers for any other platform... so you
want to get Windows developers.
Are there any darcs competitors out there for TortoiseSVN?
Moof also echoes Marius Gedminas's point that idea that Trac is something worth keeping an eye on. Trac is dangerously close to becoming "SourceForge in a box", which would be a good answer to most of my suggestions on how to run an OSS project.
In other news, it may be time to go get a blog that allows comments ;).
Object-relational stuff, revisited
After I dissed on Jeff Watkin's ORM assumptions & logic, Sean Jensen-Gray staged an intervention & basically told me I was acting like a git. He's undoubtedly right, and I'm continuing that part of the conversation off-line where it belongs.
However, in the name of separating the smoke from the fire, here's some more discussion about ORMs.
First of all, here's my ORM, cucumber2, just so y'all know where I'm coming from. I make no claims about generality or quality or goodness, except to say that I like it & have been using its predecessor for over 4 yrs now. Works great. cucumber2 is some of the nicer bits of concept code I've ever written; it's definitely on my refrigerator. (YMMV...)
Based on my relatively minimal experience with ORMs, then, here are some of my own beliefs about ORM writing in Python.
Properties, metaclasses, introspection, dynamic code generation, and "under-cover twiddling" can all help make a clean piece of code. Not using them can hurt by making your code over-verbose and cluttering your APIs with information not relevant to the task at hand.
Document your use, test your use, sure -- but use them.
Do I need to say more? Just think: how do you encode a collection in a database? (Make sure you're maintaining referential integrity in your answer...) How do you encode an inheritance hierarchy? These are simple examples of a serious mismatch between the relational model and the object model. This is the problem that new ORMs should try to solve, IMO.
There's lots of discussion about using database-specific features in the SQL world (although my google-fu is failing me...), so I won't rehash that. I come down solidly on the side of committing yourself to a specific database. I think it's particularly important in the case of an ORM, which may use *very* database-specific stuff to work its magic (e.g. cucumber2 and the PostgreSQL ORDBMS features). Porting this magic between databases is likely to get very hairy & involve lots of additional complexity.
The attempt to make your ORM generic to multiple databases may well be a specific case of premature optimization (below); it seems like over-reaching oneself by attempting to encompass database-generic issues prior to settling on a good, clean API.
Do you have specialized metainfo that will break SQL queries/inserts/etc. that don't know about this information? If so, this seriously reduces the utility of your databases: you can't use external tools any more, without adding in ORM-specific awareness.
Even if you can hack this in with triggers and VIEWs, you're adding a whole 'nother layer of complexity. Bad.
This covers things like caching and cache invalidation code, which in my experience is difficult to handle generically (although possible, esp. if you only allow transaction-wrapped access). Also, SQL query optimization is tough to do in SQL, much less in a layer wrapped around SQL. In many cases, you should consider optimizing by writing app/data-model-specific SELECT statements that integrate with your ORM interface.
Most of all, think of your ORM like an object database. Layering a procedural interface on top of an SQL database isn't building an ORM -- it's building a library that talks to an SQL database. Useful, but probably not new. If you solve a hard problem -- even poorly -- that's new.
For example, one of my absolute requirements: can you determine the class of an "object" (row, tuple, whatever) in the database without using metadata that's stored external to the database (like, say, in your Python object)? I think that's a pretty ORMy requirement, myself, and it helps to not violate condition #4 (straight SQL) above. Another requirement: can you store object hierarchies straightforwardly? Again, seems ORMy to me, but it speaks to the impedance mismatch problem -- it's a tough requirement.
Looking over this list, I think these are all pretty tough requirements. You would be justified in asking "well, why not just use an object database, then?"
There are a few obvious reasons.
OK, so what use is an ORM? I'm assuming anybody who's made it this far is already sold on ORMs, but just in case, here are a couple of my reasons:
You would now be even more justified in calling me somewhat nuts. I have strict requirements for an ORM that are nigh impossible to meet, and lots of reasons why you might be stuck with an SQL database. Yet I've also given a few good reasons to use an ORM. What to do?
My first point: it's not an easy problem. That's why seriously smart people -- much smarter than me -- have thought deeply on the matter and come up with very little.
My second point: it's worth tackling. 'nuff said, here; I think the benefits are obvious.
My third point: I personally guess that there are solutions to most of the problems that I lay out for ORMs, and these solutions lie in the dynamic nature of languages like Python (and probably languages like Ruby and Perl). Certainly I can easily do interesting things in Python that are tremendously difficult to do in Java, although many of these things use the "black magic" of metaclasses.
OK, there's no real conclusion here.
I'm at least minimally satisfied with the approach I've taken in cucumber2. Again, YMMV. Apart from polishing and optimizing the code a bit, I'm thinking about taking a pyparsing-style approach to SELECTs. More on that next time I get the yen to hack on something other than twill ;).
I hope you're at least mildly entertained by my wild-eyed ORM discussion, and I look forward to the horde of disapproving comments. (Luckily, I've disabled comments on this blog, so I won't have to make them public if I don't want to. )
p.s. apologies for the weird formatting... advogato *shrug*
 I feel compelled to point out that this sentence is a joke.
Ooh, ooh, I've got an ORM!
I'm deeply skeptical of Jeff Watkins' approach to ORMs, revealed via the Daily Python-URL.
For example, why the heck would "No Magic" be on the list of anyone using Python? Things like metaclasses and properties are effin' great at making code act like it should act based on how it reads. Sure, they can be abused -- but what can't? I've used metaclasses for several projects now -- twill and cucumber2, in particular -- and I'm extraordinarily happy with them. "Twiddling under the covers" can make things very neat. And, done properly, how can it not be Pythonic?
OK, this is brilliant.
Obviously I should rewrite my "open-source project truisms" post in this manner... another time, perhaps.
Speaking of which:
Open-source project truisms: feedback
Marius Gedminas suggests also making your source code repository browseable through the Web, and Scott Lamb boils much of my advice down to "use Trac". That way, you get a source-code browser & a Wiki, too, so that people can correct things on the site themselves. Michele Simionato came out against Wikis (in a separate exchange a few days ago) but I'm on the fence & slipping towards the side of the Wiki... I think they do more good than harm ;).
Marius also suggested putting up screenshots -- that's not applicable to developer libraries, but is a great idea for anything a non-developer/user is expected to touch.
On a somewhat separate note... Python docs
Submitting patches to Python is a huge pain, because of all of the overhead and review process. (And the sourceforge bug/patch system is really, really lousy.) This process is certainly legitimate for code, but I think it's overkill for documentation -- especially when you consider how the standard library has grown, and how poor the documentation is in some of the places I frequent (urllib2, anyone?)
One possible solution?
I've been thinking setting up a Trac site and a darcs repository for editing Python documentation. The idea would be to use tailor to maintain a synced darcs version of the CVS/SVN documentation, and then build additional documentation off of that darcs base. The additional docs could thus be maintained in concert with the standard docs, but without the hassle of making them formal. Heck, with proper integration, the Trac wiki system could augment the whole thing, too.
I'm balking because I've had some recent experiences with darcs that suggests that it may not be ready for this kind of role. More anon. Thoughts welcome, however...
I'm sure these are all obvious, but I thought I'd write them down as part of my Saturday procrastination.
Why? That way, casual developers can drop in & visit your code. If your code is nice or your project is useful, you may even attract their input. They can also submit patches based on your latest code, rather than whatever release you last bothered to bundle.
Corollaries: have a comprehensible build system that "just works"; if you have tests, same. This lowers the barrier to entry for developers.
Pet peeve: academic projects tend to hide their source code, I think because they're afraid that someone will come along and steal their brilliant idea. Wrong, wrong, wrong -- unless you have some super-serious mojo hidden in there, and someone spends the time to dig it out, nobody will care about your project. So if you intend to make it open source, open it up immediately -- who knows, you might even get some participation!
Why? This way, casual developers can (a) change hardcoded configuration settings without losing the ability to integrate your updates; (b) make sure that their own fixes jive with your patches; (c) work with your package without the need for accounts on your development machine.
Pet peeve: most open source projects in existence ignore this. It may not have been an option until relatively recently, but these distributed systems are now very usable. I don't know too many developers who are happy giving CVS write access to Joe Blow, but distributed version control obviates this.
Nobody has time to work on projects that aren't ultimately going to be useful to them, so you can be sure that active developers are using your code. "Just users", however, can be a drain on a project, especially when it's just starting out.
Pet peeve: many projects make it abysmally difficult to play with them. The source is poorly organized or in a nonstandard layout, the dependencies are unclear, and there may even be patched versions of dependencies lying around on the main developer's machine. (Guilty, your honor...) The barrier to entry for developers has to be really low for most people I know to even glance at a project.
Why? How the hell else are people supposed to find your project and appreciate your brilliance? Seriously, my goal for most open source projects is to get people hooked & contributing, if only by complaining about bugs. They won't do that if they can't find it, figure out what it does, and download it. (They might very well be able to install it, especially if it uses a normal layout with configure & make/Makefile.PL/setup.py.) And they'll be able to see that other people are using it by looking at the mailing list; plus, google will help them find solutions posted by other people.
What with sourceforge, berlios, and <insert your favorite hosting site here> you really have no excuse for running a project without a Web site and mailing list.
Pet peeve: developers who think that someone's going to download their .tar.gz and read the documentation in order to figure out if a project is worthwhile. Guys, unless they already think the project is worthwhile, they're not going to download it...
Just my 2 cents. I'd appreciate counterarguments and additions; send 'em to me.
Just updated twill to the latest versions of the wwwsearch/mechanize code. Because twill reaches moderately far into the wwwsearch code, some of the internal changes that John Lee made to mechanize affected the functioning of twill adversely.
Conveniently, my unit tests caught many problems and I could iterate through & fix them one by one.
Dunno what I would have done without 'em... probably slapped a "beta" sticker on it and asked my users to test it for me ;).
Web testing links
Good experiences with Python embedding
For you planetpython people,
Iago Rubio talks about how wonderful Python embedding can be...
Just a thought.
E-mail to my titus at caltech.edu address is failing. If, for some reason, you feel the need to contact me ... use titus at idyll.org.
For now I've had to set my Reply-To header to that, too. How frustrating; I'm sure I've lost e-mail, but I have no idea what, and I have no ability to fix the problem. I feel like a user. Argh.
(Nothing to see here. Continue about your business.)
New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.
Keep up with the latest Advogato features by reading the Advogato status blog.
If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!