I should mention that python is now my main language of choice - however, the programming language choice is irrelevant: this article is mostly about SQL - decent mature SQL servers (so that excludes MySQL) which have had Views, Stored Procedures, Constraints and Foreign Keys for over five years. So, before we begin, there's a very important issue that I should clear up RIGHT away:
Why should I avoid MySQL like the Black Plague?
I should mention that although even up until a year ago I thought MySQL was great, and PostgreSQL was 'too complicated' and 'slower', I've since worked on large projects, and the lack of Views and Foreign Keys in MySQL makes it an absolute nightmare. I cannot overemphasise this enough - and there are several articles which outline this: if you believe that you can do a decent job with MySQL, then you do not know enough about databases to do a decent job. stop. now. quit. go do something else: learn a new trade. I hear that flower-arranging is popular these days.
What's that, you say: MySQL 5 has views and even transactions? wow. great. give it another four years to mature as much as PostgreSQL before even considering using it. What's that, you say: you can't wait that long? Well then - use PostgreSQL instead.
Glad we've got that over with: normal service will now be resumed...
What is 'dynamic SQL'?
Dynamic SQL is where you write a code library that generates SQL and, in some cases, puts an object or class wrapper around the data in the SQL database. In the extreme cases (such as SQLobject) you never actually get to know that you're accessing a database at all: you see a list of objects, which you can create, access, update and delete (including tables themselves).
For example, SQLobject's object-orientated interface makes it quite tricky to do relational-database access, so you really have to know what you are doing, and the best thing to consider is to put all the really _really_ difficult stuff into SQL Views, if you do not want to become intimately familiar with SQLbuilder. SQLobject can handle, automatically, simple joins and relationships but it's just when you get more involved that you have to start digging.
Any decent programmer who has worked with SQL for several months will, if they are any good, start to wish, if they didn't start using Dynamic SQL generation, that they soon _had_ started using Dynamic SQL generation - no matter what the programming language they are using.
SQL generators must be one of the most reinvented libraries ever. As soon as you start to write your own sql "escape" function (converting quotes, newlines, tabs and carriage-returns), which is the first thing that you will need to do to ensure that users cannot do SQL injection attacks, you are immediately in trouble. The next thing that you will write will be (if you are doing MSSQL) a 'bracketer' function which puts square brackets round every field name and table name. Soon after that, you will write a WHERE clause creator - usually involving a list of statements to be ANDed together - e.g. 'AND\t'.join(sql_statement_list). Some time after _that_, you will consider doing an 'update' joiner function, and maybe one which creates the 'INSERT' syntax, so that everything can now receive variable-length arguments (if you are using python, a dictionary, or its equivalent in your programming language of choice)
aaaand.... congratulations: you are now well on the way to reinventing a dynamic SQL generator - exactly like I and pretty much every software engineer who's been involved in SQL did, 5 years ago.
The code that I wrote is called pysqldb - and it's pretty clunky, but it works. I've currently discovered that MSSQL has a simple means to query relationships (foreign keys) and also primary keys, and so in the middle of giving pysqldb something of an overhaul. Why? because I'm familiar with it, that's why.
What the heck are Views?
For the MySQL brain-dead programmers out there, allow me to explain what even MS-Access users have known about for years: a concept called 'Views'. Views are basically a SELECT statement that appears to be a read-only table. You can then use that View in other SELECT statements to make JOINs against it with other Tables and Views (including, if you have a particularly sophisticated SQL server, in sub-SELECTs). The one thing that you cannot do with a VIEW (also known as a Query in MS-Access) is write to it. In other words - uou can't do a DELETE, UPDATE or INSERT on a View.
There are circumstances where you absolutely HAVE to use VIEWs. If you do an aggregation (using COUNT or MAX) with a HAVING clause _and_ a WHERE clause _and_ a GROUP BY clause, it is impossible to JOIN the aggregated results with another table.
Some SQL servers will allow you to do the above complex Query as a sub-SELECT, which you will then be able to JOIN against other tables (or other sub-SELECTs)... but all a sub-SELECT really is is a VIEW without a name...
... it's a long story. Suffice to say that VIEWs are extremely useful, especially if you have programmers whose knowledge of SQL is limited: you can tell them 'go access this View, get your results, stop freaking out'.
Also, there is a significant advantage to VIEWs: if your entire programming team exclusively uses VIEWs to read the data, then you can change the ENTIRE database model around - without impacting a single line of their code.
What are 'Stored Procedures'?
Stored Procedures, stored procedures god these are loved and hated and I hate them (for very specific reasons). Stored procedures are basically the way to get round the fact that VIEWs are read-only: a Stored Procedure is written in a special programming language that is usually unique to the database that you are using. Oracle, MSSQL and even PostgreSQL have their own programming languages. Not surprisingly, MSSQL's stored procedures programming language looks like BASIC.
Some stored procedures can be called specifically when certain events occur - INSERTs, DELETEs or UPDATEs on specific tables: these have a special name called 'Triggers'.
I'm not going to go into great detail, here, other than to briefly mention that they exist: if you want more details, there are plenty of articles, books and resources on the Internet about Stored Procedures and Triggers.
Why would you use Stored Procedures?
Triggers are a good example case when you have, say, several programmers all accessing the database, using different programming languages: not everybody is going to be bright and clever. Let's imagine that you have 100 tables, and there is a rule that all tables must have a 'created' date-time stamp field and also a 'lastmodified' field. To expect several teams of programmers to all ensure that, on an INSERT, they set both the created and lastmodified timestamps correctly, when the user's computers are going to be in random timezones; likewise setting the lastmodified field on an UPDATE - is simply unrealistic.
So the alternative is to create a 'Trigger' which 'captures' any INSERT or UPDATE event, and modifies the created and lastmodified fields for you. In this way, you only need to ensure that the SQL server's timezone is set to the required zone, and you know that the timestamps are going to be consistent.
Another good reason is, just like with the VIEWs on reading data, if you have Stored Procedures to perform INSERTs and UPDATEs and DELETEs, instead of directly manipulating the database tables, then the database design can be changed at will WITHOUT having to go through a nightmare round of meetings with all of your programming teams for the different projects that are using the database: upgrades and changes to the underlying database design should, in theory, go without a hitch...
... that's the theory, anyway. More on this later, because it's not as straightforward as it sounds.
There is, however, one circumstance where Stored Procedures come into their own, and it's when you design nearly the ENTIRE project in the language of the SQL server you are using. It has been known for people who don't know any other programming language to write a web service entirely inside the stored procedures: the results returned from the stored procedure is a string containing the HTML to be spewed out, and the input is the HTTP POST or GET arguments and the HTTP URL.
Crazed - but true.
Other less extreme but still hair-raising examples I have heard about was a project which was equivalent to MySQLadmin. The advantage of this project was of course that they had direct access to the database table design (via the SQL programming language) and so could write code which outputted add, edit and view HTML (GET and POST) for all tables, along with a means to do simple subsets of table columns. The stored procedures would even analyse the field's data type and output an HTML textarea or input field as appropriate, with the right size as well, by looking at the length of the field.
These last two examples should not give you a warm fuzzy feeling about Stored Procedures - which is actually the ultimate aim of this article.
When that warm fuzzy feeling has turned to lead and is giving you heart murmurs and palpitations at the mention of the word 'Stored Procedure', then I will rest happy.
Why you should not use Stored Procedures
In one of the above examples, I outline one very good example - Triggers - which I cannot argue with: that is an extremely effective use of Stored Procedures. The other examples, however, should scare you silly: here's why.
Firstly - the "Create a matching INSERT, matching UPDATE and matching DELETE stored procedure to manage the data, just like views, so that the underlying database design can change" example. When first considered, it sounds great. However, it didn't take me very long to come across an article which described how much of a nightmare this is, on a large database project.
The clue should be in the bit about one stored procedure per INSERT, UPDATE and DELETE. that means three stored procedures. per 'object' that you want to be modifying in some way. If you have 100 tables, it means that you need a full-time junior programmer whose sole soul-destroying job it is is to write brain-dead stored procedures - anything up to four hundred of them.
Additionally, as you are no doubt aware, no project EVER stays still: it doesn't take much imagination to realise that you've made the Stored Procedures a critical part of the project's development and maintenance cycle.
Whereas you thought it was going to make your life easier, using Stored Procedures for this purpose actually introduces nightmare maintenance and human resource synchronisation issues: if the main SQL designer, _or_ the junior programmer, _or_ the coders _happen_ not to be available, then your project cannot proceed.
Additionally, the programming language for most SQL databases was typically designed 15 to 20 years ago. Data types like Lists and Dictionaries and Objects, which are the staple diet of modern programming languages today, just simply don't exist - to the extent where (god help us) microsoft is now allowing .NET programs to be Stored Procedures and Triggers run inside MSSQL.
For example, in my current project, I hesitated to use pysqldb for several months. Instead, I wrote a table_update() function which took a name of a table, and a python dictionary of field-value pairs which should be used to generate an UPDATE statement. The number of arguments is entirely variable, and the number of combinations which can be received is N! or something like that, where N is the number of fields in the table you want to be updated.
When I considered writing a Stored Procedure to match this function, I naively expected MSSQL stored procedure language to have a similar concept - to accept a dictionary of field-value pairs as one of its arguments. No such luck. I would therefore have to go through ALL of my code, looking for places where table_update() was used - ripping them all out, and creating ONE hard-coded stored procedure PER update.
Complete nightmare. Naturally, I am not using Stored Procedures.
The other two examples are fairly self-explanatory: if you have the expertise, and there isn't too much dependence on other programming teams - go for it: write the entire program in the SQL programming language. In the case of the Web-server-in-SQL-language example, the programmer was tasked with a simple job: make a really reliable and extremely fast web service. As he only knew one programming language (SQL stored procedures) the choice was a no-brainer.
So what am I going to do??? HELP!
By now, you should be scared shitless to use Stored Procedures for large projects involving multiple programming teams and multiple programming languages, but quite happy to use Triggers for very specific data consistency purposes and other essential integrity checking purposes (such as not allowing invoices to be deleted unless they have been marked as archived and the issue date is over seven years old - do ALL of your programming teams, in cases where you _don't_ have such a trigger, obey this critical rule, over which your company could be sued to the bedrock for not keeping proper data audit trails?).
You should also be drumming it into your programmers that they should either have, buy or be developing, a Dynamic SQL generator and/or SQL object class accessor in the programming language of their choice (I heard that .NET actually now comes with such a SQL accessing class by default, now).
Writing even the most basic of Dynamic SQL accessing library is something that doesn't take more than 1,000 lines of code in a decent Object-Orientated programming language: even if it only can do insert, update, delete and select on individual tables, you can still write Views and then use your basic SQL-accessing library to read that view as if it was a table. Perhaps the reason _why_ there are so many reinvented Dynamic SQL libraries is _because_ it's not exactly rocket science and you end up with (relatively-speaking) so little code.
The reason why SQLobject is so large is because it supports so many of the available SQL constructs - DISTINCT, count, max etc. and even automatically adding in LIMIT and OFFSET on demand when a list is part-accessed (e.g. for row in results[start:start+pagesize]).
Oh - and it has cacheing, and copes with Transactions as well, keeping the cache consistent even across Transactions. Oh - and it copes with six different database back-ends, too, ranging from SQLite, MySQL, PostgreSQL and MSSQL.
In essence: large projects that involve SQL are hard work, and there aren't too many 'magic bullets'. Stored Procedures is a bullet all right: a hollow-tipped depleted-uranium high-calibre round designed for maximum wounding so that your severely crippled project ends up in intensive care - and stays there, instead of being effective.
So why the heck are you writing this article?
Basically it's to give you the benefit of some of the stories I've encountered - but also it's to mention a project that I'm working on, and been thinking about for at least four years. If you've ever used MS-Access 'Forms', or used OpenOffice 2.0 'Base', then you will have a good idea of what I am going to say.
I am presently writing (in python) some infrastructure to manage databases, programmatically, in a similar way to what MS-Access and Base provide, to ordinary users. Ultimately, though, on top of this infrastructure, I want to provide a web service which allows you do do the same thing as Forms and Base.
All of it will be done with Dynamic SQL generation: nothing will be in Views or Stored Procedures. No matter how complex the database design, it should be a simple matter of chaining together some tables and the fields that you want (using the primary keys and the relationships - foreign keys - by default unless specifically told otherwise), and you will be presented with a means (first programmatically and then ultimately with 'Forms' via a web service) to view, edit and add data.
If you have an associative-table and a lookup table, then a multi-drop-down HTML select box will be populated from the lookup table, selected items will be picked from the associative-table, and when you click 'save' or 'add', the associative-table values updated accordingly, along-side but separate from the rest of the form input which should go into the _main_ table with which the multi-select-input is associated with!
If you have a one-to-one relationship between two tables, where one table 'extends' another in an object-orientated fashion, it will be possible to generate a single form which can do view, update and add into the two tables and even delete - as if they were one table (as if the two tables were a writeable VIEW, effectively).
It's possibly the most ambitious programming project I've ever taken on, and that's saying something.
i'm glad you asked the question, because if you use a dynamic SQL generator layer, and no Views and no Stored Procedures, the chances are that your application, which uses no hard-coded SQL statements, can very easily be moved over from one SQL database to another with little or no change, even if the database is as dumb as MySQL.
there are a couple of main gotchas:
1) insert ids.
in mysql, you can use the insert_id() function, on the cursor which was used to perform the INSERT operation. the unique and auto-generated auto-increment id will be returned to you.
i never quite got the hang of the way that PostgreSQL handles this: i think it's something to do with having a table or a stored procedure / trigger or something where unique ids get generated for you and can be returned.
in MSSQL, there are four main ways - one of them quite obscure. the first way is to reference an internal variable called @@INSERT: you can do SELECT @@INSERT AS id and then return cursor.fetchrow()['id'] as the result. the second way is to access a built-in function IDENT_CURRENT('tablename') - do a SELECT IDENT_CURRENT(<sometable>) AS id and obtain the result in the same way.
however, in MSSQL you can use GUIDs (aka UUIDs) - 16 byte unique identifiers as the primary key. if you try to obtain the last id inserted into the table with these two methods, they return NULL - because there _is_ no auto-increment number being used: there is instead a built-in function called getid() which auto-generates a UUID for you.
so you have to use a slightly different method: Recordsets, via the ADO interface. what you do is you create a new Recordset, then you set all the fields in it, then you call the Update() function - and then once the Update() function is called, the Recordset is still open - and all fields, _including_ the auto-generated UUID, are still available, business as usual. all you have to do is copy this field before you close the Recordset.
if you can't use ADO - you can't use Recordsets - then there is other final method: you generate the GUID on the client. if you do this, it is imperative that you have access to a REALLY good UUID generator. it's a standard rfc, now (both with the IETF and also the original RFC can be found on the opengroup.org web site) and has many successful and commonly used implementations, not to mention it's a standard part of the Win32 api.
this final method has distinct advantages when you have a horrendously complex set of INSERTs to do, with constraints and foreign keys that could even have circular dependencies: under such circumstances you have no choice but to generate the primary key of the records in advance.
GUIDs, although slower, have a distinct advantage when it comes to distributing databases across many machines: you can easily consider synchronising the tables without having to worry about conflicting autoincrement primary keys, because the GUIDs are ... well... globally unique! duh.
2) constraints and foreign keys
the point of constraints and foreign keys is to ensure that if you have a lookup table and you have some records which refer to that lookup table, that the data is consistent. specifically, that you cannot delete a lookup entry which is still being referred to from a related table; also that you cannot create a record which refers to a non-existent entry in a lookup table.
if you have designed your application with a shit database such as MySQL, it has no constraints, so the order of insertion into the lookup table and also of records which use that lookup table are irrelevant.
if you were to move to a database which cared about such things, you are in deep shit.
the moral of the story is: don't use a shit database.
here i cannot strictly advise you, because i will be honest: i am not familiar enough with database locking and transactions to be able to advise you.
what i do know is that MySQL has a means to lock entire tables for read and for read-write, and you have to call UNLOCK TABLES afterwards.
decent databases which consider thousands to tens of thousands of requests and upwards have transactions, and cannot POSSIBLY entertain the stupid notion of locking entire tables: at the most, they allow per-row locking, and also they have a concept of roll-back: if something went wrong, everything you tried to do gets "undone".
as far as i am concerned, right now, it's all a bit of voodoo magic, which i will need to investigate over the next few months.
there are i am sure other key differences: if you are particularly interested, i thoroughly recommend investigating SQLobject, because it has six main database backends, and copes admirably with them all.