# Older blog entries for nconway (starting at number 30)

3 Oct 2006 (updated 3 Oct 2006 at 20:04 UTC) »
Long time, no blog

Life's been keeping me busy. I'll post an update soon. Briefly: finished my summer job for the Amalgamated Insight (née TelegraphCQ) folks. I'm now back at Queen's for the last year of my undergrad.

Interesting math

An assignment in one of my classes included an interesting bonus problem. It is very simple, but I confess I got it completely wrong before I saw the solution. Maybe one of you bright folks is smarter than I:

Let the alphabet A = {a, b, c, ..., z} (A is the set of 26 lowercase letters of the English alphabet). Let S1(w) be true iff the string w over alphabet A contains the substring aaa; let S2(w) be true iff the string w contains the substring abc.

Suppose we choose a w of 10 characters; each character in w is selected randomly and independently.

Let P1 be the probability that S1(w) is true, and let P2 be the probability that S2(w) is true. Is P1 > P2, P1 < P2, or P1 = P2? Give a justification for your answer. (Hint: P1 != P2).

If you think you know the answer, email me — I'll post the answer later (neilc A T samurai D O T com). Obviously, the gist is in the justification, not which alternative you think is true.

Hat Tip: Prof. Kai Salomaa for showing me the problem.

I couldn't make it to OSCON this year (again!), but I noticed that Theo Schlossnagle has posted his slides for his OSCON talk about using PostgreSQL for tera-byte scale data warehousing.

I've posted the material from the "Introduction to Hacking PostgreSQL" tutorial that Gavin Sherry and I gave at the PostgreSQL Summit online, at http://neilconway.org/talks/hacking/. The slides, example patch, and handout are there -- I've only posted the PDFs for now, but I'll add the LaTeX source when I get a chance to clean it up.

Robert: When I was helping Josh with the scheduling, we were initially going to make sure that the hacking tutorial ran concurrently with non-technical talks. However, on reflection this seemed like precisely the wrong thing to do: experienced Postgres hackers won't have much to learn from the hacking tutorial, because it will focus on fairly elementary topics. So we decided to put some of the more hardcore technical talks, such as the talks on multi-threading and SMP scalability, alongside the hacking tutorial, so that the hardcore folks can see the technical talks, and the relative novices can attend the tutorial.
6 Apr 2006 (updated 6 Apr 2006 at 01:10 UTC) »

I'd like to highlight Tom Lane's recent commit that fixes a major defect in the implementation of domains in PostgreSQL. Domains can have CHECK or NOT NULL constraints, but prior to Tom's patch, these constraints were not enforced by PL/PgSQL, or as the return value of a procedural language function. (That meant you could have values of a domain type that violated the domain's own constraints!) This has been a known problem for quite some time, but it was somewhat tricky to fix.

In Postgres, each type has an associated "input function" that takes a string and produces a value of the type. Prior to Tom's commit, a domain type's input function was just the input function of its base type. As a result, just invoking a type's input function (which is done in several places throughout the backend) wasn't enough to check the constraints on a value of a domain type—you also had to explicitly lookup any associated domain constraints and check them. So in all the places where we were invoking input functions we'd need to add some additional code to explicitly check domain constraints. Needless to say, that would be pretty ugly -- it's just a few additional function calls, but it's really not something ought to be doing at every callsite of a type's input function.

Worse still, looking up the constraints associated with a domain is relatively expensive (it requires a non-syscache'd catalog lookup). To actually evaluate a CHECK constraint you need to evaluate an expression, which requires instantiating a bunch of executor machinery, which is also not that cheap. So in all the places where we'd need to add checks of domain constraints, we'd also need to think about how to efficiently load and cache the domain constraints and executor machinery, and when to invalidate/release them.

I did some work to add domain constraint checking to the return value in PL/PgSQL. I never applied the patch, partly because there were some implementation details that were tricky to resolve, but mostly because it just seemed like the wrong approach.

Tom's fix is much cleaner: by providing a separate input function for all domain types and doing the constraint checking there, we're guaranteed to check domain constraints at the appropriate time, without the need to clutter each call-site of a type input function ([1], [2]). It would be worthwhile to investigate whether this results in a performance regression, though: there's no easy way to cache the executor machinery needed to evaluate a CHECK constraint in this design, whereas the prior design allowed each call-site to implement its own EState caching.

Nice work, Tom!

There's an interesting article on Economist.com about "open-source business", and whether open-source-inspired techniques can be applied to other industries. The article talks about MySQL AB as an example of a hybrid between a community-driven open-source project and proprietary software company. There's one particularly amusing quote: MySQL AB "rarely accepts code from outside developers", apparently because "the complexity of database software makes it less amenable to being independently cobbled together."

13 Mar 2006 (updated 13 Mar 2006 at 20:24 UTC) »
Initializing variables

While reviewing the updatable views patch that Bernd and Jaime recently posted (more on that later), I ran into a few coding patterns that annoy me. One of those is the pointless initialization of block-local variables, as in:

T *
foo_func(...)
{
T *v = NULL;

v = palloc(sizeof(*v));
/* initialize v */

return v;
}

Some programmers are in the habit of initializing all block-local variables when they define them — even in the example above, where the initial value of v is never referenced. So, is it good style to always initialize v when it is defined? IMHO, no.

I suppose the justification for this technique is that it ensures that v always has a well-defined value: if we didn't initialize v, it would initially be undefined. At first glance, that sounds reasonable: dereferencing an undefined variable can have unpredictable consequences, whereas dereferencing the NULL pointer is guaranteed to halt the program.

I think this reasoning is flawed for a few reasons:

1. It hides errors: if you leave v undefined and then try to use its value before it has been assigned to, most modern compilers will warn you about your mistake: obviously, reading the value of an undefined variable is probably not what the programmer intended to do. If you initialize v to NULL, the compiler can't tell this, which makes this error harder to detect. (While you won't get a compiler error in any case, if you are ignoring or not enabling compiler warnings you are probably a bad programmer in any case.)

2. It is confusing: the only reason to initialize v is if its value might be used without any subsequent assignment. That is, this is perfectly reasonable code:

T *
foo_func(...)
{
T *v = NULL;

if (some_condition)
{
v = palloc(sizeof(*v));
/* initialize v */
}

return v;
}

In this example, initializing v to NULL indicates that there are some code paths where the initial value of v will not be overwritten. This is a valuable hint, because it tells you something non-trivial about the way in which v is used. By adding a pointless initialization of v in the first example, it makes it more difficult to distinguish between these two cases for no gain.

3. It is redundant: There's no need for the initialization, so by Ockham's razor we ought to eliminate it: if you can remove a construct from your code without harming readability, functionality or performance, you probably should.

(An explicit initialization might also cause the compiler to generate more code, but a decent optimizing compiler should usually be able to recognize that the initial value of v is not needed and thus can be eliminated — and in any case the performance difference is unlikely to be significant.)

If anyone can provide a good justification for this coding pattern, I'd be curious to hear it.

3 Mar 2006 (updated 3 Mar 2006 at 15:31 UTC) »

A brief followup to my earlier post on random generation of SQL queries for testing purposes: Don Slutz from MSR has an interesting paper called "Massive Stochastic Testing of SQL" that describes a stochastic SQL query generator called RAGS. One of the problems with randomized testing that I noted earlier is that it is difficult to distinguish between successes and failures: if the backend crashes while executing a query it is very likely a defect of some kind, but if no crash occurs, how do we know that the backend has produced the correct result set? Slutz writes:

If a SQL Select executes without errors, there is no easy method to validate the returned values by observing only the values, the query, and the database state. Our approach is to simply execute the same query on multiple vendors' DBMSs and then compare the results.

As the author notes, this isn't perfect:

The use of database systems from different vendors for output validation proved to be extremely useful for the SQL common to many vendors. The down side is that the common SQL subset is relatively small and changes with each release. We also found the differences in NULL and character string handling and numeric type coercion in expressions to be particularly problematic (these are also portability issues).

The paper's approach is to randomly generate SQL queries using an SQL grammar. The grammar defines the set of legal SQL statements, so it can be used to generate members from that set. For example, a trivial SELECT grammar might be:

SelectStmt: SELECT t_list FROM relation

The SQL generator would generate values for the "t_list" and "relation" randomly: it would generate a randomly-sized list of random target list expressions for "t_list", and a random relation name for "relation". The paper's SQL generator allows for a few simple configuration settings: generate between n and m joins, generate between x and y elements in the target list, and so on. William McKeeman's "Differential Testing for Software" explores some similar ideas for doing stochastic testing of C compilers. Both papers have been discussed by other bloggers ([1], [2]) in the past.

One problem with this approach is that it often generates unrealistic queries. Using an SQL grammar ensures the randomly-generated queries will be well-formed, but it doesn't provide any guarantees about whether the queries will be sensible, or bear any relation to the sorts of queries users are actually running. The relatively coarse-grained tuning knobs provided by RAGS also don't allow test suites to be written to validate the behavior of the database for specific classes of queries: you can't generate a specialized set of SQL queries to test a particular part of the query optimizer, for example. It would be nice to be able to take the query workload from a production database, and write a high-level constraint that captures the notion of "queries that look like that." I'm still hopeful a DSL could provide a nice solution to both problems, but I haven't had a chance to really think about it yet.

Mark Shuttleworth seems fairly pleased with Postgres. Money quote:

Postgres is a truly awesome database. When we started working on Launchpad I wasn't sure if it would be up to the job. I was so wrong. It's been robust, fast, and *professional* in every regard.
6 Feb 2006 (updated 6 Feb 2006 at 07:14 UTC) »

Brain dump: I've heard of QuickCheck before, but I only found time to look at it in some depth earlier today. It's a very simple but powerful tool: rather than writing unit test cases by hand, the tester just specifies a set of constraints that the function being tested must satisfy. QuickCheck then randomly generates test cases and checks that the constraints hold. A trivial example of a constraint given by in the 2000 QuickCheck paper is for an insertion function into ordered lists:

prop_Insert :: Int -> [Int] -> Property
prop_Insert x xs =
ordered xs ==> ordered (insert x xs)

That is: given any integer x and any ordered list of integers xs, insert x xs should produce an ordered list. QuickCheck takes care of generating the necessary random integers automatically. The programmer can view the distribution of random values generated by QuickCheck, or provide their own data generator.

I think it would be interesting to apply this to PostgreSQL. For example, I find it frustrating to try to manually test the impact of changes to the query optimizer: it's difficult to determine whether a local change to the optimizer is correct for all possible query trees. Worse yet, a planner change might optimize some queries and pessimize others — unless the developer realizes this while implementing the change, we often don't hear about these sorts of performance regressions until the beta period or even later.

It might be interesting to try to define a domain-specific language that would allow you to construct random SQL queries that satisfy certain high-level constraints, and then verify that the backend executes those queries correctly. For example, a statement in the DSL might be "for all SQL queries q such that q contains 0 < n < 10 joins, check that q is executed without error." Checking that queries can be executed without error is fairly trivial; perhaps we could check more advanced properties, like "executes within a certain amount of time" or "produces a result set that satisfies the constraint: [...]".

That said, there are plenty of low-hanging fruit when it comes to improving the state of testing for Postgres. For example, the ability to generate arbitrary n-relation joins of certain kinds (e.g. star schema) would be useful. Another simple improvement would be to convert the TPC-D query workload into something that could be used to isolate changes in the optimizer's performance for that set of (important) queries.

21 older entries...