Older blog entries for nconway (starting at number 32)


I'll be away until January 5th. I should still be reachable via email.

Database Architecture

The Revolution In Database Architecture is a nice short paper that outlines Jim Gray's thoughts on future directions for database systems. A few samples:

Random access is a hundred times slower than sequential. These changing ratios require new algorithms that intelligently use multi-processors sharing a massive main memory, and intelligently use precious disk bandwidth. The database engines need to overhaul their algorithms to deal with the fact that main memories are huge (billions of pages, trillions of bytes). The era of main-memory databases has finally arrived.
Cost-based static-plan optimizers continue to be the mainstay for simple queries that run in seconds. But, for complex queries, the query optimizer must adapt to current workloads, must adapt to data skew and statistics, and must plan in a much more dynamic way – changing plans as the system load and data statistics change. For petabyte-scale databases it seems the only solution is to run continuous data scans and let queries piggyback on the scans. Teradata pioneered that mechanism, and it is likely to become more common in the future.
The database community has found a very elegant way to embrace and extend machine learning technology like clustering, decision trees, Bayes nets, neural nets, time series analysis, etc... The key idea is to create a learning table T; telling the system to learn columns x, y, z from attributes a, b, c (or to cluster attributes a, b, c, or to treat a as the time stamp for b.) Then one inserts training data into the table T, and the data mining algorithm builds a decision tree or Bayes net or time series model for the data .... After the training phase, the table T can be used to generate synthetic data; given a key a,b,c it can return the likely x,y,z values of that key along with the probabilities. Equivalently, T can evaluate the probability that some value is correct. The neat thing about this is that the framework allows you to add your own machine-learning algorithms to this framework. This gives the machine-learning community a vehicle to make their technology accessible to a broad user base.
I'm not convinced that such a model is really the right way to integrate data mining into the DBMS, but it's an interesting (and easily implemented) idea. You could build such a system using Postgres' extensibility features pretty easily, although backend modifications would probably be necessary for proper integration. PL/R would definitely be a useful tool.

Real Life

Sorry for not posting the solution to the math problem I blogged about a few months (!) ago -- I'll get around to that shortly.

3 Oct 2006 (updated 3 Oct 2006 at 20:04 UTC) »
Long time, no blog

Life's been keeping me busy. I'll post an update soon. Briefly: finished my summer job for the Amalgamated Insight (née TelegraphCQ) folks. I'm now back at Queen's for the last year of my undergrad.

Interesting math

An assignment in one of my classes included an interesting bonus problem. It is very simple, but I confess I got it completely wrong before I saw the solution. Maybe one of you bright folks is smarter than I:

Let the alphabet A = {a, b, c, ..., z} (A is the set of 26 lowercase letters of the English alphabet). Let S1(w) be true iff the string w over alphabet A contains the substring aaa; let S2(w) be true iff the string w contains the substring abc.

Suppose we choose a w of 10 characters; each character in w is selected randomly and independently.

Let P1 be the probability that S1(w) is true, and let P2 be the probability that S2(w) is true. Is P1 > P2, P1 < P2, or P1 = P2? Give a justification for your answer. (Hint: P1 != P2).

If you think you know the answer, email me — I'll post the answer later (neilc A T samurai D O T com). Obviously, the gist is in the justification, not which alternative you think is true.

Hat Tip: Prof. Kai Salomaa for showing me the problem.

I couldn't make it to OSCON this year (again!), but I noticed that Theo Schlossnagle has posted his slides for his OSCON talk about using PostgreSQL for tera-byte scale data warehousing.

I've posted the material from the "Introduction to Hacking PostgreSQL" tutorial that Gavin Sherry and I gave at the PostgreSQL Summit online, at http://neilconway.org/talks/hacking/. The slides, example patch, and handout are there -- I've only posted the PDFs for now, but I'll add the LaTeX source when I get a chance to clean it up.

Robert: When I was helping Josh with the scheduling, we were initially going to make sure that the hacking tutorial ran concurrently with non-technical talks. However, on reflection this seemed like precisely the wrong thing to do: experienced Postgres hackers won't have much to learn from the hacking tutorial, because it will focus on fairly elementary topics. So we decided to put some of the more hardcore technical talks, such as the talks on multi-threading and SMP scalability, alongside the hacking tutorial, so that the hardcore folks can see the technical talks, and the relative novices can attend the tutorial.
6 Apr 2006 (updated 6 Apr 2006 at 01:10 UTC) »

I'd like to highlight Tom Lane's recent commit that fixes a major defect in the implementation of domains in PostgreSQL. Domains can have CHECK or NOT NULL constraints, but prior to Tom's patch, these constraints were not enforced by PL/PgSQL, or as the return value of a procedural language function. (That meant you could have values of a domain type that violated the domain's own constraints!) This has been a known problem for quite some time, but it was somewhat tricky to fix.

In Postgres, each type has an associated "input function" that takes a string and produces a value of the type. Prior to Tom's commit, a domain type's input function was just the input function of its base type. As a result, just invoking a type's input function (which is done in several places throughout the backend) wasn't enough to check the constraints on a value of a domain type—you also had to explicitly lookup any associated domain constraints and check them. So in all the places where we were invoking input functions we'd need to add some additional code to explicitly check domain constraints. Needless to say, that would be pretty ugly -- it's just a few additional function calls, but it's really not something ought to be doing at every callsite of a type's input function.

Worse still, looking up the constraints associated with a domain is relatively expensive (it requires a non-syscache'd catalog lookup). To actually evaluate a CHECK constraint you need to evaluate an expression, which requires instantiating a bunch of executor machinery, which is also not that cheap. So in all the places where we'd need to add checks of domain constraints, we'd also need to think about how to efficiently load and cache the domain constraints and executor machinery, and when to invalidate/release them.

I did some work to add domain constraint checking to the return value in PL/PgSQL. I never applied the patch, partly because there were some implementation details that were tricky to resolve, but mostly because it just seemed like the wrong approach.

Tom's fix is much cleaner: by providing a separate input function for all domain types and doing the constraint checking there, we're guaranteed to check domain constraints at the appropriate time, without the need to clutter each call-site of a type input function ([1], [2]). It would be worthwhile to investigate whether this results in a performance regression, though: there's no easy way to cache the executor machinery needed to evaluate a CHECK constraint in this design, whereas the prior design allowed each call-site to implement its own EState caching.

Nice work, Tom!

There's an interesting article on Economist.com about "open-source business", and whether open-source-inspired techniques can be applied to other industries. The article talks about MySQL AB as an example of a hybrid between a community-driven open-source project and proprietary software company. There's one particularly amusing quote: MySQL AB "rarely accepts code from outside developers", apparently because "the complexity of database software makes it less amenable to being independently cobbled together."

13 Mar 2006 (updated 13 Mar 2006 at 20:24 UTC) »
Initializing variables

While reviewing the updatable views patch that Bernd and Jaime recently posted (more on that later), I ran into a few coding patterns that annoy me. One of those is the pointless initialization of block-local variables, as in:

T *
    T *v = NULL;
    v = palloc(sizeof(*v));
    /* initialize v */
    return v;

Some programmers are in the habit of initializing all block-local variables when they define them — even in the example above, where the initial value of v is never referenced. So, is it good style to always initialize v when it is defined? IMHO, no.

I suppose the justification for this technique is that it ensures that v always has a well-defined value: if we didn't initialize v, it would initially be undefined. At first glance, that sounds reasonable: dereferencing an undefined variable can have unpredictable consequences, whereas dereferencing the NULL pointer is guaranteed to halt the program.

I think this reasoning is flawed for a few reasons:

  1. It hides errors: if you leave v undefined and then try to use its value before it has been assigned to, most modern compilers will warn you about your mistake: obviously, reading the value of an undefined variable is probably not what the programmer intended to do. If you initialize v to NULL, the compiler can't tell this, which makes this error harder to detect. (While you won't get a compiler error in any case, if you are ignoring or not enabling compiler warnings you are probably a bad programmer in any case.)

  2. It is confusing: the only reason to initialize v is if its value might be used without any subsequent assignment. That is, this is perfectly reasonable code:

    T *
        T *v = NULL;
        if (some_condition)
            v = palloc(sizeof(*v));
            /* initialize v */
        return v;

    In this example, initializing v to NULL indicates that there are some code paths where the initial value of v will not be overwritten. This is a valuable hint, because it tells you something non-trivial about the way in which v is used. By adding a pointless initialization of v in the first example, it makes it more difficult to distinguish between these two cases for no gain.

  3. It is redundant: There's no need for the initialization, so by Ockham's razor we ought to eliminate it: if you can remove a construct from your code without harming readability, functionality or performance, you probably should.

    (An explicit initialization might also cause the compiler to generate more code, but a decent optimizing compiler should usually be able to recognize that the initial value of v is not needed and thus can be eliminated — and in any case the performance difference is unlikely to be significant.)

If anyone can provide a good justification for this coding pattern, I'd be curious to hear it.

3 Mar 2006 (updated 3 Mar 2006 at 15:31 UTC) »

A brief followup to my earlier post on random generation of SQL queries for testing purposes: Don Slutz from MSR has an interesting paper called "Massive Stochastic Testing of SQL" that describes a stochastic SQL query generator called RAGS. One of the problems with randomized testing that I noted earlier is that it is difficult to distinguish between successes and failures: if the backend crashes while executing a query it is very likely a defect of some kind, but if no crash occurs, how do we know that the backend has produced the correct result set? Slutz writes:

If a SQL Select executes without errors, there is no easy method to validate the returned values by observing only the values, the query, and the database state. Our approach is to simply execute the same query on multiple vendors' DBMSs and then compare the results.

As the author notes, this isn't perfect:

The use of database systems from different vendors for output validation proved to be extremely useful for the SQL common to many vendors. The down side is that the common SQL subset is relatively small and changes with each release. We also found the differences in NULL and character string handling and numeric type coercion in expressions to be particularly problematic (these are also portability issues).

The paper's approach is to randomly generate SQL queries using an SQL grammar. The grammar defines the set of legal SQL statements, so it can be used to generate members from that set. For example, a trivial SELECT grammar might be:

    SelectStmt: SELECT t_list FROM relation

The SQL generator would generate values for the "t_list" and "relation" randomly: it would generate a randomly-sized list of random target list expressions for "t_list", and a random relation name for "relation". The paper's SQL generator allows for a few simple configuration settings: generate between n and m joins, generate between x and y elements in the target list, and so on. William McKeeman's "Differential Testing for Software" explores some similar ideas for doing stochastic testing of C compilers. Both papers have been discussed by other bloggers ([1], [2]) in the past.

One problem with this approach is that it often generates unrealistic queries. Using an SQL grammar ensures the randomly-generated queries will be well-formed, but it doesn't provide any guarantees about whether the queries will be sensible, or bear any relation to the sorts of queries users are actually running. The relatively coarse-grained tuning knobs provided by RAGS also don't allow test suites to be written to validate the behavior of the database for specific classes of queries: you can't generate a specialized set of SQL queries to test a particular part of the query optimizer, for example. It would be nice to be able to take the query workload from a production database, and write a high-level constraint that captures the notion of "queries that look like that." I'm still hopeful a DSL could provide a nice solution to both problems, but I haven't had a chance to really think about it yet.

23 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!