Older blog entries for nconway (starting at number 20)

Via Zack, there's an interesting entry on Xooglers from an engineer on the Google AdWords team, about the experience of using MySQL initially and then switching to an unnamed commercial database. (Of course, there is the usual uninformed proselytizing in the comments: "you should have used x, idiots!")

Two impressions from the story:

  • MySQL works just fine for a lot of applications. While admittedly I often speak with pro-Postgres people, there's a segment of the user base that has come to regard using MySQL in almost any situation as something to be embarrassed about. I've never understood that attitude (or why SQLite is not regarded in the same way): there are many applications for which MySQL is a good choice or even the best choice. AdWords may well be one of them—I won't claim to know.
  • Technical decisions made at the insistence of management often end badly:
    After AdWords launched, Jane, the ads group manager, decided that now would be a good time to switch over to a "real" database. "Real" is one of those words that Doug ought to add to his list of words. It means "expensive". Many managers seem to have this idea that it is invariably true that you get what you pay for, and that therefore nothing that is available for free can possibly be any good....

    To make a long story short, it was an unmitigated disaster.

I noticed this interesting story on FeedLounge moving to Postgres, from MySQL.

I recently stumbled upon an online copy of Claude Shannon's groundbreaking 1950 paper, "Programming a Computer for Playing Chess". It's definitely worth a read -- it's amazing to think that Shannon had already tackled this sort of problem as early as 1950.

16 Oct 2005 (updated 16 Oct 2005 at 02:23 UTC) »


Recently I've had to use Java for some assignments at school. It's not as unpleasant as I recalled (generics in Java 5.0 help a lot), but I continue to be annoyed by Java's checked exceptions. I realized why: they tend to disrupt the way that I write code. I'll often write the code for some method to sketch out how to implement something, without necessarily wanting to get all the details right the first time through. Clean error handling can be tricky to get right; before deciding how to handle errors, I often like to get some experience using the class elsewhere in the project. Java makes this difficult: I either need to litter my code with empty try ... catch blocks, or else clutter up my method signatures with ugly exception specifications. I like the fact that Java forces the programmer to handle error conditions, but it gets the timing all wrong: I'm eventually going to handle errors, but Java forces me to worry about it much earlier than I would prefer to.


For 8.2, I've decided that I definitely want to fix the issues that LISTEN/NOTIFY have with MVCC bloat (since each NOTIFY updates a row in pg_listener, applications that use notifications frequently need to VACUUM the system catalogs frequently to clean up the resulting MVCC bloat; worse, pg_listener is always sequentially scanned, so performance degrades linearly). This is particularly important since Slony I uses notifications, so a lot of people will be using them whether they know they are or not.

Fixing the problem basically requires rewriting the way that notifications are stored; the idea is to store notifications in shared memory rather than in the system catalogs. The problem is that there is only a static amount of shared memory in which to store a potentially unbounded number of notifications, so it's not an easy problem to solve. I sent a few ideas about how to fix this to the hackers list, but Tom pointed out a few problems, so it is back to the drawing board. Alvaro and I have been discussing how to use the SLRU stuff to store notifications in shared memory with a disk backing store, but I haven't found a design I'm happy with. Hopefully more on this in a few days...

30 Sep 2005 (updated 30 Sep 2005 at 10:27 UTC) »


I was browsing the program from this year's VLDB. I noticed a few papers that I thought were interesting: "Getting Priorities Straight: Improving Linux Support for Database I/O" (PDF), which is worth a read -- the authors are apparently collaborating with MySQL AB to improve MySQL's performance under Linux, including modifying MySQL to use AIO. I also liked two papers (one by Microsoft and one by IBM) on XML integration / XQuery optimization issues.

20 Jul 2005 (updated 20 Jul 2005 at 02:23 UTC) »
Robert: Most of the issues the Coverity tool found were actually in pg_dump and ecpg; it actually only found a single bug in the Postgres backend, which I'm somewhat embarrassed to admit I introduced during the 8.0 development cycle. Which is both impressive and suspicious -- a single bug in 250,000-odd lines of C is too good to be believed. As it turns out, there is a good reason for this: the Stanford checker was already run against the Postgres tree a few year ago (in the 7.3 timeframe, IIRC), and the bugs it identified were fixed. Since the Coverity tool is an improved, commercialized version of the original Stanford checker, it's not too surprising that the tool didn't find very many new bugs in the backend.

Another factor is that the way that memory management is done in Postgres makes static analysis a little more difficult. For example, given the code:

void foo(char *arg)
    char *ptr = strdup(arg);
    if (!ptr)
    printf("Hello, world: %s\n", ptr);

it is relatively easy to determine via static analysis that the strdup leaks memory when ptr goes out of scope. However, malloc and friends are rarely used directly in the backend; we use region-based memory allocation instead. In other words, each allocation is done in a "memory context"; an allocation can be manually released via pfree, or all the allocations in a context can be released at once (via MemoryContextReset and related functions). Contexts are arranged into a hierarchy -- when a context is reset, all of its child contexts are also reset. This significantly simplifies memory management and error recovery: when we're done processing a query, we can just reset the memory context associated with that query, and we can be pretty confident all associated memory will be released. If a transaction is aborted, we can longjmp back to the abort error handler and then reset the transaction's context -- this cleans up all memory allocated during the transaction.

On the other hand, this technique makes static analysis of memory leaks a little more difficult:

void foo(char *arg)
    char *ptr = pstrdup(arg);
    printf("Hello, world: %s\n", ptr);

pstrdup allocates memory in the CurrentMemoryContext; when that context is reset or deleted, the memory will be released. Some variant of the above code could be perfectly valid: if the code knows that CurrentMemoryContext will always be a short-lived memory context (i.e. one that will be reset "soon"), it needn't bother explicitely freeing all its allocations. This improves readability (by not cluttering the code with pointless frees) as well as performance (it is faster to reset an entire context at once than to release each individual allocation).

So in some sense, memory leaks are impossible: allocated memory will never be unrecoverably lost until process exit. However, you can still have problems -- this bug in 8.0 is a typical example of the kinds of memory leaks you can get with region-based allocation. This code is rewriting a table, doing essentially:

for each tuple in the old table
    evaluate any new default expressions
    recheck any applicable constraints
    insert the tuple in the new table

We need to do this when adding a column with a non-NULL default value to a table, for example. The problem was that when evaluating default expressions, we invoked various functions that performed allocations in the CurrentMemoryContext. These functions effectively assumed that CurrentMemoryContext was short-lived -- but in this case, we didn't reset the context for each iteration of the loop. Again, there is no "memory leak" here, strictly speaking -- the CurrentMemoryContext would eventually be reset. But if the loop iterated a few hundred million times on a large table, we would quickly exhaust all available memory.

So how do you detect these sorts of pseudo-leaks via static analysis? I don't really see an effective way, although it's an interesting problem.

Anyway, my point is that using metrics like "tool X found Y bugs per Z lines of code" as an indication of software quality is pretty dubious -- and it's even more dubious to try to use this metric to compare the quality of two different software products.

27 Jun 2005 (updated 27 Jun 2005 at 08:31 UTC) »

Postgres: We're making steady progress toward the 8.1 feature freeze. Perhaps I'll jinx it by writing this, but it seems that people are paying more attention to the feature freeze date (July 1st) this time around. There is the usual flood of large patches as the freeze date approaches, but it seems a bit less hectic than it has been in years past.

Slony: I don't have anything I need to get in by the freeze, although there are a few patches I need to review. I'm spending most of my time these days working on Slony II (the synchronous multi-master database replication system for Postgres that I'm working on with a few other folks). We haven't been talking much about Slony II -- not because there isn't work being done on it, but I think there's a consensus that we don't want to hype something that hasn't even been completely designed yet, let alone implemented. Still, I'm hoping to get a prototype of the system finished by the end of July that can do multi-master replication of arbitrary DML in serializable isolation level. Hopefully the prototype will also include some basic recovery functionality as well ("recovery" is how we synchronize the state of a joining node with the current state of the cluster). Once the system gets a little closer to reality, I'll post some more about the design of the system.

Robert: Yeah, the Postgres-R (SI) work is one of the papers we have been considering for Slony II (there's a copy you can get here that doesn't require IEEE membership). The current prototype we're working on will be fairly similar to what Prof. Kemme proposes, although I don't think we've decided that this is definitely the right way to go.

13 May 2005 (updated 13 May 2005 at 06:09 UTC) »
Hash indexes

I think it is somewhat common knowledge, at least among Postgres geeks, that the current implementation of on-disk linear hash indexes in PG is pretty poor. Some of the problems include:

  • no write-ahead logging — if a system crash occurs, the index state may be inconsistent, so you'll need to REINDEX
  • poor bulk-loading performance — creating a hash index on a few gigabytes of data takes a long time
  • doesn't support unique indexes or multi-column indexes

But more importantly, the current hash index code doesn't outperform our b+-tree implementation even for scalar equality (e.g. WHERE x = 'foo'). So there hasn't been much motivation for folks to hack on hash indexes, as few people are actually using them.

In theory, though, hash indexes could be useful: equality scans are very common, and a hash index should be faster than a b+-tree for these queries at least in non-concurrent situations (since a b+-tree needs to navigate the internal nodes of the tree for a few levels before reaching the leaf level; a hash index can jump right to the appropriate hash bucket).

This topic was raised on the pgsql-general list, and a pretty interesting discussion ensued. We came up with two simple ways to improve hash indexes:

  • at present, the content of a hash bucket is unordered. That means the index can use the hash to select the right bucket to scan, but needs to do a linear scan over all the bucket's pages once it gets there, running the equality function of the index's operator class for each the entry in the bucket.

    It would be a lot faster to just binary search within a page, but to do that we need to either define an ordering over the index keys (which we may not be able to do for a user-defined type), or sort the keys by their hash value. If we do the latter, we probably need to store the full hash value of each key in the index. Storing the full hash value is useful for other reasons, as well: for example, if an entry's hash value does not match the hash of the scankey, we needn't evaluate the opclass's equality function, since this tuple cannot match the scan.

  • if we're going to be storing the hash of each key in the index, it would sometimes be a good idea to only store the hash of the key, not the key's value itself. To avoid hash collisions, we'll need to recheck the key value against the heap tuple, but we need to do that anyway (PostgreSQL only stores MVCC information in the heap, so we need to check that the tuple pointed-to by the index entry actually exists).

Which was all well and good in theory, but after implementing it, it turns out that I've yet to find a case in which the performance is noticeably improved :-\ I could definitely construct a situation where the patch would be a win (e.g. a user-defined type with a complex equality operator, or an index on a very wide text field), but it is frustrating that it doesn't appear to be a win in the common case (e.g. a hash index on an int4 field with a reasonably random distribution of values).

I'm still trying to figure out if there's merit to pursuing these improvements, and if so, why I haven't seen the performance improvement I expected. Anyway, this will teach me to look at profiling data before sitting down to code...

17 Mar 2005 (updated 17 Mar 2005 at 12:26 UTC) »
grep on steroids

Recently, I've been spending some time poking around in the internals of the PostgreSQL query planner. One of the projects I've been thinking about is fixing a long-standing ugly part of the planner: the fact that it scribbles on its input.

(Context: the planner takes a Query, which is essentially an annotated, rewritten abstract syntax tree describing the query to be executed. It produces a Plan, which describes the particular way to execute the query that the planner believes to be most efficient. The problem is that the planner modifies various parts of the input Query. This is a problem for a few reasons; besides being ugly, (a) it doesn't clearly distinguish between the parser's input and its working state (b) it makes it impossible to pass the same Query to the planner multiple times without copying it each time. I'm a little embarrassed to admit we actually do make extra copies of a Query in a few places in the backend, for precisely this reason.)

So, the first thing I wanted to determine was: in what situations does the planner modify one of the fields of the input Query?

Um, so it seems this isn't a trivial problem to solve. The planner is about 30,000LOC, and the input Query is used all over the place. I also want to find indirect assignments to the input Query — for example, if the planner passes a pointer to a field of the input Query to a function, and the function then procedes to write through the pointer. (In this particular example, I have a pretty good idea where the most egregious offenders are, so I can probably solve the problem well-enough via simple text search using grep, glimpse, and the like — but a general-case solution would be interesting, I think.)

It might be plausible to do this via gdb, valgrind or the like (waiting until a write is actually made to the input Query at runtime and noting the call site then). But this only catches the modifications that happen to be made by a particular invocation of the planner, not all the modifications that might occur. It is also a hack: this information is statically derivable from the source code, so a static analysis seems much nicer than solving the problem at runtime.

Text search that does not incorporate knowledge of the syntax of the source code simply doesn't cut it. One example among many is:

void some_function(SomeType *st) {
    st->field = xxx;
Plan *planner(Query *query) {

Solving the problem in the general case also requires considering aliasing; alias analysis (in C anyway) is a rather complex problem to solve effectively, even in a compiler.

Besides this, there are interesting queries about source code that can't easily be expressed via searching for patterns in text — "show me the functions where function X is called before function Y has been called", "show me the functions that are only invoked by other functions defined in the same file (i.e. these would be candidates for being marked static)", and so on. ISTM a syntax-aware source code search tool like this would be an interesting thing to write (of course, if anyone knows of an F/OSS tool that already does this, let me know).

This thread on llvm-dev is worth reading if you're interested in compiler internals.

11 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!