Recent blog entries for nconway

New Blog

I've started a new blog, Everything is Data, to talk about data management, distributed systems, and to shamelessly flog my own research :) My first post is up, discussing the recent Stonebraker et al. SIGMOD paper comparing parallel DBs with MapReduce.

Unfortunately, the demands of grad school are such that I don't have much (any) time to work on Postgres at the moment, but if you're interested, please subscribe -- I'll probably discontinue my blog on Advogato in the future. I've imported all my old posts from this blog (using the XML-RPC interfaces provided by both WordPress and Advogato -- it was only mildly painful).

25 Feb 2009 (updated 25 Feb 2009 at 04:55 UTC) »
Serializable Snapshot Isolation

This semester at Berkeley, I'm taking CS286, the graduate-level data management course. In today's class, we discussed a paper that I thought might be of particular interest to Postgres hackers: "Serializable Isolation for Snapshot Databases", by Cahill, Rohm and Fekete from SIGMOD 2008.

The paper addresses a well-known problem with snapshot isolation (SI), which is the isolation level that Postgres actually provides when you ask for "SERIALIZABLE" isolation. SI basically means that a transaction sees a version of database state that corresponds to the effects of all the transactions that were committed before it began; it also sees the effects of its own updates. This is not equivalent to true serializability, however: that is, the database system can provide snapshot isolation, and yet still allow a concurrent transaction schedule that is not equivalent to some serial (one-at-a-time) transaction schedule.

To see why this is true, consider two concurrent transactions that both examine the state of the database, and then perform a write operation that reflects the values that they just read. The paper provides a simple example: suppose we have a database that describes the doctors in a hospital. We have a program that wants to move doctors from "on-call" to "off-duty", as long as there is at least one other doctor that is on-call. It's easy to see that if there are two doctors and we run two instances of the program concurrently, under SI rules we could end up with zero doctors on duty. This violates serializability: there's no serial schedule of these two transactions that could yield this erroneous database state.

The paper proposes a relatively simple modification to snapshot isolation that avoids this situations, by detecting a superset of the dangerous situations and aborting one of the transactions involved. I'll leave the details of their technique and the underlying theory to the paper, but it's very readable.

So, should we implement their technique in Postgres? It's an interesting idea, but the implementation cost would be very non-trivial. Despite the paper's claim that it imposes relatively little overhead on a traditional SI implementation, it would require basically tracking the set of rows each transaction has read, and keeping that information around for a bounded time period after the transaction has committed. I think that the performance costs of doing that naively would be too expensive for this to be feasible. Perhaps a cheaper implementation is possible (e.g. by tracking page-level reads rather than record-level reads)?

As an aside, it is somewhat bogus for PostgreSQL to provide snapshot isolation when the user asks for serializability; it is also a violation of the SQL standard, I believe. That said, Oracle does the same thing, so at least we're not alone, and it's hard to see a practical improvement. The relevant section of the docs could certainly make this point clearer, however.

SIGMOD 2009 Programming Contest

I just noticed that there's a programming contest at SIGMOD this year. The problem is relatively simple and tractable, although there are some interesting wrinkles:

  • The data is inserted "online" by 50 concurrent threads, which means there is no opportunity to do offline bulk build/reorganization of the index.
  • Solutions need to provide serializability, including avoiding the "phantom problem" (although that shouldn't be too hard: next-key locking should work).
  • Solutions are also penalized when they fail to meet a response-time SLA, which makes good performance about more than merely maximizing throughput.
CIDR

This weekend, I'll be at the CIDR 2009, the biennial Conference on Innovative Data Systems Research. It's an interesting conference: not as formal or as high-pedigree as the prestigious database conferences (SIGMOD and VLDB), but the papers are usually interesting and provocative. There is one track of peer reviewed papers and one of track of "Perspectives" that are selected by the program committee to spark a discussion. I'm one of the authors on a Perspectives track paper, "Continuous Analytics: Rethinking Query Processing in a Network-Effect World" — which is essentially a fancy title for the thesis that stream processing techniques are more widely applicable to mainstream business analytics than most people seem to think.

If you'll be there, say hi. In the near future, I hope to post more about the paper, and the rest of the research I've been doing so far at school.

Public Talk on Facebook Hive

The Berkeley DB group is hosting a talk about Facebook Hive this Thursday, at Soda Hall in UC Berkeley. Details and the abstract are below -- it should be an interesting talk! I'd encourage anyone in the area to attend -- if you need directions / parking suggestions / etc., just drop me a line.

Thursday, October 16th, 2008
606 Soda Hall, UC Berkeley
10-11am

Title: Hive: Data Warehousing using Hadoop

Abstract: Hive is an open-source data warehousing infrastructure built on top of Hadoop that allows SQL like queries along with abilities to add custom transformation scripts in different stages of data processing. It includes language constructs to import data from various sources, support for object oriented data types and a metadata repository that structures hadoop directories into relational tables and partitions with typed columns. Facebook uses this system for variety of tasks - classic log aggregation, graph mining, text analysis and indexing.

In this talk we will give an overview of the Hive system, the data model, query language compilation and execution and the metadata store. We will also discuss our near term roadmap and avenues for significant contributions in terms of query optimization, execution speed and data compression amongst others. We will also present some statistics on usage within Facebook and outline some of the challenges in operating Hive/Hadoop in a utility computing model in fast growing environment.

Bio: Joydeep Sensarma has been working in the Facebook Data Team for the last 1+ year where he's taken turns coding up Hive, keeping Hadoop running, eating and sleeping in that order. He's really glad he no longer works on closed source file and database systems like he did for the last ten years.

Zheng Shao has worked in Facebook Data Team on Hadoop and Hive for about 6 months. Before that he worked in the Yahoo web search team which heavily uses Hadoop.

Namit Jain has been working in the Facebook Data team with Hive for about 6 months. Before that he was in the database and application server groups at Oracle for about 10 years.

1 Sep 2008 (updated 1 Sep 2008 at 00:58 UTC) »
System R

One of the classes I'm taking at Berkeley this fall is CS262a, which is the first part of their graduate-level introductory "systems" class -- looking at great papers and common threads among operating systems, networking, databases, and the like. One of the first papers we're going to discuss is "A History And Evaluation of System R", which describes the seminal DBMS built by a team of 15 PhDs at IBM Research from 1974 to ~1980. The paper is a great read, especially if you're interested in database internals. (If you're going to read the paper, I suggest Joe Hellerstein's annotated version, which contains a number of insightful comments.)

A few comments of my own:

  • The scope of the project goals and the completeness of the implementation is remarkable, considering the time period and the lack of other production-quality RDBMS implementations at the time. System R included a cost-based query optimizer, joins, subqueries, updateable views, log-based crash recovery, granular locking, authentication and authorization, a relational system catalog, prepared queries, and other sophisticated features. In fact, System R even had the ability to automatically invalidate and replan prepared queries when their dependent objects changed, a feature Postgres didn't add until 8.3 (and we still don't have native support for updateable views).
  • People often complain that SQL is a poorly-designed language. In many respects that may be true, but it's not because the design of the language itself was neglected: even in 1975, the System R team gave "considerable thought ... to the human factors aspects of the SQL language, and an experimental study was conducted on the learnability and usability of SQL." While the goal of having secretaries and other non-technical staff writing SQL queries was perhaps not achieved, SQL wasn't a hackishly-designed language, even if it sometimes feels that way :)
  • The initial System R prototype supported subqueries, but not joins. That seems an unusual order in which to implement features, although it does make some sense (JMH points out that neglecting joins makes the optimizer search strategy much simpler).
  • One interesting design choice is that System R generated machine code from the query plan, rather than having the executor walk the plan tree at runtime. While this design sounded exotic to me at first glance, it actually makes sense: on the hardware of the time, queries were much more likely to be CPU bound than they are today.

The notes from the 1995 System R reunion are also an interesting read, if you'd like to learn more about the politics and history of the project.

Grad School

I've decided to go back to school — I'm excited to report that I'll be starting at the PhD program in computer science at UC Berkeley in the fall, working with Mike Franklin and Joe Hellerstein in the Berkeley Database Group. I'm not sure yet if this means I'll have more or less time to work on community Postgres stuff.

AsterDB and Postgres?

Aster Data Systems are a database startup that have received a bunch of press recently. I've now heard from two different people that Aster are built upon Postgres, but their website is still pretty content-free, so it's hard to be sure. I wouldn't be surprised, though: it's hard to make the case for building a database system from scratch in 2008, especially in a startup environment.

10 May 2008 (updated 10 May 2008 at 06:15 UTC) »
The End of Moore's Law

I was reading "The Problem with Threads" by Prof. Ed Lee, and noticed the following claim right on the first page:

Many technologists predict that the end of Moore’s Law will be answered with increasingly parallel computer architectures (multicore or chip [multiprocessors], CMPs) [15].

This quote confuses me, because, to the best of my knowledge, Moore's Law has not ended, and the industry's move to multicore/manycore processors is not directly related to the imminent demise of Moore's Law. Moore's Law is the claim that transistor density in integrated circuits approximately doubles every two years. As far as I know, that remains basically true for the time being, and current speculation is that it will continue to hold for at least 10 years.

What is driving the move to multicore designs is that we can no longer effectively use those extra transistors to increase the speed of a single sequential instruction stream. Ramping up clock speed increases heat dissipation, and doesn't improve performance very much if memory latency doesn't significantly change. Techniques like caching, pipelining, and superscalar execution help, but only to an extent. Hence the move to multicore designs and chip-level parallelism.

That said, I'm definitely not a hardware guy, and doubtless Prof. Lee has forgotten more about processor design than I am ever likely to know. And when Moore's Law ends, that may well encourage the multicore trend even more—but my understanding is that the eventual demise of Moore's Law and the current move to multicore architectures are not directly related. I'm curious to know if I'm mistaken.

(As an aside, text quoted above cites "Multicore CPUs for the Masses" in ACM Queue as support for the claim that the industry is moving toward multicore designs. While that is true, the article makes no mention of Moore's Law.)

SciDBMS

I noticed that the final report from the Science Database Research Meeting was released a little while ago. Worth reading if you're interested in how database technology can be applied to managing scientific data — they have some interesting ideas about both what problems need to be solved, but also how to develop those solutions into a product that scientists can use (via both an open source project and a startup company).

15 Apr 2008 (updated 15 Apr 2008 at 08:38 UTC) »
Kickfire and "Stream Processing"

I noticed Robert's post about the Kickfire launch. He mentioned Truviso — for whom I work — so I thought I'd add my two cents.

Kickfire is the company previous known as "C2App". I'm not familiar with the details of their technology, but the basic idea is to use custom hardware to accelerate data warehousing queries (this blog post has some more details). Using custom hardware is not a new idea — Netezza have been doing something superficially similar for years, with considerable success. In addition to custom hardware, Kickfire apparently use a few other data warehousing techniques that have recently come back in vogue (e.g. column-wise storage with compression, coupled with the ability to do query execution over compressed data). As an aside, I think that building a data warehousing product using MySQL is a fairly surprising technical decision.

One thing I did notice is that Kickfire's PR mentions "stream processing" repeatedly, and Robert's post suggests that the sort of stream processing done by Kickfire is similar to what Truviso does. This is not the case: the two companies and their products are very different. I'd guess that Kickfire are using the term because it's become something of a buzzword.

I'd like to talk more about Truviso on this blog in the future, but the basic idea behind data stream processing is to allow analysis queries to be performed over live streams of data, as the data arrives at the system. In traditional databases, in order to apply a query to a piece of data, you first need to insert the data item into the database, wait for it to be committed to disk (force-write the write-ahead log), and then finally run a query on it from scratch. When data arrives at a rapid pace and you need low-latency query results, this "store-and-query" model has terrible performance; it's also an unnatural way to structure a client application (you're essentially polling for results). Instead, a data stream query processor allows the user to define a set of long-running continuous queries that represent the conditions of interest over the incoming data streams. As new live data arrives, the data is applied to the queries to incrementally update their results; client applications can simply consume new query results as soon as they become available. This allows you to get query results that are always up-to-date, without the need to first write data to disk (the data can either be discarded, or else written to disk asynchronously). For certain domains, such as algorithmic trading, network and environment monitoring, fraud detection, and real-time reporting, the data stream approach often yields much better performance and a more natural programming model. For more info, see the talk on data stream query processing I gave at last year's PgCon.

So what does this have to do with using custom hardware to accelerate data warehousing queries? Not a whole lot. I'm guessing that Kickfire have co-opted the "stream processing" label because they push analysis queries down to the custom chip, and then "stream" the stored data over the chip, to compute multiple queries in a single pass. If you squint at it right, there are some similarities to stream query processing (in both cases, you only want to take one pass over the data), but fundamentally, Kickfire is trying to solve a very different problem, and using a very different set of technologies. Data warehouse engines like Kickfire (and Greenplum) are complements to data stream systems like Truviso (and Streambase, Coral8, and others), not supplements or competitors.

55 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!