Name: Neil Conway
Member since: 2001-07-11 16:27:29
Last Login: 2009-05-04 21:45:18
Homepage: http://neilconway.org
Notes: I'm currently a graduate student in computer science at UC Berkeley. I received my undergraduate degree in CS from Queen's University in Kingston, Ontario. I'm originally from Toronto. In the past, I've maintained the Ruby/GTK+ bindings, and contributed to various other bits of F/OSS. Most of the open source work I do these days is on PostgreSQL.
I've started a new blog, Everything is Data, to talk about data management, distributed systems, and to shamelessly flog my own research :) My first post is up, discussing the recent Stonebraker et al. SIGMOD paper comparing parallel DBs with MapReduce.
Unfortunately, the demands of grad school are such that I don't have much (any) time to work on Postgres at the moment, but if you're interested, please subscribe -- I'll probably discontinue my blog on Advogato in the future. I've imported all my old posts from this blog (using the XML-RPC interfaces provided by both WordPress and Advogato -- it was only mildly painful).
25 Feb 2009 (updated 25 Feb 2009 at 04:55 UTC) »
This semester at Berkeley, I'm taking CS286, the graduate-level data management course. In today's class, we discussed a paper that I thought might be of particular interest to Postgres hackers: "Serializable Isolation for Snapshot Databases", by Cahill, Rohm and Fekete from SIGMOD 2008.
The paper addresses a well-known problem with snapshot isolation (SI), which is the isolation level that Postgres actually provides when you ask for "SERIALIZABLE" isolation. SI basically means that a transaction sees a version of database state that corresponds to the effects of all the transactions that were committed before it began; it also sees the effects of its own updates. This is not equivalent to true serializability, however: that is, the database system can provide snapshot isolation, and yet still allow a concurrent transaction schedule that is not equivalent to some serial (one-at-a-time) transaction schedule.
To see why this is true, consider two concurrent transactions that both examine the state of the database, and then perform a write operation that reflects the values that they just read. The paper provides a simple example: suppose we have a database that describes the doctors in a hospital. We have a program that wants to move doctors from "on-call" to "off-duty", as long as there is at least one other doctor that is on-call. It's easy to see that if there are two doctors and we run two instances of the program concurrently, under SI rules we could end up with zero doctors on duty. This violates serializability: there's no serial schedule of these two transactions that could yield this erroneous database state.
The paper proposes a relatively simple modification to snapshot isolation that avoids this situations, by detecting a superset of the dangerous situations and aborting one of the transactions involved. I'll leave the details of their technique and the underlying theory to the paper, but it's very readable.
So, should we implement their technique in Postgres? It's an interesting idea, but the implementation cost would be very non-trivial. Despite the paper's claim that it imposes relatively little overhead on a traditional SI implementation, it would require basically tracking the set of rows each transaction has read, and keeping that information around for a bounded time period after the transaction has committed. I think that the performance costs of doing that naively would be too expensive for this to be feasible. Perhaps a cheaper implementation is possible (e.g. by tracking page-level reads rather than record-level reads)?
As an aside, it is somewhat bogus for PostgreSQL to provide snapshot isolation when the user asks for serializability; it is also a violation of the SQL standard, I believe. That said, Oracle does the same thing, so at least we're not alone, and it's hard to see a practical improvement. The relevant section of the docs could certainly make this point clearer, however.
I just noticed that there's a programming contest at SIGMOD this year. The problem is relatively simple and tractable, although there are some interesting wrinkles:
This weekend, I'll be at the CIDR 2009, the biennial Conference on Innovative Data Systems Research. It's an interesting conference: not as formal or as high-pedigree as the prestigious database conferences (SIGMOD and VLDB), but the papers are usually interesting and provocative. There is one track of peer reviewed papers and one of track of "Perspectives" that are selected by the program committee to spark a discussion. I'm one of the authors on a Perspectives track paper, "Continuous Analytics: Rethinking Query Processing in a Network-Effect World" — which is essentially a fancy title for the thesis that stream processing techniques are more widely applicable to mainstream business analytics than most people seem to think.
If you'll be there, say hi. In the near future, I hope to post more about the paper, and the rest of the research I've been doing so far at school.
The Berkeley DB group is hosting a talk about Facebook Hive this Thursday, at Soda Hall in UC Berkeley. Details and the abstract are below -- it should be an interesting talk! I'd encourage anyone in the area to attend -- if you need directions / parking suggestions / etc., just drop me a line.
Thursday, October 16th, 2008
606 Soda
Hall, UC Berkeley
10-11am
Title: Hive: Data Warehousing using Hadoop
Abstract: Hive is an open-source data warehousing infrastructure built on top of Hadoop that allows SQL like queries along with abilities to add custom transformation scripts in different stages of data processing. It includes language constructs to import data from various sources, support for object oriented data types and a metadata repository that structures hadoop directories into relational tables and partitions with typed columns. Facebook uses this system for variety of tasks - classic log aggregation, graph mining, text analysis and indexing.
In this talk we will give an overview of the Hive system, the data model, query language compilation and execution and the metadata store. We will also discuss our near term roadmap and avenues for significant contributions in terms of query optimization, execution speed and data compression amongst others. We will also present some statistics on usage within Facebook and outline some of the challenges in operating Hive/Hadoop in a utility computing model in fast growing environment.
Bio: Joydeep Sensarma has been working in the Facebook Data Team for the last 1+ year where he's taken turns coding up Hive, keeping Hadoop running, eating and sleeping in that order. He's really glad he no longer works on closed source file and database systems like he did for the last ten years.
Zheng Shao has worked in Facebook Data Team on Hadoop and Hive for about 6 months. Before that he worked in the Yahoo web search team which heavily uses Hadoop.
Namit Jain has been working in the Facebook Data team with Hive for about 6 months. Before that he was in the database and application server groups at Oracle for about 10 years.
nconway certified others as follows:
Others have certified nconway as follows:
[ Certification disabled because you're not logged in. ]
FOAF updates: Trust rankings are now exported, making the data available to other users and websites. An external FOAF URI has been added, allowing users to link to an additional FOAF file.
Keep up with the latest Advogato features by reading the Advogato status blog.
If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!