nconway is currently certified at Master level.

Name: Neil Conway
Member since: 2001-07-11 16:27:29
Last Login: 2009-05-04 21:45:18

FOAF RDF Share This

Homepage: http://neilconway.org

Notes:

I'm currently a graduate student in computer science at UC Berkeley. I received my undergraduate degree in CS from Queen's University in Kingston, Ontario. I'm originally from Toronto. In the past, I've maintained the Ruby/GTK+ bindings, and contributed to various other bits of F/OSS. Most of the open source work I do these days is on PostgreSQL.

Projects

Recent blog entries by nconway

Syndication: RSS 2.0
New Blog

I've started a new blog, Everything is Data, to talk about data management, distributed systems, and to shamelessly flog my own research :) My first post is up, discussing the recent Stonebraker et al. SIGMOD paper comparing parallel DBs with MapReduce.

Unfortunately, the demands of grad school are such that I don't have much (any) time to work on Postgres at the moment, but if you're interested, please subscribe -- I'll probably discontinue my blog on Advogato in the future. I've imported all my old posts from this blog (using the XML-RPC interfaces provided by both WordPress and Advogato -- it was only mildly painful).

25 Feb 2009 (updated 25 Feb 2009 at 04:55 UTC) »
Serializable Snapshot Isolation

This semester at Berkeley, I'm taking CS286, the graduate-level data management course. In today's class, we discussed a paper that I thought might be of particular interest to Postgres hackers: "Serializable Isolation for Snapshot Databases", by Cahill, Rohm and Fekete from SIGMOD 2008.

The paper addresses a well-known problem with snapshot isolation (SI), which is the isolation level that Postgres actually provides when you ask for "SERIALIZABLE" isolation. SI basically means that a transaction sees a version of database state that corresponds to the effects of all the transactions that were committed before it began; it also sees the effects of its own updates. This is not equivalent to true serializability, however: that is, the database system can provide snapshot isolation, and yet still allow a concurrent transaction schedule that is not equivalent to some serial (one-at-a-time) transaction schedule.

To see why this is true, consider two concurrent transactions that both examine the state of the database, and then perform a write operation that reflects the values that they just read. The paper provides a simple example: suppose we have a database that describes the doctors in a hospital. We have a program that wants to move doctors from "on-call" to "off-duty", as long as there is at least one other doctor that is on-call. It's easy to see that if there are two doctors and we run two instances of the program concurrently, under SI rules we could end up with zero doctors on duty. This violates serializability: there's no serial schedule of these two transactions that could yield this erroneous database state.

The paper proposes a relatively simple modification to snapshot isolation that avoids this situations, by detecting a superset of the dangerous situations and aborting one of the transactions involved. I'll leave the details of their technique and the underlying theory to the paper, but it's very readable.

So, should we implement their technique in Postgres? It's an interesting idea, but the implementation cost would be very non-trivial. Despite the paper's claim that it imposes relatively little overhead on a traditional SI implementation, it would require basically tracking the set of rows each transaction has read, and keeping that information around for a bounded time period after the transaction has committed. I think that the performance costs of doing that naively would be too expensive for this to be feasible. Perhaps a cheaper implementation is possible (e.g. by tracking page-level reads rather than record-level reads)?

As an aside, it is somewhat bogus for PostgreSQL to provide snapshot isolation when the user asks for serializability; it is also a violation of the SQL standard, I believe. That said, Oracle does the same thing, so at least we're not alone, and it's hard to see a practical improvement. The relevant section of the docs could certainly make this point clearer, however.

SIGMOD 2009 Programming Contest

I just noticed that there's a programming contest at SIGMOD this year. The problem is relatively simple and tractable, although there are some interesting wrinkles:

  • The data is inserted "online" by 50 concurrent threads, which means there is no opportunity to do offline bulk build/reorganization of the index.
  • Solutions need to provide serializability, including avoiding the "phantom problem" (although that shouldn't be too hard: next-key locking should work).
  • Solutions are also penalized when they fail to meet a response-time SLA, which makes good performance about more than merely maximizing throughput.
CIDR

This weekend, I'll be at the CIDR 2009, the biennial Conference on Innovative Data Systems Research. It's an interesting conference: not as formal or as high-pedigree as the prestigious database conferences (SIGMOD and VLDB), but the papers are usually interesting and provocative. There is one track of peer reviewed papers and one of track of "Perspectives" that are selected by the program committee to spark a discussion. I'm one of the authors on a Perspectives track paper, "Continuous Analytics: Rethinking Query Processing in a Network-Effect World" — which is essentially a fancy title for the thesis that stream processing techniques are more widely applicable to mainstream business analytics than most people seem to think.

If you'll be there, say hi. In the near future, I hope to post more about the paper, and the rest of the research I've been doing so far at school.

Public Talk on Facebook Hive

The Berkeley DB group is hosting a talk about Facebook Hive this Thursday, at Soda Hall in UC Berkeley. Details and the abstract are below -- it should be an interesting talk! I'd encourage anyone in the area to attend -- if you need directions / parking suggestions / etc., just drop me a line.

Thursday, October 16th, 2008
606 Soda Hall, UC Berkeley
10-11am

Title: Hive: Data Warehousing using Hadoop

Abstract: Hive is an open-source data warehousing infrastructure built on top of Hadoop that allows SQL like queries along with abilities to add custom transformation scripts in different stages of data processing. It includes language constructs to import data from various sources, support for object oriented data types and a metadata repository that structures hadoop directories into relational tables and partitions with typed columns. Facebook uses this system for variety of tasks - classic log aggregation, graph mining, text analysis and indexing.

In this talk we will give an overview of the Hive system, the data model, query language compilation and execution and the metadata store. We will also discuss our near term roadmap and avenues for significant contributions in terms of query optimization, execution speed and data compression amongst others. We will also present some statistics on usage within Facebook and outline some of the challenges in operating Hive/Hadoop in a utility computing model in fast growing environment.

Bio: Joydeep Sensarma has been working in the Facebook Data Team for the last 1+ year where he's taken turns coding up Hive, keeping Hadoop running, eating and sleeping in that order. He's really glad he no longer works on closed source file and database systems like he did for the last ten years.

Zheng Shao has worked in Facebook Data Team on Hadoop and Hive for about 6 months. Before that he worked in the Yahoo web search team which heavily uses Hadoop.

Namit Jain has been working in the Facebook Data team with Hive for about 6 months. Before that he was in the database and application server groups at Oracle for about 10 years.

60 older entries...

 

nconway certified others as follows:

  • nconway certified nconway as Journeyer
  • nconway certified garym as Master
  • nconway certified dvl as Apprentice
  • nconway certified Marcus as Master
  • nconway certified louie as Master
  • nconway certified graydon as Master
  • nconway certified whytheluckystiff as Journeyer
  • nconway certified bryanf as Journeyer
  • nconway certified karimlakhani as Apprentice
  • nconway certified robertc as Master
  • nconway certified tcopeland as Journeyer
  • nconway certified Xorian as Master
  • nconway certified ncm as Master
  • nconway certified alan as Master
  • nconway certified alvherre as Master
  • nconway certified movement as Master
  • nconway certified lkcl as Apprentice

Others have certified nconway as follows:

  • nconway certified nconway as Journeyer
  • garym certified nconway as Master
  • zbowling certified nconway as Master
  • alvherre certified nconway as Master
  • kjw certified nconway as Master
  • jarod certified nconway as Master
  • mgonzalez certified nconway as Master
  • chriscog certified nconway as Master

[ Certification disabled because you're not logged in. ]

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

X
Share this page