Older blog entries for nconway (starting at number 56)


I noticed that the final report from the Science Database Research Meeting was released a little while ago. Worth reading if you're interested in how database technology can be applied to managing scientific data — they have some interesting ideas about both what problems need to be solved, but also how to develop those solutions into a product that scientists can use (via both an open source project and a startup company).

15 Apr 2008 (updated 15 Apr 2008 at 08:38 UTC) »
Kickfire and "Stream Processing"

I noticed Robert's post about the Kickfire launch. He mentioned Truviso — for whom I work — so I thought I'd add my two cents.

Kickfire is the company previous known as "C2App". I'm not familiar with the details of their technology, but the basic idea is to use custom hardware to accelerate data warehousing queries (this blog post has some more details). Using custom hardware is not a new idea — Netezza have been doing something superficially similar for years, with considerable success. In addition to custom hardware, Kickfire apparently use a few other data warehousing techniques that have recently come back in vogue (e.g. column-wise storage with compression, coupled with the ability to do query execution over compressed data). As an aside, I think that building a data warehousing product using MySQL is a fairly surprising technical decision.

One thing I did notice is that Kickfire's PR mentions "stream processing" repeatedly, and Robert's post suggests that the sort of stream processing done by Kickfire is similar to what Truviso does. This is not the case: the two companies and their products are very different. I'd guess that Kickfire are using the term because it's become something of a buzzword.

I'd like to talk more about Truviso on this blog in the future, but the basic idea behind data stream processing is to allow analysis queries to be performed over live streams of data, as the data arrives at the system. In traditional databases, in order to apply a query to a piece of data, you first need to insert the data item into the database, wait for it to be committed to disk (force-write the write-ahead log), and then finally run a query on it from scratch. When data arrives at a rapid pace and you need low-latency query results, this "store-and-query" model has terrible performance; it's also an unnatural way to structure a client application (you're essentially polling for results). Instead, a data stream query processor allows the user to define a set of long-running continuous queries that represent the conditions of interest over the incoming data streams. As new live data arrives, the data is applied to the queries to incrementally update their results; client applications can simply consume new query results as soon as they become available. This allows you to get query results that are always up-to-date, without the need to first write data to disk (the data can either be discarded, or else written to disk asynchronously). For certain domains, such as algorithmic trading, network and environment monitoring, fraud detection, and real-time reporting, the data stream approach often yields much better performance and a more natural programming model. For more info, see the talk on data stream query processing I gave at last year's PgCon.

So what does this have to do with using custom hardware to accelerate data warehousing queries? Not a whole lot. I'm guessing that Kickfire have co-opted the "stream processing" label because they push analysis queries down to the custom chip, and then "stream" the stored data over the chip, to compute multiple queries in a single pass. If you squint at it right, there are some similarities to stream query processing (in both cases, you only want to take one pass over the data), but fundamentally, Kickfire is trying to solve a very different problem, and using a very different set of technologies. Data warehouse engines like Kickfire (and Greenplum) are complements to data stream systems like Truviso (and Streambase, Coral8, and others), not supplements or competitors.

DBMS Internals for Undergrad Students

I noticed an interesting short paper on "Exposing Undergraduate Students to Database System Internals". Written by Joe Hellerstein at UC Berkeley and Anastasia Ailamaki at CMU, it describes their experience using PostgreSQL to teach courses that introduce undergraduate students to DBMS internals. This provides some context for the student projects on hash-based aggregation and other topics that have been occasionally mentioned on -hackers in the past (e.g. here).

10 Mar 2008 (updated 10 Mar 2008 at 21:09 UTC) »
Monitoring query progress

Monitoring the progress of a long-running analysis query is a cool problem -- it's been discussed on -hackers a few times in the past (e.g. by Greg Stark). In that thread, I pointed to some Wisconsin research on this topic (2004, 2006). That work was prototyped with Postgres. I just noticed that there's apparently a new project at the DB group at U of T that is tackling similar problems: ConEx. Apparently they are also using Postgres to build their prototype, which is always cool to see.

Semantic Web SIG Meeting

There's an interesting talk in Palo Alto on Wednesday: "Are Scalable Graph Data Applications Possible?". Speakers will include Sam Madden from MIT, Andy Palmer (one of the founders of Vertica), and someone from Franz Inc -- who are apparently selling an RDF database implementation, in addition to their long-standing Lisp-related products.

21 Feb 2008 (updated 21 Feb 2008 at 08:38 UTC) »
Data Management for RDF

I was talking to a database researcher recently about why the artificial intelligence community and the database community haven't historically seen eye-to-eye. The researcher's opinion was that AI folks tend to regard databases as hopelessly limited in their expressive power, whereas DB folks tend to view AI data models as hopelessly difficult to implement efficiently. There is probably some truth to both views.

I was reminded of this when doing some reading about data management techniques for RDF (the proposed data model for the Semantic Web). Abadi et al.'s "Scalable Semantic Web Data Management Using Vertical Partitioning" is a nice paper from VLDB 2007, and appears to be one of a relatively small group of papers that approach the Semantic Web from a database systems perspective. The paper proposes a new model for storing RDF data, which essentially applies the column-store ideas from the C-Store and Vertica projects. Sam Madden and Daniel Abadi talk about their ideas more in a blog entry at The Database Column.

Planet PostgreSQL readers might be interested in this observation in the paper:

We chose Postgres as the row-store to experiment with because Beckmann et al. experimentally showed that it was by far more efficient dealing with sparse data than commercial database products. Postgres does not waste space storing NULL data: every tuple is preceded by a bit-string of cardinality equal to the number of attributes, with '1's at positions of the non-NULL values in the tuple. NULL data is thus not stored; this is unlike commercial products that waste space on NULL data. Beckmann et al. show that Postgres queries over sparse data operate about eight times faster than commercial systems

(A minor nitpick: Postgres will omit the per-tuple NULL bitmap when none of the attributes of a tuple are NULL, so it is not quite true that "every tuple is preceded by a bit-string".)

The cited Beckman et al. paper is "Extending RDBMSs To Support Sparse Datasets Using An Interpreted Attribute Storage Format".

It's interesting that none of the leading commercial systems seem to use exactly the same NULL bitmap approach that Postgres does. The tradeoff appears to be of storage against computation time: eliding the NULL values from the on-disk tuple reduces storage requirements, but makes it more expensive to find the offset within a tuple at which an attribute begins, if the attribute is preceded by one or more (elided) NULL values. If NULL values were stored in the on-disk tuple (and no variable-width attributes are used), the offset of an attribute can be found more efficiently.

In practice, Postgres implements another optimization that mitigates this problem to some extent: as tuples are passed around the executor and attributes are "extracted" from the on-disk tuple representation, they are effectively cached using the TupleTableSlot mechanism. This means that the computation to find the right offset for an attribute in the presence of NULLs is typically only done at most once per attribute of a tuple.

Nice DBMS Internals Overview Paper

I noticed that Joe Hellerstein, Mike Stonebraker, and James Hamilton (DBMS luminaries all) have published a nice, reasonably high-level paper describing the architecture and design principles of a typical database management system: "Architecture of a Database System".

PostgreSQL Mailing List Archives

MarkMail is now indexing all 630,000+ messages from the PostgreSQL mailing list archives. If, like me, you've been frustrated when trying to use the search engine and archives at archives.postgresql.org, I suggest checking out MarkMail. It's been working very well for me so far.

Signed overflow in C

Ian Lance Taylor's blog has an interesting post on signed overflow behavior in C. According to the C standard, integer overflow results in undefined behavior, and modern versions of GCC take advantage of this to generate more efficient code. This topic was raised on -hackers by Tom a few years ago — at the time, only the -fwrapv flag was implemented by GCC. Now that GCC 4.2 provides -Wstrict-overflow, this might be worth investigating further.

The broader point here is that while this optimization is completely legal according to the C standard, it is inconsistent with the traditional C semantics, and runs the risk of breaking code that depends on integer overflow having the expected behavior. At least GCC now provides a flag to emit warnings for potentially broken code, which IMHO is a prerequisite for doing aggressive optimizations of this type. There's another interesting post on Ian Lance Taylor's blog that discusses this situation in general (e.g. alias optimizations are another instance where the C standard contradicts the traditional expectations of C programmers).

Filesystem Replication and Software Quality

A few years ago, I did a summer internship with a group at Microsoft that was building a multimaster filesystem replication product. This was a very rewarding experience for several reasons. Now that the replication product has been shipped (in Windows 2003-R2, Vista, and Windows Live Messenger), I was happy to see that my mentor for that summer, Nikolaj Bjørner, has published a paper containing "lessons learned" from the project: "Models and Software Model Checking of a Distributed File Replication System". The paper is worth reading, for a few reasons:

  1. Why is filesystem replication such a hard problem, particularly in the asynchronous, multi-master case?
    The paper talks about the basic problem and the approach the group took to solving it.
  2. Perhaps more interestingly, how do you go about constructing a high-quality implementation of such a product?
    I was impressed by the group's emphasis on correctness. Nikolaj and Dan (the technical lead for the group) both had a CS theory background, so this is perhaps not surprising -- but it's interesting to see some of the practical techniques that they used to ensure they built a correct replication system:
    • A detailed specification (on the order of a few hundred pages)
    • A prototype of the system in OCaml, written concurrently with the specification but before the real implementation work began
    • A high-level, executable specification of the replication protocol in AsmL. This served as both a readable description of the protocol, as well as a way to automatically generate useful test cases.
    • Using model checking to verify the correctness of certain particularly complex aspects of the protocol (distributed garbage collection, conflict resolution).
    • A "simulator" that walked a random tree of filesystem operations, pausing after each node to verify that the system had correctly replicated the resulting filesystem state. Once a leaf node in the tree was reached, the simulator then backtracked, exploring another branch of the tree. The simulator was also clearly inspired by model checking techniques. By replacing certain components of the real system with virtualized ones (e.g. using a toy in-memory database), this tool could be used to test large numbers of scenarios very quickly.
    • Exhaustive testing. Using the simulator and a cluster of test machines, more than 500 billion test cases were examined.

47 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!