<?xml version="1.0"?>
<rss version="2.0">
  <channel>
    <title>Advogato blog for nconway</title>
    <link>http://www.advogato.org/person/nconway/</link>
    <description>Advogato blog for nconway</description>
    <language>en-us</language>
    <generator>mod_virgule</generator>
    <pubDate>Sun, 19 May 2013 19:58:50 GMT</pubDate>
    <item>
      <pubDate>Mon, 4 May 2009 21:19:11 GMT</pubDate>
      <title>4 May 2009</title>
      <link>http://www.advogato.org/person/nconway/diary.html?start=64</link>
      <guid>http://www.advogato.org/person/nconway/diary.html?start=64</guid>
      <description>&lt;b&gt;New Blog&lt;/b&gt;&#xD;
&#xD;
&lt;p&gt; I've started a new blog, &lt;a href="http://everythingisdata.wordpress.com" &gt;Everything is Data&lt;/a&gt;, to &#xD;
talk &#xD;
about data management, distributed systems, and to shamelessly flog my &#xD;
own &#xD;
research :) My &lt;a href="http://everythingisdata.wordpress.com/2009/05/04/mapreduce-vs-&#xD;
parallel-dbs/" &gt;first post&lt;/a&gt; is up, discussing the recent Stonebraker et al. &#xD;
SIGMOD paper comparing parallel DBs with MapReduce.&#xD;
&#xD;
&lt;p&gt; Unfortunately, the demands of grad school are such that I don't have much &#xD;
(any) time to work on Postgres at the moment, but &#xD;
if you're interested, please subscribe -- I'll probably discontinue  my blog on &#xD;
Advogato in the future. I've imported all my old posts from this blog (using &#xD;
the XML-RPC &#xD;
interfaces provided by both WordPress and Advogato -- it was only mildly &#xD;
painful).</description>
    </item>
    <item>
      <pubDate>Wed, 25 Feb 2009 04:52:06 GMT</pubDate>
      <title>25 Feb 2009</title>
      <link>http://www.advogato.org/person/nconway/diary.html?start=63</link>
      <guid>http://www.advogato.org/person/nconway/diary.html?start=63</guid>
      <description>&lt;b&gt;Serializable Snapshot Isolation&lt;/b&gt;&#xD;
&#xD;
&lt;p&gt; &lt;p&gt; &lt;p&gt; &lt;p&gt; &lt;p&gt; &lt;p&gt; &lt;p&gt; &lt;p&gt; &lt;p&gt; This semester at Berkeley, I'm &#xD;
taking &lt;a href="http://sites.google.com/a/cs.berkeley.edu/cs286-&#xD;
sp09/" &gt;CS286&lt;/a&gt;, the graduate-level data management course. In today's &#xD;
class, we discussed a paper that I thought might &#xD;
be of particular interest to Postgres hackers: "&lt;a href="http://www.cse.ust.hk/~yjrobin/reading_list/%5BTransaction%5DSeriali&#xD;
zable%20isolation%20for%20snapshot%20databases.pdf" &gt;Serializable &#xD;
Isolation for Snapshot Databases&lt;/a&gt;", by Cahill, Rohm and Fekete from &#xD;
SIGMOD 2008.&#xD;
&#xD;
&lt;p&gt; &lt;p&gt; &lt;p&gt; &lt;p&gt; &lt;p&gt; &lt;p&gt; &lt;p&gt; &lt;p&gt; &lt;p&gt; The paper addresses a well-known &#xD;
problem &#xD;
with &#xD;
snapshot &#xD;
isolation (SI), &#xD;
which is the isolation level that Postgres actually provides when you ask for &#xD;
"SERIALIZABLE" isolation. SI basically means that a transaction sees a &#xD;
version of database state that corresponds to the effects of all the &#xD;
transactions that were committed before it began; it also sees the effects of &#xD;
its own updates. This is &lt;i&gt;not&lt;/i&gt; equivalent to true serializability, &#xD;
however: that is, the database system can provide snapshot isolation, and yet &#xD;
still allow a concurrent transaction schedule that is not equivalent to &#xD;
some serial (one-at-a-time) transaction schedule.&#xD;
&#xD;
&lt;p&gt; &lt;p&gt; &lt;p&gt; &lt;p&gt; &lt;p&gt; &lt;p&gt; &lt;p&gt; &lt;p&gt; &lt;p&gt; To see why this is true, consider &#xD;
two &#xD;
concurrent &#xD;
transactions &#xD;
that &#xD;
both &#xD;
examine the state of the database, and then perform a write operation that &#xD;
reflects the values that they just read. The paper provides a simple example: &#xD;
suppose we have a database that describes the doctors in a hospital. We have &#xD;
a program that wants to move doctors from "on-call" to "off-duty", as long as &#xD;
there is at least one other doctor that is on-call. It's easy to see that if there &#xD;
are two doctors and we run two instances of the program concurrently, under &#xD;
SI rules we could end up with zero doctors on duty. This violates &#xD;
serializability: there's no serial schedule of these two transactions that could &#xD;
yield this erroneous database state.&#xD;
&#xD;
&lt;p&gt; &lt;p&gt; &lt;p&gt; &lt;p&gt; &lt;p&gt; &lt;p&gt; &lt;p&gt; &lt;p&gt; &lt;p&gt; The paper proposes a relatively &#xD;
simple &#xD;
modification &#xD;
to &#xD;
snapshot &#xD;
isolation that &#xD;
avoids this situations, by detecting a superset of the dangerous situations &#xD;
and aborting one of the transactions involved. I'll leave the details of their &#xD;
technique and the underlying theory to the paper, but it's very readable.&#xD;
&#xD;
&lt;p&gt; &lt;p&gt; &lt;p&gt; &lt;p&gt; &lt;p&gt; &lt;p&gt; &lt;p&gt; &lt;p&gt; &lt;p&gt; So, should we implement their &#xD;
technique &#xD;
in &#xD;
Postgres? &#xD;
It's an &#xD;
interesting &#xD;
idea, &#xD;
but the implementation cost would be very non-trivial. Despite the paper's &#xD;
claim that it imposes relatively little overhead on a traditional SI &#xD;
implementation, it would require basically tracking the set of rows each &#xD;
transaction has read, and keeping that information around for a bounded &#xD;
time &#xD;
period after the transaction has committed. I think that the performance &#xD;
costs of &#xD;
doing that naively would be too expensive for this to be feasible. Perhaps a &#xD;
cheaper implementation is possible (e.g. by tracking page-level reads rather &#xD;
than record-level reads)?&#xD;
&#xD;
&lt;p&gt; &lt;p&gt; &lt;p&gt; &lt;p&gt; &lt;p&gt; &lt;p&gt; &lt;p&gt; &lt;p&gt; &lt;p&gt; As an aside, it is somewhat bogus &#xD;
for &#xD;
PostgreSQL &#xD;
to &#xD;
provide &#xD;
snapshot &#xD;
isolation when the user asks for serializability; it is also a violation of the SQL &#xD;
standard, I believe. That said, Oracle does the &#xD;
same thing, so at least we're not alone, and it's hard to see a practical &#xD;
improvement. The &lt;a href="http://developer.postgresql.org/pgdocs/postgres/transaction-&#xD;
iso.html#XACT-SERIALIZABLE" &gt;relevant section&lt;/a&gt; of the docs could &#xD;
certainly make this point clearer, however.</description>
    </item>
    <item>
      <pubDate>Tue, 6 Jan 2009 02:09:32 GMT</pubDate>
      <title>6 Jan 2009</title>
      <link>http://www.advogato.org/person/nconway/diary.html?start=62</link>
      <guid>http://www.advogato.org/person/nconway/diary.html?start=62</guid>
      <description>&lt;b&gt;SIGMOD 2009 Programming Contest&lt;/b&gt;&#xD;
&#xD;
&lt;p&gt; I just noticed that there's a &lt;a href="http://db.csail.mit.edu/sigmod09contest/" &gt;programming&#xD;
contest&lt;/a&gt; at SIGMOD this year. The problem&#xD;
is relatively simple and tractable, although there are some&#xD;
interesting wrinkles:&#xD;
&lt;ul&gt;&#xD;
&lt;li&gt;The data is inserted "online" by 50 concurrent threads,&#xD;
which means there is no opportunity to do offline bulk&#xD;
build/reorganization of the index.&#xD;
&lt;li&gt;Solutions need to provide serializability, including&#xD;
avoiding the "phantom problem" (although that shouldn't be&#xD;
too hard: next-key locking should work).&#xD;
&lt;li&gt;Solutions are also penalized when they fail to meet a&#xD;
response-time SLA, which makes good performance about more&#xD;
than merely maximizing throughput.&#xD;
&lt;/ul&gt;</description>
    </item>
    <item>
      <pubDate>Fri, 2 Jan 2009 17:43:45 GMT</pubDate>
      <title>2 Jan 2009</title>
      <link>http://www.advogato.org/person/nconway/diary.html?start=61</link>
      <guid>http://www.advogato.org/person/nconway/diary.html?start=61</guid>
      <description>&lt;b&gt;CIDR&lt;/b&gt;&#xD;
&#xD;
&lt;p&gt; This weekend, I'll be at the &lt;a href="http://www-db.cs.wisc.edu/cidr/cidr2009/index.html" &gt;CIDR&#xD;
2009&lt;/a&gt;, the biennial Conference on Innovative Data Systems&#xD;
Research. It's an interesting conference: not as formal or&#xD;
as high-pedigree as the prestigious database conferences&#xD;
(SIGMOD and VLDB), but the papers are usually interesting&#xD;
and provocative. There is one track of peer reviewed papers&#xD;
and one of track of "Perspectives" that are selected by the&#xD;
program committee to spark a discussion. I'm one&#xD;
of the authors on a Perspectives track paper, "Continuous&#xD;
Analytics: Rethinking Query Processing in a Network-Effect&#xD;
World" &amp;mdash; which is essentially a fancy title for the&#xD;
thesis that stream processing techniques are more widely&#xD;
applicable to mainstream business analytics than most people&#xD;
seem to think.&#xD;
&#xD;
&lt;p&gt; If you'll be there, say hi. In the near future, I hope to&#xD;
post more about the paper, and the rest of the research I've&#xD;
been doing so far at school.</description>
    </item>
    <item>
      <pubDate>Tue, 14 Oct 2008 22:16:44 GMT</pubDate>
      <title>14 Oct 2008</title>
      <link>http://www.advogato.org/person/nconway/diary.html?start=60</link>
      <guid>http://www.advogato.org/person/nconway/diary.html?start=60</guid>
      <description>&lt;b&gt;Public Talk on Facebook Hive&lt;/b&gt;&#xD;
&#xD;
&lt;p&gt; &lt;blockquote&gt;The Berkeley DB group is hosting a talk about&#xD;
Facebook Hive this Thursday, at Soda Hall in UC Berkeley.&#xD;
Details and the abstract are below -- it should be an&#xD;
interesting talk! I'd encourage anyone in the area to attend&#xD;
-- if you need directions / parking suggestions / etc., just&#xD;
drop me a line.&lt;/blockquote&gt;&#xD;
&#xD;
&lt;p&gt; Thursday, October 16th, 2008&lt;br&gt;&#xD;
606 &lt;a href="http://maps.google.com/maps?f=q&amp;hl=en&amp;geocode=&amp;q=soda+hall,+berkeley&amp;ie=UTF8&amp;ll=37.881095,-122.257769&amp;spn=0.013397,0.026457&amp;z=16&amp;iwloc=A" &gt;Soda&#xD;
Hall&lt;/a&gt;, UC Berkeley&lt;br&gt;&#xD;
10-11am&#xD;
&#xD;
&lt;p&gt; &lt;b&gt;Title:&lt;/b&gt; Hive: Data Warehousing using Hadoop&#xD;
&#xD;
&lt;p&gt; &lt;b&gt;Abstract:&lt;/b&gt;&#xD;
Hive is an open-source data warehousing infrastructure built&#xD;
on top of Hadoop that allows SQL like queries along with&#xD;
abilities to add custom transformation scripts in different&#xD;
stages of data processing. It includes language constructs&#xD;
to import data from various sources, support for object &#xD;
oriented data types and a metadata repository that&#xD;
structures hadoop  directories into relational tables and&#xD;
partitions with typed columns. Facebook uses this system for&#xD;
variety of tasks - classic log aggregation, graph mining,&#xD;
text analysis and indexing.&#xD;
&#xD;
&lt;p&gt; In this talk we will give an overview of the Hive system,&#xD;
the data model, query language compilation and execution and&#xD;
the metadata store. We will also discuss our near term&#xD;
roadmap and avenues for significant contributions in terms&#xD;
of query optimization, execution speed and data compression&#xD;
amongst others. We will also present some statistics on&#xD;
usage within Facebook and outline some of the challenges in&#xD;
operating Hive/Hadoop in a utility computing model in fast&#xD;
growing environment.&#xD;
&#xD;
&lt;p&gt; &lt;b&gt;Bio:&lt;/b&gt;&#xD;
Joydeep Sensarma has been working in the Facebook Data Team&#xD;
for the last 1+ year where he's taken turns coding up Hive,&#xD;
keeping Hadoop running, eating and sleeping in that order.&#xD;
He's really glad he no longer works on closed source file&#xD;
and database systems like he did for the last ten years.&#xD;
&#xD;
&lt;p&gt; Zheng Shao has worked in Facebook Data Team on Hadoop and&#xD;
Hive for about 6 months. Before that he worked in the Yahoo&#xD;
web search team which heavily uses Hadoop.&#xD;
&#xD;
&lt;p&gt; Namit Jain has been working in the Facebook Data team with&#xD;
Hive for about 6 months. Before that he was in the database&#xD;
and application server groups at Oracle for about 10 years.</description>
    </item>
    <item>
      <pubDate>Mon, 1 Sep 2008 00:34:10 GMT</pubDate>
      <title>1 Sep 2008</title>
      <link>http://www.advogato.org/person/nconway/diary.html?start=59</link>
      <guid>http://www.advogato.org/person/nconway/diary.html?start=59</guid>
      <description>&lt;b&gt;System R&lt;/b&gt;&#xD;
&#xD;
&lt;p&gt; &lt;p&gt;One of the classes I'm taking at Berkeley this fall is&#xD;
CS262a, which is the first part of their graduate-level&#xD;
introductory "systems" class -- looking at great papers and&#xD;
common threads among operating systems, networking,&#xD;
databases, and the like. One of the first papers we're going&#xD;
to discuss is "&lt;a href="http://www.cs.berkeley.edu/%7Ebrewer/cs262/SystemR.pdf" &gt;A&#xD;
History And Evaluation of System R&lt;/a&gt;", which&#xD;
describes the seminal DBMS built by a team of 15 PhDs at IBM&#xD;
Research from 1974 to ~1980. The paper is a great read,&#xD;
especially if you're interested in database internals. (If&#xD;
you're going to read the paper, I suggest Joe Hellerstein's&#xD;
&lt;a href="http://db.cs.berkeley.edu/cs262/SystemR-annotated.pdf" &gt;annotated&#xD;
version&lt;/a&gt;, which contains a number of insightful comments.)&#xD;
&#xD;
&lt;p&gt; &lt;p&gt; A few comments of my own:&#xD;
&#xD;
&lt;p&gt; &lt;p&gt;&lt;ul&gt;&#xD;
&lt;li&gt;The scope of the project goals and the completeness of&#xD;
the implementation is remarkable, considering the time&#xD;
period and the lack of other production-quality RDBMS&#xD;
implementations at the time. System R included a cost-based&#xD;
query&#xD;
optimizer, joins, subqueries, updateable views, log-based crash&#xD;
recovery, granular locking, authentication and&#xD;
authorization, a relational system catalog, prepared&#xD;
queries, and other sophisticated features. In fact, System R&#xD;
even had the ability to automatically invalidate and replan&#xD;
prepared&#xD;
queries when their dependent objects changed, a feature&#xD;
Postgres didn't add until 8.3 (and we still don't have&#xD;
native support for updateable views).&#xD;
&lt;li&gt;People often complain that SQL is a poorly-designed&#xD;
language. In many respects that may be true, but it's not&#xD;
because the design of the language itself was neglected:&#xD;
even in 1975, the System R team gave "considerable thought&#xD;
... to the human factors aspects of the SQL language, and an&#xD;
experimental study was conducted on the learnability and&#xD;
usability of SQL." While the goal of having secretaries and&#xD;
other non-technical staff writing SQL queries was perhaps&#xD;
not achieved, SQL wasn't a hackishly-designed language, even&#xD;
if it sometimes feels that way :)&#xD;
&lt;li&gt;The initial System R prototype supported subqueries, but&#xD;
not joins. That seems an unusual order in which to implement&#xD;
features, although it does make some sense (JMH points out&#xD;
that neglecting joins makes the optimizer search strategy&#xD;
much simpler).&#xD;
&lt;li&gt;One interesting design choice is that System R generated&#xD;
machine code from the query plan, rather than having the&#xD;
executor walk the plan tree at runtime. While this design&#xD;
sounded exotic to me at first glance, it actually makes&#xD;
sense: on the hardware of the time, queries were much more&#xD;
likely to be CPU bound than they are today.&#xD;
&lt;/ul&gt;&#xD;
&#xD;
&lt;p&gt; &lt;p&gt;The notes from the &lt;a href="http://www.mcjones.org/System_R/SQL_Reunion_95/sqlr95.html" &gt;1995&#xD;
System R reunion&lt;/a&gt; are also an interesting read, if you'd&#xD;
like to&#xD;
learn more about the politics and history of the project.</description>
    </item>
    <item>
      <pubDate>Tue, 27 May 2008 22:43:31 GMT</pubDate>
      <title>27 May 2008</title>
      <link>http://www.advogato.org/person/nconway/diary.html?start=58</link>
      <guid>http://www.advogato.org/person/nconway/diary.html?start=58</guid>
      <description>&lt;b&gt;Grad School&lt;/b&gt;&#xD;
&#xD;
&lt;p&gt; I've decided to go back to school &amp;mdash; I'm excited to&#xD;
report that I'll be starting&#xD;
at the PhD program in computer science at UC Berkeley in the&#xD;
fall, working with &lt;a href="http://www.cs.berkeley.edu/~franklin/" &gt;Mike&#xD;
Franklin&lt;/a&gt; and &lt;a href="http://db.cs.berkeley.edu/jmh/" &gt;Joe Hellerstein&lt;/a&gt; in&#xD;
the &lt;a href="http://db.cs.berkeley.edu/" &gt;Berkeley Database&#xD;
Group&lt;/a&gt;. I'm not sure yet if this means I'll have more or less&#xD;
time to work on community Postgres stuff.&#xD;
&#xD;
&lt;p&gt; &lt;b&gt;AsterDB and Postgres?&lt;/b&gt;&#xD;
&#xD;
&lt;p&gt; &lt;a href="http://www.asterdata.com/" &gt;Aster Data Systems&lt;/a&gt;&#xD;
are a database startup that have received a bunch of &lt;a href="http://anand.typepad.com/datawocky/2008/05/why-the-world-needs-a-new-database-system.html" &gt;press&lt;/a&gt;&#xD;
recently. I've now heard from two different people that&#xD;
Aster are built upon Postgres, but their website is still&#xD;
pretty content-free, so it's hard to be sure. I wouldn't be&#xD;
surprised, though: it's hard to make the case for building a&#xD;
database system from scratch in 2008, especially in a&#xD;
startup environment.</description>
    </item>
    <item>
      <pubDate>Sat, 10 May 2008 06:02:53 GMT</pubDate>
      <title>10 May 2008</title>
      <link>http://www.advogato.org/person/nconway/diary.html?start=57</link>
      <guid>http://www.advogato.org/person/nconway/diary.html?start=57</guid>
      <description>&lt;b&gt;The End of Moore's Law&lt;/b&gt;&#xD;
&#xD;
&lt;p&gt; &lt;p&gt; I was reading "&lt;a href="http://ptolemy.eecs.berkeley.edu/publications/papers/06/problemwithThreads/" &gt;The&#xD;
Problem with Threads&lt;/a&gt;" by Prof. &lt;a href="http://ptolemy.eecs.berkeley.edu/~eal/" &gt;Ed Lee&lt;/a&gt;,&#xD;
and noticed the following claim right on the&#xD;
first page:&#xD;
&#xD;
&lt;p&gt; &lt;p&gt; &lt;blockquote&gt;&#xD;
Many technologists predict that the end of Moore&amp;rsquo;s&#xD;
Law will be answered with increasingly parallel computer&#xD;
architectures (multicore or chip [multiprocessors], CMPs)&#xD;
[&lt;a href="http://www.acmqueue.com/modules.php?name=Content&amp;pa=printer_friendly&amp;pid=333&amp;page=1" &gt;15&lt;/a&gt;].&#xD;
&lt;/blockquote&gt;&#xD;
&#xD;
&lt;p&gt; &lt;p&gt; This quote&#xD;
confuses me, because, to the best of my knowledge, &lt;i&gt;Moore's&#xD;
Law has not ended&lt;/i&gt;, and the industry's move to&#xD;
multicore/manycore processors is not directly related to the&#xD;
imminent demise of &lt;a href="http://en.wikipedia.org/wiki/Moore%27s_law" &gt;Moore's&#xD;
Law&lt;/a&gt;. Moore's Law is the claim that transistor density in&#xD;
integrated circuits approximately doubles every two years.&#xD;
As far as I know, that remains basically &lt;a href="http://en.wikipedia.org/wiki/Transistor_count" &gt;true&#xD;
for the time being&lt;/a&gt;, and current speculation is that it&#xD;
will continue to hold for &lt;a href="http://www.news.com/2100-1001-984051.html" &gt;at least 10&#xD;
years&lt;/a&gt;.&#xD;
&#xD;
&lt;p&gt; &lt;p&gt; What &lt;i&gt;is&lt;/i&gt; driving the move to multicore designs is that&#xD;
we can no longer effectively use those extra transistors to&#xD;
increase the speed of a single sequential instruction&#xD;
stream. Ramping up clock speed increases heat&#xD;
dissipation, and doesn't improve performance very much if&#xD;
memory latency doesn't significantly change. Techniques like&#xD;
caching, pipelining, and superscalar execution help, but&#xD;
only to an extent. Hence the move to multicore designs and&#xD;
chip-level parallelism.&#xD;
&#xD;
&lt;p&gt; &lt;p&gt; That said, I'm definitely not a hardware guy,&#xD;
and doubtless Prof. Lee has forgotten more about processor&#xD;
design than I am ever likely to know. And when Moore's Law&#xD;
ends, that may well&#xD;
encourage the multicore trend even more&amp;mdash;but my&#xD;
understanding is that the&#xD;
eventual demise of Moore's Law and the current move to multicore&#xD;
architectures are not directly related. I'm curious to know&#xD;
if I'm mistaken. &#xD;
&#xD;
&lt;p&gt; &lt;p&gt; (As an aside, text quoted above cites "&lt;a href="http://www.acmqueue.com/modules.php?name=Content&amp;pa=printer_friendly&amp;pid=333&amp;page=1" &gt;Multicore&#xD;
CPUs for the Masses&lt;/a&gt;" in &lt;i&gt;ACM Queue&lt;/i&gt; as support for&#xD;
the claim that the industry is moving toward multicore&#xD;
designs. While that is true, the article makes no mention of&#xD;
Moore's Law.)</description>
    </item>
    <item>
      <pubDate>Fri, 2 May 2008 18:00:40 GMT</pubDate>
      <title>2 May 2008</title>
      <link>http://www.advogato.org/person/nconway/diary.html?start=56</link>
      <guid>http://www.advogato.org/person/nconway/diary.html?start=56</guid>
      <description>&lt;b&gt;SciDBMS&lt;/b&gt;&#xD;
&#xD;
&lt;p&gt; I noticed that the &lt;a href="http://xldb.slac.stanford.edu/download/attachments/4784226/sciDB2008_report.pdf" &gt;final&#xD;
report&lt;/a&gt; from the &lt;a href="http://confluence.slac.stanford.edu/display/XLDB/Science+-+DB+Research+Meeting" &gt;Science&#xD;
Database Research Meeting&lt;/a&gt; was released a little while&#xD;
ago. Worth reading if you're interested in how database&#xD;
technology can be applied to managing scientific data&#xD;
&amp;mdash; they have some interesting ideas about both what&#xD;
problems need to be solved, but also how to develop those&#xD;
solutions into a product that scientists can use (via both&#xD;
an open source project and a startup company).</description>
    </item>
    <item>
      <pubDate>Tue, 15 Apr 2008 08:35:36 GMT</pubDate>
      <title>15 Apr 2008</title>
      <link>http://www.advogato.org/person/nconway/diary.html?start=55</link>
      <guid>http://www.advogato.org/person/nconway/diary.html?start=55</guid>
      <description>&lt;b&gt;Kickfire and "Stream Processing"&lt;/b&gt;&#xD;
&#xD;
&lt;p&gt; &lt;p&gt; I noticed &lt;a href="http://people.planetpostgresql.org/xzilla/index.php?/archives/339-guid.html" &gt;Robert's&#xD;
post&lt;/a&gt; about the Kickfire launch. He mentioned &lt;a href="http://www.truviso.com" &gt;Truviso&lt;/a&gt; &amp;mdash; for whom I&#xD;
work &amp;mdash; so I thought I'd&#xD;
add my two cents.&#xD;
&#xD;
&lt;p&gt; &lt;p&gt; Kickfire is the company previous known as "&lt;a href="http://www.c2app.com" &gt;C2App&lt;/a&gt;". I'm not familiar&#xD;
with the details of their technology, but the basic idea is&#xD;
to use custom hardware to accelerate data warehousing&#xD;
queries (this &lt;a href="http://www.xaprb.com/blog/2008/04/14/kickfire-relational-algebra-in-a-chip/" &gt;blog&#xD;
post&lt;/a&gt; has some more details). Using custom hardware is&#xD;
not a new idea &amp;mdash;&#xD;
Netezza have been&#xD;
doing something superficially similar for years, with&#xD;
considerable success. In addition to custom hardware,&#xD;
Kickfire apparently use a few other data warehousing&#xD;
techniques that have recently come back in vogue&#xD;
(e.g. column-wise storage with compression, coupled with the&#xD;
ability to do query execution over compressed data). As an&#xD;
aside, I think&#xD;
that building a data warehousing product using MySQL is a&#xD;
fairly surprising technical decision.&#xD;
&#xD;
&lt;p&gt; &lt;p&gt; One thing I did notice is that Kickfire's PR mentions&#xD;
"stream processing" repeatedly, and Robert's post suggests&#xD;
that the sort of stream processing done by Kickfire is&#xD;
similar to what Truviso does. This&#xD;
is not the case: the two companies and their products are&#xD;
&lt;i&gt;very&lt;/i&gt; different. I'd guess that Kickfire are using the&#xD;
term because it's become something of a buzzword.&#xD;
&#xD;
&lt;p&gt; &lt;p&gt; I'd like to talk more about Truviso on this blog in the&#xD;
future, but the basic idea behind data stream processing is&#xD;
to allow &#xD;
analysis queries to be performed over &lt;i&gt;live&lt;/i&gt; streams of&#xD;
data, as the data arrives at the system. In traditional&#xD;
databases, in order to apply a query to a piece of&#xD;
data, you first&#xD;
need to insert the data item into the database, wait for it&#xD;
to be committed to disk (force-write the write-ahead log),&#xD;
and then finally&#xD;
run a query on it from scratch. When data arrives at a rapid&#xD;
pace and you need low-latency query results, this&#xD;
"store-and-query" model has terrible performance; it's also&#xD;
an unnatural way to structure a client application (you're&#xD;
essentially polling for results). Instead,&#xD;
a data stream&#xD;
query processor allows the user to define a set of&#xD;
long-running &lt;i&gt;continuous queries&lt;/i&gt; that represent the&#xD;
conditions of interest over the incoming data streams. As&#xD;
new live data arrives, the data is applied to the queries to&#xD;
incrementally update their results; client applications can&#xD;
simply consume new query results as soon as they become&#xD;
available. This allows you to get&#xD;
query results that are always up-to-date, without the need&#xD;
to first&#xD;
write data to disk (the data can either be discarded, or&#xD;
else written to disk asynchronously). For certain domains,&#xD;
such as algorithmic&#xD;
trading, network and environment monitoring, fraud&#xD;
detection, and real-time reporting, the data stream approach&#xD;
often yields much better performance and a more natural&#xD;
programming model. For more info, see the &lt;a href="http://neilconway.org/talks/stream_intro.pdf" &gt;talk on&#xD;
data stream query processing&lt;/a&gt; I gave at last year's PgCon.&#xD;
&#xD;
&lt;p&gt; &lt;p&gt; So what does this have to do with using custom hardware to&#xD;
accelerate data warehousing queries? Not a whole lot. I'm&#xD;
guessing that Kickfire have co-opted the "stream processing"&#xD;
label because they push analysis queries down to the custom&#xD;
chip, and then "stream" the stored data over the chip, to&#xD;
compute multiple queries in a single pass. If you squint at&#xD;
it right, there are some similarities to stream query&#xD;
processing (in both cases, you only want to take one pass&#xD;
over the data), but fundamentally, Kickfire is trying to&#xD;
solve a very different problem, and using a very different&#xD;
set of technologies. Data warehouse engines like Kickfire&#xD;
(and Greenplum) are&#xD;
complements to data stream systems like Truviso (and&#xD;
&lt;a href="http://www.streambase.com/" &gt;Streambase&lt;/a&gt;, &lt;a href="http://www.coral8.com/" &gt;Coral8&lt;/a&gt;, and others), not&#xD;
supplements or competitors.</description>
    </item>
  </channel>
</rss>
