<?xml version="1.0"?>
<rss version="2.0">
  <channel>
    <title>Advogato blog for nconway</title>
    <link>http://www.advogato.org/person/nconway/</link>
    <description>Advogato blog for nconway</description>
    <language>en-us</language>
    <generator>mod_virgule</generator>
    <pubDate>Thu, 28 Aug 2008 16:37:35 GMT</pubDate>
    <item>
      <pubDate>Tue, 27 May 2008 22:43:31 GMT</pubDate>
      <title>27 May 2008</title>
      <link>http://www.advogato.org/person/nconway/diary.html?start=58</link>
      <guid>http://www.advogato.org/person/nconway/diary.html?start=58</guid>
      <description>&lt;b&gt;Grad School&lt;/b&gt;&#xD;
&#xD;
&lt;p&gt; I've decided to go back to school &amp;mdash; I'm excited to&#xD;
report that I'll be starting&#xD;
at the PhD program in computer science at UC Berkeley in the&#xD;
fall, working with &lt;a href="http://www.cs.berkeley.edu/~franklin/" &gt;Mike&#xD;
Franklin&lt;/a&gt; and &lt;a href="http://db.cs.berkeley.edu/jmh/" &gt;Joe Hellerstein&lt;/a&gt; in&#xD;
the &lt;a href="http://db.cs.berkeley.edu/" &gt;Berkeley Database&#xD;
Group&lt;/a&gt;. I'm not sure yet if this means I'll have more or less&#xD;
time to work on community Postgres stuff.&#xD;
&#xD;
&lt;p&gt; &lt;b&gt;AsterDB and Postgres?&lt;/b&gt;&#xD;
&#xD;
&lt;p&gt; &lt;a href="http://www.asterdata.com/" &gt;Aster Data Systems&lt;/a&gt;&#xD;
are a database startup that have received a bunch of &lt;a href="http://anand.typepad.com/datawocky/2008/05/why-the-world-needs-a-new-database-system.html" &gt;press&lt;/a&gt;&#xD;
recently. I've now heard from two different people that&#xD;
Aster are built upon Postgres, but their website is still&#xD;
pretty content-free, so it's hard to be sure. I wouldn't be&#xD;
surprised, though: it's hard to make the case for building a&#xD;
database system from scratch in 2008, especially in a&#xD;
startup environment.</description>
    </item>
    <item>
      <pubDate>Sat, 10 May 2008 06:02:53 GMT</pubDate>
      <title>10 May 2008</title>
      <link>http://www.advogato.org/person/nconway/diary.html?start=57</link>
      <guid>http://www.advogato.org/person/nconway/diary.html?start=57</guid>
      <description>&lt;b&gt;The End of Moore's Law&lt;/b&gt;&#xD;
&#xD;
&lt;p&gt; &lt;p&gt; I was reading "&lt;a href="http://ptolemy.eecs.berkeley.edu/publications/papers/06/problemwithThreads/" &gt;The&#xD;
Problem with Threads&lt;/a&gt;" by Prof. &lt;a href="http://ptolemy.eecs.berkeley.edu/~eal/" &gt;Ed Lee&lt;/a&gt;,&#xD;
and noticed the following claim right on the&#xD;
first page:&#xD;
&#xD;
&lt;p&gt; &lt;p&gt; &lt;blockquote&gt;&#xD;
Many technologists predict that the end of Moore&amp;rsquo;s&#xD;
Law will be answered with increasingly parallel computer&#xD;
architectures (multicore or chip [multiprocessors], CMPs)&#xD;
[&lt;a href="http://www.acmqueue.com/modules.php?name=Content&amp;pa=printer_friendly&amp;pid=333&amp;page=1" &gt;15&lt;/a&gt;].&#xD;
&lt;/blockquote&gt;&#xD;
&#xD;
&lt;p&gt; &lt;p&gt; This quote&#xD;
confuses me, because, to the best of my knowledge, &lt;i&gt;Moore's&#xD;
Law has not ended&lt;/i&gt;, and the industry's move to&#xD;
multicore/manycore processors is not directly related to the&#xD;
imminent demise of &lt;a href="http://en.wikipedia.org/wiki/Moore%27s_law" &gt;Moore's&#xD;
Law&lt;/a&gt;. Moore's Law is the claim that transistor density in&#xD;
integrated circuits approximately doubles every two years.&#xD;
As far as I know, that remains basically &lt;a href="http://en.wikipedia.org/wiki/Transistor_count" &gt;true&#xD;
for the time being&lt;/a&gt;, and current speculation is that it&#xD;
will continue to hold for &lt;a href="http://www.news.com/2100-1001-984051.html" &gt;at least 10&#xD;
years&lt;/a&gt;.&#xD;
&#xD;
&lt;p&gt; &lt;p&gt; What &lt;i&gt;is&lt;/i&gt; driving the move to multicore designs is that&#xD;
we can no longer effectively use those extra transistors to&#xD;
increase the speed of a single sequential instruction&#xD;
stream. Ramping up clock speed increases heat&#xD;
dissipation, and doesn't improve performance very much if&#xD;
memory latency doesn't significantly change. Techniques like&#xD;
caching, pipelining, and superscalar execution help, but&#xD;
only to an extent. Hence the move to multicore designs and&#xD;
chip-level parallelism.&#xD;
&#xD;
&lt;p&gt; &lt;p&gt; That said, I'm definitely not a hardware guy,&#xD;
and doubtless Prof. Lee has forgotten more about processor&#xD;
design than I am ever likely to know. And when Moore's Law&#xD;
ends, that may well&#xD;
encourage the multicore trend even more&amp;mdash;but my&#xD;
understanding is that the&#xD;
eventual demise of Moore's Law and the current move to multicore&#xD;
architectures are not directly related. I'm curious to know&#xD;
if I'm mistaken. &#xD;
&#xD;
&lt;p&gt; &lt;p&gt; (As an aside, text quoted above cites "&lt;a href="http://www.acmqueue.com/modules.php?name=Content&amp;pa=printer_friendly&amp;pid=333&amp;page=1" &gt;Multicore&#xD;
CPUs for the Masses&lt;/a&gt;" in &lt;i&gt;ACM Queue&lt;/i&gt; as support for&#xD;
the claim that the industry is moving toward multicore&#xD;
designs. While that is true, the article makes no mention of&#xD;
Moore's Law.)</description>
    </item>
    <item>
      <pubDate>Fri, 2 May 2008 18:00:40 GMT</pubDate>
      <title>2 May 2008</title>
      <link>http://www.advogato.org/person/nconway/diary.html?start=56</link>
      <guid>http://www.advogato.org/person/nconway/diary.html?start=56</guid>
      <description>&lt;b&gt;SciDBMS&lt;/b&gt;&#xD;
&#xD;
&lt;p&gt; I noticed that the &lt;a href="http://xldb.slac.stanford.edu/download/attachments/4784226/sciDB2008_report.pdf" &gt;final&#xD;
report&lt;/a&gt; from the &lt;a href="http://confluence.slac.stanford.edu/display/XLDB/Science+-+DB+Research+Meeting" &gt;Science&#xD;
Database Research Meeting&lt;/a&gt; was released a little while&#xD;
ago. Worth reading if you're interested in how database&#xD;
technology can be applied to managing scientific data&#xD;
&amp;mdash; they have some interesting ideas about both what&#xD;
problems need to be solved, but also how to develop those&#xD;
solutions into a product that scientists can use (via both&#xD;
an open source project and a startup company).</description>
    </item>
    <item>
      <pubDate>Tue, 15 Apr 2008 08:35:36 GMT</pubDate>
      <title>15 Apr 2008</title>
      <link>http://www.advogato.org/person/nconway/diary.html?start=55</link>
      <guid>http://www.advogato.org/person/nconway/diary.html?start=55</guid>
      <description>&lt;b&gt;Kickfire and "Stream Processing"&lt;/b&gt;&#xD;
&#xD;
&lt;p&gt; &lt;p&gt; I noticed &lt;a href="http://people.planetpostgresql.org/xzilla/index.php?/archives/339-guid.html" &gt;Robert's&#xD;
post&lt;/a&gt; about the Kickfire launch. He mentioned &lt;a href="http://www.truviso.com" &gt;Truviso&lt;/a&gt; &amp;mdash; for whom I&#xD;
work &amp;mdash; so I thought I'd&#xD;
add my two cents.&#xD;
&#xD;
&lt;p&gt; &lt;p&gt; Kickfire is the company previous known as "&lt;a href="http://www.c2app.com" &gt;C2App&lt;/a&gt;". I'm not familiar&#xD;
with the details of their technology, but the basic idea is&#xD;
to use custom hardware to accelerate data warehousing&#xD;
queries (this &lt;a href="http://www.xaprb.com/blog/2008/04/14/kickfire-relational-algebra-in-a-chip/" &gt;blog&#xD;
post&lt;/a&gt; has some more details). Using custom hardware is&#xD;
not a new idea &amp;mdash;&#xD;
Netezza have been&#xD;
doing something superficially similar for years, with&#xD;
considerable success. In addition to custom hardware,&#xD;
Kickfire apparently use a few other data warehousing&#xD;
techniques that have recently come back in vogue&#xD;
(e.g. column-wise storage with compression, coupled with the&#xD;
ability to do query execution over compressed data). As an&#xD;
aside, I think&#xD;
that building a data warehousing product using MySQL is a&#xD;
fairly surprising technical decision.&#xD;
&#xD;
&lt;p&gt; &lt;p&gt; One thing I did notice is that Kickfire's PR mentions&#xD;
"stream processing" repeatedly, and Robert's post suggests&#xD;
that the sort of stream processing done by Kickfire is&#xD;
similar to what Truviso does. This&#xD;
is not the case: the two companies and their products are&#xD;
&lt;i&gt;very&lt;/i&gt; different. I'd guess that Kickfire are using the&#xD;
term because it's become something of a buzzword.&#xD;
&#xD;
&lt;p&gt; &lt;p&gt; I'd like to talk more about Truviso on this blog in the&#xD;
future, but the basic idea behind data stream processing is&#xD;
to allow &#xD;
analysis queries to be performed over &lt;i&gt;live&lt;/i&gt; streams of&#xD;
data, as the data arrives at the system. In traditional&#xD;
databases, in order to apply a query to a piece of&#xD;
data, you first&#xD;
need to insert the data item into the database, wait for it&#xD;
to be committed to disk (force-write the write-ahead log),&#xD;
and then finally&#xD;
run a query on it from scratch. When data arrives at a rapid&#xD;
pace and you need low-latency query results, this&#xD;
"store-and-query" model has terrible performance; it's also&#xD;
an unnatural way to structure a client application (you're&#xD;
essentially polling for results). Instead,&#xD;
a data stream&#xD;
query processor allows the user to define a set of&#xD;
long-running &lt;i&gt;continuous queries&lt;/i&gt; that represent the&#xD;
conditions of interest over the incoming data streams. As&#xD;
new live data arrives, the data is applied to the queries to&#xD;
incrementally update their results; client applications can&#xD;
simply consume new query results as soon as they become&#xD;
available. This allows you to get&#xD;
query results that are always up-to-date, without the need&#xD;
to first&#xD;
write data to disk (the data can either be discarded, or&#xD;
else written to disk asynchronously). For certain domains,&#xD;
such as algorithmic&#xD;
trading, network and environment monitoring, fraud&#xD;
detection, and real-time reporting, the data stream approach&#xD;
often yields much better performance and a more natural&#xD;
programming model. For more info, see the &lt;a href="http://neilconway.org/talks/stream_intro.pdf" &gt;talk on&#xD;
data stream query processing&lt;/a&gt; I gave at last year's PgCon.&#xD;
&#xD;
&lt;p&gt; &lt;p&gt; So what does this have to do with using custom hardware to&#xD;
accelerate data warehousing queries? Not a whole lot. I'm&#xD;
guessing that Kickfire have co-opted the "stream processing"&#xD;
label because they push analysis queries down to the custom&#xD;
chip, and then "stream" the stored data over the chip, to&#xD;
compute multiple queries in a single pass. If you squint at&#xD;
it right, there are some similarities to stream query&#xD;
processing (in both cases, you only want to take one pass&#xD;
over the data), but fundamentally, Kickfire is trying to&#xD;
solve a very different problem, and using a very different&#xD;
set of technologies. Data warehouse engines like Kickfire&#xD;
(and Greenplum) are&#xD;
complements to data stream systems like Truviso (and&#xD;
&lt;a href="http://www.streambase.com/" &gt;Streambase&lt;/a&gt;, &lt;a href="http://www.coral8.com/" &gt;Coral8&lt;/a&gt;, and others), not&#xD;
supplements or competitors.</description>
    </item>
    <item>
      <pubDate>Tue, 8 Apr 2008 00:45:57 GMT</pubDate>
      <title>8 Apr 2008</title>
      <link>http://www.advogato.org/person/nconway/diary.html?start=54</link>
      <guid>http://www.advogato.org/person/nconway/diary.html?start=54</guid>
      <description>&lt;b&gt;DBMS Internals for Undergrad Students&lt;/b&gt;&#xD;
&#xD;
&lt;p&gt; I noticed an interesting short paper on "&lt;a href="http://www.acm.org/sigmod/record/issues/0309/4.JHdbcourseS03.pdf" &gt;Exposing&#xD;
Undergraduate Students to Database System Internals&lt;/a&gt;".&#xD;
Written by Joe Hellerstein at UC Berkeley and Anastasia&#xD;
Ailamaki at CMU, it describes their experience using&#xD;
PostgreSQL to teach courses that introduce undergraduate&#xD;
students to DBMS internals. This provides some context for&#xD;
the student projects on hash-based aggregation and other&#xD;
topics that have been occasionally mentioned on -hackers in&#xD;
the past (e.g. &lt;a href="http://markmail.org/message/umawlu45yhdvftrr" &gt;here&lt;/a&gt;).</description>
    </item>
    <item>
      <pubDate>Mon, 10 Mar 2008 21:09:39 GMT</pubDate>
      <title>10 Mar 2008</title>
      <link>http://www.advogato.org/person/nconway/diary.html?start=53</link>
      <guid>http://www.advogato.org/person/nconway/diary.html?start=53</guid>
      <description>&lt;b&gt;Monitoring query progress&lt;/b&gt;&#xD;
&#xD;
&lt;p&gt; &lt;p&gt; Monitoring the progress of a long-running analysis query is&#xD;
a cool problem -- it's been discussed on &lt;tt&gt;-hackers&lt;/tt&gt; a&#xD;
few times in the past (e.g. &lt;a href="http://markmail.org/message/rpcdtr4qhbixz66w" &gt;by Greg&#xD;
Stark&lt;/a&gt;). In that thread, I pointed to some Wisconsin&#xD;
research on this topic (&lt;a href="http://www.cs.wisc.edu/~gangluo/interface.pdf" &gt;2004&lt;/a&gt;,&#xD;
 &lt;a href="http://www.cs.wisc.edu/~gangluo/workload_final.pdf" &gt;2006&lt;/a&gt;).&#xD;
That work was prototyped with Postgres. I just noticed that&#xD;
there's apparently a new project at the DB group at U of T&#xD;
that is tackling similar problems: &lt;a href="http://queens.db.toronto.edu/project/conex/" &gt;ConEx&lt;/a&gt;.&#xD;
Apparently they are also using Postgres to build their&#xD;
prototype, which is always cool to see.</description>
    </item>
    <item>
      <pubDate>Sun, 2 Mar 2008 23:18:01 GMT</pubDate>
      <title>2 Mar 2008</title>
      <link>http://www.advogato.org/person/nconway/diary.html?start=52</link>
      <guid>http://www.advogato.org/person/nconway/diary.html?start=52</guid>
      <description>&lt;b&gt;Semantic Web SIG Meeting&lt;/b&gt;&#xD;
&#xD;
&lt;p&gt; There's an interesting talk in Palo Alto on Wednesday: "&lt;a href="http://upcoming.yahoo.com/event/429508/" &gt;Are Scalable&#xD;
Graph Data Applications Possible?&lt;/a&gt;". Speakers will&#xD;
include Sam Madden from MIT, Andy Palmer (one of the&#xD;
founders of Vertica), and someone from Franz Inc -- who are&#xD;
apparently selling an RDF database implementation, in&#xD;
addition to their long-standing Lisp-related products.</description>
    </item>
    <item>
      <pubDate>Thu, 21 Feb 2008 08:34:29 GMT</pubDate>
      <title>21 Feb 2008</title>
      <link>http://www.advogato.org/person/nconway/diary.html?start=51</link>
      <guid>http://www.advogato.org/person/nconway/diary.html?start=51</guid>
      <description>&lt;b&gt;Data Management for RDF&lt;/b&gt;&#xD;
&#xD;
&lt;p&gt; &lt;p&gt; I was talking to a database researcher recently about why&#xD;
the artificial intelligence community and the database&#xD;
community haven't historically seen eye-to-eye. The&#xD;
researcher's opinion was that AI folks tend to regard&#xD;
databases as hopelessly limited in their expressive power,&#xD;
whereas DB folks tend to view AI data models as hopelessly&#xD;
difficult to implement efficiently. There is probably some&#xD;
truth to both views.&#xD;
&#xD;
&lt;p&gt; &lt;p&gt; I was reminded of this when doing some reading about data&#xD;
management techniques for &lt;a href="http://en.wikipedia.org/wiki/Resource_Description_Framework" &gt;RDF&lt;/a&gt;&#xD;
(the proposed data model for the Semantic Web). Abadi et&#xD;
al.'s "&lt;a href="http://db.csail.mit.edu/pubs/abadirdf.pdf" &gt;Scalable&#xD;
Semantic Web Data Management Using Vertical&#xD;
Partitioning&lt;/a&gt;" is a nice paper from VLDB 2007, and&#xD;
appears to be one of&#xD;
a relatively small group of papers that approach the&#xD;
Semantic Web from a database systems perspective. The paper&#xD;
proposes a new model for storing RDF data, which essentially&#xD;
applies the column-store ideas from the &lt;a href="http://db.lcs.mit.edu/projects/cstore/" &gt;C-Store&lt;/a&gt;&#xD;
and Vertica&#xD;
projects. Sam Madden and Daniel Abadi talk about their ideas&#xD;
more in a &lt;a href="http://www.databasecolumn.com/2008/01/databases-and-rdf.html" &gt;blog&#xD;
entry&lt;/a&gt; at The Database Column.&#xD;
&#xD;
&lt;p&gt; &lt;p&gt; Planet PostgreSQL readers might be interested in this&#xD;
observation in the paper:&#xD;
&lt;blockquote&gt;&#xD;
We chose Postgres as the row-store to experiment with&#xD;
because &lt;a href="http://pages.cs.wisc.edu/~naughton/includes/papers/sparsedatasets.pdf" &gt;Beckmann&#xD;
et al.&lt;/a&gt; experimentally showed that it was by&#xD;
far more efficient dealing with sparse data than commercial&#xD;
database products. Postgres does not waste space storing&#xD;
NULL data: every tuple is preceded by a bit-string of&#xD;
cardinality equal to the number of attributes, with '1's at&#xD;
positions of the non-NULL values in the tuple. NULL data is&#xD;
thus not stored; this is unlike commercial products that&#xD;
waste space on NULL data. Beckmann et al. show that Postgres&#xD;
queries over sparse data operate about eight times faster&#xD;
than commercial systems&#xD;
&lt;/blockquote&gt;&#xD;
&#xD;
&lt;p&gt; &lt;p&gt; (A minor nitpick: Postgres will omit the per-tuple NULL&#xD;
bitmap when none of the attributes of a tuple are NULL, so&#xD;
it is not quite true that "&lt;i&gt;every&lt;/i&gt; tuple is preceded by&#xD;
a bit-string".)&#xD;
&#xD;
&lt;p&gt; &lt;p&gt; The cited Beckman et al. paper is "&lt;a href="http://pages.cs.wisc.edu/~naughton/includes/papers/sparsedatasets.pdf" &gt;Extending&#xD;
RDBMSs To Support Sparse Datasets Using An Interpreted&#xD;
Attribute Storage Format&lt;/a&gt;".&#xD;
&#xD;
&lt;p&gt; &lt;p&gt; It's interesting that none of the leading commercial systems&#xD;
seem to use exactly the same NULL bitmap approach that&#xD;
Postgres does. The tradeoff appears to be of storage against&#xD;
computation time: eliding the NULL values from the on-disk&#xD;
tuple reduces storage requirements, but makes it more&#xD;
expensive to find&#xD;
the offset within a tuple at which an attribute begins, if&#xD;
the attribute is preceded by one or more (elided) NULL&#xD;
values. If NULL values&#xD;
were stored in the on-disk tuple (and no variable-width&#xD;
attributes are used), the offset of an attribute can be&#xD;
found more efficiently.&#xD;
&#xD;
&lt;p&gt; &lt;p&gt; In practice, Postgres implements another optimization that&#xD;
mitigates this problem to some extent: as tuples are passed&#xD;
around the executor and attributes are "extracted" from the&#xD;
on-disk tuple representation, they are effectively cached&#xD;
using the &lt;tt&gt;TupleTableSlot&lt;/tt&gt; mechanism. This means that&#xD;
the computation to find the right offset for an attribute in&#xD;
the presence of NULLs is typically only done at most once&#xD;
per attribute of a tuple.</description>
    </item>
    <item>
      <pubDate>Tue, 19 Feb 2008 20:02:54 GMT</pubDate>
      <title>19 Feb 2008</title>
      <link>http://www.advogato.org/person/nconway/diary.html?start=50</link>
      <guid>http://www.advogato.org/person/nconway/diary.html?start=50</guid>
      <description>&lt;b&gt;Nice DBMS Internals Overview Paper&lt;/b&gt;&#xD;
&#xD;
&lt;p&gt; I noticed that Joe Hellerstein, Mike Stonebraker, and James&#xD;
Hamilton (DBMS luminaries all) have published a nice,&#xD;
reasonably high-level paper describing the architecture and&#xD;
design principles of a typical database management system: "&lt;a href="http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf" &gt;Architecture&#xD;
of a Database System&lt;/a&gt;".</description>
    </item>
    <item>
      <pubDate>Fri, 25 Jan 2008 01:30:03 GMT</pubDate>
      <title>25 Jan 2008</title>
      <link>http://www.advogato.org/person/nconway/diary.html?start=49</link>
      <guid>http://www.advogato.org/person/nconway/diary.html?start=49</guid>
      <description>&lt;b&gt;PostgreSQL Mailing List Archives&lt;/b&gt;&#xD;
&#xD;
&lt;p&gt; &lt;a href="http://markmail.org" &gt;MarkMail&lt;/a&gt; is now indexing&#xD;
all 630,000+ messages from the &lt;a href="http://postgresql.markmail.org/" &gt;PostgreSQL mailing&#xD;
list archives&lt;/a&gt;. If, like me, you've been frustrated when&#xD;
trying to use the search engine and archives at &lt;a href="http://archives.postgresql.org" &gt;archives.postgresql.org&lt;/a&gt;,&#xD;
I suggest checking out MarkMail. It's been working very well&#xD;
for me so far.</description>
    </item>
  </channel>
</rss>
