Older blog entries for nconway (starting at number 35)

Algorithmic Information Theory

"Randomness and Proof" is a nice, introductory-level article on algorithmic information theory. It's by G. J. Chaitin, who has personally done a lot of the foundational work in this area. A random, interesting snippet:

Solomonoff represented a scientist's observations as a series of binary digits. The scientist seeks to explain these observations through a theory, which can be regarded as an algorithm capable of generating the series and extending it, that is, predicting future observations. For any given series of observations there are always several competing theories, and the scientist must choose among them. The model demands that the smallest algorithm, the one consisting of the fewest bits, be selected. Stated another way, this rule is the familiar formulation of Occam's razor: Given differing theories of apparently equal merit, the simplest is to be preferred.

Thus in the Solomonoff model a theory that enables one to understand a series of observations is seen as a small computer program that reproduces the observations and makes predictions about possible future observations. The smaller the program, the more comprehensive the theory and the greater the degree of understanding. Observations that are random cannot be reproduced by a small program and therefore cannot be explained by a theory. In addition the future behavior of a random system cannot be predicted. For random data the most compact way for the scientist to communicate his observations is for him to publish them in their entirety.

I'm writing a paper on Kolmogorov complexity for a class in complexity theory I'm taking this semester -- if I get a chance, I'll try to blog about some of the fascinating ideas in this field.

Neat hack

/proc/sys/vm/drop_caches could be useful — to clean the kernel's buffer cache between runs of a benchmark, for example. Available in recent Linux kernels (>= 2.6.16).

1 Feb 2007 (updated 1 Feb 2007 at 01:33 UTC) »
pgmemcache

I'm now maintaining pgmemcache. I'm happy to announce that a 1.1 beta release is now available. It contains a number of bug fixes, including the ability to compile against PostgreSQL 8.1 and 8.2. It also no longer has a build dependency on pmk. I've labelled it a "beta" only because I haven't done much testing of it — if it seems okay, I'll put up a final release shortly. Thanks to the good folks at the Open Technology Group, Inc. for sponsoring my work on pgmemcache.

Download (source .tar.gz)
Release Notes

Europe

I'll be away until January 5th. I should still be reachable via email.

Database Architecture

The Revolution In Database Architecture is a nice short paper that outlines Jim Gray's thoughts on future directions for database systems. A few samples:

Random access is a hundred times slower than sequential. These changing ratios require new algorithms that intelligently use multi-processors sharing a massive main memory, and intelligently use precious disk bandwidth. The database engines need to overhaul their algorithms to deal with the fact that main memories are huge (billions of pages, trillions of bytes). The era of main-memory databases has finally arrived.
...
Cost-based static-plan optimizers continue to be the mainstay for simple queries that run in seconds. But, for complex queries, the query optimizer must adapt to current workloads, must adapt to data skew and statistics, and must plan in a much more dynamic way – changing plans as the system load and data statistics change. For petabyte-scale databases it seems the only solution is to run continuous data scans and let queries piggyback on the scans. Teradata pioneered that mechanism, and it is likely to become more common in the future.
...
The database community has found a very elegant way to embrace and extend machine learning technology like clustering, decision trees, Bayes nets, neural nets, time series analysis, etc... The key idea is to create a learning table T; telling the system to learn columns x, y, z from attributes a, b, c (or to cluster attributes a, b, c, or to treat a as the time stamp for b.) Then one inserts training data into the table T, and the data mining algorithm builds a decision tree or Bayes net or time series model for the data .... After the training phase, the table T can be used to generate synthetic data; given a key a,b,c it can return the likely x,y,z values of that key along with the probabilities. Equivalently, T can evaluate the probability that some value is correct. The neat thing about this is that the framework allows you to add your own machine-learning algorithms to this framework. This gives the machine-learning community a vehicle to make their technology accessible to a broad user base.
I'm not convinced that such a model is really the right way to integrate data mining into the DBMS, but it's an interesting (and easily implemented) idea. You could build such a system using Postgres' extensibility features pretty easily, although backend modifications would probably be necessary for proper integration. PL/R would definitely be a useful tool.

Real Life

Sorry for not posting the solution to the math problem I blogged about a few months (!) ago -- I'll get around to that shortly.

3 Oct 2006 (updated 3 Oct 2006 at 20:04 UTC) »
Long time, no blog

Life's been keeping me busy. I'll post an update soon. Briefly: finished my summer job for the Amalgamated Insight (née TelegraphCQ) folks. I'm now back at Queen's for the last year of my undergrad.

Interesting math

An assignment in one of my classes included an interesting bonus problem. It is very simple, but I confess I got it completely wrong before I saw the solution. Maybe one of you bright folks is smarter than I:

Let the alphabet A = {a, b, c, ..., z} (A is the set of 26 lowercase letters of the English alphabet). Let S1(w) be true iff the string w over alphabet A contains the substring aaa; let S2(w) be true iff the string w contains the substring abc.

Suppose we choose a w of 10 characters; each character in w is selected randomly and independently.

Let P1 be the probability that S1(w) is true, and let P2 be the probability that S2(w) is true. Is P1 > P2, P1 < P2, or P1 = P2? Give a justification for your answer. (Hint: P1 != P2).

If you think you know the answer, email me — I'll post the answer later (neilc A T samurai D O T com). Obviously, the gist is in the justification, not which alternative you think is true.

Hat Tip: Prof. Kai Salomaa for showing me the problem.

I couldn't make it to OSCON this year (again!), but I noticed that Theo Schlossnagle has posted his slides for his OSCON talk about using PostgreSQL for tera-byte scale data warehousing.

I've posted the material from the "Introduction to Hacking PostgreSQL" tutorial that Gavin Sherry and I gave at the PostgreSQL Summit online, at http://neilconway.org/talks/hacking/. The slides, example patch, and handout are there -- I've only posted the PDFs for now, but I'll add the LaTeX source when I get a chance to clean it up.

Robert: When I was helping Josh with the scheduling, we were initially going to make sure that the hacking tutorial ran concurrently with non-technical talks. However, on reflection this seemed like precisely the wrong thing to do: experienced Postgres hackers won't have much to learn from the hacking tutorial, because it will focus on fairly elementary topics. So we decided to put some of the more hardcore technical talks, such as the talks on multi-threading and SMP scalability, alongside the hacking tutorial, so that the hardcore folks can see the technical talks, and the relative novices can attend the tutorial.
6 Apr 2006 (updated 6 Apr 2006 at 01:10 UTC) »

I'd like to highlight Tom Lane's recent commit that fixes a major defect in the implementation of domains in PostgreSQL. Domains can have CHECK or NOT NULL constraints, but prior to Tom's patch, these constraints were not enforced by PL/PgSQL, or as the return value of a procedural language function. (That meant you could have values of a domain type that violated the domain's own constraints!) This has been a known problem for quite some time, but it was somewhat tricky to fix.

In Postgres, each type has an associated "input function" that takes a string and produces a value of the type. Prior to Tom's commit, a domain type's input function was just the input function of its base type. As a result, just invoking a type's input function (which is done in several places throughout the backend) wasn't enough to check the constraints on a value of a domain type—you also had to explicitly lookup any associated domain constraints and check them. So in all the places where we were invoking input functions we'd need to add some additional code to explicitly check domain constraints. Needless to say, that would be pretty ugly -- it's just a few additional function calls, but it's really not something ought to be doing at every callsite of a type's input function.

Worse still, looking up the constraints associated with a domain is relatively expensive (it requires a non-syscache'd catalog lookup). To actually evaluate a CHECK constraint you need to evaluate an expression, which requires instantiating a bunch of executor machinery, which is also not that cheap. So in all the places where we'd need to add checks of domain constraints, we'd also need to think about how to efficiently load and cache the domain constraints and executor machinery, and when to invalidate/release them.

I did some work to add domain constraint checking to the return value in PL/PgSQL. I never applied the patch, partly because there were some implementation details that were tricky to resolve, but mostly because it just seemed like the wrong approach.

Tom's fix is much cleaner: by providing a separate input function for all domain types and doing the constraint checking there, we're guaranteed to check domain constraints at the appropriate time, without the need to clutter each call-site of a type input function ([1], [2]). It would be worthwhile to investigate whether this results in a performance regression, though: there's no easy way to cache the executor machinery needed to evaluate a CHECK constraint in this design, whereas the prior design allowed each call-site to implement its own EState caching.

Nice work, Tom!

26 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!