Older blog entries for cbbrowne (starting at number 23)

Fast COUNT(*) in PostgreSQL

One of the frequently-asked questions about PostgreSQL is “why is SELECT COUNT(*) FROM some_table doing a slow sequential scan?”

This has been asked repeatedly on mailing lists everywhere, and the common answer in the FAQ provides a fine explanation which I shall not repeat. There is some elaboration on slow counting.

Regrettably, the proposed alternative solutions aren’t always quite so fine. The one that is most typically pointed out is this one, Tracking the row count

How Tracking the row count works

The idea is fine, at least at first blush:

  • Set up a table that captures row counts
CREATE TABLE rowcounts (
  table_name text not null primary key,
  total_rows bigint);
  • Initialize row counts for the desired tables
DELETE FROM rowcounts WHERE table_name = 'my_table';
INSERT INTO ROWCOUNTS (table_name, total_rows) SELECT 'my_table', count(*) from my_table;
  • Establish trigger function on my_table which has the following logic
if tg_op = 'INSERT' then
   update rowcounts set total_rows = total_rows + 1
     where table_name = 'my_table';
elsif tg_op = 'DELETE' then
   update rowcounts set total_rows = total_rows - 1
     where table_name = 'my_table';
end if;
  • If you want to know the size of my_table, then query
SELECT total_rows FROM rowcounts WHERE table_name = 'my_table';

The problem with this approach

On the face of it, it looks fine, but regrettably, it doesn’t work out happily under conditions of concurrency. If there are multiple connections trying to INSERT or DELETE on my_table, concurrently, then all require an exclusive lock on the tuple in rowcounts for my_table, and there is a risk (heading towards unity) of:

  1. Deadlock, if different connections access data in incompatible orderings
  2. Lock contention, leading to delays
  3. If some of the connections are running in SERIALIZABLE mode, rollbacks due to inability to serialize this update

So, there is risk of delay, or, rather worse, that this counting process causes otherwise perfectly legitimate transactions to fail. Eek!

A non-locking solution

I suggest a different approach, which eliminates the locking problem, in that:

  • The triggers are set up to only ever INSERT into the rowcounts
  • An asynchronous process does summarization, to shorten rowcounts
  • I’d be inclined to use a stored function to query rowcounts

Table definition

CREATE TABLE rowcounts (
    table_name text not null,
    total_rows bigint,
    id serial primary key);
create index rc_by_table on rowcounts(table_name);

I add the id column for the sake of nit-picking normalization, so that anyone that demands a primary key gets what they demand. I’d not be hugely uncomfortable with leaving it off.

Trigger strategy

The triggers have the following form:

if tg_op = 'INSERT' then
   insert into rowcounts(table_name,total_rows) values ('my_table',1);
elsif tg_op = 'DELETE' then
   insert into rowcounts(table_name,total_rows) values ('my_table',-1);
end if;

Note that since the triggers only ever INSERT into rowcounts, they no longer interact with one another in a way that would lead to locks or deadlocks.

Function to return row count

create or replace function row_count(i_table text) returns integer as $$
begin
   return sum(total_rows) from rowcounts where table_name = i_table;
end
$ language plpgsql;

It would be tempting to have this function itself do a “shortening” of the table, but, that would reintroduce into the application the locking that we were wanting to avoid. So DELETE/UPDATE are still deferred.

Function to clean up row counts table

This function needs to be run once in a while to summarize the table contents.

create or replace function rowcount_cleanse() returns integer as $$
define
   prec record;
begin
   for prec in select table_name, sum(total_rows) as sum, count(*) as count from rowcounts group by table_name loop
       if count > 1 then
          delete from rowcounts where table_name = prec.table_name;
          insert into rowcounts (table_name, total_rows) values (prec.table_name, prec.total_rows);
       end if;
   end loop;
   return 0;
end
$ language plpgsql;

Initializing rowcounts for a table that is already populated

Nothing has yet been mentioned that would cause an initial entry to go into rowcounts for an already-populated table.

create or replace function rowcount_new_table(i_table text) returns integer as $$
declare
   query text;
begin
   delete from rowcounts where table_name = i_table;
   query := 'insert into rowcounts(table_name, total_rows) select ''|| i_table ||'', count(*) from ' || i_table || ';';
   execute query;
   return total_rows from rowcounts where table_name = i_table;
end
$ language plpgsql;

If a table has already got data in it, then it’s necessary to populate rowcounts with an initial count. Implementing such a function is straightforward, and is left as an exercise to the reader.

Further enhancements possible

It is possible to shift some of the maintenance back into the row_count() function, if we do some exception handling.

create or replace function row_count(i_table text) returns integer as $$
declare
   prec record;
begin
   begin
      lock table rowcounts nowait;
      select sum(total_rows) as sum, count(*) as count from rowcounts where table_name = i_table;
      if count > 1 then
          delete from rowcounts where table_name = i_table;
          insert into rowcounts (table_name, total_rows) values (prec.table_name, prec.total_rows);
      end if;
      return prec.total_rows;
   exception
      return sum(total_rows) from rowcounts where table_name = i_table;
   end;
end
$ language plpgsql;

This is more than a little risky, as, if this function wins the lock, it will block other processes that wish to access row counts until it’s done, this likely isn’t a worthwhile exercise.

Syndicated 2011-03-04 16:34:00 from linuxdatabases.info

Please Send A Patch

Recent Debian blog entries with this title (by Lucas Nussbaum, Matt Palmer) point out assortedly that:

  • Existing developers frequently know the code base so much better than newcomers that they’re likely way more effective at improving things than some callow newcomer.
  • Taking those developers’ time to do your pet thing instead of something they find useful mayn’t be more effective.

Both points are quite valid, and recent PostgreSQL CommitFest activity suggests a way to at least try to evaluate things.

The PostgreSQL project has a number of committers that are unusually productive developers (-1 from me, Tom? :-) ), and there have certainly been times when the “best” outcome has been for someone to come in suggesting ideas, and for one of the notably productive folk to implement it.

But there has been some debate surrounding the 2011-01 CommitFest, which consists of some 98 proposed patches, all of which require review. These are all, in fact, patches that came as some sort of response to Please send a patch :-) . The trouble with this particular CommitFest is that the patches have been overwhelming the reviewers in terms of sheer volume. Developers that should be considering working on their own “pet features” have been drawn into the review process to look at others’ features instead. None of these results are inherently a bad thing, except for the aggregate that falls out, which is that there’s so much stuff outstanding that it’s tough to get them all properly reviewed.

If a project is busy and vital, it’s pretty necessary for people to do a fair bit of “scratching their own itches” (in keeping with Matt Palmer’s comment) in order to grow the community of people capable of giving real assistance to managing the code base.

“Growing community” requires that some people struggle with the code base a bit so that they become familiar enough to become effective in the future.

Syndicated 2011-02-15 16:18:00 from linuxdatabases.info

NoSQL’s next step – stored procedures

The latest discovery is that the “bad old stored procedures” of SQL… Are what NoSQL needs… http://highscalability.com/blog/2010/11/1/hot-trend-move-behavior-to-data-for-a-new-interactive-applic.html

They’re calling them coprocessors or plugins, and it’s truly not terribly surprising. The High Scalability article makes a Battlestar Galactica joke, of http://en.wikipedia.org/wiki/Eternal\_return. The BSG line that kept coming back over and over was: All this has happened before, and all this will happen again. There’s a rather depressing possibility that people will consider coprocessors to be the greatest thing ever, not realizing that a substantial chunk of the same issues true (for better and worse) for SQL stored procedures will also hold true for coprocessors and they may learn (or fail to learn!) from scratch.

The notion is that you colocate, along with your database, some kind of “coprocessor engine” that can run code locally, which solves a number of problems, some not new, but some somewhat unique to key/value stores:

Connectivity

You’re running your application in the cloud and have somewhat spotty connectivity between the place where your application logic runs and the database where the data is stored. A coprocessor brings logic right near the database, resolving this problem.

Bulk data transfer

A difference between SQL and key/value stores is that SQL is quite happy shovelling sets of data back and forth, whereas key/value stores are all about singular key/value pairs. An SQL request readily “scales” by transferring data in bulk, whereas key/value can get bogged down by there being a zillion network round trips. A coprocessor can keep a bunch of those “round trips” inside the database layer, which will be a big win.

Goodbye, foreign keys, hello, um, ???

You may be able to shove some combination of logic maintenance and such into the coprocessor area, thereby gaining back some of the things lost when NoSQL eschewed SQL foreign key references and triggers.

Data normalization analysis returns

One of the typical things to do with NoSQL is to “shard” the database so each database server only has part of the data, and may operate independently of other database servers.

Coprocessor use will require that all the data that is to be used is on the local server, otherwise you head back to the problem of shovelling tuples back and forth between DB servers with the zillions of network roundtrips problem.

To guard against that, the data needs to be normalized in such a way that the data relevant to the coprocessors is available locally. (Perhaps not exclusively, but generally so. A few round trips may be OK, but not zillions.)

It seems to me that people have been excited by NoSQL in part because they could get away from all that irritating SQL normalization rules stuff. But this bit implies that this benefit was something of a mirage. Perhaps the precise rules of Boyce-Codd Normal Form are no longer crucial, but you’ll still need to have some kind of calculus to ascertain which divisions work and which don’t.

Things still not clear about this…

Managing the coprocessors

One of the challenges faced in SQL systems that use a lot of stored procedures is that of managing these procedures, complete with versioning (because what goes into production on day #1 isn’t what will be there forever, right?).

Windows always used to suffer (may still suffer, for all I know) from dependency hell, where different applications may need competing versions of libraries. (Entertainment of the week was seeing that the Haskell folks http://www.haskell.org/pipermail/haskell-cafe/2010-April/076164.html are, of late running into this.  Not intended as insult; it’s a problem that is nontrivial to avoid.)

It’s surely needful to have some kind of coprocessor dictionary to keep this sort of thing under some control. It’s never been trivial for any system, so there’s room for:

* Repeating yesteryear’s errors

* Learning from other systems’ mistakes

* Discovering brand new kinds of mistakes

How rich should the coprocessor environment be?

On the powerful side, http://nodejs.org surely is neat, but having the ability to run arbitrary code there is risky…

How auditable will these systems be?

On the positive side, it’s presumably plausible to add auditing coprocessors to capture interesting information for regulatory purposes.

On the other hand, arbitrarily powerful things like node.js might make it arbitrarily easy to evade regulation.

There aren’t necessarily easy answers to that.

Aside: org2blog mode is pretty nifty…  Made it pretty easy to build this without much tagging effort…

Syndicated 2011-01-27 22:40:00 from linuxdatabases.info

Trying Out org2blog

Hmm. Let’s see how https://github.com/punchagan/org2blog works.

It requires xml-rpc.el; el-get knows about that… Splendid!

I can login to my blog… It takes a very little bit of URL surgery to figure out the apropos URL…

I think I overdid the default categories, but that’s not a huge problem.

Now, let’s see if it’ll publish the entry…

Hey, that worked fine! Cool, I can publish blog entries without looking for my web browser. Now, let’s see if I can get it to stow the password for my website in the encrypted .authinfo file that Emacs likes…

Nope, the .authinfo extension is a Gnus thing, so that possibly goes further than we can readily get. But the author’s amenable to taking a peek at it :-) .

Syndicated 2011-01-19 17:14:00 from linuxdatabases.info

PostgreSQL 9.0 released!

A new release of the most advanced open source database is now available!

As always, as a new major release, there are great gobs of little features that have been added, most of which, individually, likely don’t matter to any particular individual. (For instance, there are a couple dozen enhancements to ECPG, and if you don’t know you’re using that, you almost certainly aren’t, and so those changes likely don’t affect you.)

But there are plenty that are liable to matter, and, indeed, to help improve behaviour of one’s streams of queries, often without even needing any changes to applications.

See also the official release notice, for “markety-speak.”

And see official release notes (that are part of the documentation tree) for deeper details of all the changes in the new release.

Syndicated 2010-09-20 15:14:36 from linuxdatabases.info

Gnus, Dovecot, OfflineIMAP

This is a followup, effectively, to Roland Mas’ article Gnus, Dovecot, OfflineIMAP, search: a HOWTO .

I went thru Roland’s HOWTO, and have a few comments on variances that I noticed:

  1. I first installed OfflineIMAP; this worked pretty much fine as described. I didn’t bother adding the extra Python code for propagating Gnus expiry material, as I generally don’t use it.
  2. I had a couple problems setting up Dovecot:
    1. By default, Dovecot uses Maildir++ folder handling, which isn’t consistent with how OfflineIMAP stores folders.There’s an additional option needed to cope with this:
      mail_location = maildir:~/Maildir:LAYOUT=fs
    2. Perhaps because of the above, I couldn’t readily get Gnus to talk over a pipe to a Dovecot process.Not a big deal – I have Gnus speak to Dovecot via talking to the socket, which is the usual thing one would do with Dovecot anyways.
  3. It seems to me as though Gnus should be able to talk directly to Maildir. It does, after all, have a protocol for it (nnmaildir).I couldn’t struggle my way thru the Gnus documentation to properly set up a virtual server for nnmaildir to do this.

    This would be pretty valuable in that it would eliminate the need for Dovecot altogether. Perhaps it’s a documentation problem that nobody seems to know how to do this.

Syndicated 2010-09-09 18:53:57 from linuxdatabases.info

Android Security

The Android permissions model is, to my mind, a goodly improvement over pretty well any of alternatives out there at present, in that it at least declares what capabilities any given application demands and expects you to grant.

Applications are unfortunately quite readily able to abuse this a fair bit; a (recent, as of August 2010) example being
Evernote.

Evernote, and Why You Need to Think About Permissions describes the problem:

The Evernote app requests a fair number of permissions. Some make sense, such as the INTERNET permission (kinda important for a Web service). Some are a bit dubious, such as needing both coarse and fine location data.

It definitely demands too much permission, with two cross-sections that are troublesome

  • It asks for “the world” up front
  • It asks for permissions it shouldn’t need For instance, it shouldn’t need access to contacts – it should merely offer to share data, which pushes data to a boundary where the user, at run time, can choose whether or not to allow the data out.

In addition, some of the permissions ought to be optional.

  1. If you want to record locations on your notes, then granting access to location data may be a reasonable thing to do.
  2. If you don’t want to record locations, then Evernote doesn’t need that access.

Unfortunately, at present, you don’t have any of those shadings, your options are mighty binary:

  1. Grant Evernote all the capabilities requested
  2. Reject the access, and don’t install it.

I suggest that there is another shading that would be useful, notably for INTERNET access (and probably also for filesystem access), which is to “tie down” what places the application can go.

  • Evernote probably only needs to access evernote.com
  • Twitter only needs access to twitter.com
  • Shuffle (a GTD-like application) may access a domain of the user’s choice to synchronize data.
  • Web Browser needs the “wide open” Internet.

I expect that filesystem access could similar be tied down:

  • A file browser (such as Astro) might legitimately access “everything”
  • Most applications should be restricted to their own directory

Syndicated 2010-08-12 15:47:45 from linuxdatabases.info

Farewell, Solaris, we hardly knew ye

I had long had on my low level “to do” list to consider trying out OpenSolaris, likely either in the form of Nexenta or as Debian/OpenSolaris (nearest link: OpenSolaris @ CSC).

Alas, I didn’t get around to it in time for the license change which essentially eliminates interest in it. The precis of the change: You’re free to download it, and use it for as long as 90 days, but then, you’re expected to pay Oracle for a service contract.

I guess the good news is that I didn’t waste any time on something I’d have to be “sunsetting” by the end of June 2010.

Nope, not “April Fools.”

Syndicated 2010-04-01 20:03:22 from linuxdatabases.info

Helicopters and the Budget

The city of Indianapolis recently announced that they were cancelling use of police helicopters, to save $1.4-ish millions.

Locals complained that this is terrible and demonstrates that the city does not care about public safety.

I suggest that this is not nearly as obvious as it might seem.

By all means, helicopters are “sexy”, but that certainly isn’t good enough to justify it!

Helicopters can help solve some specific problems quickly, but there are perhaps three metrics by which they mayn’t actually be worthwhile.

  • Do they solve more crimes? If not, then that is a strike against choppers.
  • Do they merely catch some perps more quickly. Is faster truly worth the money? Do faster catches save them from extra crimes being committed? That may be nice for would-be victims… How does it actually affect the budget?
  • What would be the expected outcome from the addition or loss of the equivalent money spent on cops on the ground?

After all, it may be that a dozen extra guys (and ladies) walking or driving beats, 8 hours a day, 200-some days per year, may do more good than an aircraft sprinting around for a couple hours a day.

The answers are in the details…

Syndicated 2010-03-09 19:00:25 from linuxdatabases.info

My recent disappointment is that the 43 Folders Wiki is evidently down for extended extended maintenance. They had claimed it was down for a couple days, to be back July 6th; the "return date" seems to have gotten more nebulous :-(.

They had been suffering quite a bit from "spam," as it were; people logging in (possibly as scripts) to deface the site by adding links to link farms (e.g. - for viagra and the likes). Perhaps the intent is to do a more substantial upgrade to the MediaWiki instance, as there is rumour that modern versions can be set up to be pretty resistant to such attacks.

Regrettably, it means I don't get my "fix" of links to productivity changes for a while yet...

14 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!