Older blog entries for cbbrowne (starting at number 30)

Mailman subscriber lists

As part of “due diligence” for some mailing lists I am involved with (for Slony, see slony-backups ), I discovered the need to dump out Mailman mailing list subscribers.

There is a script to do this, written in Python, mentioned on the Mailman wiki, accessible as mailman-subscribers.py

I’d kind of rather have something a bit more version-tracked, so I poked around at GitHub, and found larsks / mailman-subscribers

That was a little out of date; the last code was from a couple of years ago, so I forked, updated to the latest, and suggested that “larsks” pull it, which he did, quite quickly.

The “kudos” bit is that I noticed a bit of a blemish, in that the mailing list password was required to be on the command line, thereby making it visible to anyone with access to /usr/bin/ps on one’s system. I submitted a feature request, and Lars was so kind as to have this feature added so quickly that by the time I had the prototype of my Slony “subscriber backup” script working, I immediately needed to change it to make use of the lovely new password-in-file feature. Nice!

Syndicated 2013-02-27 18:32:42 from linuxdatabases.info

Installing git-annex from Debian unstable

Installing git-annex from unstable

I happen to be a supporter of Joey Hess’ Git Annex Kickstarter project; no big bucks, but it seemed a good thing to help out.

I got in the stickers, that were my “project reward,” and figured I should start playing with the new results. I’m particularly keen on the planned Android client, but I should make some use of it before that comes available.

There’s good news, and bad news:

Good news
He has added in an assistant to provide interactive help in setting up repositories. It’s included in debian unstable, in a version released September 24th.
Bad news
I generally prefer using packages from debian testing, and it has a version released July 24th, well before any of this, and without any of Joey’s recent enhancements.

Fortunately, drawing in the September/~unstable~ version isn’t too terribly difficult. My /etc/apt/preferences.d/simple configuration has Pin-Priority values that prefer stable over testing, testing over unstable, and unstable over experimental (where enormous potential for breakage lies!).

As a consequence, installing the testing version is pretty easy, albeit involving an option I had to go looking for:

root@cbbrowne:~# apt-get -t unstable install git-annex
... leads to loading ...
Get:1 http://ftp.us.debian.org/debian/ unstable/main git-annex amd64 3.20120924 [7,411 kB]

And, with a run of % git annex webapp, it’s up and running!

Syndicated 2012-10-12 15:06:31 from linuxdatabases.info

Netboot via PXE

Netboot via PXE 2012-03-13 Tue

Some notes

To get this to work, you need…

BIOS ROM that supports PXE
True for most modern motherboards and/or NICs
DHCP server
To manage passing out configuration such as IP addresses and the next-server attribute.
TFTP server
With images
???
It looks for images based on most-to-least specific configuration
  • MAC address
  • IP subnet
  • Default

Some things PXE doesn’t support

It was created as a standard in 1999, and hasn’t been updated much since, so there are things that postdate it, and that are thus not supported.

WIFI
Likely to be troublesome anyways, as you surely want some authentication to get onto a WIFI network
IPv6
It wasn’t clear that it yet mattered in 1999…
DNS
It works with IP addresses only

DHCP discussion

  • Go look for next-server attribute
  • Some discussion of handling sharing subnets across a redundant set of DHCP servers

More worth looking at

Inquisitor
OSS hardware testing tool that’s better than memtest
gPXE
OSS bootloader
  • Supports DNS, so can forward requests broadly potentially anywhere
  • Can transfer data across additional protocols, such as HTTP, HTTPS, SAN (iSCSI, AoE)
  • Can support WIFI
  • Possibly IPv6

Syndicated 2012-03-14 19:47:00 from linuxdatabases.info

Subversion “deprecation”

I was a bit tickled by the characterization I saw today in the new Subversion release, describing the deprecation of version 1.5:

The Subversion 1.5.x line is no longer supported. This doesn't mean
that your 1.5 installation is doomed; if it works well and is all you
need, that's fine. "No longer supported" just means we've stopped
accepting bug reports against 1.5.x versions, and will not make any
more 1.5.x bugfix releases.

They aren’t telling us the world will end for anyone using version 1.5, just that they don’t intend to provide support anymore.

Which seems like a fine thing. Version 1.5 is 3 years old, and, when they seem to be releasing about a version per year (1.0 in 2004, 1.7 in 2011), 3 years of backwards support doesn’t seem dramatically insufficient. Particularly if, when support goes away, you’re not inherently doomed!

Syndicated 2011-10-11 19:55:00 from linuxdatabases.info

PostgreSQL 9.1 now available

Making for some reasonably good news on 9/11, the next version of PostgreSQL, version 9.1, has been released.

Major enhancements include:

Synchronous replication
continuing the enhancements to built-in WAL-based replication
Per-column collations
to support linguistically-correct sorting down to the column level
Unlogged tables
improving performance for the handling of ephemeral data (e.g. – such as caches)
K-Nearest-Neighbor Indexing
indexing on distances for geographical and text-search queries
Serialized Snapshot Isolation
implementing “true serializability”
Writable Common Table Expressions
recursive and similar queries can now update data
Security Enhanced Postgres
Similar to SE-Linux, providing Mandatory Access Controls for higher grade security
Foreign Data Wrappers
attach to other databases and data sources
Extensions
managing deployment of additional database features

Many of these continue the trend of continuing to enhance features added in earlier versions (e.g. – synchronous replication, KNN, Writable CTEs)

Some introduce new kinds of functionality (e.g. – SE-Postgres, FDW, Extensions), where new seeds are sown, that we may expect to flower into further new features in future versions.

Syndicated 2011-09-12 17:00:00 from linuxdatabases.info

Music Playing

My latest “musical experiment” is with Clementine, which was recently added to Debian.

I should note things that I have used in the past, and some areas of past pain:

XMMS
Which has often been nice enough, but which has grown long in the tooth.
XMMS2
Which takes the desirable step of being a client/server system which admits the availability of a bunch of backends. I have, when using it, tended to prefer the shell backend.
Amarok
An “all singing, all dancing” option…
  • It uses KDE, which I’m historically not terribly keen on
  • It has libraries that are evidently clever enough to pull music off my iPod Touch as long as it’s plugged into a USB dock
  • It has the “KDE integration” that seems to want to have widgets integrating into some “KDE-compliant” window manager. I’m running StumpWM, which is decidedly not a KDE thing, so controlling Amarok always seems like a bit of a crapshoot…
  • I have played a bit with the “playlist” functionality; it hasn’t yet agreed with me…

At any rate, I saw Clementine listed as “new in Debian,” so thought I’d take a peek. I’m liking what I see thus far:

  • Onscreen widgets for all the sorts of things that need to be controlled, including
    • Managing music library, so as to add things
  • Like Amarok, it can see my iPod whenever it’s plugged in, and can play that music through the computer
  • It easily grabbed album covers (I’m not sure what service it’s using) for most of my music
  • Onscreen controls seem pretty reasonable, though I kind of wish the volume control was larger, as that’s something one wants most frequently to fiddle with.
  • There’s a cool visualization widget (think “equalizer”)

Seems pretty likable thus far…

Syndicated 2011-06-15 21:46:00 from linuxdatabases.info

What’s Up Lately With Slony?

What’s up Lately? 2011-04-12 Tue

Git Changeover

In July 2010, we switched over to use Git, which has been working out quite fine so far. The official repository is at git.postgresql.org; note that some developers are publishing their repositories publicly at GitHub:

You can find details at those “private” repositories of branches that the developers have opened to work on various bug fixes and features.

The next big version

We have been working on what seems most likely to be called the “2.1 release.”

  • There are quite a lot of fixes and enhancements already in place. We have been quite faithful about integrating release notes in as changes are made, so Master RELEASE notes should be quite accurate in representing what has changed. Some highlights include:
    • Changes to queries against sl_log_* tables improve performance when undergoing large backlog
    • Slonik now supports commands to bulk-add tables and sequences
    • Integration of clustertest framework that does rather more sophisticated tests, obsolescing previous “ducttape” and shell script tests.
    • Cleanup of a bunch of things
      • Use named parameters in all functions.
      • Dropped SNMP support that doesn’t seem to run anymore, and which was never part of any regression tests.
  • It is unlikely that it will get dubbed “version 3,” as there aren’t the sorts of deep changes that would warrant such.
    • The database schema has not materially changed in any way that would warrant re-initializing clusters, as was the case between version 1.2 and 2.0.
    • The changes generally aren’t really huge, with the exceptions of a couple features that aren’t quite ready yet (which deserves its own separate discussion)

Still Outstanding

There are two features being worked on, which we hoped would be ready around the time of PGCon 2011:

Implicit WAIT FOR EVENT
This feature causes most Slonik commands to wait for whatever event responses should be received before they may be considered properly finished. For instance SUBSCRIBE SET would wait until the subscription has been completed before proceeding.
Multinode FAIL OVER
For clusters where there are multiple origins for different sets, this allows reshaping the entire cluster properly, which has historically been rather more troublesome than people usually were able to recognize.

Unfortunately, neither of these are quite ready yet. It is conceivable that the automatic waiting may be mostly ready, but complications and interruptions have gotten in the way of completion of multinode failover.

When will 2.1 be ready?

Three possibilities seem to present themselves:

  1. Release what we’ve got as 2.1, let the outstanding items arrive in a future version.Unfortunately, this would seem to dictate that we support a “version 2.1″ for an extended period of time, complete with the trouble and effort of backpatching. It’s not very attractive.
  2. Draw in Implicit WAIT FOR EVENT, which would make for a substantially more featureful 2.1, and let multinode FAIL OVER come along later.We had been hoping that there would be common functionality between these two features, so had imagined it a bad idea to do one without the other. But perhaps that’s wrong, and Implicit WAIT FOR EVENT doesn’t need multinode failover to be meaningful. That does seem like it may be true.

    There is still the same issue as with 1. above, that this would mean having an extra version of Slony to support, which isn’t something anyone is too keen on.

  3. Wait until it’s all ready.This gets rid of the version proliferation problem, but means that it’s going to be a while (several months, perhaps quite a few) before users may benefit from any of these enhancements.

    Development of the failover facility seems like it will be bottlenecked for a while on Jan, so this suggests that it may be timely to solicit features that Steve and I might work on concurrently in the interim.

So, what might still go into 2.1?

  • We periodically get bug reports from people about this and that, and minor things will certainly get drawn in, particularly if they represent incorrect behaviour.
  • ABORT scriptI plan to send a note out soon describing my thoughts thus far.
  • Cluster Analysis ToolingI think it would be pretty neat to connect to a Slony cluster, pull out some data, and generate some web pages and GraphViz diagrams to characterize the status and health of the cluster.
  • There was evidently discussion at PGEast about trying to get the altperl scripts improved/cleaned up.My personal opinion (cbbrowne) is that they’re not quite general enough, and that making them so would be more trouble than it’s worth, so my “vote” would be to deprecate them.

    But that is certainly not the only opinion out there – there are apparently others that regularly use them.

    While I’m not keen on putting effort into them, if there is some consensus on what to do, I’d go along with it. That might include:

    • Adding scripts to address slonik features that have not thus far been included in altperl.
    • Integrating tests into the set of tests run using the clustertest framework, so that we have some verification that this stuff works properly.
  • Insert Your Pet Feature Here?Maybe there’s some low hanging fruit that we’re not aware of that’s worth poking at.

Syndicated 2011-04-12 19:26:00 from linuxdatabases.info

Fast COUNT(*) in PostgreSQL

One of the frequently-asked questions about PostgreSQL is “why is SELECT COUNT(*) FROM some_table doing a slow sequential scan?”

This has been asked repeatedly on mailing lists everywhere, and the common answer in the FAQ provides a fine explanation which I shall not repeat. There is some elaboration on slow counting.

Regrettably, the proposed alternative solutions aren’t always quite so fine. The one that is most typically pointed out is this one, Tracking the row count

How Tracking the row count works

The idea is fine, at least at first blush:

  • Set up a table that captures row counts
CREATE TABLE rowcounts (
  table_name text not null primary key,
  total_rows bigint);
  • Initialize row counts for the desired tables
DELETE FROM rowcounts WHERE table_name = 'my_table';
INSERT INTO ROWCOUNTS (table_name, total_rows) SELECT 'my_table', count(*) from my_table;
  • Establish trigger function on my_table which has the following logic
if tg_op = 'INSERT' then
   update rowcounts set total_rows = total_rows + 1
     where table_name = 'my_table';
elsif tg_op = 'DELETE' then
   update rowcounts set total_rows = total_rows - 1
     where table_name = 'my_table';
end if;
  • If you want to know the size of my_table, then query
SELECT total_rows FROM rowcounts WHERE table_name = 'my_table';

The problem with this approach

On the face of it, it looks fine, but regrettably, it doesn’t work out happily under conditions of concurrency. If there are multiple connections trying to INSERT or DELETE on my_table, concurrently, then all require an exclusive lock on the tuple in rowcounts for my_table, and there is a risk (heading towards unity) of:

  1. Deadlock, if different connections access data in incompatible orderings
  2. Lock contention, leading to delays
  3. If some of the connections are running in SERIALIZABLE mode, rollbacks due to inability to serialize this update

So, there is risk of delay, or, rather worse, that this counting process causes otherwise perfectly legitimate transactions to fail. Eek!

A non-locking solution

I suggest a different approach, which eliminates the locking problem, in that:

  • The triggers are set up to only ever INSERT into the rowcounts
  • An asynchronous process does summarization, to shorten rowcounts
  • I’d be inclined to use a stored function to query rowcounts

Table definition

CREATE TABLE rowcounts (
    table_name text not null,
    total_rows bigint,
    id serial primary key);
create index rc_by_table on rowcounts(table_name);

I add the id column for the sake of nit-picking normalization, so that anyone that demands a primary key gets what they demand. I’d not be hugely uncomfortable with leaving it off.

Trigger strategy

The triggers have the following form:

if tg_op = 'INSERT' then
   insert into rowcounts(table_name,total_rows) values ('my_table',1);
elsif tg_op = 'DELETE' then
   insert into rowcounts(table_name,total_rows) values ('my_table',-1);
end if;

Note that since the triggers only ever INSERT into rowcounts, they no longer interact with one another in a way that would lead to locks or deadlocks.

Function to return row count

create or replace function row_count(i_table text) returns integer as $$
begin
   return sum(total_rows) from rowcounts where table_name = i_table;
end
$ language plpgsql;

It would be tempting to have this function itself do a “shortening” of the table, but, that would reintroduce into the application the locking that we were wanting to avoid. So DELETE/UPDATE are still deferred.

Function to clean up row counts table

This function needs to be run once in a while to summarize the table contents.

create or replace function rowcount_cleanse() returns integer as $$
define
   prec record;
begin
   for prec in select table_name, sum(total_rows) as sum, count(*) as count from rowcounts group by table_name loop
       if count > 1 then
          delete from rowcounts where table_name = prec.table_name;
          insert into rowcounts (table_name, total_rows) values (prec.table_name, prec.total_rows);
       end if;
   end loop;
   return 0;
end
$ language plpgsql;

Initializing rowcounts for a table that is already populated

Nothing has yet been mentioned that would cause an initial entry to go into rowcounts for an already-populated table.

create or replace function rowcount_new_table(i_table text) returns integer as $$
declare
   query text;
begin
   delete from rowcounts where table_name = i_table;
   query := 'insert into rowcounts(table_name, total_rows) select ''|| i_table ||'', count(*) from ' || i_table || ';';
   execute query;
   return total_rows from rowcounts where table_name = i_table;
end
$ language plpgsql;

If a table has already got data in it, then it’s necessary to populate rowcounts with an initial count. Implementing such a function is straightforward, and is left as an exercise to the reader.

Further enhancements possible

It is possible to shift some of the maintenance back into the row_count() function, if we do some exception handling.

create or replace function row_count(i_table text) returns integer as $$
declare
   prec record;
begin
   begin
      lock table rowcounts nowait;
      select sum(total_rows) as sum, count(*) as count from rowcounts where table_name = i_table;
      if count > 1 then
          delete from rowcounts where table_name = i_table;
          insert into rowcounts (table_name, total_rows) values (prec.table_name, prec.total_rows);
      end if;
      return prec.total_rows;
   exception
      return sum(total_rows) from rowcounts where table_name = i_table;
   end;
end
$ language plpgsql;

This is more than a little risky, as, if this function wins the lock, it will block other processes that wish to access row counts until it’s done, this likely isn’t a worthwhile exercise.

Syndicated 2011-03-04 16:34:00 from linuxdatabases.info

Please Send A Patch

Recent Debian blog entries with this title (by Lucas Nussbaum, Matt Palmer) point out assortedly that:

  • Existing developers frequently know the code base so much better than newcomers that they’re likely way more effective at improving things than some callow newcomer.
  • Taking those developers’ time to do your pet thing instead of something they find useful mayn’t be more effective.

Both points are quite valid, and recent PostgreSQL CommitFest activity suggests a way to at least try to evaluate things.

The PostgreSQL project has a number of committers that are unusually productive developers (-1 from me, Tom? :-) ), and there have certainly been times when the “best” outcome has been for someone to come in suggesting ideas, and for one of the notably productive folk to implement it.

But there has been some debate surrounding the 2011-01 CommitFest, which consists of some 98 proposed patches, all of which require review. These are all, in fact, patches that came as some sort of response to Please send a patch :-) . The trouble with this particular CommitFest is that the patches have been overwhelming the reviewers in terms of sheer volume. Developers that should be considering working on their own “pet features” have been drawn into the review process to look at others’ features instead. None of these results are inherently a bad thing, except for the aggregate that falls out, which is that there’s so much stuff outstanding that it’s tough to get them all properly reviewed.

If a project is busy and vital, it’s pretty necessary for people to do a fair bit of “scratching their own itches” (in keeping with Matt Palmer’s comment) in order to grow the community of people capable of giving real assistance to managing the code base.

“Growing community” requires that some people struggle with the code base a bit so that they become familiar enough to become effective in the future.

Syndicated 2011-02-15 16:18:00 from linuxdatabases.info

NoSQL’s next step – stored procedures

The latest discovery is that the “bad old stored procedures” of SQL… Are what NoSQL needs… http://highscalability.com/blog/2010/11/1/hot-trend-move-behavior-to-data-for-a-new-interactive-applic.html

They’re calling them coprocessors or plugins, and it’s truly not terribly surprising. The High Scalability article makes a Battlestar Galactica joke, of http://en.wikipedia.org/wiki/Eternal\_return. The BSG line that kept coming back over and over was: All this has happened before, and all this will happen again. There’s a rather depressing possibility that people will consider coprocessors to be the greatest thing ever, not realizing that a substantial chunk of the same issues true (for better and worse) for SQL stored procedures will also hold true for coprocessors and they may learn (or fail to learn!) from scratch.

The notion is that you colocate, along with your database, some kind of “coprocessor engine” that can run code locally, which solves a number of problems, some not new, but some somewhat unique to key/value stores:

Connectivity

You’re running your application in the cloud and have somewhat spotty connectivity between the place where your application logic runs and the database where the data is stored. A coprocessor brings logic right near the database, resolving this problem.

Bulk data transfer

A difference between SQL and key/value stores is that SQL is quite happy shovelling sets of data back and forth, whereas key/value stores are all about singular key/value pairs. An SQL request readily “scales” by transferring data in bulk, whereas key/value can get bogged down by there being a zillion network round trips. A coprocessor can keep a bunch of those “round trips” inside the database layer, which will be a big win.

Goodbye, foreign keys, hello, um, ???

You may be able to shove some combination of logic maintenance and such into the coprocessor area, thereby gaining back some of the things lost when NoSQL eschewed SQL foreign key references and triggers.

Data normalization analysis returns

One of the typical things to do with NoSQL is to “shard” the database so each database server only has part of the data, and may operate independently of other database servers.

Coprocessor use will require that all the data that is to be used is on the local server, otherwise you head back to the problem of shovelling tuples back and forth between DB servers with the zillions of network roundtrips problem.

To guard against that, the data needs to be normalized in such a way that the data relevant to the coprocessors is available locally. (Perhaps not exclusively, but generally so. A few round trips may be OK, but not zillions.)

It seems to me that people have been excited by NoSQL in part because they could get away from all that irritating SQL normalization rules stuff. But this bit implies that this benefit was something of a mirage. Perhaps the precise rules of Boyce-Codd Normal Form are no longer crucial, but you’ll still need to have some kind of calculus to ascertain which divisions work and which don’t.

Things still not clear about this…

Managing the coprocessors

One of the challenges faced in SQL systems that use a lot of stored procedures is that of managing these procedures, complete with versioning (because what goes into production on day #1 isn’t what will be there forever, right?).

Windows always used to suffer (may still suffer, for all I know) from dependency hell, where different applications may need competing versions of libraries. (Entertainment of the week was seeing that the Haskell folks http://www.haskell.org/pipermail/haskell-cafe/2010-April/076164.html are, of late running into this.  Not intended as insult; it’s a problem that is nontrivial to avoid.)

It’s surely needful to have some kind of coprocessor dictionary to keep this sort of thing under some control. It’s never been trivial for any system, so there’s room for:

* Repeating yesteryear’s errors

* Learning from other systems’ mistakes

* Discovering brand new kinds of mistakes

How rich should the coprocessor environment be?

On the powerful side, http://nodejs.org surely is neat, but having the ability to run arbitrary code there is risky…

How auditable will these systems be?

On the positive side, it’s presumably plausible to add auditing coprocessors to capture interesting information for regulatory purposes.

On the other hand, arbitrarily powerful things like node.js might make it arbitrarily easy to evade regulation.

There aren’t necessarily easy answers to that.

Aside: org2blog mode is pretty nifty…  Made it pretty easy to build this without much tagging effort…

Syndicated 2011-01-27 22:40:00 from linuxdatabases.info

21 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!