Older blog entries for robertc (starting at number 164)

minimising downtime for schema changes with postgresql

Two years ago Launchpad did schema changes once a month. Everyone would cross their fingers and hope while the system administrators took all the application servers offline, patched the database with a months worth of work and brought up the servers again running the new QA’d codebase.

This had two problems:

  1. due to the complexity of the system – something like 300 processes have to be stopped or inhibited to take everything offline – the downtime duration was often about 90 minutes long irrespective of the schema patch duration. [Some of the processes don't like being interrupted at all].
  2. We simply could not deliver any change in less than 1 week, with the on average latency for something that jumped all the queues still being 2 weeks.

About a year ago we wanted to increase the rate at which schema changes could be carried out – the efforts to speed Launchpad up had consumed most low hanging fruit and more and more schema patches were required. We didn’t want to introduce additional 90 minute downtime windows though. Adopting incremental migrations – the sort of change process described in various places on the internet – seemed like a good way to make it possible to apply the schema changes without this slow shutdown-and-restart step, which was required because the pre-patch codebase couldn’t speak to the new schema. We could optimise each patch to be very fast by avoiding anything that causes a full table scan or table rewrite (such as adding indices, adding columns with a non-NULL default value). That would let us avoid the 90 minutes of downtime caused by stopping and restarting everything. However, that wasn’t sufficient – the reason Launchpad ended up doing monthly downtime is that previous attempts to do more frequent schema changes had too high a failure rate. A key reason for patch deployment time blowing out when everything wasn’t shut down was due to  Launchpad being a very busy system – with the use of Slony, schema changes require an exclusive lock on all tables. [More recent versions of Slony only lock some tables, but it still requires very widespread locks for most DDL operations]. We’re doing nearly 10 thousand transactions per minute, at any point in time there are always locks open on some table in the system: it was highly improbably and effectively impossible for slonik to get an exclusive lock on all tables in a reasonable timeframe. Background tasks that take many minutes to complete exacerbate this – we can’t just block new transactions long enough to deliver all the in-flight web pages and let locks clear that way.

PGBouncer turns out to be an ideal tool here. If you route all your connections through PGBouncer, you have a single point you can deliberately interrupt to clear all database locks in a second or so (it takes time for backends to all notice that their clients have gone).

So we combined these things to get what we called ‘Fast Down Time’ or FDT.  We set the following rules for developers:

  1. Any schema patch had to complete in <= 15 seconds in our schema staging environment (which has a full copy of the production DB), or we’d roll it back and redesign.
  2. Any patch could change either code or schema, never both. schema patches were to land on a separate branch and would be promoted to trunk only after deployment. That branch also receives automated merges from trunk after every commit to trunk, so its running the latest code.

This meant that we could be confident in QA: we would QA the new schema and the application process with the current live code (we deploy trunk multiple times a day). We published some documentation about how to write fast schema patches to help socialise the approach.

Then we wrote an automated tool that would:

  1. Check for known fragile processes and abort if any were found.
  2. Check for very long transactions and abort if any were found.
  3. Shutdown pgbouncer, disconnecting all clients instantly.
  4. Use slonik to apply one or more schema patches.
  5. Start pgbouncer back up again.

The code for this (call it FDTv1) is in the Launchpad source code history – its pretty entangled but its there for grabbing if you need it. Read on to see why its only available in the history :)

The result was wonderful – we immediately were able to deploy schema changes with <= 90 seconds of downtime, which was significantly less than the 5 minutes our stakeholders had agreed to as a benchmark – if we were under 5 minutes, we could schedule downtime once a day rather than once a month. We had to fix some API client code to retry more reliably, and likewise fix a few minor bugs in the database connection handling logic in the appservers, but all in all it was a pretty smooth project. Along the way we spun off a small python helper to run and control pgbouncer, which let us write effective tests for the connection handling code paths. In

This gave us the following workflow for making schema changes:

  1. Land and deploy an incremental schema change.
  2. Land and deploy any indices that need to be added – these are deployed live using CREATE INDEX CONCURRENTLY.
  3. Land and deploy code changes to populate any additional fields/tables from both application servers, and from cron – we do a bulk backfill that does many small transactions while walking over the entire dataset that needs to be updated / populated.
  4. Land and deploy code changes to drop references to the old schema, whatever it was.
  5. Land and deploy an incremental schema change to finalise the change – such as making a new column NOT NULL once the backfill is complete.

This looks long and unwieldy but its worth noting that its actually just repeated applications of a smaller primitive:

  1. Make a schema change that is fast and compatible with existing code.
  2. Change code to take advantage of the changed schema

Pretty much any change that is desired can be done using this single primitive.

We wanted to go further though – the multiple stages required for complex migrations became a burden with one change a day. Fortunately PostgreSQL now includes its own replication engine, which replicates the WAL logs rather than installing triggers on all tables like Slony.

Stuart, our intrepid DBA migrated Launchpad to PostreSQL 9.1, updated the FDT tool to work with native replication, and migrated Launchpad off of Slony. The result is again wonderful – the overhead in doing a schema patch, with all the protection I described above, is now ~5 seconds. We can do incremental changes in less time than it takes your browser to figure out that a given server is offline. We’re now negotiating with the Launchpad stakeholders to get multiple downtime windows each day, with this almost unnoticable, super reliable process in place.

Reliability wise, FDT has been superb. We’ve had 2 failures: one where we believe we encountered a bug in Slony when dropping two tables at once, and one where we landed a patch that worked on staging but led to lock contention in production – so the patch applied, but the system was very unhealthy after that until we fixed it. Thats after doing approximately 60 patches over a 1 year period.

We’re partway through extracting the patching logic from Launchpad’s code base into a reusable tool, but the basic principles will apply to any PostgreSQL environment.


Syndicated 2012-08-13 07:17:38 from Code happens

Reprap driver pinouts

This is largely a memo-to-my-future self, but it may save some time for someone else facing what I was last weekend.

I’ve been putting together a Reprap recently, seeded by the purchase of a partially assembled one from someone local who was leaving town and didn’t want to take it with them.

One of the issues it had was that 2 of the stepstick driver boards it uses were burnt out, and in NZ there are no local suppliers – that I could find. There is however a supplier of Easydriver driver boards, which are apparently compatible. (The Reprap electronics is a sanguinololu, which has a fitted strip that exactly matches stepstick (or pololu) driver boards. The Easydrivers are not physically compatible, but they should be pin compatible.. no?

I mapped across all the pins carefully, and the only issues were: there are three GND’s on the Easydriver vs 2 on the stepstick, and the PFD pin isn’t exposed on the stepstick board so it can’t be mapped across.

I ended up with this mapping (I’m not sure where pin 1 is *meant* to be on the stepstick, so I’m starting with VMOT, the anti-clockwise corner pin on the same side as the 2B/2A/1A/1B pins, when looking down on an installed board pin 1, and going clockwise from there).

Stepstick – Easydriver

VMOT – M+
GND – GND
2B – B2
2A – A2
1A – A1
1B – B1
VDD – +5V
GND – GND
Dir – Dir
Step – Step
Slp – Slp
Rst – Rst
Ms3 – Nothing
Ms2 – Ms2
Ms1 – Ms1
En – Enable

But, when I tried to use this, the motor just jammed up solid.

A bit of debugging and trial and error later and I figured it out. The right mapping for the motor pins:

2B – B2
2A – B1
1A – A1
1B – A2

Thats right, the two boards have chosen opposed elements for labelling of motors coils pins – on the step stick 1/2 refers to the coil and A/B the two ends that need to have voltage put across them, on the easydriver A/B refer to the coil and 1/2 the two ends…

Super confusing, especially as I haven’t been doing much electronics for oh, a decade or so.

I’m reminded very strongly of Rusty’s scale of interface usability here.


Syndicated 2012-07-07 04:07:51 from Code happens

Running juju against a private openstack instance.

My laptop has somewhat less than 1/2 the grunt of my desktop at home, but I prefer to work on it as I can go sit in the sun etc, very hard to do that with a mini tower case :)

However, running everything through ssh to another machine makes editing and iterating more clumsy; I need to do agent forwarding etc – not terribly hard, but not free either, particularly when I travel, I need to remember to sync my source trees back to my laptop. So I prefer to live on my laptop and use my desktop for compute power.

I had a couple of Juju charms I wanted to investigate, but I needed enough compute power to make my laptop really quite warm – so I thought, its time to update my local cloud provider from Eucalyptus to Openstack. This was easy enough, until I came to run Juju. Turns out that Juju’s commands really want to talk to the public DNS name of the instance (in order to SSH tunnel a connection to Zookeeper).

But! Openstack returns DNS names like ‘Server-3′, and if you think about a home network, its fairly rare to have a local DNS server *anyway*, so putting a suffix on names like that won’t help at all: you either need to use a DNS naming provider (openstack ships with an LDAP provider, which adds even more complexity), and configure your clients to know how to find it, or you need to use the public IP addresses (which default to the FlatNetwork, which is routable within a home LAN by simply adding a route to 10.0.0.0/8 to your wifi interface). Adding to confusion, some wifi routers fail to forward avahi messages, which is a) terrible and b) breaks the only obvious way of doing no-config local DNS :( .

So, I did some yak shaving this morning. Turns out other folk have already run into this and filed a Juju bug and a supporting txaws bug. The txaws bug was fixed, but just missed the release of Precise. Clint Byrum is going to SRU it this week though, so we’ll have it soon. I’ve put a patch up to address the Juju side, which is now pending review. Running the two together works very happily for me. \o/


Syndicated 2012-06-24 23:24:54 from Code happens

Less SPOFs: pyjunitxml, testscenarios

I’ve made the Testtools committers team own both the project and the trunk branch for both pyjunitxml and testscenarios. This removes me as a SPOF if anything needs doing in those projects – any Testtools committer can now do it. (Including code review and landing). If you are a testtools committer and need PyPI release rights, ping me and I’ll add you. (I wish PyPI had group management).


Syndicated 2012-04-24 05:00:12 from Code happens

Reading list

I’ve recently caught up on a bunch of reading some of which are worth commending.

  • Switch – documents the factors that cause changes to fail  (both in organisations and personal stuff), and provides a recipe for ensuring you have addressed those factors in any change you are planning.
  • The Lean Startup – Applies Lean principles to the learning what customers respond well to – in the same way that Lean removes waste from the process of building some X, this removes waste from the process of determining what that X should be.
  • The Innovator’s Solution – Pop science report of research done on why disruptive innovation at existing companies fails; covers structure, management, funding, market analysis, has recommendations to remove these sure-fail cases.
  • The Innovator’s DNA – Pop science report of research done into how people innovate : turns out that there are a lot of things that one can do to be a better innovator.

Read them all, or none. I enjoyed them all.


Syndicated 2012-04-22 03:15:13 from Code happens

Public service announcement: signals implies reentrant code even in Python

This is a tiny PSA prompted by my digging into a deadlock condition in the Launchpad application servers.

We were observing a small number of servers stopping cold when we did log rotation, with no particularly rhyme or reason.

tl;dr: do not call any non-reentrant code from a Python signal handler. This includes the signal handler itself, queueing tools, multiprocessing, anything with locks (including RLock).

Tracking this down I found we were using an RLock from within the signal handler (via a library…) – so I filed a bug upstream: http://bugs.python.org/issue13697

Some quick background: when a signal is received by Python, the VM sets a status flag saying that signal X has been received and returns. The next chance that thread 0 gets to run bytecode, (and its always thread 0) the signal handler in Python itself runs. For builtin handlers this is pretty safe – e.g. for SIGINT a KeyboardInterrupt is raised. For custom signal handlers, the current frame is pushed and a new stack frame created, which is used to execute the signal handler.

Now this means that the previous frame has been interrupted without regard for your code: it might be part way through evaluating a multi-condition if statement, or between receiving the result of a function and storing it in a variable. Its just suspended.

If the code you call somehow ends up calling that suspended function (or other methods on the same object, or variations on this theme), there is no guarantee about the state of the object; it becomes very hard to reason about.

Consider, for instance, a writelines() call, which you might think is safe. If the internal implementation is ‘for line in lines: foo.write(line)’, then a signal handler which also calls writelines, could have what it outputs appear between any two of the lines in writelines.

True reentrancy is a step up from multithreading in terms of nastiness, primarily because guarding against it is very hard: a non-reentrant lock around the area needing guarding will force either a deadlock, or an exception from your reentered code; a reentrant lock around it will provide no protection. Both of these things apply because the reentering occurs within the same thread – kindof like a generator but without any control or influence on what happens.

Safe things to do are:

  • Calling code which is threadsafe and only other threads will be concurrently calling.
  • Performing ‘atomic’ (any C function is atomic as far as signal handling in Python is concerned) operations such as list.append, or ‘foo = 1′. (Note the use of a constant: anything obtained by reading is able to be subject to reentrancy races [unless you take care :) ])

In Launchpad’s case, we will be setting a flag variable unconditionally from the signal handler, and the next log write that occurs will lock out other writers, consult the flag, and if needed do a rotation, resetting the flag. Writes after the rotation signal, which don’t see the new flag, would be ok. This is the only possible race, if a write to the variable isn’t seen by an in-progress or other-thread log write.

That is all.


Syndicated 2012-01-06 04:38:06 from Code happens

dmraid (fakeraid) mirror + striped

While some folk look down on fakeraid (that is BIOS based RAID-until-OS-takes-over) solutions, I think they are pretty neat: they let a user get many of the benefits of dedicated controller cards at a fraction of the cost. The benefits include the usual ones for RAID – more spindles to handle IO, tolerance of disk failures. And unlike pure LVM solutions, you can boot from a degraded RAID 1 / 5 / 10 set because the BIOS knows how.

In some ways this is better than dedicated cards, because we have the software take over, so we can change the algorithms for IO dispatch all the way down to the individual devices :)

However, these RAID volumes are in a pretty awkward spot for installers and bootloaders: inside a running Linux environment they look like software RAID which cannot be depended on for booting, but at boot time they look like hard disks which cannot be looked under the hood.

I recently got a new desktop machine which has one of these motherboards, and fortuitously my old desktop I was replacing had the same size disks – so I had 4 disks and the option of using a RAID setup. Apparently I’m a sucker for punishment because I went for a RAID 10 (that is two RAID volumes made up of two-disk mirrors (the RAID 1 component), and then those two volumes are combined via striping (the RAID 0 component). This has the potential for pretty nice performance: in principle any read can come from one of 2 disks, and every 64KB (the stripe size) of linear data will switch to the other mirror set, giving a nice boost. Writes need to write to 2 disks always, but every 64KB worth of data will alternate mirror sets, also giving a boost.

Sadly we (Ubuntu) aren’t ready for this yet: there are two key bugs that make this layout almost impossible to install into. This blog post is for my exo-memory, I want to be able to figure out what I did next time around :) .

Firstly parted_devices, a helper used by Ubiquity and debian-installer to determine which block devices are actually disk drives that one can partition and install onto, has a confused heuristic – when dealing with dmraid it looks for devices which are not layered on other dmraid devices. This handily excludes partitions, but has the undesirable effect of excluding that striped device – because it is layered on the two mirrored devices. Bug 560748 was filed about that, and I’ve added a workaround to it – basically disabling the filtering, so its not suitable as a long term fix, but it will let one select the RAID volume correctly.

Secondly, grub2, which needs to figure out what the name at boot time of the RAID volume will be currently gets confused. I don’t know enough to really explain – and be correct in my explanation – but I do have a fugly patch which worked for me. Bug 803658 tracks this defect. The basic approach I took was to say that dmraid devices should be an abstraction layer we don’t peek under: if it claims to be a disk, well then its a disk. As grub does actually work that way  - it talks to INT 13h – the BIOS support for booting off of the RAID volume is entirely sufficient.

Sadly neither bug is at the point where the patches can be rolled into Ubuntu itself, but the workaround should let folk get up and running.

In both cases, build the package locally in the installer, install it, then after than run ubiquity and things should install.

After the install, you will need to reapply the patch in the resulting installed environment, or things like update-grub will die on you!

(huge thanks to cjwatson and ev for giving me some tips while I investigated this)


Syndicated 2011-06-30 01:28:42 from Code happens

justworks-hardware-vendors

Ok, so micro rant time: this is the effect of not taking things upstream: hardware doesn’t work Out Of The Box.

Very briefly, I purchased a Vodafone prepaid mobile broadband package today, which comes with a modem and SIM. The modem is a K3571-Z, and Ubuntu *thinks* it knows how they work (it doesn’t). So it fails to connect in NetworkManager with a rather opaque ‘NO CARRIER’ message.

Thanks to excellent assistance from Matt Trudel, we tracked this down to a theory that perhaps modemmanager is using the wrong serial port – and voila, it is. From there, the config file (/lib/udev/rules.d/77-mm-zte-port-types.rules) was an obvious next step – and indeed there is no entry in there for the 19d2:1010 – the K3571-Z. Google found one immediately though, on a Vodafone research site.

The awful shame is this: that was committed to the bcm project in March this year. If Vodafone had shipped off a patch to modemmanager, we could have had that in 10.10, and possibly even in 10.04. There are plenty of users having trouble on Whirlpool etc with this model who would have had a better experience – helping Vodafone’s users be happier.

All it would have taken is an email :(

I’m sure Vodafone want a great experience for their users, but I think they’re failing to separate out platform improvements – share and share alike, and branding / custom facilities. The net impact is harmful, not helpful.

Anyhow, Natty will support this modem.


Syndicated 2010-12-02 05:48:27 from Code happens

testrepository iteration for python projects

Tesetrepository has a really nice workflow for fixing a set of failing tests:

  1. Tell it about the failing tests (e.g. by doing a full test run, or running a single known failing test)
  2. Run just the known failing tests (testr run –failing)
  3. Make a change
  4. Goto step 2

As you fix up the tests testr will just give your test runner a smaller and smaller list of tests to run.

However I haven’t been able to use that feature when developing (most) Python programs.

Today though, I added the necessary support to testtools, and as a result subunit (which inherits its thin test runner shim from testtools) now supports –load-list. With this a simple .testr.conf can support this lovely workflow. This is the one used in testrepository itself: it runs the testrepository tests, which are regular unittest tests, using subunit.run – this gives it subunit output, and tells testrepository how to run a subset of tests.

[DEFAULT]
test_command=python -m subunit.run $IDOPTION testrepository.tests.test_suite
test_id_option=--load-list $IDFILE


Syndicated 2010-11-30 06:14:00 from Code happens

Maintainable pyunit test suites – fixtures

So a while back I blogged about maintainable test suites. One of the things I’ve been doing since is fiddling with the heart of the fixtures concept.

To refresh your memory, I’m defining fixture as some basic state you want to reach as part of doing a test. For instance, when you’ve mocked out 2 system calls in preparation for some test code – that represent a state you want to reach. When you’ve loaded sample data into a database before running the actual code you want to make assertions about – that also represents a state you want to reach. So does simply combining three or four objects so you can run some code.

Now, there are existing frameworks in python for this sort of thing. testresources and testscenarios both go some way towards this (and I and to blame for them :) ), so does the zope testrunner with layers,  and the testfixtures project has some lovely stuff as well. And this is without even mentioning py.test!

There are a few things that you need from the point of view of running a test and establishing this state:

  • You need to  be able to describe the state (e.g. using python code) that you wish to achieve.
  • The test framework needs to be able to put that state into place when running the test. (And not before because that might interfere with other tests)
  • And the state needs to be able to be cleaned up.

Large test suites or test suites dealing with various sorts of external facilities will also often want to optimise this process and put the same state into place for many tests. The (and I’m not exaggerating) terrible setUpClass and setUpModule and other similar helpers are often abused for this.

Why are they terrible? They are terrible because they are fragile; there is no (defined in the contract) way to check that the state is valid for the next test, and its common to see false passes and false failures in tests using setUpClass and similar.

So we also need some way to reuse such expensive things while still having a way to check that test isolation hasn’t been compromised.

Having looked around, I’ve come to the conclusion we’ll all benefit if there is a single core protocol for doing these things, something that can be used and built on in many different ways for many different purposes. There was nothing (that I found) that actually met all these requires and was also tasteful enough that folk might really like using it.

I give you ‘fixtures‘. Or on Launchpad. This small API is intended to be a common contract that all sorts of different higher level test libraries can build on. As such it has little to no policy or syntatic sugar.

It does have a nice core, integration with pyunit.TestCase, and I’m going to add a library of useful generic fixtures (like temporary directories, environment isolators and so on) to it. I’d be delighted to add more committers to the project, and intend to have it be both Python 2.x and 3.x compatible (if its not already – my CI machine isn’t back online after the move yet, I’m short of round tuits).

Now, if you’re writing some code like:

class MyTest(TestCase):
    def setUp(self):
        foo = Foo()
        bar = Bar()
        self.quux = Quux(Foo(), Bar())
        self.addCleanup(self.quux.done)

You can make it reusable across your code base simply by moving it into a fixture like this:

class QuuxFixture(fixtures.Fixture):
    def setUp(self):
        foo = Foo()
        bar = Bar()
        self.quux = Quux(Foo(), Bar())
        self.addCleanup(self.quux.done)

class MyTest(TestCase, fixtures.TestWithFixtures):
    def setUp(self):
        self.useFixture(QuuxFixture)

I do hope that the major frameworks (nose, py.test, unittest2, twisted) will include the useFixture glue themselves shortly; I will offer it as a patch to the code after giving it some time to settle. Further possibilities include declared fixtures for tests, and we should be able to make setUpClass better by letting fixtures installed during it get reset between tests.


Syndicated 2010-09-18 06:48:23 from Code happens

155 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!