Public service announcement: signals implies reentrant code even in Python
This is a tiny PSA prompted by my digging into a deadlock condition in the Launchpad application servers.
We were observing a small number of servers stopping cold when we did log rotation, with no particularly rhyme or reason.
tl;dr: do not call any non-reentrant code from a Python signal handler. This includes the signal handler itself, queueing tools, multiprocessing, anything with locks (including RLock).
Tracking this down I found we were using an RLock from within the signal handler (via a library…) – so I filed a bug upstream: http://bugs.python.org/issue13697
Some quick background: when a signal is received by Python, the VM sets a status flag saying that signal X has been received and returns. The next chance that thread 0 gets to run bytecode, (and its always thread 0) the signal handler in Python itself runs. For builtin handlers this is pretty safe – e.g. for SIGINT a KeyboardInterrupt is raised. For custom signal handlers, the current frame is pushed and a new stack frame created, which is used to execute the signal handler.
Now this means that the previous frame has been interrupted without regard for your code: it might be part way through evaluating a multi-condition if statement, or between receiving the result of a function and storing it in a variable. Its just suspended.
If the code you call somehow ends up calling that suspended function (or other methods on the same object, or variations on this theme), there is no guarantee about the state of the object; it becomes very hard to reason about.
Consider, for instance, a writelines() call, which you might think is safe. If the internal implementation is ‘for line in lines: foo.write(line)’, then a signal handler which also calls writelines, could have what it outputs appear between any two of the lines in writelines.
True reentrancy is a step up from multithreading in terms of nastiness, primarily because guarding against it is very hard: a non-reentrant lock around the area needing guarding will force either a deadlock, or an exception from your reentered code; a reentrant lock around it will provide no protection. Both of these things apply because the reentering occurs within the same thread – kindof like a generator but without any control or influence on what happens.
Safe things to do are:
- Calling code which is threadsafe and only other threads will be concurrently calling.
- Performing ‘atomic’ (any C function is atomic as far as signal handling in Python is concerned) operations such as list.append, or ‘foo = 1′. (Note the use of a constant: anything obtained by reading is able to be subject to reentrancy races [unless you take care ])
In Launchpad’s case, we will be setting a flag variable unconditionally from the signal handler, and the next log write that occurs will lock out other writers, consult the flag, and if needed do a rotation, resetting the flag. Writes after the rotation signal, which don’t see the new flag, would be ok. This is the only possible race, if a write to the variable isn’t seen by an in-progress or other-thread log write.
That is all.