On selecting good timeouts
Timeouts are a common design choice or implementation detail in any computer system, but are in particular popular in High-Availability clusters (such as those build with the SUSE Linux High-Availability Extension and other stacks that are similarly based on corosync and pacemaker).
They are seemingly straightforward to detect faults: if the task doesn't complete within N seconds, it is considered failed, and recovery attempted. (The task could be anything from a network messaging protocol, a database starting under the cluster's control, any IO, and a number of other cases.)
However, selecting a good value for the timeout is less straightforward than it may seem; more often than not, they are much too short. This seems to stem from the belief that a fast response to failures is unconditionally a good thing: the system will perform better if timeouts are shorter. This is not quite true, though.
To illustrate, assume two scenarios:
- First, that the system has failed in such a way that it will not respond with a failed response to a monitor task immediately, but instead runs indefinitely unless aborted by the timeout.
- Second, that the system is operating fine, but experiencing a brief period of stress, where responses are delayed, just to the edge of the timeout value.
Now, let us explore the impact of a timeout that is one second "too long"; and then, one that is one second "too short".
For a too long timeout, the failure in the first scenario is detected one second later, adding one second to the recovery time. In the second scenario, no timeout occurs, and the system continues as normal.
For the too short timeout, the first scenario is recovered one second faster; the second scenario causes an unnecessary recovery, probably incurring a real service outage in the attempt to restart the application, or at least a brief period without service!
Another problem arises from how timeouts are often chosen; of course, if they were obviously too short, administrators would immediately notice, since their system would never get off the ground at all, but immediately start spewing errors. Instead, the timeouts are usually adequate for the tested scenario (note that you can use the pacemaker monitoring tools to look at the actual runtime of operations); if your test load exceeds the load of your live system, raise your hand - more often than not, it does not.
Under a stress/peak load, the system response tends to degenerate exponentially; it will not just slow down by ten percent, but by thirty. If this scenario gets treated as a failure, the likelihood that the fail-over system will experience the same level of stress is high; worse, requests may have queued up, and if - due to the stress, remember - the system did not shutdown cleanly, an application-internal recovery phase will compound the effect.
Monitoring application performance for load-distribution is quite a different task from monitoring application correctness. The former is important, and a performance degradation may also imply violation of service level agreements; however, initiating recovery through restart is unlikely to alleviate the problem. (In a pacemaker cluster, this would best be monitored externally and fed into the utilization constraints of the resources and nodes.)
In summary, a too short timeout is the worse choice; rather, it is safer to make hard timeouts large enough beyond reasonable doubt. Yes, it will slow down the fail-over and recovery slightly, but at least not cause them by mistake.
(For a rather excellent and exhaustive treatment of this subject matter, see K. Wolter, “Stochastic Models for Fault Tolerance: Restart, Rejuvenation and Checkpointing,” Habilitation Thesis, Humboldt-University, 2007.)