Better automated detection of Firefox performance regressions
Last spring I spent some of my spare time improving the automated script that detects regressions in Talos and other Firefox performance data. I’m finally writing up some of that work in case it’s useful or interesting to anyone else.
Talos is a system for running performance benchmarks; we use it to run a suite of benchmarks every time a change is pushed to the Firefox source repository. The Talos test harness reports these results to the graph server which stores them and can plot the recorded data to show how it changes over time.
Like most performance measurements, Talos benchmarks can be noisy. We need to use statistics to separate signal from noise. To determine whether a change to the source code caused a change in the benchmark results, an automated script takes multiple datapoints from before and after each push. It computes the average and standard deviation of the “before” datapoints and the “after” datapoints, and uses a Student’s t-test to estimate the likelihood that the datasets are significantly different. If the t-test exceeds a certain threshold, the script sends email to the author(s) of the responsible patches and to the dev-tree-management mailing list.
By nature, these statistical estimates can never be 100% certain. However, we need to make sure that they are correct as often as possible. False negatives mean that real regressions go undetected. But false positives will generate extra work, and may cause developers to ignore future regression alerts. I started inspecting graph server data and regression alerts by hand, recording and investigating any false negatives or false positives I found, and filed bugs to fix the causes of those errors.
Some of these were straightforward implementation bugs, like one where an infinite t-test score (near certain likelihood of regression) was treated as a zero score (no regression at all). Others involved tuning the number of datapoints and the threshold for sending alerts.
Some fixes required more involved changes to the analysis. For example, if one code change actually caused a regression, the pushes right before or after that change will also appear somewhat likely to be responsible for the regression (because they will also have large differences in average score between their “before” and “after” windows). If multiple pushes in a row had t-test scores over the threshold, the script used to send an alert for the first of those pushes, even if it was not the most likely culprit. Now the script blames the push with the highest t-test score, which is almost always the guilty party. This change had the biggest impact in reducing incorrect regression alerts.
After those changes, there was still one common cause of false alarms that I found. The regression analyzer compares the 12 datapoints before each push to the 12 datapoints after it. But these 12-point moving averages could change suddenly not just at the point where a regression happened, but also at an unrelated point that happens to be 12 pushes earlier or later. This caused spooky action at a distance where a regression in one push would cause a false alarm in a completely different push. To fix this, we now compute weighted averages with “triangular” weighting functions that give more weight to the point being analyzed, and fall off gradually with increasing distance from that point. This smooths out changes at the opposite ends of the moving windows.
There are still occasional errors in regression detection, but as far as I can tell most of them are caused by genuinely misleading random noise or bimodal data. If you see any problems with regression emails, please file a bug (and CC :mbrubeck) and we’ll take a look at it.