7 Nov 2006 (updated 8 Nov 2006 at 02:40 UTC)
»
ncm
You were in Boston before, right? Welcome to the Bay Area...
robogato
Thank you! In response to your solving one of my long-standing
complaints
about the recentlog, I'll try to
post something interesting to it.
Load test grapher
Background: I've been slowly playing with a suite of software for
looking at performance. It's one of those problems that I think few people
look at, because even though there's real computer science involved, any
project relating to the word "test" is sneered off by most people as QA work.
(Another project I've been playing with is a monitoring system for sysadmins;
that's basically the same
way. Sysadmin tools suck.) It's possible there's a little more computer science
involved than
necessary because lately I've been wanting to do a little less software
engineering and a little more computer science...
No release, documentation, or even a name yet, but there's a little bit of
code
in my Subversion repository here.
Today's problem: I made a graph of transaction duration vs. time of a
quick Apache load test.
It's worthless. You have basically no idea how long the average
transaction
takes because the scale is blown by a few outliers. I tried a couple obvious
things to solve this:
- switching to log scale
- setting ymax to be some function of the mean and standard deviation
but wasn't happy with the result. What I really wanted was what network
people do for billing: look at the 95th percentile. The obvious way to do that
is to sort the data and pick element n/.95. But my program works with huge
data sets - this graph was generated from 1.3 million transactions.
Everything else uses (nearly) linear time and not much disk. I don't want to
sort stuff.
Off to the ACM Digital Library:
- The P^2 algorithm
for dynamic calculation of quantiles and histograms without storing
observations.
- Space-efficient
online computation of quantile summaries
These are really cool algorithms to estimate quantiles like the median or
95th
percentile. The first has no defined error bounds but uses O(1) space and
time. The second returns a value whose rank r' is within [r-εN, r
+εN] (with ε
chosen in advance). It runs in a single pass using O(1/ε log
(εN)) space.
Neither requires you to know N ahead of time.
I implemented the first one. A really simple test:
def testShuffledUniformSequence(self):
r = random.Random()
e = QuantileEstimator(0.5)
numbers = range(100)
r.shuffle(numbers)
for i in numbers:
e.append(i)
self.assertTrue(40 < e.get() < 60)
Though there aren't well-defined error bounds, it seems to do quite well
on my data set. That range of 40 to 60 in the test case is crazy liberal; I just
wanted to catch my blatant implementation errors.
I fed that into my graph tool to bound the graph at five times the 95%.
Much better.
Now I have a general-purpose statistics class I can use all over the
place. Maybe I'll plug it into Axamol SQL Library's SQL statement timing to get
the median and 95%.