#### 7 Nov 2006 slamb»(Journeyer)

ncm

You were in Boston before, right? Welcome to the Bay Area...

Thank you! In response to your solving one of my long-standing complaints about the recentlog, I'll try to post something interesting to it.

Background: I've been slowly playing with a suite of software for looking at performance. It's one of those problems that I think few people look at, because even though there's real computer science involved, any project relating to the word "test" is sneered off by most people as QA work. (Another project I've been playing with is a monitoring system for sysadmins; that's basically the same way. Sysadmin tools suck.) It's possible there's a little more computer science involved than necessary because lately I've been wanting to do a little less software engineering and a little more computer science...

No release, documentation, or even a name yet, but there's a little bit of code in my Subversion repository here.

Today's problem: I made a graph of transaction duration vs. time of a quick Apache load test.

It's worthless. You have basically no idea how long the average transaction takes because the scale is blown by a few outliers. I tried a couple obvious things to solve this:

• switching to log scale
• setting ymax to be some function of the mean and standard deviation

but wasn't happy with the result. What I really wanted was what network people do for billing: look at the 95th percentile. The obvious way to do that is to sort the data and pick element n/.95. But my program works with huge data sets - this graph was generated from 1.3 million transactions. Everything else uses (nearly) linear time and not much disk. I don't want to sort stuff.

Off to the ACM Digital Library:

These are really cool algorithms to estimate quantiles like the median or 95th percentile. The first has no defined error bounds but uses O(1) space and time. The second returns a value whose rank r' is within [r-εN, r +εN] (with ε chosen in advance). It runs in a single pass using O(1/ε log (εN)) space. Neither requires you to know N ahead of time.

I implemented the first one. A really simple test:

```
def testShuffledUniformSequence(self):
r = random.Random()
e = QuantileEstimator(0.5)
numbers = range(100)
r.shuffle(numbers)
for i in numbers:
e.append(i)
self.assertTrue(40 < e.get() < 60)
```

Though there aren't well-defined error bounds, it seems to do quite well on my data set. That range of 40 to 60 in the test case is crazy liberal; I just wanted to catch my blatant implementation errors.

I fed that into my graph tool to bound the graph at five times the 95%.

Much better.

Now I have a general-purpose statistics class I can use all over the place. Maybe I'll plug it into Axamol SQL Library's SQL statement timing to get the median and 95%.