4 Mar 2007 fejj   » (Master)

Like Merge Sort, Quick Sort promises O(n lg n) time in the average case, and, like Merge Sort, is also a recursive algorithm. However, it is important to be aware that Quick Sort is O(n * n) worst case, where, ironically, the input is already mostly sorted in either direction.

The way Quick Sort works is that it first chooses a pivot value with which to partition the working sequence into two segments. Next, it puts all items with values smaller than the pivot value into the first segment and all items with values greater than the pivot value into the second segment. Recursively repeat the process on each segment.

Lets put this into code (keeping our same call signature from previous sorting posts):

static int
QuickSortPartiton (int a[], int low, int high)
{
    int pivot, lo, hi;
    int tmp;

hi = high - 1; lo = low;

pivot = a[lo];

while (1) { while (lo < high && a[lo] < pivot) lo++;

while (hi > low && a[hi] >= pivot) hi--;

if (lo < hi) { /* swap */ tmp = a[lo]; a[lo] = a[hi]; a[hi] = tmp; hi--; } else { return hi + 1; } } }

static void QuickSortRecurse (int a[], int low, int high) { int mid;

mid = QuickSortPartition (a, low, high); if (mid < high) { QuickSortRecurse (a, low, mid); QuickSortRecurse (a, mid, high); } }

void QuickSort (int a[], int n) { if (n < 2) return;

QuickSortRecurse (a, 0, n); }

Before we go any further, lets see how this compares to our Merge Sort implementation from earlier.

Since I no longer have the same machine that I timed Merge Sort with, I had to re-time it on my laptop (which is where I've been doing this since I can hack while laying back on my comfortable couch). My laptop is an IBM T40 ThinkPad which sports an Intel Pentium M Processor rated at 1600 MHz.

Since I left off sorting 10,000,000 items in the Merge Sort post, I figured I'd use the same array size here.

For Merge Sort, I'm consistently getting around 6.96s. For our Quick Sort implementation, I'm getting about 5.40s.

Not bad... but lets see if we can speed it up a bit more. The easiest change to make is to combine QuickSortPartition() into QuickSortRecurse() in order to try and reduce extra function-call overhead. The resulting code is as follows:

static void
QuickSortRecurse (int a[], int low, int high)
{
    int pivot, lo, hi;
    int tmp;

hi = high - 1; lo = low;

pivot = a[lo];

while (1) { while (lo < high && a[lo] < pivot) lo++;

while (hi > low && a[hi] >= pivot) hi--;

if (lo < hi) { /* swap */ tmp = a[lo]; a[lo] = a[hi]; a[hi] = tmp; hi--; } else { hi++; break; } }

if (hi < high) { QuickSortRecurse (a, low, hi); QuickSortRecurse (a, hi, high); } }

void QuickSort (int a[], int n) { if (n < 2) return;

QuickSortRecurse (a, 0, n); }

This new version gets times closer to 5.20s on average.

It would be nice if we could get rid of the function call overhead of recursion like we did with Binary Insertion Sort, as we've seen that getting rid of the need to call out to QuickSortPartition() improved the performance a bit (not much, but every little bit counts, right?).

We're going to have to get crafty in order to work around the need to make our function recursive. What we can do is keep our own stack, but growing it dynamically could potentially kill any performance gains we could hope for... so, lets see what Donald Knuth has to say about the mathematical properties of this sort algorithm.

An auxiliary stack with at most [lg N] entries is needed for temporary storage.

As it happens, log base 2 of the max unsigned value of any integer type is the number of bits in said integer type. This means that on my system, where integers are 32bit, I'll need a stack size of 32.

What do we store on our stack? Well, what variables do we need for QuickSortRecurse()? We need a[], high, and low. If we're going to make QuickSort() non-recursive, then that means we'll always have access to a[] which means the only variables we need to hold on our stack are high and low.

So here it is, your Moment of Zen:

typedef struct qstack {
    size_t lo;
    size_t hi;
} qstack_t;

void QuickSort (int a[], size_t n) { qstack_t stack[32], *sp; register size_t lo, hi; size_t low, high; int pivot, tmp;

if (n < 2) return;

/* push our initial values onto the stack */ sp = stack; sp->lo = 0; sp->hi = n; sp++;

while (sp > stack) { /* pop lo and hi off the stack */ sp--; high = sp->hi; low = sp->lo;

hi = high - 1; lo = low;

pivot = a[lo];

while (1) { while (lo < high && a[lo] < pivot) lo++;

while (hi > low && a[hi] >= pivot) hi--;

if (lo < hi) { /* swap */ tmp = a[lo]; a[lo] = a[hi]; a[hi] = tmp; hi--; } else { hi++;

if (hi == high) { /* done with this segment */ break; }

/* push the larger segment onto the * stack and continue sorting the * smaller segment. */ if ((hi - low) > (high - hi)) { sp->lo = low; sp->hi = hi; sp++;

hi = high; low = lo; } else { sp->hi = high; sp->lo = hi; sp++;

high = hi; lo = low; }

pivot = a[lo]; hi--; } } } }

Once again, we see a slight improvement in execution time: 4.59s, just a hair under 5.00s, down from ~5.20s.

Just a bit faster, eh?

Out of my own interest, I've been reimplementing all of these sorting algorithms to target glibc's qsort() API, so as to make any one of them a drop-in replacement (not that it's likely I'd do that, since until now, it's unlikely any of them could have even approached the speed of qsort()... ).

Without further ado, here it is, another Moment of Zen:

static void
memswap (void *a, void *b, size_t n)
{
    register unsigned int *ai, *bi;
    unsigned char *ac, *bc, *an;
    register unsigned int i;
    unsigned char c;

an = (unsigned char *) a + n; ai = (unsigned int *) a; bi = (unsigned int *) b;

while (((unsigned char *) (ai + 1)) <= an) { i = *ai; *ai++ = *bi; *bi++ = i; }

ac = (unsigned char *) ai; bc = (unsigned char *) bi;

while (ac < an) { c = *ac; *ac++ = *bc; *bc++ = c; } }

typedef struct qstack { unsigned char *lo; unsigned char *hi; } qstack_t;

void QuickSort (void *base, size_t nmemb, size_t size, int (* compare) (const void *, const void *)) { register unsigned char *lo, *hi, *pivot; qstack_t stack[32], *sp = stack; unsigned char *high, *low;

if (nmemb < 2) return;

/* push our initial values onto the stack */ sp->lo = (unsigned char *) base; sp->hi = sp->lo + (nmemb * size); sp++;

while (sp > stack) { /* pop lo and hi off the stack */ sp--; high = sp->hi; low = sp->lo;

hi = high - size; lo = low;

pivot = lo;

while (1) { while (lo < high && compare (lo, pivot) < 0) lo += size;

while (hi > low && compare (hi, pivot) >= 0) hi -= size;

if (lo < hi) { /* swap */ memswap (lo, hi, size);

if (lo == pivot) pivot = hi; else if (hi == pivot) pivot = lo;

hi -= size; } else { hi += size;

if (hi == high) { /* done with this segment */ break; }

/* push the larger segment onto the * stack and continue sorting the * smaller segment. */ if ((hi - low) > (high - hi)) { sp->lo = low; sp->hi = hi; sp++;

hi = high; low = lo; } else { sp->hi = high; sp->lo = hi; sp++;

high = hi; lo = low; }

pivot = lo; hi -= size; } } } }

It turns out that my implementation is neck-and-neck with glibc's qsort() compiled at -O0, but if I compile with -Os, my implementation consistently takes the lead:

QuickSort() takes on average 12.82s while qsort() takes about 12.98s. However, the integer tailored version of our QuickSort() only takes 6.46s on average.

If your program makes heavy use of a sorting routine, you may want to consider implementing a tailored sort function rather than just using qsort().

To prove the point further, if we restrict the input array to be made up of pointers (rather than arbitrarily sized elements), we can replace the memswap() routine with a simple pointer swap. This simple change shaves a whole second off the execution time!

Not only can you tailor your sort implementation to be able to do faster swaps/comparisons, but you can also tailor it to be a "stable sort" (as I have done in these implementations), which is something that the libc qsort() does not guarantee.

As Michael Abrash mentions in one of his books, there are a number of fallacies that programmers use to justify not hand-writing some of their low-level routines, one of which is the assumption that the C library is written in optimized assembly and therefore it's unlikely that you could possibly write a replacement routine faster yourself.

As I've just proven, this is not the case. And I haven't even bothered to re-implement my Quick Sort routine in highly optimized assembly yet!

Latest blog entries     Older blog entries

New Advogato Features

FOAF updates: Trust rankings are now exported, making the data available to other users and websites. An external FOAF URI has been added, allowing users to link to an additional FOAF file.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!