# Old Lessons Relearned

## Posted 27 Oct 2001 at 15:28 UTC by apm

Sometimes even the master forgets his lessons and has to be re-
taught them by the school of hard knocks. The lesson in this case is
to always benchmark before trying to optimize - don't assume you know
where the bottleneck is. And don't forget that the algorithm is often
the most important speed factor.

For a while now we've been planning to speed up the version control
tool in Suneido. So the other day I grabbed Jeff
(jferguson) for some pair programming. I was sure the
speed issue was in the actual diff code, and that the answer was to
rewrite it from Suneido code to C++. It wasn't too hard since Suneido
code is fairly similar to C++ and we finished up in 3 or 4 hours of
pair programming spread over two days. Jeff hasn't done any C++
programming for a while, so part of the intent was to teach him some
C++. Of course, what I should have been teaching him was to benchmark
to find bottlenecks *before* diving into optimizing.

Needless to say, when we ran the new version it made no noticeable
difference. More detailed timing showed that we had, in fact, doubled
the speed of the diff function. But since it only accounted for about
10% of the time, we'd only cut 5% off the total time. Argh! Better
late than never, we identified the real culprit as the method we were
using to *display* the diff results. It took us less than an hour
to revise the display to be about 10 times faster. (Without resorting
to any C++ code.)

After making this improvement, we found the new bottleneck was the
client server protocol the version control was using. It was
transferring the code a line at a time. It took about 10 minutes to
rewrite this to send and receive the code as one block. This turned
out to be about 5 times faster. With these improvements, the version
control was starting to seem positively speedy.

In the end we're not even sure we're going to use the C++ version of
the diff function. The algorithm is obviously efficient enough that it
isn't that critical.

The moral of the story?

"Measurement is a crucial component of performance
improvement since reasoning and intuition are fallible guides and must
be supplemented with tools like timing commands and profilers."

--
The Practice of Programming, Brian W. Kernighan and Rob
Pike

You'd think after 25 years of programming I would have learned this
lesson once and for all. I guess one thing I learned is that it's
always easy to jump to conclusions. And it's always easy to thing
*you* don't need to check your assumptions.

Reference: "An O(ND) Difference Algorithm and Its Variations" by
Eugene W. Myers, Dept. of Computer Science, University of Arizona.

Andrew McKinlay

www.suneido.com

For years now, I've been advocating a simple phrase to people who are
thinking about optimization:

90% of the time, you're *wrong* about where you spend 90% of the
time.

Net result is that you *always* profile before starting any work
on optimization. If you're doing network programming, then try to
profile across network operations, too. If you spend 95% of the time
sending data over the network, then you just don't have to worry about
optimizing your request handler :-)

Regarding algorithms: yes, algorithms are important. After programming
for many years, however, they will become less and less of an issue.
You will simply select the right one for the job up front. Invariably,
your program will have the right/optimal algorithms to begin with, so
your optimizations will be fine-tuning rather than algorithmic changes.
What *does* happen to experienced programmers is that the code
will evolve over time (typically due to adding new features), and the
end result will use non-optimal algorithms for the overall feature set.
For an evolved codebase, it will be quite common to rethink the problem
and revamp large portions to use a new algorithm(s): satisfying current
feature requirements, and satifying performance requirements in one
shot.

From my own experience, not with low level benchmarking of code (I can
write semi-competent Perl, and basic C, but will never be a programmer
to match those assembled here) but with benchmarking and tuning web
servers and the applications served from them, the importance of
*never* assuming that the bottleneck will be in the place that
seems obvious cannot be stressed too highly.

Tuning tiers of webservers, application servers, and backend databases
is a task that can too easily lead the experienced to leap to
conclusions. "We're running [insert foo-Java-servlet-runner] here,
and performance is, frankly, abysmal" is often met with "Oh,
that's Java or [foo-JVM] for you. You'll want to use
[foo-other-Java-servlet-runner] or [foo-flavour-of-month-JVM] or recode
this critical bit of your application using FastCGI- it'll
solve all your problems", only for one or more of the
recommendations to be taken up, and the performance increase to be
negligible.

Then, and too often, only then, does someone step up and say "Hmm,
let's have a closer look at the system", and after a few hours of
load tests, and much `netstat`ing, `strace`ing and
digging around in `/proc` realise that the real bottleneck is
the connection between the web server (untuned OS thread limits,
perhaps) or class based
queuing issues, or misconfiguration of the web server, such that
it's needlessly passing all request for static content to the servlet
runner, or poorly tuned indexes in the database, or running the JVM with
the paltry default heap size, leaving most of the available 1 GB of RAM
unused, or...

And then a simple tuning fix produces the 200-300% (sometimes even as
much as 2000%) improvement that was originally being looked for, instead
of the 5-10% the knee-jerk reaction obtained (at much time and expense
to the overworked sysadmin) .

I wrote an article called Use
Validators and Load Generators to Test Your Web Applications after I
got some consulting work where the client wanted to know why his
application fell over with only 50 users after they brought it live. It
had worked fine for them in development.

Something important to note about tuning algorithms. I disagree with
the way algorithm analysis is almost always taught. Or rather, I don't
think the way to do the analysis is ever taught completely.

Traditional algorithm analysis teaches "Big-O" notation, where you
determine how the algorithm scales with the increase in the size of the
input. It is often emphasized that tweaking the code isn't the right
way to approach optimization, rather that one must find an algorithm
that scales to a lower order - n log n instead of n squared, for
example. We are taught to ignore the constant factor in the algorithm.

This is all well and good in the case we are handling just a few data
sets which are large, but it doesn't cover the case of handling large
numbers of independent data sets. We may know the fastest way to sort a
file with a million records, but what is the fastest way to sort 200,000
independent files with 5 records apiece?

In such a case the constant factor can come to outweigh the growth of
the algorithm with the size of the input. It is certainly worthwhile in
such a case to tweak the code, and it may even be appropriate to use a
higher-order algorithm because it offers a lower setup and teardown time.

By the way, I learned to profile after I talked my boss into buying a
$5000 floating point accellerator board for our Sun 3/160, only to find
that it sped up the time to process one of our aerial photos from 90
seconds to 85. After I learned to profile (desperately, in an all night
hacking session), I had the code running 10 seconds without the
accellerator and 5 seconds with - so hardware accelleration of floating
point lived up to its claims, but only when we weren't wasting time.

In this particular case, a subroutine was being called for each pixel of
the image to determine which stdio FILE to read the next pixel from, and
then the pixel was obtained with getc(). Changing this to determine the
input just once, then read the whole image in a single read() system
call provided the greatest speedup.

goingware, I disagree about the Big-O thing.

Algorithm analysis works pretty well for everything. It just has to be understood right. Let's say we have two algorithms *A* and *B*, where *A* runs in *O(n^2)* and *B* runs in *O(n^3)*. Then, true, *A* is called "faster than" *B*.

This is true. Given that your code must handle arbitrary values for *n*, you should choose algorithm *A*.

But if you know that *n* is not free, but rather limited (say, *n<10*), you can just fill it in. You get that *A* runs in *O(100)* and *B* runs in *O(1000)*, and you know that *O(100)=O(1000)=O(1)*. That tells you that for your problem both algorithms have the same runtime complexity, and none of the both is to be chosen because of the *O*.

Now, the Big-O is a mathematical concept that hides some facts for you. If you did your runtime analysis careful, you first calculated the correct runtime functions (For *A* it was, say, *17n^2+26n+1*, and for *B* some other term.) Then, you derived the Big-O's from those terms. Given that those Big-O's gave you "same runtime complexity", you'd now insert *n<10* into your terms, and get upper bounds for *A* and *B* applied on your problem in single step counts. If now you get that *A* runs in less than 556 steps and *B* in less than 30, the complexity analysis gives you the correct result, namely that *B* is the faster algorithm for your specific problem.

If you got taught to **always** ignore the constant factors and the remaining term, you had the wrong teacher.

p.s.: Why can't I use <sup> here? It would look so much better...

ali, you are of course
correct here.

But I'm pretty sure almost everything I've seen recently published, and
what I got taught in school, did say to forget all but the highest order
terms. I
particularly remember this from the graduate algorithm analysis course
at UCSC.

My hazy recollection though, is that in Knuth's art of computer
programming, he did do such a detailed analysis of the complexity for
all his algorithms. One reason for inventing the MIX assembly code was
so that he would have a standard processor to measure complexity with.

It happens that when I first decided to study programming in a serious
way, I read Knuth. It was a little too deep for a beginner, but I did
remember this way of looking at things.

Um. Sorry for my previous post. It wasn't meant to be as offensive as
it turned out. I just wrote it before the day's first coffee, when
everything I say sounds offensive. Of course the initial "I disagree"
is completely garbage, because afterwards I state that I agree with what
you said. Damn. (Note to self: Never write anything while
Caffeine-shortage-alarm is active.)

I also had some "lecture" on algorithms and complexity, and it taught
the same. The reason is quite simple: If you have CS students which
panic when they see polynomes, you cannot go deeply into complexity
theory.

This is very annoying. To really understand it, you obviously need
some knowledge in both mathematics and computer theory (which many CS
students neither have nor want)

You can do a simple survey to see that problem: Ask some people what
the Big-O complexity for the following problem is: "Given a set of
*n* numbers, find the biggest one." Almost everyone will answer
"*O(n)*", which is basically wrong (Or they'll gasp
"Big-WHAT?").

On the other hand, a seriously done complexity analysis is
practically impossible for most things. Nearly everyone does it using
additional assumptions and simplifications. But if you know what it's
really about, you can at least derive correct results from the
analysis.

So, my(!) final conclusion is: If it must handle arbitrary data, use
the mighty *O*, otherwise code multiple possibilities and use a
profiler.

**Yes, Knuth**, posted 28 Oct 2001 at 15:11 UTC by roundeye »
(Journeyer)

goingware, I was just going to reply
that an over-dependence upon big-O notation for choosing algorithms is
an artifact
of the way run-time analysis is being taught. While I've never seen
big-O taught
without a formal definition which includes examples about how big-O
comparable
algorithms may vary widely in performance (current fastest matrix
multiplication algorithms
being a great expository case), it seems that without a solid grounding
in low-level
performance analysis that most students end up taking home that ignoring
the big-O
constants is just fine... which is limiting and probably just plain Bad.

I was thinking that Knuth's ACP is a great introduction to performance
analysis. After
spending a couple of semesters adding up operations in MIX a student
would have a
feel for algorithmic comparison, at which point big-O analysis could be
taught without
fear of the student forgetting about the constants indefinitely.

While I don't expect the normal academic curriculum to reroute itself to
start with Knuth,
it certainly wouldn't hurt. I'd say to aspiring programmers wanting to
learn about
performance and complexity, go to Knuth first and then come back to
big-O analysis.

One of the common mistakes I run into is people not recognizing the
difference between, for example:

for each record in "select * from customers"
if (record.city is "Saskatoon")
process(record)

and

for each record in "select * from customers where city = Saskatoon"
process(record)

If there are a lot of customers, and there's an index on city, then
there's a huge difference, but a lot of inexperienced people just don't
see it. There are two different algorithms here, but it's not
*obvious*.

**I/O**, posted 29 Oct 2001 at 06:57 UTC by jmason »
(Master)

goingware said:

In this particular case, a subroutine was being called
for each
pixel of the image to determine which stdio FILE to read the next
pixel from,
and then the pixel was obtained with getc(). Changing this to
determine the
input just once, then read the whole image in a single read() system
call
provided the greatest speedup.

Which raises a very good point. When you're profiling, be sure to take
a good look at <code>strace</code> and see what system calls you're
making; a highly-efficient algorithm can be totally defeated by
inefficient I/O (such as the above).

**Big-O notation**, posted 29 Oct 2001 at 16:08 UTC by forrest »
(Journeyer)

I apologize if this is getting off-topic, but my curiousity was piqued
when ali said

You can do a simple survey to see that problem: Ask some people what the
Big-O complexity for the following problem is: "Given a set of n
numbers, find the biggest one." Almost everyone will answer "O(n)",
which is basically wrong (Or they'll gasp "Big-WHAT?").

I certainly would answer "0(n)".

To find the largest of n numbers, you cannot escape the need to examine
the value of each and every one. In the most straightforward algorithm
I can think of, you go through your list of numbers, comparing each
number with the largest-found-so-far. That takes n-1 comparisons.

Is the algorithm I've just described not O(n)? Is there some more
efficient algorithm to solve this problem?

**Re: Big-O notation**, posted 29 Oct 2001 at 18:10 UTC by ali »
(Apprentice)

No, it isn't possible faster than *O(n)*: Unless the numbers are ordered in some way, you at least have to **look** at (or "read") each of them, which takes at least *O(1)*.

The problem is that reading and comparing numbers does not run in *O(1)*. Given that (as example) you use binary-encoded integers, it takes *O(m)*, where *m* is the number of binary digits. The common (but false if talking about general algorithms) assumption is that *m<=32* (or 64, or whatever), in which case it reduces to *O(1)*. But generally speaking, find the highest number in a set of just one number needs *O(m)* (with *m* as above), not *O(1)*.

Because of that, real complexity analysis defines this magic *n* appearing in all those O's as the **required space</em> for the algorithm's input, measured in bits (for example).
**

** **

**That's what I meant with "it's hardly ever done correct" - nobody wants to deal with that. You wouldn't get anywhere with reasonable effort. But it's a problem that nobody teaches it that way, so people cannot remember it when this distinction suddenly counts.
**

** **

**A (incorrect but simple) example for "it counts" is, for example, with strings as identifiers. A lot of libraries provide algorithms with written runtime complexities that rely on string comparison for "key/value" pairs or something. All those complexities are wrong, because they don't take into account that a user-provided arbitrary string may be 10GByte long and isn't generally comparable in ***O(1)*. (This example is slightly incorrect because one can argue that memory isn't infinite large. On a 32-bit-OS, the whole memory is limited to 2^32 bytes (I think...), so all memory operations are limited by that factor, and thus string comparison runs in *O(1)*. But then you can also argue that find-the-largest also is limited by that, runs in *O(1)*, and so on. But that doesn't lead to anything...)

** **

**Hope this helped.
**

ali, I think the point on which we're all ultimately
agreeing
can be summarized by saying that complexity analysis is usually sloppy,
and is
unfortunately taught that way (maybe the last part's not fair to
educators who
usually do teach rigorous concepts, but not in a way where students have
to take
home a rigorous understanding...). The corollary to this is that not only
does it
often not matter whether analysis is sloppy, but sometimes it's a
downright annoyance. When it does matter we should have the training,
rigor, and tools to do it right.

The issues you brought up with the O(n) max problem have to do with
sloppiness
in analysis or specification. In the original form of the question we
just know that
we want the big-O complexity of finding the largest of a set of n
numbers. We either
have to ask for clarification, which usually unnecessary and annoying
(i.e., when it
is without cause), or make a number of assumptions about the nature of the
sloppily-stated problem. We assume things about the model of
computation, the
storage model, complexity of numeric comparisons, the state of the data
provided, etc.
This is not to say that the analyst doesn't understand the impact of
variations in these
constraints, simply that he has to make assumptions.

In a computation model where the n numbers are represented by n sticks
of spaghetti
with n-proportional lengths the max problem can be solved in constant
time by
grabbing them in one hand and pushing them end-down on a flat surface
and picking
out the one which sticks up highest. This model is limited by the size
of the hand, coefficients of friction, the ability to discriminate
between close lengths, and a number of
other factors.

If I were to say that analysis is hardly ever done correctly because
it's rare to see
anyone bringing up the spaghetti sorting model of computation (or
whatever you'd like
to call it) I'd be a bit extremist. Unless it really matters it doesn't
really frigging matter. :-)

roundeye, it's obviously true that it *usually* doesn't matter. But sometimes it does. So, it's okay to say that find-the-highest runs in O(n) because it's clear what assumptions are underlying it.

Besides that, your spaghetti model (albeit being yummy) is a very nice example for when most assumptions break. Given *n* positive numbers, the largest being *m*, it takes *O(log m)* to compare two of them if they're binary encoded. Now, you're spaghetti model is well-known for being the ugliest problem in complexity theory: It's equivalent to using 1-ary digits (That is, "20" is represented by 20 times the same digit followed by a "stop" digit) In this case, (because *m=exp(log(m))*), you are shifting nearly every algorithm to be running in exponential time.

(Um, the "look at your hand" fails if I give you three spaghetti, one is 1 inch long, and the other two have one end on your table and the other ends leaned against jupiter, their lengths differing by 2 inches. Now find the largest in *O(3)*...)

So, the common assumption is that you're using a k-ary alphabet with *k>1*.

All of this is not important if you want to solve simple problems, like our find-the-largest, and it's okay to completely neglect encodings, space requirements and everything.

It does however matter when you, for example, want to say something about P=NP, want to invent a crypto algorithm, or whatever. If you're discussing RSA (a prime-number based crypto algorithm), this encoding stuff gets massively important. If you play around with 32-bit-numbers, you're lost. For the usual limited-size problems, complexity theory only gives you hints. It's really meant for arbitrary-sized problems.

I forgot the name, but there existed a crypto-algorithm that was called "safe", because it was based on (I think) the stable set problem, which is NP-hard. Now, people started using it, and everyone assumed that using NP-hard problems guaranteed safety. But, unfortunately, the inventors fell into exactly this trap: They used fixed sizes for the keys, which made the problem being easily crackable. (Additionally they had some structure in the key generation, which reduced the problem to a sub-problem-set of P-solvable problems). So, that's the perfect example for "You might neglect it, but you have to know when it counts, and what changes then.". (Does anyone know how it was called?)