12 Sep 2002 mbp   » (Master)

testing and proof

raph refers to Djikstra's famous quote that "testing can only show the presence of bugs, never their absence."

This rather reminds me of Karl Popper's sensible and insightful arguments that scientific theories can never be proved, only disproved. No number of experiments can prove that F = mg, or that there are no bugs in TeX. However, one (valid) case in which gravity is not proportional to mass demands that the theory be discarded or improved.

This leads at first, to a kind of intellectual nihilism: we can never know anything about the outside physical world absolutely for certain.

However, as Popper pointed out, it is reasonable to gradually gain confidence in a theory as people try and fail to disprove it. People attacking a theory have several possible strategies: they can find experimental conditions under which it fails, or they can demonstrate that the theory is inconsistent with itself or some other theory, or they can perhaps demonstrate that the theory is "unfalsifiable": not amenable to reasonable testing. The proposition that there is an undetectable invisible tiger in my study falls into this category: it may or may not be true, but it is certainly not scientific.

Quantifying the level of confidence is hard, but broadly this method seems to work reasonably well. The more bona-fide attacks that a theory survives, the stronger it appears -- but of course never above reproach.

So Dijkstra and Popper would agree that we can never prove that a program has no bugs. But (from my limited knowledge of them) Popper seems to have some more useful ideas about how as non-omniscient humans, we can rationally gain increased confidence.

Neither empirical methods (e.g. testing) or theoretical methods (e.g. proofs) ought to be privileged. In either case, it is possible that the trial might be carried out incorrectly, or that it may not be as strong an attack as was thought. Both have their place. In addition, the boundary between them is not entirely clear: experiments are formulated from theory, and theory is based on previous experiments. So for example, in software, we can annotate with preconditions and postconditions, as in Eiffel. This stands neatly between empirical measurement and reasoning.

Theoretical proofs have the advantage that they build upon extremely well-established mathematical theorems. But they do perhaps have the disadvantage that they can't test the program fulfills its real-world requirements in the same way that, say, interactive free-form testing can.

Tests, obviously, ought to be designed to prove that the program is *incorrect*. It's too easy to construct experiments which confirm a theory, but Popper would say that they are worthless, particularly if designed by the theory's author.

So it seems to me the crucial question of testing is "how can we economically produce the most excruciating tests?" I think there is a body of knowledge about this which is not really completely captured, by either the XP people or the QA theory people.

One well-known approach is to write test cases for bugs when they're reported and perhaps before they're completely diagnosed. We know that there is some kind of weakness in that area: possibly trying out different tests will discover a related fault.

Another one I have been thinking about recently is "defect injection". Suppose you intentionally temporarily change a line of code to be incorrect -- off-by-1, for example. Do your tests catch the mistake? If not, they're obviously too weak. If you injected 100 mistakes, how many would be caught? That gives some idea of the strength of testing. If you have everything in CVS, and a good "make check", then doing this is actually pretty easy.

Now this is all very well in theory, but in practice humans get attached to their ideas, and don't want to see evidence that contradicts them. This applies to programmers and scientists alike.

Knuth is supposed to have had a comment along the lines of

(* WARNING: I have only proved this code correct, not tested it. *)

gc

raph, Java's bytecode system allows the implementation to use any garbage collection system it wants, from never-free through refcounting through to a mark-and-sweep gc. Various different ones might be appropriate depending on the target: for short-lived tasks, or v0.1 interpreters, being able to use a simpleminded gc is very nice. The "generational gc" that was current last time I looked behaves a lot like Apache pools.

All of this is done with no compile-time knowledge, and with a C API that looks a lot like Python's.

(Perhaps I'm wrong, it is a long time since I looked.)

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!