Older blog entries for graydon (starting at number 99)


mbp speaks of the inability to prove that a program is free from bugs. this is broadly asserted, yet I feel a little twinge of annoyance every time I hear it, perhaps because it is so imprecise.

a bug is a subset of machine state space which you do not want to be reachable by implication from your program's text. a test is a single path through machine state space, either leading into a bug region (a test you want to see fail) or avoiding bug regions (a test you want to see pass).

there are far more possible tests than you can ever write or run, so you must choose "good" paths towards "important" bugs. the problem of a test writer is to determine which areas are important, and it can require deep insight into what the program is intended for.

but note: a bug, a test, and a program are all completely formal objects from this viewpoint.

a proof, on the other hand, is another formal object in a much larger "space" (a linguistic space in which machine state spaces, and some parts of your preferred maths or logics are basic terms). the sort of proof you are interested in is one which relates your program's text residing in memory at one point, to a (good or bad) region of machine state space which is implied by the program, via a logic whose rules you like.

any test can be translated into a proof in a silly logic easily: the proof is simply the trace of your processor executing your program's code on your test's input, and the logic is one in which each machine transition that happened is an axiom. but that proof is boring.

an interesting proof is one which is much smaller, when written down, than the sum of all the tests which you would need to write to fill the machine state space bounded by the proof. in this sense, I really believe Chaitin is onto something when he talks about proofs as nothing more than a form of "higher order data compression", and the value of a given formal system as the amount of compression it commonly admits over interesting data.

so, getting back to mbp's comment, certainly you can produce a very large set of (formal) bugs which nobody's compressed inside a proof yet, for any program you care to mention. but I do not think this means that all, or even a sizeable majority of those bugs will admit "no further compression" if the author puts their mind to it.

raph's suggestion that we design programs "the way we'd want to prove things about them" is, seen in this light, a suggestion that we design programs "in a way which admits a lot of compression". I think this fact is at the heart of the programmer aesthetic for "small and simple" programs. "larger" is often equivalent to "less provable", or similarly the feeling that any test you run explores too little of the state space to be helpful. you lose any confidence you have that a large system will do what you want, unless you can carve up the state space with type systems, modules, processes, etc.

as a final note: people often try to express test quality in terms of "code coverage", but this is only slightly useful, and less so as the quality of the program being tested improves. good programs parameterize much of their behavior on their input. their machine state space massively outweighs the size of their program text, so you learn very little about exercizable machine states by knowing that all the code was exercized.

refinements on a theme

if you set out to prove "predicates" about a program, there are two (common) ways to go about it.

the first way involves writing a program, and then either in the language of the program or some other "more logical" language writing down some formula about the program and trying to construct a proof. proofs performed in this order are constructed by repeatedly reducing the terms of the proof towards axioms. you are given a list of transformations on the formula which are permitted, with the understanding that rewriting a formula in some sense preserves "truth", or at least logical implication. once you arrive at a formulation which is all axioms or instances of abstract axioms, you are done your proof. sometimes you have tool support to automate reducing the tedious parts of your formulas to axioms. formal systems which are intended to work this way include ACL2, Coq, HOL, NuPRL, Isabelle, Twelf, and many other logic and meta-logic frameworks.

the second way involves writing a formula expressing the conjunction of facts you eventually want your program to satisfy, and transforming that formula into the program by successive refinements (rewrites). you are given, like in the first case, a list of the sorts of syntactic transformations which constitute a legal refinement, with the understanding that a refinement of a term "means the same thing", only the refinement is "more concrete" than the refined. when you have nothing left but terms in your abstract machine language, you have completed the refinement and you can probably run the resulting program. some times you have tool support to automate performing tedious refinements of formulas into machine code. formal systems which are intended to work this way include Z, VDM, RAISE, B Method, Refinement Calculus, and Hehner's "practical theory of programming".

typically the underlying logics of members of either such family permit their use in one approach or the other, but the tools and logics may display the preference of their creators.

formalities and formalisms

one way of viewing programs is as "active" objects, about which certian assertions may hold from time to time. in this view, people say things like "what does this program do?" the program text is viewed as suspicious, vulgar; to do "real" reasoning, you need to relate the program text to something clean like mathematics or formal logic where there are many good theorems (never mind the ambiguity and duplication in their notations).

another way of viewing programs is to say that programs do nothing; computers do things and programs are simply formal suggestions to the computer. a program in a high level language is a formal suggestion to another program (a compiler, expressed in a low-level language such as machine code) about a sequence of i/o operations which may result in the production of another program in a low level language. in all cases, however, a program is a completely static entity; a single formula with rules which may apply to describe legal transformations of it.

in this latter view, the most important thing a program suggests is a relationship between pre- and post- states of a computer. this "large" relationship is stated as a sequence of instructions, but each instruction suggests at a fundamental level nothing more than a relationship of the pre- and post- states of a computer (including i/o of course). the post-state of one instruction is the pre-state of the next. this is a simplification of modern instruction pipelines but it is often a sufficient level of detail for reasoning.

so suppose you write down an instruction using the "static" view of a program. say:

1:  x <- k * v

if instead you take the first view, you may then feel the need to "annotate" this with the "formal" property that the post-state of the computer contains a variable whose arithmetic value is the product of two others:

// annotation: { post[1][x] = pre[1][k] * pre[1][v] }
1:  x <- k * v

it may strike you as suspicious at this point that you've just written down two nearly identical expressions, but recall that in this view the program is an "active" object (subject to wild fits of unpredictability) and the annotation spec, in some more "mathematical" semantics, is the only "formal" object.

what this means, when you say that an expression is "formal", is that the formal object is the only one you have confidence about your reasoning about, and transformation and of. for example, say you later have an instruction:

// annotation: { post[5][y] = pre[5][x] * pre[5][v] }
5:  y <- x / v

you can then perform the annotation reasoning:

post[5][y] = pre[5][x] * pre[5][v]
           = post[1][x] * pre[1][v]
           = pre[1][k]

and then you can (somehow) relate this back to the formal text of the program to see that it's legal to rewrite as:

5:  y <- k

but notice that to perform this reasoning, you had to perform a "round trip" in and out of the "formal" system. after all, you cannot reason about filthy active objects like programs; they might do anything! and of course in order to faithfully do the "translation" back and forth between the "active" and "formal" parts of the program, you need a formal relationship between your "formal" notation and the program text. oops! you've just accidentally raised the program text to the level of a "formal" system anyways!

this is the convoluted thinking (unfortunately common, in "formal methods" circles) that Hehner sets out to change. he thinks that the program is a formal expression of a relationship between pre- and post- states of a computer, and if you tidy up common programming languages you will find buried inside them a simple boolean expression logic which uses the programming language as its fundamental terms. otherwise you spend all your time translating back and forth between the program text and your "real" formalism, and you get so much clutter that any reasonable programmer gives up in disgust.

lest you think this is completely implausible, have a look at ACL2. it is a fully functioning, quite usable system which performs exactly this sort of "reasoning about the program text" for a fully executable lisp language.

put my OLS slides online, if anyone's curious.

local and remote

citeseer surfing this evening turned up a more detailed note on the matter of keeping the difference between local and remote resources different, in a program.

I sort of which I'd read this before spending all that time working on berlin. though perhaps the lesson is best understood when experienced first hand.

proofs and programs

it is good to think of programs as logical terms, because you can then consider the "specification" or "interface" of a program as some logical term which is implied by another term, say the implementation of the program or the assembly language of it. these are all terms within the same logic, some using a restricted vocabularly but all logically related. the act of "writing a program" is then viewed as "locating a more syntactically restricted term which logically implies the more abstract specification".

this is a useful view since it eliminates the pointless distinction between specification and programming language. you view the programming language as the subset of the specification language that you happen to have hardware which can follow.

for example, say I have a specification / program of the form:

	for i from 1 to 10: x[i] := x[i] + 1;

there is no "direct" encoding of that sequence of symbols as something my hardware will try to follow. but there is another program which logically implies the first one:

	load r1,1
top:	load r2,r1
	subtract r2,r2,10
	branch-if-zero r2,end
	load r3,x
	load r2,r1(r3)
	increment r2
	store r1(r3),r2
	increment r1
	jump top

and this, given a small change-of-encoding into machine words, is a specification my hardware will follow. but note the relationship: the pseudo-assembly implies the pseudo-C, not the other way around. it implies it, not in the informal sense of "it's a reasonable translation", but rather in the formal sense of a computer with memory that can assume various states, and control that passes from instruction to instruction, with the post-state of one instruction being the pre-state of the next. what if your logic doesn't have the symbols "jump" or "for" or ":=" in it? well, it's time to change to a logic that does.

document formats

three things worth pointing out wrt. document formats:

  1. editability is a factor worth paying close attention to. many users want to be able to "freeze" a document and make it non-editable, either in the name of preserving "pristine content" or (benevolently) by composing content using a higher-level abstraction than the delivery format.

    this has similar moral character to compiling source code, insofar as derived documents are very difficult to produce. many people think that derived documents never happen in the real world; I would advise such people to consider how much worse the web would be without "view source".

  2. machine-readability is also an important factor. if you assume many document retrieval tasks in the future will be partially machine-assisted (beyond merely presentation) then you've got to keep things like citeseer, google, and glimpse in mind when working on document formats.

  3. for non-editable, mostly presentation-oriented documents, the djvu format is very good.

I'm working fully-remote now, which is an interesting change, but really has just served to demonstrate the flexibility of free software to me; with a little bit of glue here and there, the system feels exactly as if I were in the office. minus all the socializing :(

ported some bits of slib to ocaml and built a quick sqlite binding. finding ocaml just keeps growing on me as a "serious hacking" language.

otherwise pretty relaxed. I'll be heading up to OLS in a couple weeks to give a talk on building ridiculously small domain specific compilers. and then I'm taking a fantastic vacation through the western end of the country. very much looking forward to it.


  • I think that eval, in its lisp form, is another of those sorts of things the ML/haskell world usually frowns on as "a bit too dynamic". you don't even know when an eval'ed sexp is well typed. that's scary. that said, there is still dynamic caml should the need arise.
  • TCL is actually an interesting case. I used to think it was just a complete joke, what with basically no datatypes, but I'm finding myself occasionally forced to use it (sigh.. work), and each time I am impressed with how far you can go with a clever quoting system and a single datatype. it's further than you think.
  • with respect to camlp4, I do think the macros offered are "as powerful" as those in lisp, by any reasonable measure. consider:
    • you can write your own lexer. you don't have to shoehorn your macro notation into sexps</a>
    • you have explicit or implicit control over source coordinates
    • you can programmatically construct your results, or construct them through quotations, or both
    • you can name and layer quotation expansions of different notations
    • you can replace the core grammar of ocaml, essentially re-tasking the ocaml compiler to a "new" language with ocaml semantics.

I think you are right about the scene graph location. I was initially quite a fan of "location transparency", but have come to feel that there are always at least 2 locations in any computer program: here and elsewhere. Here is at least host-local, probably user-local, process-local, and even thread-local. Really, I don't think a display server ought to have more than one thread anymore either. I think Miguel was quite right in that argument we had years ago. But that's a different kettle of fish.

In any case, collapsing here and elsewhere into just one case leads to, well, corba. Increasingly, I don't think corba is wise at all; remote resources are just too intrinsically different from pointers.

One might be motivated, then, to wonder what remote resources are like. Obviously with some sequence number trickery, remote resources can be organized into things like streams (hence TCP socket ~~ file descriptor), but can they be made into something even more structured? 9p thinks remote resources can appear like a VFS tree: open, read, write, close, and walk. Seems a useful, and mostly harmless extension. But I think it's important to keep open vs. closed, and explicit i/o, as part of the equation. Trying to do those things transparently is trouble.

I wish paul graham would stop publishing articles claiming lisp has the only useful syntactic metaprogramming system. it's especially irritating since he seems like a smart guy, and should know better. you don't need car and cdr to do good syntactic metaprogramming (indeed, alone they are awkward); you just need a quasi-quote notation and a place to hook your extensions into the evaluator.

hell, even TCL satisfies the menu here, though few people bother to learn it. unless he's been hiding under a rock for a couple years, I can't believe he's seen and understood camlp4, and still thinks defmacro is "unique to lisp".

90 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!