Much of the time when I'm looking at proof systems, I feel like a
novice. There's so much highly technical stuff, and it's not easy for
me to figure out which of it is even worth learning.

Even so, occasionally I see clear glimpses. Sometimes, these glimpses
don't even seem to be written down anywhere else. That's what I'll try
to communicate tonight.

**Formal logical systems**

Here's a very simplified summary of formal logical systems. I strongly
encourage you to get familiar with a real system, but if not, this
will have to do.

Each such system has a language in which you can make statements. In a
proof checker, you want to accept a lot of these statements
(corresponding to true mathematical statements), but reject all those
that correspond to false mathematical statements. Of course, there
will always be *plenty* of statements which you can neither prove
nor disprove, but we won't worry about those here.

In any case, most formal systems work like this: there is a small
number of axioms considered true without needing proof, and then a
small number of inference rules, which take one or more existing
statements considered true, and spit out a new one. A particularly
important inference rule is the "modus ponens". If you've proved A and
proved that A implies B, then the modus ponens gives you a proof of B.
In many systems, the modus ponens is what gives proofs their branching
tree structure, and is also the best way to take "large leaps", since
the inference rules themselves tend to be quite primitive.

There's a lot of detail we'll skip over here. I'll mention the fact
that many axioms are actually axiom schema, and the fact that you have
to watch out for free vs. bound variables (or, in Metamath, distinct
variable conditions).

**The de Bruijn criterion**

At least, the outline above is true for systems satisfying the "de
Bruijn criterion", which Freek Wiedijk defines as having the
correctness of the system as a whole depends on the correctness of a
very small "kernel". There are also a bunch of other systems that do
more stuff in the proof steps, such as computations and so on.

Failure to satisfy the de Bruijn criterion has a number of
disadvantages. Many people talk about the philosophical issue of
knowing whether your results are really correct. Instead, I'll stress
the issue of translating proofs from one system to another. If your
steps are kind of big and ad hoc, it's much harder to know how to do
this.

You can see why people want to relax a bit from purely formal rigor,
but in fact systems satisfying the de Bruijn principle tend to admit
proofs just as concise as those that don't. See Freek's
"Comparing Mathematical Provers" for evidence of this claim.

So I'm going to assume that the de Bruijn principle is a good idea.
In a larger ecology of proof systems, those that don't satisfy it will
need translation into one that does, but not the converse.

**The Poincare principle**

On the other hand, doing things like numeric calculations by hand is
quite tedious. Freek considers systems that can prove the correctness
of calculations automatically as satisfying the "Poincare principle".

This principle raises the question of exactly what kind of software
you've got for doing these proofs. You can do an extremely simple
program that just checks proofs. Such a program won't satisfy the
Poincare principle. So-called "interactive theorem provers" such as
HOL have lots of subroutines that crank through a potentially large
numbers of primitive inference rules. These systems are inevitably
more complex, but do satisfy the Poincare principle.

Even so, automatic calculation, while very good for saving time when
trying to make proofs, is not good for clarity and transparency. In
particular, it's very difficult to know when these subroutines
("tactics" in HOL-speak) will succeed and when they'll fail. In
particular, a lot of the things you want to automate are undecidable
in general.

Automation also increases the risk of "version skew". Assuming that
the technology is steadily improving, then you can't use an older
system to check a proof script made with a newer system, even if it's
perfectly capable of following the primitive inference steps generated
by the latter.

This version skew is just one aspect of proof scripts being bound too
tightly to short-lived software, rather than based in eternal
mathematical truth. In particular, the way scripts are done in HOL
seem especially hackish to me, with a combination of commands which
modify the global state, output of half-chewed proofs as ML programs
in ASCII format, and interactions with a make system that, among other
things, leaves a lot of x86 machine code lying around. It seems to
work reasonably well in practice, but it can't possibly be the Right
Way.

**The way out**

The way out seems obvious to me: separate the proof generation and
proof checking tools. The checker can be very small and stable, while
proof generator can ride the feature treadmill just like most other
software.

Further, you can have a diversity of tools for proof generation. Some
might look just like HOL. Others might have a GUI, or perhaps Web,
front-end. You might well do small, specialized tools for generating
proofs in particular domains.

The interface between the tools is a proof file format, strongly
analogous to object code. Most people think of object code as an
unpleasant but necessary step in the process to get programs executed
efficiently, but I would rather like to sing its praises. In
particular, object code is key in binding together components written
in different programming langauges. I can easily see the same thing
happening in proofs.

In my opinion, one of the strongest advantages is that you'd be able
to explore new "source" languages for writing proofs, and still be
able to bind with existing results. There's a *lot* of stuff
that's been proved in mechanically-checkable form. Having to redo it
all every time you want to play with a new proof language would be a
crippling limitation.

**Notes on computational complexity**

The relationship between code and proof can now be seen as one of
straightforward engineering tradeoffs, rather than a deeply
philosophical inquiry into the nature of proof.

A big engineering question is: how long do these calculations take
when done in a proof system. Harrison's thesis
discusses this question fairly deeply. He put some engineering work
into efficient representations (using bit sequences rather than
iterated succession for the naturals, for example), so that
computations would be reasonably efficient. Even so, they're enough
slower so as not to be competitive with, say, a good bignum library.

In HOL, the cost of a primitive inference step is a certain amount of
CPU time. In the system I'm talking about, it's also a certain amount
of space on disk for the "proof object" file, and, probably more
interestingly, bandwidth when you're moving these things around the
Web.

Given that a good bignum library is a few orders of magnitude faster
than doing calculation in an HOL system, there's an overwhelming
temptation to just call the bignum library instead when you need
calculations. That's how the de Bruijn criterion gets violated :)

But we can look at the question: just how much are we losing by
forcing everything through primitive inference rules. Let's say you
have a program which does calculations, and also a proof that the
program gives the correct answers. How long does it take to "execute"
each step in the program? In the case of HOL, there's an upper-bound
answer of about eight orders of magnitude (based on the ARM model
included in HOL). Of course, in practice you'll usually be able to
simplify the proof a lot.

Similarly, the question of file size is deeply related to
compressibility. Kolmogorov complexity is the obvious relevant theory
here.

**So shall we run arbitrary code to check proofs?**

Given the obvious usefulness of code, it's tempting to put a virtual
machine into the proof checker and proof object file format. A proof
can include code when it also includes a proof that the code is
correct. Such a proof will also imply that the code is safe, by the
way. The checker will verify the proof of the code first, then
assuming it passes, just run it. The output can be entered into the
database of statements proved correct, just as if it had been spit out
from an inference rule.

The consequence, I think, is a potentially significant performance
boost, but not an increase in the logical power of the system. In all
cases where you'd execute code, you could just interpret the virtual
machine using inference rules instead.

It may be that this performance increase becomes worthwhile. Right
now, it doesn't seem to be. In fact, proof checkers on modern machines
seem to be almost ridiculously fast. Freek's Automath implementation
checks the entire Grundlagen der Analysis
in under a second (roughly double the time taken to interpret a single
ARM instruction in HOL). Similarly, Metamath takes about a second to
verify all of set.mm, including Hilbert spaces and other nontrivial
results.

So I think code is not worth it until we start building a library of
slow-running proofs, and that doing an efficient virtual machine once
becomes a better investment than optimizing each of the proofs.