Much of the time when I'm looking at proof systems, I feel like a novice. There's so much highly technical stuff, and it's not easy for me to figure out which of it is even worth learning.
Even so, occasionally I see clear glimpses. Sometimes, these glimpses don't even seem to be written down anywhere else. That's what I'll try to communicate tonight.
Formal logical systems
Here's a very simplified summary of formal logical systems. I strongly encourage you to get familiar with a real system, but if not, this will have to do.
Each such system has a language in which you can make statements. In a proof checker, you want to accept a lot of these statements (corresponding to true mathematical statements), but reject all those that correspond to false mathematical statements. Of course, there will always be plenty of statements which you can neither prove nor disprove, but we won't worry about those here.
In any case, most formal systems work like this: there is a small number of axioms considered true without needing proof, and then a small number of inference rules, which take one or more existing statements considered true, and spit out a new one. A particularly important inference rule is the "modus ponens". If you've proved A and proved that A implies B, then the modus ponens gives you a proof of B. In many systems, the modus ponens is what gives proofs their branching tree structure, and is also the best way to take "large leaps", since the inference rules themselves tend to be quite primitive.
There's a lot of detail we'll skip over here. I'll mention the fact that many axioms are actually axiom schema, and the fact that you have to watch out for free vs. bound variables (or, in Metamath, distinct variable conditions).
The de Bruijn criterion
At least, the outline above is true for systems satisfying the "de Bruijn criterion", which Freek Wiedijk defines as having the correctness of the system as a whole depends on the correctness of a very small "kernel". There are also a bunch of other systems that do more stuff in the proof steps, such as computations and so on.
Failure to satisfy the de Bruijn criterion has a number of disadvantages. Many people talk about the philosophical issue of knowing whether your results are really correct. Instead, I'll stress the issue of translating proofs from one system to another. If your steps are kind of big and ad hoc, it's much harder to know how to do this.
You can see why people want to relax a bit from purely formal rigor, but in fact systems satisfying the de Bruijn principle tend to admit proofs just as concise as those that don't. See Freek's "Comparing Mathematical Provers" for evidence of this claim.
So I'm going to assume that the de Bruijn principle is a good idea. In a larger ecology of proof systems, those that don't satisfy it will need translation into one that does, but not the converse.
The Poincare principle
On the other hand, doing things like numeric calculations by hand is quite tedious. Freek considers systems that can prove the correctness of calculations automatically as satisfying the "Poincare principle".
This principle raises the question of exactly what kind of software you've got for doing these proofs. You can do an extremely simple program that just checks proofs. Such a program won't satisfy the Poincare principle. So-called "interactive theorem provers" such as HOL have lots of subroutines that crank through a potentially large numbers of primitive inference rules. These systems are inevitably more complex, but do satisfy the Poincare principle.
Even so, automatic calculation, while very good for saving time when trying to make proofs, is not good for clarity and transparency. In particular, it's very difficult to know when these subroutines ("tactics" in HOL-speak) will succeed and when they'll fail. In particular, a lot of the things you want to automate are undecidable in general.
Automation also increases the risk of "version skew". Assuming that the technology is steadily improving, then you can't use an older system to check a proof script made with a newer system, even if it's perfectly capable of following the primitive inference steps generated by the latter.
This version skew is just one aspect of proof scripts being bound too tightly to short-lived software, rather than based in eternal mathematical truth. In particular, the way scripts are done in HOL seem especially hackish to me, with a combination of commands which modify the global state, output of half-chewed proofs as ML programs in ASCII format, and interactions with a make system that, among other things, leaves a lot of x86 machine code lying around. It seems to work reasonably well in practice, but it can't possibly be the Right Way.
The way out
The way out seems obvious to me: separate the proof generation and proof checking tools. The checker can be very small and stable, while proof generator can ride the feature treadmill just like most other software.
Further, you can have a diversity of tools for proof generation. Some might look just like HOL. Others might have a GUI, or perhaps Web, front-end. You might well do small, specialized tools for generating proofs in particular domains.
The interface between the tools is a proof file format, strongly analogous to object code. Most people think of object code as an unpleasant but necessary step in the process to get programs executed efficiently, but I would rather like to sing its praises. In particular, object code is key in binding together components written in different programming langauges. I can easily see the same thing happening in proofs.
In my opinion, one of the strongest advantages is that you'd be able to explore new "source" languages for writing proofs, and still be able to bind with existing results. There's a lot of stuff that's been proved in mechanically-checkable form. Having to redo it all every time you want to play with a new proof language would be a crippling limitation.
Notes on computational complexity
The relationship between code and proof can now be seen as one of straightforward engineering tradeoffs, rather than a deeply philosophical inquiry into the nature of proof.
A big engineering question is: how long do these calculations take when done in a proof system. Harrison's thesis discusses this question fairly deeply. He put some engineering work into efficient representations (using bit sequences rather than iterated succession for the naturals, for example), so that computations would be reasonably efficient. Even so, they're enough slower so as not to be competitive with, say, a good bignum library.
In HOL, the cost of a primitive inference step is a certain amount of CPU time. In the system I'm talking about, it's also a certain amount of space on disk for the "proof object" file, and, probably more interestingly, bandwidth when you're moving these things around the Web.
Given that a good bignum library is a few orders of magnitude faster than doing calculation in an HOL system, there's an overwhelming temptation to just call the bignum library instead when you need calculations. That's how the de Bruijn criterion gets violated :)
But we can look at the question: just how much are we losing by forcing everything through primitive inference rules. Let's say you have a program which does calculations, and also a proof that the program gives the correct answers. How long does it take to "execute" each step in the program? In the case of HOL, there's an upper-bound answer of about eight orders of magnitude (based on the ARM model included in HOL). Of course, in practice you'll usually be able to simplify the proof a lot.
Similarly, the question of file size is deeply related to compressibility. Kolmogorov complexity is the obvious relevant theory here.
So shall we run arbitrary code to check proofs?
Given the obvious usefulness of code, it's tempting to put a virtual machine into the proof checker and proof object file format. A proof can include code when it also includes a proof that the code is correct. Such a proof will also imply that the code is safe, by the way. The checker will verify the proof of the code first, then assuming it passes, just run it. The output can be entered into the database of statements proved correct, just as if it had been spit out from an inference rule.
The consequence, I think, is a potentially significant performance boost, but not an increase in the logical power of the system. In all cases where you'd execute code, you could just interpret the virtual machine using inference rules instead.
It may be that this performance increase becomes worthwhile. Right now, it doesn't seem to be. In fact, proof checkers on modern machines seem to be almost ridiculously fast. Freek's Automath implementation checks the entire Grundlagen der Analysis in under a second (roughly double the time taken to interpret a single ARM instruction in HOL). Similarly, Metamath takes about a second to verify all of set.mm, including Hilbert spaces and other nontrivial results.
So I think code is not worth it until we start building a library of slow-running proofs, and that doing an efficient virtual machine once becomes a better investment than optimizing each of the proofs.