A memory trick
I don't remember where I saw it but someone wrote that using CPS style prolog implementation is costly. And It is a little indeed. It will punish you with a factor of about 3x and the irregularities of garbage collection will hit you with a factor of 6x or more if it is done in passes. At least this is the timings I get on a prolog example written in C using a tail call trampoline and proper closures allocated from the heap. Now the same limiting feature would be present in a implementation in a Lisp or any other higher level language with proper closures and a working GC. The cool thing with the closures needed to do backtracking are that they don't need to allocate the closures on the heap. In stead one can maintain a stack to do the trick. I did implement a quick version using this method and the figures above came as a conclusion. So in principle there is a 4-5 fold of savings possible to go from the VM to Naitive compilation. A note here are that the overhead of the VM masks GC issues and there is not much gained from trying to squeeze extra vm instructions or compiler directives into generating special op-codes to allocate the closures from the stack in a special manner.
So how did I implement this. Well actually I have a small compiler written in scheme that output's c-code from a scheme like language, taking advantage of Alex Shinn's excellent fmt library. In this package there are element's to hide tail call trampoline semantics and closure generation to achieve close to scheme semantics. So in order to allocate the closere from a stack I needed an allocated array and companion pointer lam-stack, I added two extra macros, s-lambda, f-lambda. s-lambda allocates just the closure from the next bytes from the stack and are used in closures typically generated by (%and a b ...). f-lambda are the lambda that defines the next continuation at a junction. for this closure we can allocate it again from the stack and then when the lambda is run we simply set the stack pointer lam-stack to the next byte of the
end of the chunk of environment data for the closure that it needs to express the continuation. (that data is copied over, a strategy that works because we do not set! variables) In all, f-lambda semantics imply that we do not consume the stack and a quite small stack is enough for
many implementations.
The careful reader will recognize that this strategy will consume much more stack then needed. f-lambda sematics can be improved to do much less copying by noticing that the variables carried over to the next continuation follows a total order on the set of subsets of variables needed in the sequence of closures at a junction. This means that one can copy the variables only ones and then reuse the stackframe (so it is apparent that the current versions will need quite a lot of stack space at each junction. Also for the s-lambdas one can save some space by similar techniques. This can decrease the amount of copying and needed stack space.
How does the prolog macro package work? Here is %and%:
(define-syntax :and:
(syntax-rules ()
((_ w x) (Y w x))
((_ (cc cut fail s) (:cut:) . l)
(:and: (cc cut cut s) . l))
((_ (cc cut fail s) x . l)
(:let: ((ccc (:s-lambda: s (fail) (:and: (cc cut fail s) . l))))
(Y (ccc cut fail s) x)))))
and %or%:
(define-syntax :or:
(syntax-rules ()
((_ w x) (Y w x))
((_ (cc cut fail s) x . l)
(:let*: ((P (:scm-call: c_newframe))
(f (:f-lambda: s ()
(:scm-call: c_unwind P)
(or-work P (cc cut fail s) . l))))
(Y (cc cut f s) x)))))
Note here that cc is the continuation cut is a failure of an earlier junction. And here is how one defines a prolog function.
(:define: lam-stack (queens3 UnplacedQs SafeQs Qs)
;(:pp: "queens3 Unplaced ~a -> ~a" UnplacedQs SafeQs)
(:match: (UnplacedQs)
( _ (:var: (Q UnplacedQs1)
(:and: (:call: selectq Q UnplacedQs UnplacedQs1)
(:not: (:call: attack Q SafeQs))
(:call: queens3 UnplacedQs1
(:scm-call: c_cons Q SafeQs) Qs))))
(() (:=: SafeQs Qs))))
And we see that there are overhead and some carefulness in the coding but it's a lot of help over trying to write c-code that consists of 2000 lines of code compared to the 80 lines used to write the application. N.B. I didn't find a way to format the code snippets in a good way, sorry!
There are two more way's that can save time. Inlining and using fixnum arithmetics in stead of SCM based arithmetics. Also many new bottlenecks start to appear like the output printing routine. Anyway the conclusion are that a CPS + tail call trampoline is not to bad as a tool to do backtracking searches if done with the right compiler.
Have fun