Right now I work with taking a specification of virtual operations, the guile rtl vm, and translate them to assembler code. The idea is to stack the resulting assembler instructions and compile to machine code in stead of using named gotos in C. Using this we will get less execution overhead but more code.
Some tests shows that simple loops like incrementing a counter and sum it as well as simple list operations, get a boost by 3-4 times by doing this. Of cause if expensive vm operations are used most of the time you will not gain much by such an approach.
The drawback of the method is that for example a simple addition may look like,
(define-vm-inst vm-add 79 ((U8_U8_U8_U8 dst x y))
(inst mov call-1 (local-ref x))
(inst mov call-2 (local-ref y))
(inst test call-1 2)
(inst jmp #:z slow:)
(inst test call-2 2)
(inst jmp #:z slow:)
(inst add call-2 call-1)
(inst jmp #:o slow:)
(inst sub call-2 2)
(inst mov (local-ref dst) call-2)
(inst jmp out:)
(inst mov call-2 (Q rsp))
(c-call scm_sum call-1 call-2)
(inst mov (local-ref dst) rax)
Disclaimer not debugged yet, but you get the picture 14 instructions!
The problem with verbose code like this is that it increases the compilation time of the code, and the size of the resulting programs. What I'd would like to have is the possibility to have macro instructions which get translated
to the specified code directly in hardware and just need to write
(inst mov call-1 (ref rbp 12))
(inst mov call-2 (ref rbp 13))
(inst macrocall 134)
(inst mov (ref rbp 14) rax)
e.g. just like a c call in amd64(Linux) but the macrocall will
be a hardware expansion on the cpu to the expansion found in slot 134. In all a good 3x decrease in instruction length and of much less complexity (no jumps).
A problem with this setup is that if you would want to use the machine registers effectively you may want to specify which registers you shall use in the macrocall like
(inst macrocall 134 call-1 call-2)
Another issue with this is that the same specific expansion code is located in the cpu for all processes and different processes might want to have their own expansion code in there. Not sure how to solve this, maybe each process have an id that describes which set of expansion it uses and when a new process start to execute on the core that key can be used to recognize if a different macro code should be loaded into the cpu when asked for. Of cause this can lead to expensive context switches. But I find it an interesting feature.