hot corn, cold corn
In this article, we'll walk through some pieces of the LLInt. But first, we need to describe the problem that the LLInt solves.
Web browsers are like little operating systems unto themselves, into which you as a user install and uninstall hundreds of programs a day. The vast majority of code that a web browser sees is only executed once or twice, so for cold code, the name of the game is to do as little work as possible.
For cold code, a good bytecode interpreter can beat even a very fast native compiler. Bytecode can be more compact than executable code, and quicker to produce.
All of this is a long introduction into what the LLInt actually is: it's a better bytecode interpreter for JSC.
Before the introduction of the LLInt, the so-called "classic interpreter" was just a C++ function with a bunch of labels. Every kind of bytecode instruction had a corresponding label in the function.
As a hack, in the classic interpreter, bytecode instructions are actually wide enough to hold a pointer. The opcode itself takes up an entire word, and the arguments also take up entire words. It is strange to have such bloated bytecode, but there is one advantage, in that the opcode word can actually store the address of the label that implements the opcode. This is called direct threading, and presumably its effects on the branch prediction table are sufficiently good so as to be a win.
Anyway, it means that Interpreter.cpp has a method that looks like this:
// vPC is pronounced "virtual program counter".
JSValue Interpreter::execute(void **vPC)
// add dst op1 op2: Add two values together.
int dst = (int)vPC;
int op1 = (int)vPC;
int op2 = (int)vPC;
fp[dst] = add(fp[op1], fp[op2]);
vPC += 4;
// jmp offset: Unconditional branch.
ptrdiff_t offset = (ptrdiff_t)vPC;
vPC += offset;
It's a little different in practice, but the essence is there.
OK, so what's the problem? Well, readers who have been with me for a while might recall my article on static single assignment form, used as an internal representation by compilers like GCC and LLVM. One conclusion that I had was that SSA is a great first-order IR, but that its utility for optimizing higher-order languages is less clear. However, this computed-goto thing that I showed above (goto *vPC) is a form of higher-order programming!
Indeed, as the GCC internals manual notes:
Computed jumps contain edges to all labels in the function referenced from the code. All those edges have EDGE_ABNORMAL flag set. The edges used to represent computed jumps often cause compile time performance problems, since functions consisting of many taken labels and many computed jumps may have very dense flow graphs, so these edges need to be handled with special care.
Basically, GCC doesn't do very well at compiling interpreter loops. At -O3, it does get everything into registers on my machine, but it residualizes 54 KB of assembly, whereas I only have 64 KB of L1 instruction cache. Many other machines just have 32 KB of L1 icache, so for those machines, this interpreter is a lose. If I compile with -Os, I get it down to 32 KB, but that's still a tight fit.
Besides that, implementing an interpreter from C++ means that you can't use the native stack frame to track the state of a computation. Instead, you have two stacks: one for the interpreter, and one for the program being interpreted. Keeping them in sync has an overhead. Consider what GCC does for op_jmp at -O3:
; int currentOffset = vPC - bytecode + 1
; callFrame[CurrentOffset] = currentOffset
; ptrdiff_t offset = (ptrdiff_t)vPC
; vPC += offset
; goto *vPC
First there is this strange prelude, which effectively stores the current vPC into an address on the stack. To be fair to GCC, this prelude is part of the DEFINE_OPCODE macro, and is not an artifact of compilation. Its purpose is to let other parts of the runtime see where a computation is. I tried, but I can't convince myself that it is necessary to do this at the beginning of every instruction, so maybe this is just some badness that should be fixed, if the classic interpreter is worth keeping.
The rest of the opcode is similar to the version that the LLInt produces, as we will see, but it is less compact.
the compiler to make the code you want
The goal of compilation is to produce good code. It can often be a good technique to start from the code you would like to have, and then go backwards and see how to produce it. It's also a useful explanatory technique: we can take a look at the machine code of the LLInt, and use it to put the LLInt's source in context.
In that light, here is the code corresponding to op_jmp, as part of the LLInt:
; offset += bytecode[offset + 1]
; jmp *bytecode[offset]
That's it! Just 9 bytes. There is the slight difference that the LLInt moves around an offset instead of a pointer into the bytecode. The whole LLInt is some 14 KB, which fits quite confortably into icache, even on mobile devices.
This assembly code is part of the LLInt as compiled for my platform, x86-64. The source of the LLInt is written in a custom low-level assembly language. Here is the source for the jmp opcode:
dispatchInt(8[PB, PC, 8])
You can define macros in this domain-specific language. Here's the definition of dispatchInt:
addi advance, PC
jmp [PB, PC, 8]
Incidentally, the "offlineasm" compiler that actually produces the native code is written in Ruby. It's clear proof that Ruby can be fast, as long as you don't load it at runtime ;-)
but wait, there's more
We take for granted that low-level programs should be able to determine how their data is represented. C and C++ are low-level languages, to the extent that they offer this kind of control. But what is not often remarked is that for a virtual machine, its code is also data that needs to be examined at runtime.
I have written before about JSC's optimizing DFG compiler. In order to be able to optimize a computation that is in progress, or to abort an optimization because a speculation failed -- in short, to be able to use OSR to tier up or down -- you need to be able to replace the current stack frame. Well, with the classic interpreter written in C++, you can't do that, because you have no idea what's in your stack frame! In contrast, with the LLInt, JSC has precise control over the stack layout, allowing it to tier up and down directly from the interpreter.
This can obviate the need for the baseline JIT. Currently, WebKitGTK+ and Apple builds still include the baseline, quick-and-dirty JIT, as an intermediate tier between the LLInt and the DFG JIT. I don't have sure details about about long-term plans, but I would speculate that recent work on the DFG JIT by Filip Pizlo and others has had the goal of obsoleting the baseline JIT.
Note that in order to tier directly to the optimizing compiler, you need type information. Building the LLInt with the DFG optimizer enabled causes the interpreter to be instrumented to record value profiles. These profiles record the types of values seen by instructions that load and store values from memory. Unlike V8, which stores this information in executable code as part of the inline caches, in the LLInt these value profiles are in non-executable memory.
Spiritually speaking, I am all for run-time code generation. But it must be admitted that web browsers are a juicy target, and while it is unfair to make safe languages pay for bugs in C++ programs, writable text is an attack surface. Indeed in some environments, JIT compilation is prohibited by the operating system -- and in that sort of environment, a fast interpreter like the LLInt is a big benefit.
There are a few more LLInt advantages, like being able to use the CPU's overflow flags, doing better register allocation, and retaining less garbage (JSC scans the stack conservatively). But the last point I want to mention is memory: an interpreter doesn't have to generate executable code just to display a web page. Obviously, you need to complement the LLInt with an optimizing compiler for hot code, but laziness up front can improve real-world page load times.
in your webs, rendering your kittens
So that's the LLInt: a faster bytecode interpreter for JSC.