Older blog entries for wingo (starting at number 372)

to guadec!

I'm very happy to say that I'm going to Berlin on Friday, for GUADEC, the GNOME conference.

This will be my seventh GUADEC, and my first as an Igalian. I'm very much looking forward to seeing old friends, as well as to meeting new folks.

I'm especially looking forward to soaking in the energy of GNOME 3. GNOME 3 is the best user-facing project that Free Software has going right now -- it's a dynamic group of creative, talented people, making users' experiences with computers beautiful and natural.

The only other Free Software project that is focused on users in the same way is Android, and it is its own corporate kettle of fish. I hope that it's GNOME that ultimately wins, against the hoarders of the software world, from the workstation down to the phone, because GNOME brings with it our participatory culture of software as a means for human emancipation: that we can work together to make a better world. We've been working on this Free Software thing for some 30 years now; we're not a fad, or a hype: we are people working together, making our world. We can win, and humanity deserves it.

. . .

This year GNOME is again sharing a conference with KDE, another "desktop environment", to use the outdated term. I have pleasant memories of hanging out with KDE folk at the last joint conference we had, but I don't think it's healthy. Last time we were together because it was convenient for Nokia. Now that Nokia sold us both out, it doesn't make sense.

I do wish we could collaborate in meaningful ways -- and we do, in lower levels in the stack, and partly at the window manager level -- but we're working in the same space with two different cultures, and colocation ultimately dilutes the impact these conferences can have among the members of these two projects. So my KDE friends, let's have one last fling this weekend in Berlin, and thereafter see each other in Brussels, at FOSDEM :)

My seven GUADECs marks me as some kind of old fart now, I suppose, and it's true that my own work with GNOME is mostly in the past. I used to work on GStreamer, but don't anymore; Guile is mostly where I'm at, free-software-wise now. I wrote some bindings for the GNOME libraries back in the 2.x days but have not had the time to update them to the modern 3.x world. I fear that I'll just have to pass that project on, though there does not appear to be much hacker interest in maintaining API compatibility.

Anyway, as I said, I mostly work with Guile now. If you're interested at all in Guile, please pull me aside, and I'll be happy to talk your ear off ;-) I've also been doing work with the V8 JavaScript implementation recently, for work, and I'd also be happy to talk about that: compilers, optimization, and what-not. But kids, remember: JavaScript is a fad. Guile is the official scripting language of GNOME.1

See you in Berlin!

1;-)

Syndicated 2011-08-03 21:19:02 from wingolog

a closer look at crankshaft, v8's optimizing compiler

Continuing in my series of articles on V8, in this article I'd like to take a closer look at V8's optimizing compiler, with a focus on the Hydrogen intermediate language.

brief summary

As my readers might recall, Crankshaft is the marketing name of V8's optimizing compiler. It complements the "full" compiler by optimizing hot functions, while ignoring uncommon cases. It translates the JavaScript abstract syntax tree to a high-level static single-assignment (SSA) representation called Hydrogen, and tries to optimize that Hydrogen graph. Then the Hydrogen is translated to the machine-specific Lithium low-level language, which facilitates register allocation and, finally, code generation.

The point of the Hydrogen translation is threefold:

  1. To permit inlining. Inlining is truly the mother of all optimizations, in that it permits more optimizations to occur.

  2. To permit temporary values to be represented as untagged integer or double values. In JavaScript, you need type feedback to do this with any degree of success. Unboxing reduces memory pressure and permits Crankshaft to use efficient native machine instructions. But as you would expect, this unboxing needs to be done carefully.

  3. To facilitate loop-invariant code motion and common subexpression elimination.

The Hydrogen SSA representation facilitates all of these operations. This article takes a closer look at how it does so.

static single assignment


(click for accessible SVG)

Hydrogen consists of values--instructions and phis--contained in basic blocks, which themselves are contained in a graph.

The standard SSA notation associates a named value with every expression:

t1 = x + y;
t2 = 3;
t3 = t1 + t2;
...

It's useful to write it out this way, but these temporary names are not strictly necessary: since each instruction defines a temporary, we can use the identity (address) of each instruction as its name, for the purpose of SSA.

A later pass will allocate storage for all of these temporaries -- on the stack, on the heap, or unboxed in general-purpose or SSA/Neon registers. At first I was quite confused about the term "instruction", as we functional programmers tend to use "expression", but it's something you get used to after a while, and it's useful when you use it to refer to the named definition in addition to the expression that produces it.

An instruction is a kind of value -- an entity that defines a name. There other kind of value is the phi. Phis occur at control-flow joins, and are logically associated with the beginning of a block. My previous article discusses phis in detail.

All values have a pointer to their containing block and to their uses. A use is when a later instruction uses the value computed (defined) by an earlier instruction. Values also have a pointer to the next instruction in their block. It is cheaper to traverse instructions forward, in the direction of control and data flow, than it is to go backward. Values also have fields for their type, their representation, a flags bitfield, and a pointer to a range structure. We will look at these in more detail later.

The last instruction in a block is a control instruction. No one uses the value defined by a control expression -- their purpose is to transfer control to another basic block. The set of possible blocks that may follow a block in control-flow order are the successors of a block. Preceding blocks are predecessors. Every block has an array of pointers to its predecessors and successors, and also to its dominator block, and to all the blocks that it dominates.

Now, friends, when I first saw this I was disgusted. What a rat's nest of pointers! But then I read Kennedy's great Compiling With Continuations, Continued paper and it became clear that it is very difficult to have an efficient functional-style intermediate representation. The diagrams above are from a part of his paper in which he suggests a mutable IR. In my opinion, if he's gone that far, he might as well finish the job and eliminate temporary identifiers entirely, as Hydrogen does.

electrolysis

Crankshaft has a funny job to do. Not only does it have to optimize JavaScript, which is hard enough; it also has to do so in a way that permits on-stack replacement to optimize a long-running loop, without exiting the loop.

In practice what this means is that Crankshaft needs to walk the abstract syntax tree (AST) to emit Hydrogen code in the same way as the full compiler. It needs to simulate what the full compiler is doing: what is on the stack, which variables are heap-allocated and which are not, and, for named local variables, what HValue is bound to that variable. This simulation is kept in an environment.

This AST-to-Hydrogen translation process also handles phi insertion. So unlike the textbook dominance-frontier algorithm for optimal phi insertion, the HGraphBuilder simply adds phi values whenever a block has more than one predecessor, for every variable in the environment that is not always bound to the same value. For loops, a phi is added for every variable live in the environment at that time. A later pass attempts to prune the number of phis, where it can.

type feedback

The AST that was given to the crankshaft compiler has integer identifiers assigned to each node. These identifiers are assigned predictably, so they are the same identifiers that the "full-codegen" compiler saw, and they are embedded in the inline caches used by the full-codegen code. This correspondence allows crankshaft to know the types of the values that have been seen at any given call site, property access, etc.

This information is used by the graph builder to initialize the type field of each HValue, to help in type inference. Additionally, property accesses and function calls can record the precise type of object being accessed, or the function that was called, so that Crankshaft has an opportunity to inline these function and method calls.

inlining

When the graph builder reaches a call, it can try to inline it. Note that this happens very early, before any optimizations are made on the source code. This allows the optimizations to have more visibility, and allows for more code motion. (In general, V8 will not move instructions across a call boundary.)

There are lots of heuristics and conditions that affect inlining, and these are in flux. Some conditions that must currently be met for inlining to proceed are:

  1. The function to be inlined is less than 600 characters long.

  2. Neither the inner nor the outer functions may contain heap-allocated variables. (In practice, this means that neither function may contain lexical closures.)

  3. Inlining of for-in, with, and some other expression types is not currently supported.

  4. There is a limit on the maximum inlining depth.

  5. A function's call to itself will not be inlined.

  6. There is a limit to the number of nodes added to the AST.

source-to-source optimization passes

As parsing proceeds, Crankshaft generates specific HInstruction nodes and packs them into HBasicBlocks. There is a large set of about 120 instructions, which allows Crankshaft to reason precisely about the various operations and their representations.

At this point, inlining is complete, so Crankshaft moves to focus on representation of temporary values. The goal is to unbox as many values as possible, allowing the use of native machine instructions, and decreasing the allocation rate of tagged doubles. The Hydrogen SSA language directly facilitates this control-flow and data-flow analysis.

The following sections discuss the various source-to-source passes that are made on the Hydrogen IR. It's a lot of detail, but I'm sure that someone will find it useful.

block reordering

Crankshaft reorders the array of blocks in the HGraph to appear in reverse post-order, equivalent to a topological sort. In this way, iterating the array of blocks from beginning to end traverses them in data-flow order.

dominator assignment

The reverse post-order sort from the previous step facilitates calculation of the dominator for each block. The entry block dominates its successors. Other blocks are iterated over, in order, assigning them the dominator of their predecessors, or the common dominator of their predecessors if there is more than one.

mark dead subgraphs

It can often be the case that a function being optimized does not have full type information. Some branches in the original function may never have been executed, and so code on those branches does not have any type feedback. If the parser reaches such a branch -- and it will know at parse-time, because that's when it has the type-feedback information -- then it inserts an unconditional soft deoptimization, to force the collection of more type information, should that branch be reached in optimized code.

However, instead of causing a branch to a terminal deoptimization block, the parse continues, presumably because Crankshaft needs to proceed with its abstract interpretation of the AST. For that reason, the soft deoptimization appears in the instruction stream, not as the last instruction in the block.

To avoid the compiler wasting its time trying to optimize loops and move code in parts of the graph that have never been reached, this dead-subgraph pass puts a mark on all blocks dominated by a block containing an unconditional soft deoptimization. As you can see, this analysis is facilitated by the SSA dominator relation.

redundant phi elimination

The naïve phi placement algorithm detailed above can put phis where they are not needed. This pass eliminates phi values whose input is always the same, replacing the uses with the use of the value.

Like a number of phi-centered optimizations, elimination of one phi can expose other phis for elimination. For this reason, this and many other algorithms use a worklist.

In this case, all phis are placed on a work list. Then, while there are any phis on the worklist, a phi is taken off the list, and if it is eliminatable, it is replaced. Now here's the trick: if the phi was in turn used by any other phi, the phi uses are pushed onto the list, if they weren't there already. Iteration proceeds until the worklist is empty.

To keep the big-O order of worklist operations down, you need a flag for each element indicating whether the element is already in the list. The bit should either be in the objects being traversed, or in a separate bitvector. In this particular case -- and in this case only, as far as I can tell -- there is no such bit, so the algorithm can take too much time, in certain cases.

dead phi elimination, and phi collection

Some phi values are dead, having no real uses, neither direct nor indirect (through some other phi). This pass eliminates any phi that doesn't have a real use. It's somewhat like a garbage collection algorithm, and also uses a worklist.

Once this pass has run, we have minimized the set of phis, so we can collect them all into an array and assign them indices. These indices can be used in the future bitvectors, to contain worklist-driven algorithms.

As the phis are collected, if any undefined value is found to reach a phi, or arguments can reach a phi only via some inputs, then optimization is aborted entirely.

representation inference

Every HValue has a type field, representing a coarse approximation of the types seen at a particular AST node, and a representation field, representing a specific choice of how to represent a value. The goal, as I have said, is to represent temporaries as untagged int32 or double values. Type and representation are related, but they are treated separately in Crankshaft. This pass is about representation inference.

Firstly, Crankshaft looks at all of the phis, and sees which ones are connected. This is an N2 fixed-point iteration. Then, for each phi, if any operand of the phi or any connected phi may not be converted to an int32, Crankshaft pessimistically marks the phi as not convertible to an int32. This pass tries to prevent too many representation changes for phi values.

Finally, for all values with flexible representation--phis, Math.abs, and binary bitwise and arithmetic operations--Crankshaft determines the representation to use for that value. If the representation can be inferred directly from the inputs, then that representation is used. Otherwise, some heuristics are run on the representations needed by the use sites of the value. If all uses are untagged, then the representation will be untagged, as an int32 if possible, unless any use treats the value as a double.

The end result is that, hopefully, many values have been unboxed to int32 or double values.

range analysis

This pass looks at each value, and tries to determine its range, if Crankshaft was actually able to allocate a specialized representation for it. In practice this is mainly used for values represented as integers. It allows HInstructions to assert various properties about the instruction, such as lack of overflow, or lack of negative zero values. These properties can influence code generation.

The algorithm propagates constraints forward on all control paths, so that control instructions can produce different ranges for the same value on different control paths. Ranges reaching a phi node are unioned together.

type inference and canonicalization

I mentioned that representation was related to type. Here I should be more specific. In Crankshaft, types are logically associated with objects. One can treat any object as an array, for the purposes of a calculation, but that treatment does not coerce the object itself to another type. Representation is about how to use an object; type is about the object itself.

Type inference can help eliminate runtime checks. If we can infer that a certain value is not only a tagged value, but is not a smi (small integer--a fixnum), then we can access its map directly without having to do the smi check. If we are storing an element in an array, and we know it is a smi, we don't need to emit a write barrier. So, type inference is another forward control-flow propagation algorithm, with a fixed-point iteration for phis in loop headers.

After type inference is run, each instruction in the whole graph is asked to canonicalize itself, which basically eliminates useless instructions: HToInt32 of a value with int32 representation, HCheckNonSmi of a value with non-smi type, etc.

deoptimization marks

When can you coerce undefined to an unboxed double? When you need a double for math, but you aren't going to eventually compare it to another double. The problem is that undefined coerces to NaN, but undefined == undefined and NaN != NaN. So, crankshaft marks any value used by a numeric compare as deoptimize-on-undefined. If the value is a phi, all connected phis must check for undefined, which can be a pretty big lose. But hey, that's JavaScript!

representation changes

Up to this point we've been dealing with values on a fairly high level, assuming that all of the definitions will be compatible with the uses. But as we chose to represent some temporaries as untagged values, there will need to be explicit conversions between representations -- hopefully not too many, but they will be needed. This pass inserts these representation changes.

This pass uses the truncating-to-int32 and deoptimize-on-undefined flags calculated in previous phases, to tell the changes what conditions they need to watch out for. Representation changes for phis happen just before branching to the block with the phi -- effectively, just before calling the block.

minus zero checks

The previous pass probably inserted a number of HChange instructions. For any HChange from int32 (which must be to tagged or double), Crankshaft traces dataflow backwards to mark the HChange(s) that produced the int32 as deoptimizing on negative zero. This allows parts of the graph to use fast int32 operations, while preserving the semantics of -0 should it appear, albeit more slowly (via a deoptimization).

stack check elimination

Loops need to be interruptable, and V8 does so by placing a stack check at the beginning of each loop iteration. If the runtime wants to interrupt a loop, it resets the stack limit of the process, and waits for the process's next stack check.

However if a call dominates all backward branches of a loop, then the loop can be interrupted by the stack check in the callee's prologue, so the stack check in the loop itself is unneeded and can be removed.

global value numbering: licm

I've been trying to get here for all this long page. Thanks to readers that have kept with it!

At this point, Hydrogen has served two of its three purposes: inlining and unboxing. The third is loop-invariant code motion (LICM) and common subexpression elimination (CSE). These important optimizations are performed by the perhaps misnamed global value numbering (GVN) pass.

LICM is about code motion: moving code out of hot loops. So what prevents an HInstruction from being moved? Two things: data dependencies, and side effects.

The LICM algorithm takes care of both of these. For every instruction in a loop, if it only depends on values defined outside the loop, and the side effects of the loop as a whole do not prevent the move, then LICM hoists the instruction to the loop header. If you do this in reverse post-order, you can hoist whole chains of expressions out of loops.

Every hydrogen instruction has a bitfield indicating the side effects that it causes, and a corresponding bitfield for the side effects that can affect it. In addition there is a flag indicating whether code motion is allowed for that instruction or not. These flags constitute a simple effects analysis -- if instruction B directly follows A, and A does not cause any side effects that B depends on, and B does not use the value from A, then B may be moved up before A.

Since the effects fields are simple bitfields, one can easily compute the side effects of a whole block by iterating over the instructions, doing a bitwise OR at each step. This is what LICM does -- it computes the side effects for the entire loop, and moves any instructions that it can to the loop header.

global value numbering: cse

Global value numbering, as implemented in V8, uses this technique in the opposite direction: instead of trying to hoist expressions up, it propagates them down the control-flow graph, trying to kill common subexpressions. A common subexpression is a Hydrogen value B that is "equal" in some sense to a previous instruction A, and whose value will not be affected by any side effects that occur on the control-flow paths between A and B. In this context, "value numbering" essentially means "define appropriate hash and equality functions and stick all the instructions you've seen in a hash table."

As GVN goes through the instructions of each block, in reverse-post-order, it puts them into this hash table (HValueMap). When GVN reaches an instruction that causes side effects, it flushes out the entries that depend on those side effects from the table. In that way, if ever it sees an instruction B that is equal to an instuction A in the table, GVN can remove B from its block, and replace every use of B with a use of A.

In standard SSA terminology you can consider instructions to be "definitions", because each instruction defines a value. Global value numbering works across basic blocks, hence the "global" in its name. Specifically, it eliminates equivalent definitions. One of the invariants of SSA is that "definitions dominate uses". So the part of the graph that should be searched for equivalent definitions is the part that is dominated by the original definition.

checked value replacement

In optimized code, accessing an array out of its bounds will cause a deoptimization. So the AST-to-Hydrogen translation will not emit a use of an index directly -- it inserts a bounds check, and has the array access use the result of the bounds check. But the purpose of this instruction is just to cause the side effect of deoptimization, and the resulting value is just a copy of the incoming index, which increases register allocation pressure for no reason. So now that code motion is complete, Crankshaft replaces uses of HBoundsCheck with the instruction that defines the index value, relying on the HBoundsCheck's position in the instruction stream to cause deopt as needed.

conclusions

In the end, the practical goals of Crankshaft are simple, and the optimizations are too, for the most part. Java people like to trot out all their gerundic compiler passes (see slide 7) and declare victory, and it is true that V8 and other modern JavaScript implementations will be beaten by the HotSpot JVM -- once it is allowed to warm up, of course. (V8's 5ms startup is particularly impressive, here.) But Crankshaft definitely has the basics, and throws in some range analysis to squeeze out a few more percent on top.

Finally, if you've gotten here, thanks again for reading this far. Comments and corrections are most welcome.

I have enjoyed being able to do this serious code reading, and some small bit of V8 hacking, as part of my work at Igalia. I think I'll have one more article about the platform-specific Lithium language, and its register allocator, and then see where I should move on from there. Happy hacking to all, and to all a good hack!

Syndicated 2011-08-02 21:21:01 from wingolog

static single assignment for functional programmers

Friends, I have an admission to make: I am a functional programmer.

By that I mean that lambda is my tribe. And you know how tribalism works: when two tribes meet, it's usually to argue and not to communicate.

So it is that I've been well-indoctrinated in the lore of the lambda calculus, continuation-passing style intermediate languages, closure conversion and lambda lifting. But when it comes to ideas from outside our tribe, we in the lambda tribe tune out, generally.

At the last Scheme workshop in Montreal, some poor fellow had the temerity to mention SSA on stage. (SSA is what the "machine tribe" uses as an intermediate langauge in their compilers.) I don't think the "A" was out of his mouth before Olin Shivers' booming drawl started, "d'you mean CPS?" (CPS is what "we" use.) There were titters from the audience, myself included.

But there are valuable lessons to be learned from SSA language and the optimizations that it enables, come though it may from another tribe. In this article I'd like to look at what the essence of SSA is. To do so, I'll start with explaining the functional programming story on intermediate languages, as many of my readers are not of my tribe. Then we'll use that as a fixed point against which SSA may be compared.

the lambda tribe in two sentences

In the beginning was the lambda. God saw it, realized he didn't need anything else, and stopped there.


functional programmers got god's back

Hey, it's true, right? The lambda-calculus is great because of its expressivity and precision. In that sense this evaluation is a utilitarian one: the lambda-calculus allows us to reason about computation with precision, so it is worth keeping around.

I don't think that Church was thinking about digital computers when he came up with the lambda-calculus back in the 1930s, given that digital computers didn't exist yet. Nor was McCarthy thinking about computers when he came up with Lisp in the 1960s. But one of McCarthy's students did hack it up, and that's still where we are now: translating between the language of the lambda-calculus and machine language.

This translation process is compilation, of course. For the first 20 years or so of practicing computer science, compilers (and indeed, languages) were very ad-hoc. In the beginning they didn't exist, and you just wrote machine code directly, using switches on a control panel or other such things, and later, assembly language. But eventually folks figured out parsing, and you get the first compilers for high-level languages.

I've written before about C not being a high-level assembly language, but back then, FORTRAN was indeed such a language. There wasn't much between the parser and the code generator. Everyone knows how good compilers work these days: you parse, you optimize, then you generate code. The medium in which you do your work is your intermediate language. A good intermediate language should be simple, so your optimizer can be simple; expressive, so that you can easily produce it from your source program; and utilitarian, in that its structure enables the kinds of optimizations that you want to make.

The lambda tribe already had a good intermediate language in this regard, in the form of the lambda-calculus itself. In many ways, solving a logic problem in the lambda-calculus is a lot like optimizing a program. Copy propagation is beta-reduction. Inlining is copy propagation extended to lambda expressions. Eta-conversion of continuations eliminates "forwarding blocks" -- basic blocks which have no statements, and just jump to some other continuation. Eta-conversion of functions eliminates functional trampolines.

continuation-passing style

But I'm getting ahead of myself. In the lambda tribe, we don't actually program in the lambda-calculus, you see. If you read any of our papers there's always a section in the beginning that defines the language we're working in, and then defines its semantics as a translation to the lambda-calculus.

This translation is always possible, for any programming language, and indeed Peter Landin did so in 1965 for Algol. Landin's original translations used his "J operator" to capture continuations, allowing a more direct translation of code into the lambda-calculus.

I wrote more on Landin, letrec, and the Y combinator a couple of years ago, but I wanted to mention one recent paper that takes a modern look at J, A Rational Deconstruction of Landin's J Operator. This paper is co-authored by V8 hacker Kevin Millikin, and cites work by V8 hackers Mads Sig Ager and Lasse R. Nielsen. Indeed all three seem to have had the privilege of having Olivier Danvy as PhD advisor. That's my tribe!

Anyway, J was useful in the context of Landin's abstract SECD machine, used to investigate the semantics of programs and programming languages. However it does not help the implementor of a compiler to a normal machine, and intermediate languages are all about utility. The answer to this problem, for functional programmers, was to convert the source program to what is known as continuation-passing style (CPS).

With CPS, the program is turned inside out. So instead of (+ 1 (f (+ 2 3))), you would have:

lambda return
  let c1 = 1
    letcont k1 t1 = _+ return c1 t1
      letcont k2 t2 = f k1 t2
        let c2 = 2, c3 = 3
          _+ k2 c2 c3

Usually the outer lambda is left off, as it is implicit. Every call in a CPS program is a tail call, for the purposes of the lambda calculus. Continuations are explicitly represented as lambda expressions. Every function call or primitive operation takes the continuation as an argument. Papers in this field usually use Church's original lambda-calculus notation instead of the ML-like notation I give here. Continuations introduced by a CPS transformation are usually marked as such, so that they can be efficiently compiled later, without any flow analysis.

Expressing a program in CPS has a number of practical advantages:

  • CPS is capable of expressing higher-order control-flow, for languages in which functions may be passed as values.

  • All temporary values are named. Unreferenced names represent dead code, or code compiled for effect only. Referenced names are the natural input to a register allocator.

  • Continuations correspond to named basic blocks. Their names in the source code correspond to a natural flow analysis simply by tracing the definitions and uses of the names. Flow analysis enables more optimizations, like code motion.

  • Full beta-reduction is sound on terms of this type, even in call-by-value languages.

Depending on how you implement your CPS language, you can also attach notes to different continuations to help your graph reduce further: this continuation is an effect context (because its formal parameter is unreferenced in its body, or because you knew that when you made it), so its caller can be processed for effect and not for value; this one is of variable arity (e.g. can receive one or two values), so we could jump directly to the right handler, depending on what we want; etc. Guile's compiler is not in CPS right now, but I am thinking of rewriting it for this reason, to allow more transparent handling of control flow.

Note that nowhere in there did I mention Scheme's call-with-current-continuation! For me, the utility of CPS is in its explicit naming of temporaries, continuations, and its affordances for optimization. Call/cc is a rare construct in Guile Scheme, that could be optimized better with CPS, but one that I don't care a whole lot about, because it's often impossible to prove that the continuation doesn't escape, and in that case you're on the slow path anyway.

So that's CPS. Continuations compile to jumps within a function, and functions get compiled to closures, or labels for toplevel functions. The best reference I can give on it is Andrew Kennedy's 2007 paper, Compiling With Continuations, Continued. CWCC is a really fantastic paper and I highly recommend it.

a digression: anf

CPS fell out of favor in the nineties, in favor of what became known as Administrative Normal Form, or ANF. ANF is like CPS except instead of naming the continuations, the code is left in so-called "direct-style", in which the continuations are implicit. So my previous example would look like this:

let c2 = 2, c3 = 3
  let t2 = + 2 3
    let t1 = f t1
      let c1 = 1
        + c1 t1

There are ANF correspondences for CPS reductions, like the beta-rule. See the Essence of Compiling With Continuations paper, which introduced ANF and sparked the decline of the original CPS formulation, for more.

This CPS-vs-ANF discussion still goes on, even now in 2011. In particular, Kennedy's CWCC paper is quite compelling. But the debate has been largely mooted by the advances made by the machine tribe, as enabled by their SSA intermediate language.

the machine tribe in two sentences

In the beginning was the
Segmentation fault (core dumped)

(Just kidding, guys & ladies!)

Instead of compiling abstract ideas of naming and control to existing hardware, as the lambda tribe did, the machine tribe took as a given the hardware available, and tries to expose the capabilities of the machine to the programmer.

The machine tribe doesn't roll with closures, continuations, or tail calls. But they do have loops, and they crunch a lot of data. The most important thing for a compiler of a machine-tribe language like C is to produce efficient machine code for loops.

Clearly, I'm making some simplifications here. But if you look at a machine-tribe language like Java, you will be dealing with many control-flow constructs that are built-in to the language (for, while, etc.) instead of layered on top of recursion like loops in Scheme. What this means is that large, important parts of your program have already collapsed to a first-order control-flow graph problem. Layering other optimizations on top of this like inlining (the mother of all optimizations) only expands this first-order flow graph. More on "first-order" later.

So! After decades of struggling with this problem, after having abstracted away from assembly language to three-address register transfer language, finally the machine folks came up with something truly great: static single-assignment (SSA) form. The arc here is away from the machine, and towards more abstraction, in order to be able to optimize better, and thus generate better code.

It's precisely for this utilitarian reason that SSA was developed. Consider one of the earliest SSA papers, Global Value Numbers and Redundant Comparisons by Rosen, Wegman, and Zadeck. Rosen et al were concerned about being able to move invariant expressions out of loops, extending the "value numbering" technique to operate across basic blocks. But the assignment-oriented intermediate languages that they had been using were getting in the way of code motion.

To fix this issue, Rosen et al switched from the assignment-oriented model of the machine tribe to the binding-oriented model of the lambda tribe.

In SSA, variables are never mutated (assigned); they are bound once and then left alone. Assignment to a source-program variable produces a new binding in the SSA intermediate language.

For the following function:

function clamp (x, lower, upper) {
  if (x < lower)
    x = lower;
  else if (x > upper)
    x = upper;
  return x;
}  

The SSA translation would be:

entry:
  x0, lower0, upper0 = args;
  goto b0;

b0:
  t0 = x0 < lower0;
  goto t0 ? b1 : b2;

b1:
  x1 = lower0;
  goto exit;

b2:
  t1 = x0 > upper0;
  goto t1 ? b3 : exit;
  
b3:
  x2 = upper0;
  goto exit;
  
exit:
  x4 = phi(x0, x1, x2);
  return x4;

SSA form breaks down a procedure into basic blocks, each of which ends with a branch to another block, either conditional or unconditional. Usually temporary values receive their own names as well, as it facilitates optimization.

phony functions

The funny thing about SSA is the last bit, the "phi" function. Phi functions are placed at control-flow joins. In our case, the value of x may be proceed from the argument or from the assignment in the first or second if statement. The phi function indicates that.

But you know, lambda tribe, I didn't really get what this meant. What is a phi function? It doesn't help to consider where the name comes from, that the original IBM compiler hackers put in a "phony" function to merge the various values, but considered that "phi" was a better name if they wanted to be taken seriously by journal editors.

Maybe phi functions are intuitive to the machine tribe; I don't know. I doubt it. But fortunately there is another interpretation: that each basic block is a function, and that a phi function indicates that the basic block has an argument.

Like this:

entry x lower upper =
  letrec b0 = let t0 = x0 < lower
                if t0 then b1() else b2()
         b1 = let x = lower
                exit(x);
         b2 = let t1 = x > upper0
                if t1 then b3() else exit()
         b3 = let x = upper
                exit(x);
         exit x = x
    b0()

Here I have represented basic blocks as named functions instead of labels. Instead of phi functions, we allow the blocks to take a number of arguments; the call sites determine the values that the phi function may take on.

Note that all calls to blocks are tail calls. Reminds you of CPS, doesn't it? For more, see Richard Kelsey's classic paper, A Correspondence Between Continuation-Passing Style and Static Single Assignment Form, or my earlier article about Landin, Steele, letrec, and labels.

But for a shorter, readable, entertaining piece, see Appel's SSA is Functional Programming. I agree with Appel that we in the lambda-tribe get too hung up on our formalisms, when sometimes the right thing to do is draw a flow-graph.

so what's the big deal?

If it were only this, what I've said up to now, then SSA would not be as interesting as CPS, or even ANF. But SSA is not just about binding, it is also about control flow. In order to place your phi functions correctly, you need to build what is called a dominator tree. One basic block is said to dominate another if all control paths must pass through the first before reaching the second.

For example, the entry block always dominates the entirety of a function. In our example above, b0 also dominates every other block. However though b1 does branch to exit, it does not dominate it, as exit may be reached on other paths.

It turns out that you need to place phi functions wherever a definition of a variable meets a use of the variable that is not strictly dominated by the definition. In our case, that means we place a phi node on exit. The dominator tree is a precise, efficient control-flow analysis that allows us to answer questions like this one (where do I place a phi node?).

For more on SSA and dominators, see the very readable 1991 paper by Cytron, Ferrante, Rosen, Wegman, and Zadeck, Efficiently Computing Static Single Assignment Form and the Control Dependence Graph.

Typical implementations of SSA embed in each basic block pointers to the predecessors and successors of the blocks, as well as the block's dominators and (sometimes) post-dominators. (A predecessor is a block that precedes the given node in the control-flow graph; a successor succeeds it. A post-dominator is like a dominator, but for the reverse control flow; search the tubes for more.) There are well-known algorithms to calculate these links in linear time, and the SSA community has developed a number of optimizations on top of this cheap flow information.

In contrast, the focus in the lambda tribe has been more on interprocedural control flow, which -- as far as I can tell -- no one does in less than O(N2) time, which is, as my grandmother would say, "just turrible".

I started off with a mention of global value numbering (GVN) on purpose. This is still, 20+ years later, the big algorithm for code motion in JIT compilers. HotSpot C1 and V8 both use it, and it just landed in IonMonkey. GVN is well-known, well-studied, and it works. It results in loop-invariant code motion: if an invariant definition reaches a loop header, it can be hoisted out of the loop. In contrast I don't know of anything from the lambda tribe that really stacks up. There probably is something, but it's certainly not as well-studied.

why not ssa?

Peoples of the machine tribe, could you imagine returning a block as a value? Didn't think so. It doesn't make sense to return a label. But that's exactly what the lambda-calculus is about. One may represent blocks as functions, and use them as such, but one may also pass them as arguments and return them as values. Such blocks are of a higher order than the normal kind of block that is a jump target. Indeed it's the only way to express recursion in the basic lambda calculus.

That's what I mean when I say that CPS is good as a higher-order intermediate language, and when I say that SSA is a good first-order intermediate language.

If you have a fundamentally higher-order language, one in which you need to loop by recursion, then you have two options: do whole-program analysis to aggressively closure-convert your program to be first-order, and then you can use SSA, or use a higher-order IL, and use something more like CPS.

MLton is an example of a compiler that does the former. Actually, MLton's SSA implementation is simply lovely. They do represent blocks as functions with arguments instead of labels and phi functions.

But if you can't do whole-program analysis -- maybe because you want to extend your program at runtime, support separate compilation, or whatever -- then you can't use SSA as a global IL. That's not to say that you shouldn't identify first-order segments of your program and apply SSA-like analysis and optimization on them, of course! That's really where the lambda tribe should go.

finally

I wrote this because I was in the middle of V8's Crankshaft compiler and realized I didn't understand some of the idioms, so I went off to read a bunch of papers. At the same time, I wanted to settle the CPS-versus-ANF question for my Guile work. (Guile currently has a direct-style compiler, for which there are precious few optimizations; this fact is mostly a result of being difficult to work with the IL.)

This post summarizes my findings, but I'm sure I made a mistake somewhere. Please note any corrections in the comments.

Syndicated 2011-07-12 16:39:50 from wingolog

v8: a tale of two compilers

Regular readers will have noticed my fascination with the V8 JavaScript implementation. It is indeed an impressive piece of engineering.

When V8 was originally announced, Lars Bak wrote:

I hope the web community will adopt the code and the ideas we have developed to advance the performance of JavaScript. Raising the performance bar of JavaScript is important for continued innovation of web applications.

How right he was! V8 has succeeded admirably, not only in adoption, but also in "raising the performance bar" among all JavaScript implementations.

But as William Gibson says, "The future is already here — it's just not very evenly distributed." Many parts of V8 are not documented at all, and perhaps understandably, given the pace at which things are changing. So as I'm getting up to speed with V8 for my work with Igalia, I've been trying to document the interesting things that I find, so that all JavaScript implementations can learn and improve.

And indeed, with my selfish Guile hat on, this study in V8 is giving me lots of ideas and motivation. So perhaps V8's new motto should be, "shaming the world into faster code, one compiler at a time".

compiler the first: full-codegen

V8 compiles all JavaScript to native code. V8 has two compilers: one that runs fast and produces generic code, and one that doesn't run as fast but does try to produce optimized code.

The quick-and-simple compiler is known internally as the "full-codegen" compiler. It takes as its input the abstract syntax tree (AST) of a function, walks over the nodes in the AST, and emits calls to a macroassembler directly. Here's a picture:

The boxes represent the flow of data in the compilation process. There's only two boxes because, as we said, it's a simple compiler. All local variables are stored either on the stack or on the heap, not in registers. Any variable referenced by a nested function is stored on the heap, in the context object associated with the function in which the variable was defined.

The compiler will emit loads and stores to pull these values into registers to actually do the work. The top of the stack of temporaries is cached in a register. Complicated cases are handled by emitting calls to runtime procedures. The compiler does track the context in which an expression is being evaluated, so that tests can just jump directly to the consequent blocks, instead of pushing a value, testing if it's zero or not, then branching. Small integer arithmetic is usually inlined. And that's all there is!

Actually I should mention one important optimization that is done even with the full-codegen compiler, and that is inline caching. See the Hölzle, Chambers, and Ungar paper I link to there for all of the details. Inline caches are used for assignments, unary and binary operations, function calls, property accesses, and comparisons.

The inline caches also serve as sources of type information, used by the optimizing compiler. In the case of some statement types, like assignments, the only purpose of the IC is to record type information.

Here are some links to the source:

ast.h
The abstract syntax tree. Note that though the comment says that ASTs are allocated in separate zones, for constant-time deallocation, in practice the AST is kept around as long as the JavaScript functions are alive, for use as input to the optimizing compiler. Zone allocation is more of a win for the optimizing compiler, which produces more ephemeral garbage.
full-codegen.h
full-codegen.cc
full-codegen-ia32.cc
The full-codegen compiler. Most of the meat of the full-codegen compiler is in the target-specific directory (4257 lines vs 769+1323 lines). Currently supported targets are ia32, x64, arm, and mips.

type feedback

The first time V8 sees a function, it parses it to the AST but doesn't actually do anything with it. It only runs the full-codegen compiler when the function is first run. How's that for lazy! But after things have started up, it kicks off a profiler thread to see how things are going, and what functions are hot.

This lazy, sit-back-and-watch approach gives V8 time to record the type information flowing through it. So by the time it has decided that a function is hot, and could use a little more help, it has type information to give to the compiler.

Runtime type feedback information is recorded by and stored in the inline caches (ICs). Types are expressed internally as a 8-bit value constructed in such a way that it can detect a hierarchy of types, with a simple bitmask. At this point the best I can do is show the artwork in the source:

//         Unknown
//           |   \____________
//           |                |
//      Primitive       Non-primitive
//           |   \_______     |
//           |           |    |
//        Number       String |
//         /   \         |    |
//    Double  Integer32  |   /
//        |      |      /   /
//        |     Smi    /   /
//        |      |    / __/
//        Uninitialized.

Everytime an IC stub sees a new kind of value, it computes the type of that value, and bitwise-ands it with the old type. The initial type value is uninitialized. So if an IC only sees integers in the Smi (small integer) range, the recorded type will indicate that. But as soon as it sees a double value, the type becomes Number; and if it then sees an object, the type becomes Unknown. A non-primitive IC will necessarily store the map of the receiver type in the IC, for dispatch purposes. Type feedback can parse the IC stub to get at this map, when needed.

Type feedback information is associated with a particular AST node (the assignment, property load, etc.). An integer identifier for the node is serialized into the IC so that when V8 decides that a function is hot, it can parse out the recorded type information from the full-codegen code, and associate it with the AST node.

Now, this process is a bit complicated. It needs support up and down in your compiler stack. You need to have inline caches. Your inline caches need to have support for the type information, both for operands and for the result. You need to be able to walk this data to find the values. Then you need to link it back to the AST, so that when you pass the AST to your optimizing compiler, the compiler is able to ask the right questions.

The concrete strategy that V8 takes is to parse the data into a TypeFeedbackOracle object, associating information with particular AST nodes. Then V8 visits all of the AST nodes with this oracle, and the nodes themselves parse out the data that they might find useful from the oracle.

In the end one is able to, for example, ask a Property node if it is monomorphic, and in any case what are the receiver types at that node. It seems that this works well for V8, because it reduces the number of moving parts in the optimizing compiler, given that it doesn't need to have the TypeFeedbackOracle itself.

So, source links.

type-info.h
The TypeInfo 8-bit data type, and the TypeFeedbackOracle declaration. I have to admit, I really like the use of C++ within V8. It's a nasty tool, but they wield it well.
type-info.cc
The implementation of TypeFeedbackOracle. See ProcessTarget down at the bottom of the file for where the bulk of the work happens.

Also check the ast.h link before to see see how type feedback ties into the AST itself.

crankshaft = type feedback + hydrogen + lithium

Once V8 has identified that a function is hot and has collected some type feedback information, it tries to run the augmented AST through an optimizing compiler. This optimizing compiler is called Crankshaft in the marketing materials, though that name rarely appears in the source itself.

Instead, in the source, Crankshaft consists of the Hydrogen high-level intermediate representation (IR), the Lithium low-level IR, and their associated compilers. Like this:

(You probably caught on from my heavy-handed repetition above, but I believe that the names Hydrogen and Lithium come from High- and Low-level, respectively.)

It depends on your background, but you might have seen a diagram like that before:

Indeed I believe that Crankshaft was highly influenced by the changes that Sun introduced into the Hotspot Client compiler in Java 6. Let me quote from an illuminating 2008 paper by Kotzmann et al, Design of the HotSpot Client Compiler for Java 6:

First, a high-level intermediate representation (HIR) of the compiled method is built via an abstract interpretation of the bytecodes. It consists of a control-flow graph (CFG), whose basic blocks are singly linked lists of instructions. The HIR is in static single- assignment (SSA) form, which means that for every variable there is just a single point in the program where a value is assigned to it. An instruction that loads or computes a value represents both the operation and its result, so that operands can be represented as pointers to previous instructions. Both during and after generation of the HIR, several optimizations are performed, such as constant folding, value numbering, method inlining, and null check elimination. They benefit from the simple structure of the HIR and the SSA form.

The back end of the compiler translates the optimized HIR into a low-level intermediate representation (LIR). The LIR is conceptually similar to machine code, but still mostly platform-independent. In contrast to HIR instructions, LIR operations operate on virtual registers instead of references to previous instructions. The LIR facilitates various low-level optimizations and is the input for the linear scan register allocator, which maps virtual registers to physical ones.

This statement describes Crankshaft very neatly, and the rest of section 2 of that paper applies in a general sense. Of course there are some differences. Crankshaft starts with the AST, not bytecodes. The HotSpot client runtime does not use type-feedback to help its compiler, as it is less necessary for Java, though it would still be helpful. Crankshaft doesn't do very much with exception handlers.

But the similarities are such that V8 can actually produce traces that can be read by c1visualizer (docs), a program used to visualize the internals of the HotSpot client compiler. (The client compiler appears to be known internally as c1; the server compiler appears to be opto).

in our next installment

This post is getting quite long. I'm going to have to send my mom some flowers or something, because I'm sure she does check these things out, and it must be brain-numbing to slog through. Another paragraph? When will he stop?

So if you, dear reader, find yourself in this category, you're in good company, because my mom is a lovely lady.

But if you're a compiler geek, I think I have at least couple more articles to write on Crankshaft before I can get back to other hacks. Specifically, we'll go into the optimizations that Crankshaft does on the Hydrogen IR, and take a look on how it does the translation to Lithium. But until then, happy hacking!

Syndicated 2011-07-05 14:31:49 from wingolog

guile 2.0.2: building the guildhall

Hey tubes, we made another Guile release!

what is guile

Guile is the GNU extension language. It's an implementation of Scheme. It's pretty fun to hack in, and on!

guile 2.0.2: even faster!

Here I'd like to highlight the thing that I really like about this release: performance. Relative to 2.0.1:

  • The VM is 20% faster

  • Compiled .go files are half the size

  • Startup time is down from 16.2 milliseconds to 11.7 milliseconds (on my machine)

  • Base memory use down from 3.8 MB of private dirty memory to 2.9MB (measured on a guile -c '(sleep 100)')

In short, thundercats are totally on the loose! See the release announcement for all the details.

where we're going

It's hard to tell sometimes where this jalopy is headed. Sometimes the vehicle just drives itself, like the recent speed improvements, which were totally unplanned.

But that said, there are a few stops we want to make on this journey, in the near- and mid-term. Let's take a look.

2.0: building the guildhall

On the 2.0 branch, the current stable series, the big feature that we'd like to land soon will be a CPAN-like repository of installable Scheme modules.

We agonized over the name for a while. It needs to be short and memorable, so it can be used as a command-line tool. (The Git experience is pretty definitive here.) At the same time it needs to be linked to Guile, and follow some sort of theme.

What we've done now is to take our guile-tools script runner (equivalent to the git wrapper) and rename it to guild. Currently it's mostly used for compiling files, as in guild compile -o foo.go foo.scm, but in the future we want to use it as a part of this system for Guile wizards and journeyfolk to band together to share code---to form a guild, so to speak.

And so the package archive itself will be the "guild hall" (or "guildhall"). We'll see if we can get guildhall.gnu.org, as a place to host the packages.

As far as code, protocol, and formats go, we are looking at porting Andreas Rottmann's Dorodango Scheme package manager to be Guile-specific, with guild commands. So guild update, guild upgrade, guild install xml-rpc, etc.

Dorodango is basically a port of Debian's apt to Scheme. It even has ports of the same dependency resolution algorithms, which is fun. It also allows for multiple sources, so we can have our Guile guildhall archive, but also have support for portable R6RS Scheme archives from Dorodango.

2.2: bit mucking

Guile is getting pretty good now, but there is more to be done before it is as cheap to run a program written in Scheme as it is to run a program written in C.

The first thing that we will do to help this is to switch to ELF as the format of our compiled code. We will do this on all platforms, and even before we do native compilation. The bytecode will simply be stored in an ELF section. This will also allow us to statically allocate more data, decreasing memory usage and increasing locality. Also our debugging information is currently interspersed with code -- not in the execution path, mind you, but it does make the memory profile of the code look more like emmental than comté cheese. Moving debug info out to a separate section will help with that.

ELF will also add flexibility. We can add sections to express dependencies of compiled code on source code. We can strip debug info and other things from compiled code. We can link multiple modules into one file, to avoid stat penalties on startup.

In addition ELF will allow us to add native code, in a separate section. Of course we have to actually produce native code, so that's the next big thing for Guile 2.2. The plan is to write an ahead-of-time code generator for x86 and x86-64 systems, while keeping the bytecode interpreter as well. Then once we know what we want, we can go and talk to the GCC folks and see if our use case can be supported there.

Anyway, that's the plan. I expect 2.2 to take another year or so. By the time we have native compilation, I hope also our Elisp implementation will be significantly faster than the one in Emacs, so it will really become attractive to people to finish the work that we have done to make the switch there.

Syndicated 2011-07-04 09:47:25 from wingolog

security implications of jit compilation

Periodically the tubes alight with statements that just-in-time (JIT) compilers are somehow bad for security. At Igalia we work with many companies that have put or are interested in putting JIT compilers on mobile devices, and obviously security is one of their concerns. In this article I'd like to take a closer look at the risks and myths surrounding JIT and security.

executive summary

JIT compilers do not cause a significant security risk in and of themselves.

However, JIT compilers do allow flaws in unsafe native libraries to be more easily exploited.

The problem is not JIT compilation. The problem is code written in unsafe languages.

a background: interpreters and security

It used to be that JavaScript implementations all used interpreters to execute JavaScript code, usually in bytecode form. An interpreter is a program, usually written in C or C++, which interprets program code as data.

(I will take JavaScript as the example in this article, though similar considerations apply to other language implementations.)

JavaScript implementations also includes a language runtime, which implements all of the "primitive" objects of the language. In this case of JavaScript this might include Date objects and such. The runtime also includes a garbage collector, a parser, and other such things.

Sometimes interpreters are used to run "external" code, as is the case with JavaScript. In that case the language implementation is a fairly clear attack target: the attacker tries to produce an input which triggers a flaw in the language implementation, allowing takeover of the target system.

Note that in this case, the interpreter itself is only part of the target. Indeed the interpreter is typically only a small part of a language implementation, and is very well-tested. Usually the vulnerability is in some bit of the runtime, and often stems from having the runtime written in an unsafe language like C or C++.

So the interpreter part itself typically imposes no risk, beyond the obvious ones of interpreting data provided by an attacker.

compilers have more moving pieces

Instead of interpreting a program as data, compilers produce native code. A compiler is bigger than an interpreter, and so exposes a larger attack surface. A compiler usually has more corner cases than an interpreter, which often are not tested properly. Compilers are usually specific to particular target architectures, which makes bugs in less-used architectures more likely. Additionally, by generating native code, a compiler operates outside whatever meager safety guarantees are provide by the implementation language (C or C++).

So yes, compilers are a risk, JIT or otherwise. However with proper testing it is possible to have confidence in a compiler. First of all, a compiler is written by compiler writers, not ordinary programmers, and not attackers. Secondly a compiler is generally a functional process, so it's easy to test. Modern JS implementations have thousands and thousands of tests. It's still a cross-your-fingers-and-pray thing, usually, but the fact is that exploits these days come from bugs present in the source of unsafe languages, not bugs in code generated from safe languages.

jit compilation allows performant mobile code

JIT compilation can actually increase the overall security of a system, if one of the design goals of that system is user-installable "apps". For example, Android apps are (typically) written in Java, targetting the Dalvik VM. This choice of safe language and safe runtime largely eliminates the need to make a tight sandbox around apps. If it weren't for the presence of unsafe code in the libraries available to an Android app, things would be even easier.

Note that Dalvik's JIT has other positive security characteristics, in that its generated code can be verified against the bytecodes that it is supposed to implement. I recommend this video for a good discussion; thanks to my colleague Xan for pointing it out.

In contrast, Apple's decision to allow natively-compiled applications to be installed by users leaves them scrambling to try to mitigate the effects of what happens when an app has a buffer overflow. Again, the problem is unsafe code, not JIT compilation.

attempts to contain security vulnerabilities in unsafe code

That said, unsafe languages are part of the sordid reality, so let's look at how attackers compromise such applications.

In the old days, once you got a stack buffer overflow, you just wrote whatever code you wanted onto the stack, and arranged for a function return to execute that code. Smashing the stack for fun and profit, you see.

These days it is more difficult. One of the things that makes it more difficult is making it impossible to write code which is later executed. This technique goes under various names: the NX bit, DEP, W⊕X, etc. Basically it means you can't write shellcode. But, you can return to libc, which uses existing functions in libc (like system).

Some systems also prevent you from running code on disk unless it is signed. Dion Blazakis wrote a nice overview on the sandbox in iOS devices (subtitle: using Scheme for evil). Besides not letting you write code for your own device -- a terrible thing IMO, but not related to security -- this technique does prevent attackers from getting around the NX restriction by writing code to disk, then executing it.

But one need not write code to perform arbitrary computation. I assume everyone's seen the following papers, and if not, I highly recommend them:

Finally, instead of loading libc always at the same place, the latest thing to happen is to randomize the positions at which the various segments are loaded in memory. This is called Address Space Layout Randomization (ASLR). To effectively execute a return to libc under ASLR, you have to figure out where libc is first; not a trivial task.

I think I need to mention again that these are not security measures per se. If you are really interested in security, don't write systems software in unsafe languages. These measures are designed to contain the effects of a vulnerability.

I should also note that all of these measures are actively harmful to people writing code. ASLR increases load time by disallowing prelinking of various sorts, and forcing the generation of position-independent code. Disallowing writable, executable memory means that you can't do JIT compilation, not to mention quajects. Disallowing unsigned binaries means that you can't run your code on your own device.

jit compilation is not compatible with these restrictions

Having a JIT compiler by definition means that you are writing code and then going to run it. This makes it easier for an attacker who has found a bug in a native library to exploit it.

These exploits do happen, though currently not very often. A good example of such a paper is Interpreter Exploitation: Pointer Inference And JIT Spraying by Dion Blazakis.

Blazakis uses characteristics of the language runtime (Flash, in that case) to allow him to determine the address ranges to write his code. The actual code is written by taking advantage of the predictable output of the particular JIT in question to spray a big NOPsled onto the heap, which if I understand it correctly, is implemented by starting the IP in the middle of x86 instructions, in lovely contrapuntal style. Very clever stuff.

But again, the vulnerability is present not in code related to the JIT, but in native code. Having executable pages makes the vulnerability easier to exploit, but is not itself the security risk.

specific jit implementations

I called out Dalvik as an example of a JIT implementation that is particularly security-conscious. As much as I like V8, though, I cannot recommend it particularly, from the security perspective. It is simply changing too fast. Practically every commit touches code generators. The individual architectures share some pieces but duplicate a lot. I cannot speak for the other JavaScript implementations, as I do not know them as well.

V8's primary use case in the Chromium browser does have some security safeguards though, benefitting from the process sandbox and rapid release cycle. But I would be hesitant to put it in a device, lacking the same safeguards.

mitigations for jit vulnerabilities

These days all signs are pointing to the end of sandboxing as a security strategy, replacing it instead with verification and software fault isolation (SFI). Google's Native Client group is leading the way here. For more, see Adapting software fault isolation to contemporary CPU architectures by Sehr et al. But it does seem a poor thing to try to turn an unsafe language into a "fault-isolated" runtime; better to start with safe code to begin with.

Adapting SFI to JIT compilers could protect against flaws in JIT compilers, but it has abysmal results. The authors of that paper adapted V8 to produce the code sequences needed to preserve the SFI invariants, and got killed in speed. They somehow managed to claim success though, which is beyond me.

in closing

If you've already decided that you're going to ship remotely exploitable libraries written in unsafe languages, disallowing JIT can indeed reduce your attack surface. I understand it in some cases. But it is a shame to make safe languages pay the sins of their unsafe brethren. If you can, be Android, and not iOS.

Syndicated 2011-06-21 17:00:25 from wingolog

on-stack replacement in v8

A recent post looked at what V8 does with a simple loop:

function g () { return 1; }
function f () { 
  var ret = 0;
  for (var i = 1; i < 10000000; i++) {
    ret += g ();
  }
  return ret;
}

It turns out that calling f actually inlines g into the body, at runtime. But how does V8 do that, and what was the dead code about? V8 hacker Kasper Lund kindly chimed in with an explanation, which led me to write this post.

When V8 compiled an optimized version of f with g, it decided to do so when f was already in the loop. It needed to stop the loop in order to replace the slow version with the fast version. The mechanism that it uses is to reset the stack limit, causing the overflow handler in g to be called. So the computation is stuck in a call to g, and somehow must be transformed to a call to a combined f+g.

The actual mechanics are a bit complicated, so I'm going to try a picture. Here we go:

As you can see, V8 replaces the top of the current call stack with one in which the frames have been merged together. This is called on-stack replacement (OSR). OSR takes the contexts of some contiguous range of function applications and the variables that are live in those activations, and transforms those inputs into a new context or contexts and live variable set, and splats the new activations on the stack.

I'm using weasel words here because OSR can go both ways: optimizing (in which N frames are combined to 1) and deoptimizing (typically the other way around), as Hölzle notes in "Optimizing Dynamically-Dispatched Calls with Run-Time Type Feedback". More on deoptimization some other day, perhaps.

The diagram above mentions "loop restart entry points", which is indeed the "dead code" that I mentioned previously. That code forms part of an alternate entry point to the function, used only once, when the computation is restarted after on-stack replacement.

details, godly or otherwise

Given what I know now about OSR, I'm going to try an alternate decompilation of V8's f+g optimized function. This time I am going to abuse C as a high-level assembler. (I know, I know. Play along with me :))

Let us start by defining our types.

typedef union { uint64_t bits; } Value;
typedef struct { Value map; uint64_t data[]; } Object;

All JavaScript values are of type Value. Some Values encode small integers (SMIs), and others encode pointers to Object. Here I'm going to assume we are on a 64-bit system. The data member of Object is a tail array of slots.

inline bool val_is_smi (Value v) 
{ return !(v.bits & 0x1); }

inline Value val_from_smi (int32 i)
{ return (Value) { ((uint64_t)i) << 32 }; }

inline int32 val_to_smi (Value v)
{ return v.bits >> 32; }

Small integers have 0 as their least significant bit, and the payload in the upper 32 bits. Did you see the union literal syntax here? (Value){ ((uint64_t)i) << 32 }? It's a C99 thing that most folks don't know about. Anyway.

inline bool val_is_obj (Value v)
{ return !val_is_smi (v); }

inline Value val_from_obj (Object *p)
{ return (Value) { ((uint64_t)p) + 1U }; }

inline Object* val_to_obj (Value v)
{ return (Object*) (v.bits - 1U); }

Values that are Object pointers have 01 as their least significant bits. We have to mask off those bits to get the pointer to Object.

Numbers that are not small integers are stored as Object values, on the heap. All Object values have a map field, which points to a special object that describes the value: what type it is, how its fields are laid out, etc. Much like GOOPS, actually, though not as reflective.

In V8, the double value will be the only slot in a heap number. Therefore to access the double value of a heap number, we simply check whether the Object's map is the heap number map, and in that case return the first slot, as a double.

There is a complication though; what is value of the heap number map? V8 actually doesn't encode it into the compiled function directly. I'm not sure why it doesn't. Instead it stores the heap number map in a slot in the root object, and stores a pointer into the middle of the root object in r13. It's as if we had a global variable like this:

Value *root;    // r13

const int STACK_LIMIT_IDX = 0;
const int HEAP_NUMBER_MAP_IDX = -7; // Really.

inline bool obj_is_double (Object* o)
{ return o->map == root[HEAP_NUMBER_MAP_IDX]; }

Indeed, the offset into the root pointer is negative, which is a bit odd. But hey, details! Let's add the functions to actually get a double from an object:

union cvt { double d; uint64_t u; };

inline double obj_to_double (Object* o)
{ return ((union cvt) { o->data[0] }).d; }

inline Object* double_to_obj (double d)
{
  Object *ret = malloc (sizeof Value * 2);
  ret->map = root[HEAP_NUMBER_MAP_IDX];
  ret->data[0] = ((union cvt) { d }).u;
  return ret;
}

I'm including double_to_obj here just for completeness. Also did you enjoy the union literals in this block?

So, we're getting there. Perhaps the reader recalls, but V8's calling convention specifies that the context and the function are passed in registers. Let's model that in this silly code with some global variables:

Value context;  // rsi
Value function; // rdi

Recall also that when f+g is called, it needs to check that g is actually bound to the same value. So here we go:

const Value* g_cell;
const Value expected_g = (Value) { 0x7f7b205d7ba1U };

Finally, f+g. I'm going to show the main path all in one go. Take a big breath!

Value f_and_g (Value receiver)
{
  // Stack slots (5 of them)
  Value osr_receiver, osr_unused, osr_i, osr_ret, arg_context;
  // Dedicated registers
  register int64_t i, ret;

  arg_context = context;

  // Assuming the stack grows down.
  if (&arg_context > root[STACK_LIMIT_IDX])
    // The overflow handler knows how to inspect the stack.
    // It can longjmp(), or do other things.
    handle_overflow ();

  i = 0;
  ret = 0;
  
restart_after_osr:
  if (*g_cell != expected_g)
    goto deoptimize_1;

  while (i < 10000000)
    {
      register uint64_t tmp = ret + 1;
      if ((int64_t)tmp < 0)
        goto deoptimize_2;

      i++;
      ret = tmp;
    }

  return val_from_smi (ret);

And exhale. The receiver object is passed as an argument. There are five stack slots, none of which are really used in the main path. Two locals are allocated in registers: i and ret, as we mentioned before. There's the stack check, locals initialization, the check for g, and then the loop, and the return. The test in the loop is intended to be a jump-on-overflow check.

restart_after_osr what?

But what about OSR, and what's that label about? What happens is that when f+g is replaced on the stack, the computation needs to restart. It does so from the osr_after_inlining_g label:

osr_after_inlining_g:
  if (val_is_smi (osr_i))
    i = val_to_smi (osr_i);
  else
    {
      if (!obj_is_double (osr_i))
        goto deoptimize_3;

      double d = obj_to_double (val_to_obj (osr_i));

      i = (int64_t)trunc (d);
      if ((double) i != d || isnan (d))
        goto deoptimize_3;
    }

Here we take the value for the loop counter, as stored on the stack by OSR, and unpack it so that it is stored in the right register.

Unpacking a SMI into an integer register is simple. On the other hand unpacking a heap number has to check that the number has an integer value. I tried to represent that here with the C code. The actual assembly is just as hairy but a bit more compact; see the article for the full assembly listing.

The same thing happens for the other value that was live at OSR time, ret:

  if (val_is_smi (osr_ret))
    ret = val_to_smi (osr_ret);
  else
    {
      if (!obj_is_double (osr_ret))
        goto deoptimize_4;

      double d = obj_to_double (val_to_obj (osr_ret));

      ret = (int64_t)trunc (d);
      if ((double) ret != d || isnan (d))
        goto deoptimize_4;

      if (ret == 0 && signbit (d))
        goto deoptimize_4;
    }
  goto restart_after_osr;

Here we see the same thing, except that additionally there is a check for -0.0 at the end, because V8 could not prove to itself that ret would not be -0.0. But if all succeeded, we jump back to the restart_after_osr case, and the loop proceeds.

Finally we can imagine some deoptimization bailouts, which would result in OSR, but for deoptimization instead of optimization. They are implemented as tail calls (jumps), so we can't represent them properly in C.

deoptimize_1:
  return deopt_1 ();
deoptimize_2:
  return deopt_2 ();
deoptimize_3:
  return deopt_3 ();
deoptimize_4:
  return deopt_4 ();
}

and that's that.

I think I've gnawed all the meat that's to be had off of this bone, so hopefully that's the last we'll see of this silly loop. Comments and corrections very much welcome. Next up will probably be a discussion of how V8's optimizer works. Happy hacking.

Syndicated 2011-06-20 13:14:03 from wingolog

V8 is faster than GCC

Do you like the linkbait title? Neither do I, but it is true in this case. Check it out.

After my last post, Benjamin noted that GCC would reduce my simple test to a mov rax, $10000000; ret sequence. Well yes, that's true, and GCC does do that: but only if GCC is able and allowed to do the inlining. So the equivalent of the test, for GCC, is to compile the g in a separate file:

unsigned int g (void) { return 1; }

Then in the main file, we have:

extern unsigned int g (void);

int main (int argc, char *argv) {
  unsigned int i, ret = 0;
  for (i = 0; i < NU; i++)
    ret += g ();
  return ret;
}

Here we replace N with the number of iterations we're interested in testing.

I decided to run such a test; the harness is here. I ran both GCC and V8 for the above program, for N from 20 to 231, and plotted the total time taken by V8 as compared to GCC.

Naturally this includes the GCC compile times, for which we expect GCC to lose at low iterations, but the question is, when does GCC start being faster than V8?

answer: never!

That's right, V8 is always faster than GCC, right up to the point at which its fixnums start failing. For the record, only the points on the right of the graph are really worth anything, as the ones on the left only run for a few milliseconds. Raw data here.

Why is V8 faster, even for significant iteration counts? Because it can inline hot function calls at runtime. It can see more than GCC can, and it takes advantage of it: but still preserving the dynamic characteristics of JavaScript. In this context the ELF symbol interposition rules that GCC has to deal with are not such a burden!

Syndicated 2011-06-10 15:57:58 from wingolog

what does v8 do with that loop?

Hi!

I've spent the last month or so swimming in V8. Not the tomato juice, mind you, though that would be pleasant, but rather the JavaScript implementation.

In my last dispatch I made the assertion that the current JavaScript implementations are totally undocumented, but in the case of V8 that's not precisely true. There were at least two PhD dissertations written on it, along with a few technical reports and conference proceedings.

I refer of course to the Self work at Stanford and Sun, back in the late eighties and early nineties. I had read a number of their papers, but hadn't seen Hölzle's thesis yet. That one, for me, is the best of all, because it explains the Self implementation on an engineering level. The V8 implementation, I mean to say, because the vocabulary is entirely the same: maps, on-stack replacement, lazy deoptimization, etc. These are all parts of V8 now.

And it's no wonder, really. Lars Bak, the lead developer of V8, was there back then, hacking on Self, and then Strongtalk, then HotSpot, then a little startup built around virtual machines on mobile devices. So you see there's a reason why V8 doesn't have very much big-picture internal documentation -- Bak has been writing the same program for 20 years now; he knows how it works.

(Note however that Bak's name never appears in the V8 commit logs. You can see his hand at work, but never his face. Like Dr. Claw. Actually not very much like Dr. Claw but I just wanted to put that out there. Is Lars Bak Dr. Claw? Is he?)

enough with the personification

As you might recall, V8 always compiles JavaScript to native code. The first time V8 sees a piece of code, it compiles it quickly but without optimizing it. The initial unoptimized code is fully general, handling all of the various cases that one might see, and also includes some type-feedback code, recording what types are being seen at various points in the procedure.

At startup, if the "Crankshaft" optimizing compiler is enabled, V8 spawns off a profiling thread. If it notices that a particular unoptimized procedure is hot, it collects the recorded type feedback data for that procedure and uses it to compile an optimized version of the procedure. The old unoptimized code is then replaced with the new optimized code, and the process continues. (This on-stack replacement (OSR) process is covered in Hölzle's thesis.)

The optimized procedure does not actually cover all cases. If a callee procedure starts returning floating-point values where it was returning integer values before, the optimized procedure is deoptimized -- a relatively expensive process of recompiling the original procedure to the unoptimized form, replacing the optimized function on the stack with the unoptimized version, and then continuing the computation. This requires that the compiler keep around additional information about how to continue the computation -- you could be in the middle of a loop, after all.

Deoptimization is particularly tricky if the optimizing process inlined a callee procedure, and thus has to de-inline, replacing the activation of one optimized procedure call with two or more unoptimized calls. Again, Hölzle's thesis discusses this in depth. Inlining is such a win though that the complexity appears to be worth it.

assembly

I wanted to see what V8 does with a simple loop, but one for which lexical inlining isn't possible. Like this:

function g () { return 1; }
function f () { 
  var ret = 0;
  for (var i = 1; i < 10000000; i++) {
    ret += g ();
  }
  return ret;
}

Pretty simple: adding 1 to a counter, 10 million times, but attempting to foil statically apparent inlining possibilities. I entered these two definitions at the d8 shell, invoked with the --code_comments --print_code options.

Note that running V8 in this way is going to spew a lot on your console, as V8 itself warms up. Startup is quick, though. On my modern laptop with an SSD, the debugging shell takes about 17ms to start up. The standard shell takes about 5ms to start. Both of these numbers are with snapshots on; without snapshots, the numbers are more like 32ms and 18ms, respectively. Just for comparison, the JavaScriptCore shell (jsc) takes about 12ms to start here.

Interestingly, V8's profiler decides that the best thing to do here is not to optimize g -- which it actually can't, as it's so small the unoptimized code is already optimal -- but to inline g into f, and optimize f.

V8 is able to do this inlining because it keeps around the parsed AST of every piece of code it sees. It needs some high-level internal representation, and it turns out that the AST is the easiest one: it's fairly small, it's already the input format to the "full-codegen" unoptimized compiler, and it also has all of the lexical information necessary to do good inlining analysis. Much easier than trying to decompile bytecode, it seems to me.

I did say that I was going to show some assembly, so here we go. This is what d8 prints out when evaluating f(). I've trimmed the output a bit. The comments on the right are my annotations.

Instructions (size = 466)
  0  push rbp           ;; Save the frame pointer.
  1  movq rbp,rsp       ;; Set the new frame pointer.
  4  push rsi           ;; Save the callee's "context object".
  5  push rdi           ;; Save the callee's JSFunction object.
  6  subq rsp,0x28      ;; Reserve space for 5 locals.

Here we have a standard prelude. The JavaScript calling conventions in V8 pass arguments on the stack, using rbp and rsp as callee-saved stack and frame pointers. Additionally, information associated with the function itself is passed in rsi and rdi: the context, and the function object itself. The context is an array optimized for looking up various information that the function needs at runtime, mostly free variables (lexical and global).

In this case it's redundant to take the context out of the function, but it does allow for faster access to the global scope object that was current when the function was defined. In the case of closures, every time the function() expression is evaluated, a new context will be created with a new set of free variables.

Anyway! Moving along:

 10  movq rax,rsi        ;; Store the context in a scratch register.
 13  movq [rbp-0x38],rax ;; And save it in the last (4th) stack slot.
 17  cmpq rsp,[r13+0x0]  ;; Check for stack overflow
 21  jnc 28              ;; If we overflowed,
 23  call 0x7f7b20f40a00 ;; Call the overflow handler.

Here we store the context again, in the last stack slot, and we check for overflow. On x86-64, V8 reserves r13 for a pointer to somewhere in the middle of a "global object", holding the GC roots for a given JavaScript execution context. There is a cell in the root list that holds the stack limit, which V8 abuses for other purposes: interrupting loops, breakpoints, deoptimization, etc. Again, Hölzle's thesis has some details on this technique.

The "somewhere in the middle" bit is arranged so that a simple dereference of r13 will allow us to get at the stack limit. V8 will reset the stack limit whenever it needs to interrupt a procedure.

Having passed the stack overflow check, we initialize local variables:

 28  movq rax,[rbp+0x10] ;; Receiver object to rcx (unused).
 32  movq rcx,rax        ;;
 35  movq rdx,[rbp-0x38] ;; Global object to rdx.
 39  movl rbx,(nil)      ;; Loop counter (i) to 0.
 44  movl rax,(nil)      ;; Accumulator (ret) to 0.
 49  jmp 97              ;; Jump over some stuff.

Note that the receiver object (the this object) is passed as an argument, and thus is above rbp.

Following this bit we have some code that appears to be completely dead. I'll include it for completeness, but unless the deoptimization bailouts jump back here, I don't know why it's there.

 54  movq rax,rsi        ;; Dead code.
 57  movq rbx,[rbp-0x28]
 61  testb rbx,0x1
 64  jnz 189  (0x7f7b20fa2a7d)
 70  shrq rbx,32
 74  movq rdx,[rbp-0x30]
 78  testb rdx,0x1
 81  jnz 237  (0x7f7b20fa2aad)
 87  shrq rdx,32
 91  movq rcx,[rbp-0x18]
 95  xchgq rax, rdx

We'll just forget that happened, shall we? However we are getting closer to the firm tofu of the matter, the loop. First one more check:

 97  movq rdx,[rsi+0x2f] ;; Slot 6 of the context: the global object.
101  movq rdi,0x7f7b20e401e8 ;; Location of cell holding `g'
111  movq rdi,[rdi]      ;; Dereference cell
114  movq r10,0x7f7b205d7ba1 ;; The expected address of `g'
124  cmpq rdi,r10        ;; If they're not the same...
127  jnz 371             ;; Deoptimization bailout 2

Here we see if the current definition of g, the function that we inlined below, is actually the same as when the inlining was performed.

Note that on line 97, since pointers in V8 are low-bit tagged with 01, to access slot 6 (0-based) of the context object, we only need to add 0x2f instead of 0x30. Clever, right? But we don't actually need the global object here in the main loop, so we could have delayed this load until finding out that deoptimization was necessary. Perhaps it was needed though for GC reasons.

133  movq rdx,[rdx+0x27] ;; Another redundant load.
137  cmpl rbx,0x989680   ;; 10000000, you see.
143  jge 178             ;; If i >= 10000000, break.
149  movq rdx,rax        ;; tmp = ret
152  addl rdx,0x1        ;; tmp += 1
155  jo 384              ;; On overflow, deoptimize.
161  addl rbx,0x1        ;; i++
164  movq rax,rdx        ;; ret = tmp
167  cmpq rsp,[r13+0x0]  ;; Reload stack limit.
171  jnc 137             ;; Loop if no interrupt,
173  jmp 306             ;; Otherwise bail out.
178  shlq rax,32         ;; Tag rax as a small integer.
182  movq rsp,rbp        ;; Restore stack pointer.
185  pop rbp             ;; Restore frame pointer.
186  ret 0x8             ;; Return, popping receiver.

And that's it! It's a fairly tight loop: g is inlined of course, its return value is untagged to a native int32, as are the accumulator and loop counter variables. Of course, improvements are possible -- the loop could be unrolled a few times, range analysis could avoid the deoptimization check, the overflow check could possibly be cheaper, and indeed the whole thing could be folded, but all in all, good job, MAD kittens!

Note the interesting approach to tagging: instead of storing the integer in the lower 32 bits, shifted by one, it is stored in the upper 32 bits, without a shift.

Actually there's some more junk after all of this. Another dead code block, apparently meaning to deal with floating-point values, but totally unreferenced:

189  movq r10,[r13-0x38]
193  cmpq [rbx-0x1],r10
197  jnz 397
203  movsd xmm0,[rbx+0x7]
208  cvttsd2sil rbx,xmm0
212  cvtsi2sd xmm1,rbx
216  ucomisd xmm0,xmm1
220  jnz 397
226  jpe 397
232  jmp 74
237  movq r10,[r13-0x38]
241  cmpq [rdx-0x1],r10
245  jnz 410
251  movsd xmm0,[rdx+0x7]
256  cvttsd2sil rdx,xmm0
260  cvtsi2sd xmm1,rdx
264  ucomisd xmm0,xmm1
268  jnz 410
274  jpe 410
280  testl rdx,rdx
282  jnz 301
288  movmskpd xmm2,xmm0
292  andl rdx,0x1
295  jnz 410
301  jmp 91

And then there's our stack check handler, saving all of the registers of interest, and calling out:

306  push rax
307  push rcx
308  push rdx
309  push rbx
310  push rsi
311  push rdi
312  push r8
314  push r9
316  push r11
318  push r14
320  push r15
322  leaq rsp,[rsp-0x28]
327  movq rsi,[rbp-0x8]
331  xorl rax,rax
333  leaq rbx,[r13-0x20a4b70]
340  call 0x7f7b20fa25c0
345  leaq rsp,[rsp+0x28]
350  pop r15
352  pop r14
354  pop r11
356  pop r9
358  pop r8
360  pop rdi
361  pop rsi
362  pop rbx
363  pop rdx
364  pop rcx
365  pop rax
366  jmp 137

And finally the deoptimization bailouts:

371  movq r10,0x7f7b20f7c054
381  jmp r10
384  movq r10,0x7f7b20f7c05e
394  jmp r10
397  movq r10,0x7f7b20f7c068
407  jmp r10
410  movq r10,0x7f7b20f7c072
420  jmp r10

Whew! I'd say that's long enough for today. I wanted to establish a fixed point on the low level though, so that I could write in the future about how it is that V8 compiles down to this code, what deoptimization is about, and such; filling in the middle, between the assembly and the JavaScript source. Corrections and suggestions in the comments please. Happy hacking!

Syndicated 2011-06-08 14:33:55 from wingolog

value representation in javascript implementations

In my last note I mentioned that I had been doing a lot of reading of JavaScript implementation code. One point that I didn't mention is that the state of the art is completely undocumented. So to work on rectifying that, here's the first in what might be a series of articles about the internals of JavaScript implementations.

tagging

Initially, all JavaScript implementations used tagged pointers to represent JS values. This is a old trick that comes from the observation that allocated memory takes up at least 4 or 8 bytes, and are aligned in such a way that the least significant bit or three will be zero.

Here's an old part of the Guile manual that explains the technique. For example you could take values whose lowest bit is 1 to be a pointer to the heap. Any value that ends in 0 could be an integer, represented directly in the bits of the pointer. (If your pointer tag is not 0, you will have to mask off the low bits before dereferencing a pointer..)

Google's V8 JavaScript engine still does it this way. This yields 31-bit immediate signed integers; you can just do a signed right-shift to get the value. They call them "smi" values, for "small integers"; they are more commonly called "fixnums". There are other fun tricks you can do to avoid shifts for some common operations, like addition. Note that they also take advantage of the four-byte alignment (in their case) to encode another immediate value, "failure objects", using the other low bit.

The size of a tagged pointer is the size of a pointer, so 32 bits on a 32-bit machine, and 64 on a 64-bit machine. V8 doesn't actually use 63-bit smi values on 64-bit machines, AFAIK.

nan-boxing

JavaScriptCore, the implementation used by WebKit, switched from tagged pointers to a "nan-boxed" format. The best explanation of this strategy is given by Rob Sayre of Mozilla. I believe the technique was first made popular by LuaJIT, though I would appreciate additional references.

Sayre does a good job of explaining things, but to recap, there are about 253 bit patterns that represent not-a-number values for IEEE-754 double-precision floats, but that only one of those is actually emitted by current hardware and software. So cheeky implementations are free to use the other patterns for their own purposes.

What I always wondered when I heard about this strategy was how 53 bits were possibly enough, for a 64-bit machine. How are you going to fit a pointer in that space?

Well it turns out that on x64-64 systems, you only have a 48-bit address space, currently anyway. The shape of that space is quite peculiar, too. I don't know whether Linux gives out addresses in the upper half, but Windows does not, though Solaris does.

Anyway, the other way of looking at NaN-boxing is that there are 264 - 248 values in a pointer that aren't being used, and we might as well stuff something in them, and hey, the whole space of valid double precision numbers fits!

It's convenient too that numbers in JS are specified as doubles. (Even still, all implementations define separate "small integer" types, for use in loops and indexing and such; integer operations are faster when you can use them, too.)

JSC's implementation of nan-boxing is described here, and you can see some of the neat tricks they do here. It actually makes me envious of C++, as a C programmer!

nun-boxing

So, when you choose to do nan-boxing, you basically choose to do one of two things: you favor pointers, or you favor doubles. To favor pointers means that you recognize pointers as having initial (most-significant) 0 bits; if the initial bits are not 0, then it's a double, and you have to add or subtract a bit pattern to get to the double value.

Favoring doubles means that pointers are left as NaN values, so their initial bits are all ones (or the sign bit can be unset, it doesn't matter), and to unpack a pointer, you have to rotate the double space around.

Amusingly, there is a third option as well. For 32-bit machines, you can address the second word in the double directly, so there is no need to mask off anything. This is the JSVALUE32_64 case mentioned in the JSValue code I linked to above.

JSC chose to favor pointers, and as the first JS implementation to nan-box, got to call their choice "nan-boxing". Mozilla chose to favor doubles, and so made up the silly name "nun-boxing".

get thee to a nun boxery?

So you're implementing a language. Should you nan-box or not? I can't say in your case but I can give a couple of examples.

NaN-boxing has the obvious advantage of not allocating doubles on the heap. This reduces cache pressure, GC pressure, and such. That's why Moz and JSC chose it.

V8 on the other hand has not chosen it, at least not yet, anyway. I think that one reason is because especially on embedded devices, and to an extent on ia32, passing around 64-bit values is a big lose. It's so bad that I understand that Mozilla actually passes these values by reference instead of by value in some places, on 32-bit systems only.

But the real reason that V8 doesn't nan-box I think is that they have a compiler that is able to unbox values in registers and in temporary stack locations, both as int32 values and as double values. This works for loops and such things in a function, and is obviously quite efficient, as you don't need to box and unbox at all. Passing doubles as arguments or return values however does allocate them on the heap, as far as I can tell anyway.

Also there is the hackery that with NaN-boxing, you assume things about your OS's memory management. A language implementation should be able to get around this with use of mmap at specific addresses, but still, it's tricky.

I looked at doing nan-boxing for Guile, and eventually decided against it. Guile has a few more constraints that V8 does not have. Firstly it is very portable, so it can't rely the address range constraints that x86-64 has, not without some hackery. It's also portable to the low end, like V8, so 64-bit values are a lose there.

But the killer is the conservative GC on 32-bit systems. If you represent integers with the tag in the high bits, as you would with nan-boxing, then the lower 32 bits are not distinguishable from a pointer, so they can cause the GC to retain heap objects for longer than it should. With tagged pointers, the low-bit masking lets the GC know that an integer is not a pointer. If you were to reintroduce low-bit tagging to a nan-boxed system, you've lost much of the advantage of it. Furthermore if you want to support more than 32 bits of precision in a fixnum, then on a 32-bit system you need to play games with sign extension of a 48-bit value to a 64-bit value, and for all word sizes it looks like the overhead will be significantly higher than tagging.

Finally I think that if we're really going to shoot for the moon, we should implement something like V8's Crankshaft compiler that does decent type inference, to allow unboxed values in registers and on the stack. (For readers of source code, "crankshaft" is basically the "hydrogen" flow-graph code plus a "lithium" assembler. Confusing, right?)

Well that's all for now. If you corrections, or an idea for a further topic, let me know in the comments. Thanks!

Syndicated 2011-05-18 16:57:41 from wingolog

363 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!