Older blog entries for wingo (starting at number 374)

5 Sep 2011 »

from ssa to native code: v8's lithium language

So, friends, about a month ago I posted a look at the Hydrogen intermediate language, as part of an ongoing series on the internals of the V8 JavaScript implementation. Hydrogen is the high-level part of V8's Crankshaft optimizing compiler, where the optimizations happen. This article tells the story of Lithium, Crankshaft's low-level, machine-dependent intermediate language.

totally metal

So here's the deal: you parsed your JavaScript functions to AST objects (parse all the functions!). You emitted some simple native code, quickly but not optimized, and you sat back and waited. After a little while, your profiling told you that some function was hot, so you re-parsed it, generated a graph of Hydrogen instructions and optimized that graph. Now you need to generate code. And that's where we are! Generate all the code!

But that's not what V8 does here. Instead of going directly to native code, it translates the HGraph to a chunk of Lithium code, then emits machine code from those Lithium instructions. Unlike Hydrogen, which is an SSA form in which instructions implicitly define values, the Lithium form is closer to three-address code, with labels and gotos, and explicitly names its operands (if any) and results (if any).

These temporary values are initially declared with a number of constraints, such as "must be in a double-precision register". Later, a register allocator considers the constraints and liveness ranges, allocating each value to a specific register. Code generation happens after register allocation.

an example

Let us take a look at an example to see how these things interact. Given that Lithium is target-specific, we'll just choose one of the targets; ia32 and x64 are the most advanced, so we'll just take x64. (That's what they call it.)

LInstruction* LChunkBuilder::DoAdd(HAdd* instr) {
  if (instr->representation().IsInteger32()) {
    ASSERT(instr->left()->representation().IsInteger32());
    ASSERT(instr->right()->representation().IsInteger32());
    LOperand* left =
        UseRegisterAtStart(instr->LeastConstantOperand());
    LOperand* right =
        UseOrConstantAtStart(instr->MostConstantOperand());
    LAddI* add = new LAddI(left, right);
    LInstruction* result = DefineSameAsFirst(add);
    if (instr->CheckFlag(HValue::kCanOverflow)) {
      result = AssignEnvironment(result);
    }
    return result;
  } else if (instr->representation().IsDouble()) {
    return DoArithmeticD(Token::ADD, instr);
  } else {
    ASSERT(instr->representation().IsTagged());
    return DoArithmeticT(Token::ADD, instr);
  }
  return NULL;
}

As an aside, this is fairly typical for Crankshaft code: tight, lots of asserts, custom type predicates instead of RTTI, self-explanatory enumerated values, and much more oriented towards function calls than in-place mutations. The new objects are created in a "zone", under the covers, so that once the compiler is done, it can free the whole zone at once. I reflowed the left and right lines; otherwise it really is a paragraph of text.

Unfortunately there are not a lot of big-picture comments, hence this blog series.

Anyway, here we have three cases. The first two are for when the HValue has been has been given an unboxed representation, as an int32 or as a double. The last is for tagged values, which can be small integers (Smi values) or heap-allocated doubles. In these latter cases, we dispatch to the DoArithmeticD and DoArithmeticT helpers, which create generic LArithmeticD and LArithmeticT instructions. But we see that in the first case, there is a special LAddI instruction.

When there is a special case in Crankshaft, it either means that we are working around some strange corner of JavaScript, or that something is being optimized. In this case, it is an optimization, reflecting the importance of integer addition. The concrete optimizations are:

Immediate operands. If one of the addends is a constant at compile-time, it can be encoded into the instruction stream directly.
Overflow check removal. If the range analysis (see my previous article) showed that the value would not overflow, then no overflow check is emitted. (Otherwise, an overflow check will be inserted, branching to a deoptimization bailout. The bailout will need to know what variables are live, in that case, hence the AssignEnvironment call; more on that later.)

The UseRegisterAtStart helper call indicates that the operand should be in a register, and that it will be live at the start of the instruction but not the end. This latter constraint is needed because x86 instructions typically clobber the first operand. UseRegisterOrConstant only allocates a register if the operand is not constant and so cannot be immediate. Finally DefineSameAsFirst constrains the output register to be the same as the first operand. All of these calls additionally inform the register allocator of the uses and definitions, in SSA parlance, so that the allocator can build accurate liveness ranges.

Just to be complete, the double arithmetic case (not shown) is bit more straightforward, in that there are no overflow checks or immediates. Tagged arithmetic, on the other hand, always dispatches to an inline cache. (I find this to be fascinating. Typically language implementations inline fixnum operations, but V8 does not because their inline caches also perform type feedback, allowing unboxing optimizations where possible. In V8, tagged integer arithmetic is never inlined, by default.)

allocate all the registers?

Why bother with this Lithium thing? Why not just generate code directly? The answer is that having a target-specific Lithium language allows target-specific code generation and temporary allocation, while permitting the register allocator itself to be generic. As an example, see the ARM DoAdd implementation.

Most everyone knows, but the idea of register allocation is to store your temporaries in machine registers instead of in memory, so that things run fast. It's tricky though, because nested procedure calls can trash your registers, so you need to "spill" them to memory in certain cases. You also need to avoid thrashing values between registers and memory. V8 also tries to use registers when calling internal runtime helpers, which can result in a fair amount of register shuffling.

That's all traditional compiler stuff. But in a garbage-collected language, we also have the complication of knowing which registers hold garbage-collected values and which don't. For every point in the code in which GC could run, the collector will need to mark those pointers, and possibly relocate them. This complicates the allocator.

Another complication is that besides the normal entry point, optimized functions often have an additional entry point, due to on-stack replacement. That entry point has to take values from memory and pack them into registers in a way that lets the computation proceed.

Finally, the allocator plays a role in permitting so-called "lazy deoptimization". V8 hacker Søren Gjesse writes, in a rare but pleasantly exegetic message to the list:

Lazy deoptimization is what happens when all code in a context is forced to be deoptimized either due to starting debugging or due to a global assumption on the optimized code that no longer holds. The reason for the term "lazy" is that actual deoptimization will not happen until control returns to the optimized function. The way we do this lazy deoptimization is somewhat fragile and we have had quite a few bugs in that code. It is based on destructively patching the optimized code placing a call just after each call forcing deoptimization. And actually the patching is not right after the return point but after the gap code inserted at the return point. For the deferred stack check the code up to restoring the registers (popad in IA-32) is considered the gap code.

gaps?

The gap code that Søren referred to is the set of moves that shuffles registers back into place after a call. Translation of Hydrogen to Lithium inserts LGap instructions between each LInstruction. When the allocator decides that it needs to spill, restore, or shuffle some registers, it records these moves on an LGap.

A later pass serializes these parallel moves into a specific order, trying to use the minimum number of temporaries. I link to Xavier Leroy's paper above, not because V8 uses Coq, but because of the nice introduction to the problem.

Anyway, the actual algorithm used by the Lithium register allocator appears to be a version of linear-scan, with live range splitting. This part of V8 actually is well-commented, so if you can understand the Sarkar paper, then head on over to ye olde google code and check it out.

generate all the code!

The result of register allocation is that all LOperands have been assigned specific machine registers, and some moves have been recorded in the LGap instructions. Now, finally, we generate code: each Lithium instruction is traversed in order, and they all write assembly into a buffer. See the LAddI code generator, as an example.

All of the JavaScript engines generate code using assemblers written in C++. Some people still assume that compilers must generate assembly, in text, but it's much easier and faster just to call methods on an object that writes bytes into a buffer. This approach also makes it easy to add macro instructions, just by subclassing the assembler. Finally, since modern JS implementations have to muck around with the bits all the time (disassembly, source debugging, backtraces, garbage collection (tracing and moving), deoptimization, OSR, inline caches, type feedback), it's easiest to maintain your invariants if you control the bit generation.

V8's assembler originally came from the Strongtalk project, where some of the V8 hackers came from, long ago. This assembler writes code into memory, going forward, and then when it's finished, it appends "relocation information" to the code, written backwards. Relocation information lets V8 mark interesting places in the generated code: pointers that might need to be relocated (after garbage collection, for example), correspondences between the machine program counter and source locations, etc.

As anyone who's hacked on debuggers knows, the problem with debug information is that it is large. Without clever encodings, the amount of debugging data generated by a compiler can kill runtime performance, simply because of the amount of data that needs to be paged into memory. Of course this concern also applies at compile-time, and especially so for JIT compilers. The Strongtalk assembler's main contribution to V8 (I think) is a complicated bit-encoding that tries to compress this data as far as possible. The assembler also provides iterators for traversing this data at runtime.

finish line

And now, dear readers, I believe that we have come full circle on this look at V8. We started low, looking at the assembly generated by V8 for a simple loop. We elaborated this low-level view with a look at on-stack replacement, though without understanding how such information was available. So we jumped back to the top, taking a high-level look at V8's compilers. We drilled down to Hydrogen, and here we covered Lithium, register allocation, and code generation.

I hope it's been an entertaining series to read; it has certainly been enlightening to write. Thanks very much to my employer Igalia for supporting my work on this. Comments and corrections are welcome, as always, and happy hacking.

Syndicated 2011-09-05 11:25:05 from wingolog

30 Aug 2011 »

the gnu extension language

I'm sitting on a train waiting to leave Paris for Barcelona, returning from the 2011 GNU Hackers Meeting. It was fantastic, clearly the best we have had yet. Thanks a lot to Ludovic Courtès for organizing it, and to IRILL and Sylvestre Ledru for hosting. I hope to write more about it in the future, but this essay will be long enough :)

I gave a talk entitled "The User in the Loop", which made the perhaps obvious argument that "extensibility is good & stuff". I hope that it did so in an entertaining and illuminating fashion. It also argued that Guile is a great fit for the extensibility needs of the GNU project. The video will be out shortly. Slides are available here, though you probably just want the notes instead.

going meta: goals and strategies

Guile is the GNU extension language. This is the case because Richard Stallman said so, 17 years ago. Beyond being a smart guy, Richard is powerfully eloquent: his "let there be Guile" proclamation was sufficient to activate the existing efforts to give GNU a good extension language. These disparate efforts became a piece of software, a community of hackers and users, and an idea in peoples' heads, on their lips, and at their fingertips.

So, Guile is a great language. But is it still the right extension language for GNU? At this GHM, Jim Blandy brought up the good point that we should revisit our past decisions periodically, to see if they still fit with our goals and with the world. "X is the GNU Y" is still a powerful, generative proclamation, so we should be careful with that power, to make sure that we are generating the right world. In that spirit, I would like to re-make the argument for Guile as the GNU extension language.

goal

Why care about extension languages? My presentation deals with this issue at length, but a nice summary can be found in the Guile manual:

Guile was conceived by the GNU Project following the fantastic success of Emacs Lisp as an extension language within Emacs. Just as Emacs Lisp allowed complete and unanticipated applications to be written within the Emacs environment, the idea was that Guile should do the same for other GNU Project applications. This remains true today.

The idea of extensibility is closely related to the GNU project's primary goal, that of promoting software freedom. Software freedom means that people receiving a software package can modify or enhance it to their own desires, including in ways that may not have occurred at all to the software's original developers. For programs written in a compiled language like C, this freedom covers modifying and rebuilding the C code; but if the program also provides an extension language, that is usually a much friendlier and lower-barrier-of-entry way for the user to start making their own changes.

-- Guile and the GNU Project

Although GNU maintainers are free to do as they wish with their packages, we are working on making an integrated system, and a system that mostly uses one extension language is better than one with many implementations. It is reasonable to promote one language as the default choice, while not forbidding others, of course.

I think that Guile is a great strategy to achieve this goal. Guile is good for the programs it extends, it is good for users, and it is good for GNU as a whole. The rest of this article argues these points in more detail.

languages do matter

An extension language is first and foremost a language: a vehicle for expression and for computation. Guile is an implementation of Scheme, one of the most expressive languages out there.

In his 2002 essay, Revenge of the Nerds, Paul Graham argues (somewhat polemically) that some languages are better than others, and that you should use the better languages. Of course he identifies "better" with "closer to Common Lisp", which is somewhat naïve, but the general point still stands.

At the time when Guile was started, it was quite clear that Scheme was indeed more expressive than Bourne shell, or than Tcl. Since then, many languages have come and gone, and most of those that have stuck around do have many of the features of Scheme: garbage collection, closures, lexical scope, etc.

But still, there are many areas in which Guile Scheme remains more powerful than other languages.

macros: language extensibility

First and foremost, Scheme's macros have no comparison in any other language. Macros let users extend their language. Macros allow a language be adapt to time, to the different needs of the future (unevenly distributed as it is).

I realize that this broad argument will not speak to people that do not use macros, so here is a concrete example.

Some languages have pattern-matching facilities, like Erlang. Pattern-matching goes like this: you have a datum, and a set of expected patterns. You go over the patterns in order, seeing if the datum matches the pattern. If it matches, some matching part of the datum is extracted and bound to local variables, and a consequent expression is evaluated.

With most languages, either you have pattern matching, because Joe Armstrong put it there, or you don't, in which case you are either blissfully ignorant or terminally frustrated. But in languages with macros, like Scheme, you can extend your programming language to give it pattern-matching features.

I'll show an implementation here, to keep things concrete. Here's a definition of a match macro. If you are unfamiliar with Scheme macros, do take a look at the relevant section in Guile's manual.

(define-syntax match
  (syntax-rules (else)
    ;; Use `pat' helper macro to see if a clause matches.
    ((match v (p e0 e ...) cs ...)
     (let ((fk (lambda () (match v cs ...))))
       (pat v p (begin e0 e ...) (fk))))
    ;; If we get to an `else' clause, evaluate its body.
    ((match v (else e0 e ...))
     (begin e0 e ...))))

There are two rules in this macro. The first rule hands off the real work to a helper macro, pat. The second just matches an else clause at the end.

(define-syntax pat
  (syntax-rules (_ unquote)
    ;; () succeeds if the datum is null, and fails
    ;;  otherwise.
    ((pat v () kt kf)
     (if (null? v) kt kf))
    ;; _ matches anything and succeeds.
    ((pat v _ kt kf)
     kt)
    ;; ,VAR binds VAR to the datum and succeeds.
    ((pat v (unquote var) kt kf)
     (let ((var v)) kt))
    ;; If the pattern is a pair, succeed if the datum
    ;; is a pair and its car and cdr match.
    ((pat v (x . y) kt kf)
     (if (pair? v)
         (let ((vx (car v)) (vy (cdr v)))
           (pat vx x (pat vy y kt kf) kf))
         kf))
    ;; Literals in a pattern match themselves.
    ((pat v lit kt kf)
     (if (eq? v (quote lit)) kt kf))))

The pat helper macro is written in so-called continuation-passing style. This means that pat receives the consequent expressions to evaluate as arguments: one if the pattern matches (kt), and the other if it doesn't (kf). I have commented the clauses to help readers that are unfamiliar with syntax-rules.

I hope that you find this implementation to be elegant, but that's not really the important thing. The important thing is that if we put this code in a module and export match, then you have provided pattern matching to a language that didn't have it. You, as a user, have extended your programming language.

This facility is quite useful. Let us consider the case of some HTML, which you have parsed to an SXML representation:

(define html '(a (@ (href "http://gnu.org/")) "GNU"))

We can use the match form to get the URL for this link:

(define (link-target x)
  (match x
    ((a (@ (href ,target)) . _) target)
    (else #f)))

(link-target html) => "http://gnu.org"
(link-target '(p "not a link")) => #f

Sometimes people say that macros make code hard to read, because (they say) everyone makes up their own language. I hope that this example demonstrates that this is not the case: link-target above is eminently more readable, robust, and efficient than rolling your own matcher by hand. Good programmers use macros to bring the language up to the level of the problem they are solving.

Note that you didn't need to build consensus in the ECMAScript committees to add match to your language. Nor do you have to be satisfied with Joe Armstrong's particular implementation of matching, as you do in Erlang. Guile's default matcher has a number of bells and whistles, but you are free to write custom matchers for other formats, extending matchers to protocols and binary records of various types.

Finally, this same argument applies to embedded parsers, domain-specific languages, dynamic binding, and a whole host of other things. You don't have to wait for a language like Python to finally add a with statement; you can add it yourself. To paraphrase Alan Perlis, Guile is an extensible extension language.

delimited continuations

Delimited continuations let a user extend their code with novel, expressive control structures that compose well with existing code.

I'm going to give another extended example. I realize that there is a risk of losing you in the details, dear reader, but the things to keep in mind are: (1) most other programming languages don't let you do what I'm about to show you; and (2) this has seriously interesting practical applications.

So, let's take coroutines as the concrete example. Guile doesn't define a standard coroutine abstraction yet, though it probably should. Here is one implementation. We begin by defining call-with-yield:

(define (call-with-yield proc)
  (let ((tag (make-prompt-tag)))
    (define (handler k . args)
      (define (resume . args)
        (call-with-prompt tag
                          (lambda () (apply k args))
                          handler))
      (apply values resume args))

    (call-with-prompt
     tag
     (lambda ()
       (let ((yield (lambda args
                      (apply abort-to-prompt tag args))))
         (proc yield)))
     handler)))

The guts of this function is the call-with-prompt call, which makes a yield procedure, and passes it to proc. If the yield is called, it will jump back up the stack, passing control to handler.

The handler builds a resume function, and returns it along with the values that were yielded. The user can then call the resume function to continue computation, possibly passing it some values.

If we would like to add some syntactic sugar to make this form easier to read, we can do so:

(define-syntax with-yield
  (syntax-rules ()
    ((_ yield exp exp* ...)
     (call-with-yield
      (lambda (yield) exp exp* ...)))))

As an example, here is a function that turns a traversal operator into a generator:

(define (generate traverse collection)
  (with-yield yield
    (traverse yield collection)))

Let's see a use of it.

> (generate for-each '(three blind mice))
$1 = #<procedure resume args>
$2 = three
> ($1)
$3 = #<procedure resume args>
$4 = blind
> ($3)
$5 = #<procedure resume args>
$6 = mice
> ($1)
$7 = #<procedure resume args>
$8 = blind
> ($5)
>

This is fascinating stuff. Normally when you call for-each, you lose control of your execution. for-each is going to call your function as many times as it wants, and that's that. But here we have inverted control-flow so that we're back in charge again.

Note that we did this without any cooperation from for-each! In Python, if you want to compose a number of its generators into a coroutine, every procedure in the call stack must cooperate. In fact until recently this wasn't even possible to implement efficiently. But in Guile, delimited continuations allow us to remove the unnecessary restrictions that make yield from seem like a good idea. Our with-yield composes with the other parts of our language.

Also note that these generators are functional: instead of mutating an object, causing the next call to it to proceed, we return the resume procedure as a value. It's not always what you want, and indeed you could make different choices, but this does have the advantage that you can rewind your traversal to different points, as I did above by calling $1 after calling $3. You can use this strategy to implement other advanced control-flow operators, like amb.

One can use this approach in other inversion-of-control contexts. Node.js people are all excited about their asynchronous idioms, but they have no idea how much better it could be if JavaScript actually supported asynchronous idioms on a language level.

For more on delimited continuations, see the readable 1993 paper by Dorai Sitaram, Handling Control.

the next 700 language features

Guile has a number of other language features that other extension languages do not. Here's a short list.

goops

GOOPS is Guile's Object-Oriented Programming System. It implements OOP like Common Lisp's CLOS does.

What this means to most programmers is that GOOPS is a little different. In Smalltalk-style OOP, the focus is on classes. In Self-style OOP, the focus is on prototypes. But in CLOS-style OOP, the focus is on generic functions. A good introduction to this topic can be found in Gregor Kiczales' beautiful book, The Art of the Metaobject Protocol. At OOPSLA 1997, Alan Kay called it "the best book written in computing in ten years".

Anyway, what I mean to say here is Guile does support object-oriented programming, and although it is not the same as (for example) Python's model, it is quite capable and expressive. Some of its features, like class redefinition, are not found in any of the more common OO systems.

parallel computation

Guile supports POSIX threads, with no global interpreter lock. Coordination between threads is managed via the normal primitives like mutexes and conditional variables, though there are higher-level forms like with-mutex.

Pthreads provide concurrency. Higher-level constructs provide structured parallelism, like futures, for fine-grained parallelism, and other forms like par-map.

I'm going to go out on a limb here and say that managing concurrency is the major problem of programming today. Guile's existing facilities are OK, but are not as convincing as those of Clojure or D. Is STM the right answer? Or is it shared-nothing by default, like Erlang? I don't know, but I'm pretty sure that we can build the right answer in Guile, whatever the right answer is.

the numeric tower

Guile implements the full numeric tower: integers (of any precision), real numbers, rationals, complex numbers, and the strange IEEE-754 values like negative zero, not-a-number, and the infinities. Arithmetic works with them all. In addition, there are some plans afoot to add support for optional multi-precision floating-point and complex numbers.

I shouldn't have to say this, but in Guile, (/ 12 10) is not 1. Nor is it a double-precision approximation, for that matter; it is the rational number, 6/5.

data types

Guile comes with a wealth of built-in data types: bytevectors, uniform arrays (for example, a vector of packed unsigned 16-bit integers), multidimensional arrays (possibly of packed values), bitvectors, strings consisting of characters (as opposed to bytes), records, etc. There are also the standard Lisp data types like symbols, pairs, vectors, characters, and so on.

You can define your own data types, using GOOPS, using the record facility, or the lower-level structs. Of course you can build ad-hoc data structures using the other predefined data types; after all, as Alan Perlis says, "it is better to have 100 functions operate on one data structure than 10 functions on 10 data structures."

functional programming

Scheme is a multi-paradigm language, supporting many kinds of programming techniques. It has excellent support for the functional style, with its lexical scope, closures, and proper tail calls. Also, unlike their call/cc cousins, Guile's delimited continuations are practical to use in a functional style, as shown above in the generator example.

Finally, I should mention that all of these features are designed to work well together. Macros compose with modules to preserve lexical scoping. Delimited continuations can transfer control through functional combinators while preserving referential transparency. Dynamic bindings behave sanely when continuations are unwound and reinstated. Scheme is a language that "hangs together": its features are carefully chosen to produce sound semantics when used in combination.

implementation features

So much for the Scheme language, as implemented by Guile. On a related but not identical tack, I would like to argue that Guile's implementation is a good practical fit for GNU.

development experience

Guile has a fantastic developer experience. Two factors contribute to this. One is that it ships with a full-featured, pleasant REPL, with tab-completion, readline, and a debugger. The second factor is Guile's reflective capabilities, coupled with the excellent Geiser Emacs mode.

As Geiser hacker Jao says, when it comes to debugging, you can be a mortician or you can be a physician. Core dumps are corpses. The REPL is the land of the living. If you run Guile with the --listen argument, it will spawn off a REPL thread, listening on a port. Geiser can telnet into that port to work with your program as it is running.

See Using Guile in Emacs, for more on how to hook this up.

multiple languages

It's pretty clear to me that we're going to nail the Elisp / Scheme integration. BT Templeton has been doing some great work on that this summer, and we are closer than ever. Admittedly some folks have doubts about how we will pull this off, but I am convinced. I will have to defer detailed arguments to some other forum, though.

Now, when Richard first promoted Guile as the GNU extension language, I think it was because he loved Lisp, but he did consider that not every user would share this perspective. For that reason he stressed the multi-lingual characteristics of Guile, which until now have not really come to fruition. However it does seem that in addition to Scheme and Emacs, with time we can probably get Lua and JavaScript working well, to the point that users will feel happy making small- and medium-sized programs and extensions in these languages.

It's true that if we decided that our best strategy would be to make JavaScript hackers happy, then we should adopt V8 instead, or something like that. But I don't think that is our best strategy. I will come back to this point in the conclusion.

health

Guile is some 16 years old now, and time has winnowed its interface to something that's mostly clean and orthogonal. There are still some crufty pieces, but we have a good deprecation story, both for Scheme code and for C code linking to libguile.

Despite its maturity, Guile is a very healthy project. Check the Ohloh page for details, but note that the activity statistics are based on the master branch, whereas most commits have been landing on stable-2.0 since 2.0.0 was released in February. We'll move back into a development phase at some point in the next few months.

And on the level of users, things seem to be going well -- it's not the most-used language out there, but it is gaining traction and in respectability. The number of contributors is going up, and the various forums are active.

choosing a strategy to achieve our goals

I have argued, I hope convincingly, that Guile is a good choice for an extension language. Now I would like to put the decision itself in a context. As Noam Chomsky keeps reminding us, we need to judge decisions by their expected consequences, not their proclaimed intentions. Does Guile generate the right future for GNU?

Let me begin by arguing this point from the contrapositive: that choosing another language is not the right thing for GNU. Let us take the example of JavaScript, a very popular language these days. Jim Blandy notes that the real measure of an extension language is that, when you want to implement a feature, that you want to implement it in the extension language rather than the original language (C, usually). So let us assume that JavaScript triumphs in this regard in the future, that the GNU system ends up mostly implemented in JavaScript, perhaps via the V8 implementation. Is this the right future to generate?

For me, the answer is no. First of all, what does it mean to have a GNU system based on a non-GNU technology? What would be left of GNU? It might be lack of imagination on my part, but such a future makes me think of country folk come to the city and losing sight of who they are. GNU is a social organization that makes software, with particular social characteristics. Mixing GNU with a foreign language community sounds dangerous to GNU, to me. See Kent Pitman's Lambda, the Ultimate Political Party for similar musings.

Beyond the cultural issues -- which to me are the ones that really matter -- there are some important technical points to consider. Right now GNU is mostly written in C, and controls its implementation of C, and the libraries that it depends on. Would GNU be able to exercise such control on JavaScript? Can GNU build in new concurrency primitives? Could we add new numeric types? Probably not. JavaScript is a language with a platform, and that platform is the browser. Its needs are not very related to the rest of GNU software. And if we did adapt JavaScript to POSIX-like platforms, GNU would lose some of the advantages of choosing a popular language, by making a second-class, foreign platform.

We should also consider the costs of using hastily designed languages. JavaScript has some crazy bad stuff, like with, var hoisting, a poor numeric model, dynamic this scoping, lack of modularity regarding binding lookup (witness the recent spate of browser bugs regarding prototype lookup on user objects), the strange arguments object, and the lack of tail recursion. Then there are the useful features that JavaScript lacks, like macros, modules, and delimited continuations. Using JavaScript has a cost, and its total cost in GNU would have to be integrated over its lifespan, probably 20 years or so. These limitations might be acceptable now, but not in the context of the larger programs of the future. (This, by the way, was the essence of Richard's argument against Tcl.)

Finally, we have the lifespan issue. If GNU had chosen Tcl because it was popular, we would have a mass of dead code. (You can like Tcl and still admit that we are past its prime.) Python is now, I think, at the height of its adoption curve, and beginning its descent. JavaScript is the next thing, and still on the uptick. But JavaScript will fall out of favor, and there will be a new paradigm, and a next language. The future of computing will not be the same as the present. So how will JavaScript adapt to these changes? We can't tell right now, but given the difficulty in changing simple things like making 010 parse as 10 and not 8 indicates that at some point it will stagnate. But Scheme will still be with us, because its parts are well thought-out, and because it is a language that is inherently more adaptable.

Paul Graham concludes Revenge of the Nerds with the following advice:

"If you start a startup, don't design your product to please VCs or potential acquirers. Design your product to please the users. If you win the users, everything else will follow. And if you don't, no one will care how comfortingly orthodox your technology choices were."

In the case of a GNU extension language, our users are developers, so we do need to work hard to please them. But free software has a different life cycle than a VC-backed startup; instead of racing to the flip, we have to please our users now and in ten years. Viewed from the long now, the future is not simply a first-order extrapolation of the present: it is something that we generate. Guile is still the right choice for the GNU extension language because it generates the right future: a healthy, sustainable, powerful GNU project that adapts to the changing needs of time.

Syndicated 2011-08-30 18:15:29 from wingolog

3 Aug 2011 »

to guadec!

I'm very happy to say that I'm going to Berlin on Friday, for GUADEC, the GNOME conference.

This will be my seventh GUADEC, and my first as an Igalian. I'm very much looking forward to seeing old friends, as well as to meeting new folks.

I'm especially looking forward to soaking in the energy of GNOME 3. GNOME 3 is the best user-facing project that Free Software has going right now -- it's a dynamic group of creative, talented people, making users' experiences with computers beautiful and natural.

The only other Free Software project that is focused on users in the same way is Android, and it is its own corporate kettle of fish. I hope that it's GNOME that ultimately wins, against the hoarders of the software world, from the workstation down to the phone, because GNOME brings with it our participatory culture of software as a means for human emancipation: that we can work together to make a better world. We've been working on this Free Software thing for some 30 years now; we're not a fad, or a hype: we are people working together, making our world. We can win, and humanity deserves it.

. . .

This year GNOME is again sharing a conference with KDE, another "desktop environment", to use the outdated term. I have pleasant memories of hanging out with KDE folk at the last joint conference we had, but I don't think it's healthy. Last time we were together because it was convenient for Nokia. Now that Nokia sold us both out, it doesn't make sense.

I do wish we could collaborate in meaningful ways -- and we do, in lower levels in the stack, and partly at the window manager level -- but we're working in the same space with two different cultures, and colocation ultimately dilutes the impact these conferences can have among the members of these two projects. So my KDE friends, let's have one last fling this weekend in Berlin, and thereafter see each other in Brussels, at FOSDEM :)

My seven GUADECs marks me as some kind of old fart now, I suppose, and it's true that my own work with GNOME is mostly in the past. I used to work on GStreamer, but don't anymore; Guile is mostly where I'm at, free-software-wise now. I wrote some bindings for the GNOME libraries back in the 2.x days but have not had the time to update them to the modern 3.x world. I fear that I'll just have to pass that project on, though there does not appear to be much hacker interest in maintaining API compatibility.

Anyway, as I said, I mostly work with Guile now. If you're interested at all in Guile, please pull me aside, and I'll be happy to talk your ear off ;-) I've also been doing work with the V8 JavaScript implementation recently, for work, and I'd also be happy to talk about that: compilers, optimization, and what-not. But kids, remember: JavaScript is a fad. Guile is the official scripting language of GNOME.¹

See you in Berlin!

¹;-)

Syndicated 2011-08-03 21:19:02 from wingolog

2 Aug 2011 »

a closer look at crankshaft, v8's optimizing compiler

Continuing in my series of articles on V8, in this article I'd like to take a closer look at V8's optimizing compiler, with a focus on the Hydrogen intermediate language.

brief summary

As my readers might recall, Crankshaft is the marketing name of V8's optimizing compiler. It complements the "full" compiler by optimizing hot functions, while ignoring uncommon cases. It translates the JavaScript abstract syntax tree to a high-level static single-assignment (SSA) representation called Hydrogen, and tries to optimize that Hydrogen graph. Then the Hydrogen is translated to the machine-specific Lithium low-level language, which facilitates register allocation and, finally, code generation.

The point of the Hydrogen translation is threefold:

To permit inlining. Inlining is truly the mother of all optimizations, in that it permits more optimizations to occur.
To permit temporary values to be represented as untagged integer or double values. In JavaScript, you need type feedback to do this with any degree of success. Unboxing reduces memory pressure and permits Crankshaft to use efficient native machine instructions. But as you would expect, this unboxing needs to be done carefully.
To facilitate loop-invariant code motion and common subexpression elimination.

The Hydrogen SSA representation facilitates all of these operations. This article takes a closer look at how it does so.

static single assignment

(click for accessible SVG)

Hydrogen consists of values--instructions and phis--contained in basic blocks, which themselves are contained in a graph.

The standard SSA notation associates a named value with every expression:

t1 = x + y;
t2 = 3;
t3 = t1 + t2;
...

It's useful to write it out this way, but these temporary names are not strictly necessary: since each instruction defines a temporary, we can use the identity (address) of each instruction as its name, for the purpose of SSA.

A later pass will allocate storage for all of these temporaries -- on the stack, on the heap, or unboxed in general-purpose or SSA/Neon registers. At first I was quite confused about the term "instruction", as we functional programmers tend to use "expression", but it's something you get used to after a while, and it's useful when you use it to refer to the named definition in addition to the expression that produces it.

An instruction is a kind of value -- an entity that defines a name. There other kind of value is the phi. Phis occur at control-flow joins, and are logically associated with the beginning of a block. My previous article discusses phis in detail.

All values have a pointer to their containing block and to their uses. A use is when a later instruction uses the value computed (defined) by an earlier instruction. Values also have a pointer to the next instruction in their block. It is cheaper to traverse instructions forward, in the direction of control and data flow, than it is to go backward. Values also have fields for their type, their representation, a flags bitfield, and a pointer to a range structure. We will look at these in more detail later.

The last instruction in a block is a control instruction. No one uses the value defined by a control expression -- their purpose is to transfer control to another basic block. The set of possible blocks that may follow a block in control-flow order are the successors of a block. Preceding blocks are predecessors. Every block has an array of pointers to its predecessors and successors, and also to its dominator block, and to all the blocks that it dominates.

Now, friends, when I first saw this I was disgusted. What a rat's nest of pointers! But then I read Kennedy's great Compiling With Continuations, Continued paper and it became clear that it is very difficult to have an efficient functional-style intermediate representation. The diagrams above are from a part of his paper in which he suggests a mutable IR. In my opinion, if he's gone that far, he might as well finish the job and eliminate temporary identifiers entirely, as Hydrogen does.

electrolysis

Crankshaft has a funny job to do. Not only does it have to optimize JavaScript, which is hard enough; it also has to do so in a way that permits on-stack replacement to optimize a long-running loop, without exiting the loop.

In practice what this means is that Crankshaft needs to walk the abstract syntax tree (AST) to emit Hydrogen code in the same way as the full compiler. It needs to simulate what the full compiler is doing: what is on the stack, which variables are heap-allocated and which are not, and, for named local variables, what HValue is bound to that variable. This simulation is kept in an environment.

This AST-to-Hydrogen translation process also handles phi insertion. So unlike the textbook dominance-frontier algorithm for optimal phi insertion, the HGraphBuilder simply adds phi values whenever a block has more than one predecessor, for every variable in the environment that is not always bound to the same value. For loops, a phi is added for every variable live in the environment at that time. A later pass attempts to prune the number of phis, where it can.

type feedback

The AST that was given to the crankshaft compiler has integer identifiers assigned to each node. These identifiers are assigned predictably, so they are the same identifiers that the "full-codegen" compiler saw, and they are embedded in the inline caches used by the full-codegen code. This correspondence allows crankshaft to know the types of the values that have been seen at any given call site, property access, etc.

This information is used by the graph builder to initialize the type field of each HValue, to help in type inference. Additionally, property accesses and function calls can record the precise type of object being accessed, or the function that was called, so that Crankshaft has an opportunity to inline these function and method calls.

inlining

When the graph builder reaches a call, it can try to inline it. Note that this happens very early, before any optimizations are made on the source code. This allows the optimizations to have more visibility, and allows for more code motion. (In general, V8 will not move instructions across a call boundary.)

There are lots of heuristics and conditions that affect inlining, and these are in flux. Some conditions that must currently be met for inlining to proceed are:

The function to be inlined is less than 600 characters long.
Neither the inner nor the outer functions may contain heap-allocated variables. (In practice, this means that neither function may contain lexical closures.)
Inlining of for-in, with, and some other expression types is not currently supported.
There is a limit on the maximum inlining depth.
A function's call to itself will not be inlined.
There is a limit to the number of nodes added to the AST.

source-to-source optimization passes

As parsing proceeds, Crankshaft generates specific HInstruction nodes and packs them into HBasicBlocks. There is a large set of about 120 instructions, which allows Crankshaft to reason precisely about the various operations and their representations.

At this point, inlining is complete, so Crankshaft moves to focus on representation of temporary values. The goal is to unbox as many values as possible, allowing the use of native machine instructions, and decreasing the allocation rate of tagged doubles. The Hydrogen SSA language directly facilitates this control-flow and data-flow analysis.

The following sections discuss the various source-to-source passes that are made on the Hydrogen IR. It's a lot of detail, but I'm sure that someone will find it useful.

block reordering

Crankshaft reorders the array of blocks in the HGraph to appear in reverse post-order, equivalent to a topological sort. In this way, iterating the array of blocks from beginning to end traverses them in data-flow order.

dominator assignment

The reverse post-order sort from the previous step facilitates calculation of the dominator for each block. The entry block dominates its successors. Other blocks are iterated over, in order, assigning them the dominator of their predecessors, or the common dominator of their predecessors if there is more than one.

mark dead subgraphs

It can often be the case that a function being optimized does not have full type information. Some branches in the original function may never have been executed, and so code on those branches does not have any type feedback. If the parser reaches such a branch -- and it will know at parse-time, because that's when it has the type-feedback information -- then it inserts an unconditional soft deoptimization, to force the collection of more type information, should that branch be reached in optimized code.

However, instead of causing a branch to a terminal deoptimization block, the parse continues, presumably because Crankshaft needs to proceed with its abstract interpretation of the AST. For that reason, the soft deoptimization appears in the instruction stream, not as the last instruction in the block.

To avoid the compiler wasting its time trying to optimize loops and move code in parts of the graph that have never been reached, this dead-subgraph pass puts a mark on all blocks dominated by a block containing an unconditional soft deoptimization. As you can see, this analysis is facilitated by the SSA dominator relation.

redundant phi elimination

The naïve phi placement algorithm detailed above can put phis where they are not needed. This pass eliminates phi values whose input is always the same, replacing the uses with the use of the value.

Like a number of phi-centered optimizations, elimination of one phi can expose other phis for elimination. For this reason, this and many other algorithms use a worklist.

In this case, all phis are placed on a work list. Then, while there are any phis on the worklist, a phi is taken off the list, and if it is eliminatable, it is replaced. Now here's the trick: if the phi was in turn used by any other phi, the phi uses are pushed onto the list, if they weren't there already. Iteration proceeds until the worklist is empty.

To keep the big-O order of worklist operations down, you need a flag for each element indicating whether the element is already in the list. The bit should either be in the objects being traversed, or in a separate bitvector. In this particular case -- and in this case only, as far as I can tell -- there is no such bit, so the algorithm can take too much time, in certain cases.

dead phi elimination, and phi collection

Some phi values are dead, having no real uses, neither direct nor indirect (through some other phi). This pass eliminates any phi that doesn't have a real use. It's somewhat like a garbage collection algorithm, and also uses a worklist.

Once this pass has run, we have minimized the set of phis, so we can collect them all into an array and assign them indices. These indices can be used in the future bitvectors, to contain worklist-driven algorithms.

As the phis are collected, if any undefined value is found to reach a phi, or arguments can reach a phi only via some inputs, then optimization is aborted entirely.

representation inference

Every HValue has a type field, representing a coarse approximation of the types seen at a particular AST node, and a representation field, representing a specific choice of how to represent a value. The goal, as I have said, is to represent temporaries as untagged int32 or double values. Type and representation are related, but they are treated separately in Crankshaft. This pass is about representation inference.

Firstly, Crankshaft looks at all of the phis, and sees which ones are connected. This is an N² fixed-point iteration. Then, for each phi, if any operand of the phi or any connected phi may not be converted to an int32, Crankshaft pessimistically marks the phi as not convertible to an int32. This pass tries to prevent too many representation changes for phi values.

Finally, for all values with flexible representation--phis, Math.abs, and binary bitwise and arithmetic operations--Crankshaft determines the representation to use for that value. If the representation can be inferred directly from the inputs, then that representation is used. Otherwise, some heuristics are run on the representations needed by the use sites of the value. If all uses are untagged, then the representation will be untagged, as an int32 if possible, unless any use treats the value as a double.

The end result is that, hopefully, many values have been unboxed to int32 or double values.

range analysis

This pass looks at each value, and tries to determine its range, if Crankshaft was actually able to allocate a specialized representation for it. In practice this is mainly used for values represented as integers. It allows HInstructions to assert various properties about the instruction, such as lack of overflow, or lack of negative zero values. These properties can influence code generation.

The algorithm propagates constraints forward on all control paths, so that control instructions can produce different ranges for the same value on different control paths. Ranges reaching a phi node are unioned together.

type inference and canonicalization

I mentioned that representation was related to type. Here I should be more specific. In Crankshaft, types are logically associated with objects. One can treat any object as an array, for the purposes of a calculation, but that treatment does not coerce the object itself to another type. Representation is about how to use an object; type is about the object itself.

Type inference can help eliminate runtime checks. If we can infer that a certain value is not only a tagged value, but is not a smi (small integer--a fixnum), then we can access its map directly without having to do the smi check. If we are storing an element in an array, and we know it is a smi, we don't need to emit a write barrier. So, type inference is another forward control-flow propagation algorithm, with a fixed-point iteration for phis in loop headers.

After type inference is run, each instruction in the whole graph is asked to canonicalize itself, which basically eliminates useless instructions: HToInt32 of a value with int32 representation, HCheckNonSmi of a value with non-smi type, etc.

deoptimization marks

When can you coerce undefined to an unboxed double? When you need a double for math, but you aren't going to eventually compare it to another double. The problem is that undefined coerces to NaN, but undefined == undefined and NaN != NaN. So, crankshaft marks any value used by a numeric compare as deoptimize-on-undefined. If the value is a phi, all connected phis must check for undefined, which can be a pretty big lose. But hey, that's JavaScript!

representation changes

Up to this point we've been dealing with values on a fairly high level, assuming that all of the definitions will be compatible with the uses. But as we chose to represent some temporaries as untagged values, there will need to be explicit conversions between representations -- hopefully not too many, but they will be needed. This pass inserts these representation changes.

This pass uses the truncating-to-int32 and deoptimize-on-undefined flags calculated in previous phases, to tell the changes what conditions they need to watch out for. Representation changes for phis happen just before branching to the block with the phi -- effectively, just before calling the block.

minus zero checks

The previous pass probably inserted a number of HChange instructions. For any HChange from int32 (which must be to tagged or double), Crankshaft traces dataflow backwards to mark the HChange(s) that produced the int32 as deoptimizing on negative zero. This allows parts of the graph to use fast int32 operations, while preserving the semantics of -0 should it appear, albeit more slowly (via a deoptimization).

stack check elimination

Loops need to be interruptable, and V8 does so by placing a stack check at the beginning of each loop iteration. If the runtime wants to interrupt a loop, it resets the stack limit of the process, and waits for the process's next stack check.

However if a call dominates all backward branches of a loop, then the loop can be interrupted by the stack check in the callee's prologue, so the stack check in the loop itself is unneeded and can be removed.

global value numbering: licm

I've been trying to get here for all this long page. Thanks to readers that have kept with it!

At this point, Hydrogen has served two of its three purposes: inlining and unboxing. The third is loop-invariant code motion (LICM) and common subexpression elimination (CSE). These important optimizations are performed by the perhaps misnamed global value numbering (GVN) pass.

LICM is about code motion: moving code out of hot loops. So what prevents an HInstruction from being moved? Two things: data dependencies, and side effects.

The LICM algorithm takes care of both of these. For every instruction in a loop, if it only depends on values defined outside the loop, and the side effects of the loop as a whole do not prevent the move, then LICM hoists the instruction to the loop header. If you do this in reverse post-order, you can hoist whole chains of expressions out of loops.

Every hydrogen instruction has a bitfield indicating the side effects that it causes, and a corresponding bitfield for the side effects that can affect it. In addition there is a flag indicating whether code motion is allowed for that instruction or not. These flags constitute a simple effects analysis -- if instruction B directly follows A, and A does not cause any side effects that B depends on, and B does not use the value from A, then B may be moved up before A.

Since the effects fields are simple bitfields, one can easily compute the side effects of a whole block by iterating over the instructions, doing a bitwise OR at each step. This is what LICM does -- it computes the side effects for the entire loop, and moves any instructions that it can to the loop header.

global value numbering: cse

Global value numbering, as implemented in V8, uses this technique in the opposite direction: instead of trying to hoist expressions up, it propagates them down the control-flow graph, trying to kill common subexpressions. A common subexpression is a Hydrogen value B that is "equal" in some sense to a previous instruction A, and whose value will not be affected by any side effects that occur on the control-flow paths between A and B. In this context, "value numbering" essentially means "define appropriate hash and equality functions and stick all the instructions you've seen in a hash table."

As GVN goes through the instructions of each block, in reverse-post-order, it puts them into this hash table (HValueMap). When GVN reaches an instruction that causes side effects, it flushes out the entries that depend on those side effects from the table. In that way, if ever it sees an instruction B that is equal to an instuction A in the table, GVN can remove B from its block, and replace every use of B with a use of A.

In standard SSA terminology you can consider instructions to be "definitions", because each instruction defines a value. Global value numbering works across basic blocks, hence the "global" in its name. Specifically, it eliminates equivalent definitions. One of the invariants of SSA is that "definitions dominate uses". So the part of the graph that should be searched for equivalent definitions is the part that is dominated by the original definition.

checked value replacement

In optimized code, accessing an array out of its bounds will cause a deoptimization. So the AST-to-Hydrogen translation will not emit a use of an index directly -- it inserts a bounds check, and has the array access use the result of the bounds check. But the purpose of this instruction is just to cause the side effect of deoptimization, and the resulting value is just a copy of the incoming index, which increases register allocation pressure for no reason. So now that code motion is complete, Crankshaft replaces uses of HBoundsCheck with the instruction that defines the index value, relying on the HBoundsCheck's position in the instruction stream to cause deopt as needed.

conclusions

In the end, the practical goals of Crankshaft are simple, and the optimizations are too, for the most part. Java people like to trot out all their gerundic compiler passes (see slide 7) and declare victory, and it is true that V8 and other modern JavaScript implementations will be beaten by the HotSpot JVM -- once it is allowed to warm up, of course. (V8's 5ms startup is particularly impressive, here.) But Crankshaft definitely has the basics, and throws in some range analysis to squeeze out a few more percent on top.

Finally, if you've gotten here, thanks again for reading this far. Comments and corrections are most welcome.

I have enjoyed being able to do this serious code reading, and some small bit of V8 hacking, as part of my work at Igalia. I think I'll have one more article about the platform-specific Lithium language, and its register allocator, and then see where I should move on from there. Happy hacking to all, and to all a good hack!

Syndicated 2011-08-02 21:21:01 from wingolog

12 Jul 2011 »

static single assignment for functional programmers

Friends, I have an admission to make: I am a functional programmer.

By that I mean that lambda is my tribe. And you know how tribalism works: when two tribes meet, it's usually to argue and not to communicate.

So it is that I've been well-indoctrinated in the lore of the lambda calculus, continuation-passing style intermediate languages, closure conversion and lambda lifting. But when it comes to ideas from outside our tribe, we in the lambda tribe tune out, generally.

At the last Scheme workshop in Montreal, some poor fellow had the temerity to mention SSA on stage. (SSA is what the "machine tribe" uses as an intermediate langauge in their compilers.) I don't think the "A" was out of his mouth before Olin Shivers' booming drawl started, "d'you mean CPS?" (CPS is what "we" use.) There were titters from the audience, myself included.

But there are valuable lessons to be learned from SSA language and the optimizations that it enables, come though it may from another tribe. In this article I'd like to look at what the essence of SSA is. To do so, I'll start with explaining the functional programming story on intermediate languages, as many of my readers are not of my tribe. Then we'll use that as a fixed point against which SSA may be compared.

the lambda tribe in two sentences

In the beginning was the lambda. God saw it, realized he didn't need anything else, and stopped there.

functional programmers got god's back

Hey, it's true, right? The lambda-calculus is great because of its expressivity and precision. In that sense this evaluation is a utilitarian one: the lambda-calculus allows us to reason about computation with precision, so it is worth keeping around.

I don't think that Church was thinking about digital computers when he came up with the lambda-calculus back in the 1930s, given that digital computers didn't exist yet. Nor was McCarthy thinking about computers when he came up with Lisp in the 1960s. But one of McCarthy's students did hack it up, and that's still where we are now: translating between the language of the lambda-calculus and machine language.

This translation process is compilation, of course. For the first 20 years or so of practicing computer science, compilers (and indeed, languages) were very ad-hoc. In the beginning they didn't exist, and you just wrote machine code directly, using switches on a control panel or other such things, and later, assembly language. But eventually folks figured out parsing, and you get the first compilers for high-level languages.

I've written before about C not being a high-level assembly language, but back then, FORTRAN was indeed such a language. There wasn't much between the parser and the code generator. Everyone knows how good compilers work these days: you parse, you optimize, then you generate code. The medium in which you do your work is your intermediate language. A good intermediate language should be simple, so your optimizer can be simple; expressive, so that you can easily produce it from your source program; and utilitarian, in that its structure enables the kinds of optimizations that you want to make.

The lambda tribe already had a good intermediate language in this regard, in the form of the lambda-calculus itself. In many ways, solving a logic problem in the lambda-calculus is a lot like optimizing a program. Copy propagation is beta-reduction. Inlining is copy propagation extended to lambda expressions. Eta-conversion of continuations eliminates "forwarding blocks" -- basic blocks which have no statements, and just jump to some other continuation. Eta-conversion of functions eliminates functional trampolines.

continuation-passing style

But I'm getting ahead of myself. In the lambda tribe, we don't actually program in the lambda-calculus, you see. If you read any of our papers there's always a section in the beginning that defines the language we're working in, and then defines its semantics as a translation to the lambda-calculus.

This translation is always possible, for any programming language, and indeed Peter Landin did so in 1965 for Algol. Landin's original translations used his "J operator" to capture continuations, allowing a more direct translation of code into the lambda-calculus.

I wrote more on Landin, letrec, and the Y combinator a couple of years ago, but I wanted to mention one recent paper that takes a modern look at J, A Rational Deconstruction of Landin's J Operator. This paper is co-authored by V8 hacker Kevin Millikin, and cites work by V8 hackers Mads Sig Ager and Lasse R. Nielsen. Indeed all three seem to have had the privilege of having Olivier Danvy as PhD advisor. That's my tribe!

Anyway, J was useful in the context of Landin's abstract SECD machine, used to investigate the semantics of programs and programming languages. However it does not help the implementor of a compiler to a normal machine, and intermediate languages are all about utility. The answer to this problem, for functional programmers, was to convert the source program to what is known as continuation-passing style (CPS).

With CPS, the program is turned inside out. So instead of (+ 1 (f (+ 2 3))), you would have:

lambda return
  let c1 = 1
    letcont k1 t1 = _+ return c1 t1
      letcont k2 t2 = f k1 t2
        let c2 = 2, c3 = 3
          _+ k2 c2 c3

Usually the outer lambda is left off, as it is implicit. Every call in a CPS program is a tail call, for the purposes of the lambda calculus. Continuations are explicitly represented as lambda expressions. Every function call or primitive operation takes the continuation as an argument. Papers in this field usually use Church's original lambda-calculus notation instead of the ML-like notation I give here. Continuations introduced by a CPS transformation are usually marked as such, so that they can be efficiently compiled later, without any flow analysis.

Expressing a program in CPS has a number of practical advantages:

CPS is capable of expressing higher-order control-flow, for languages in which functions may be passed as values.
All temporary values are named. Unreferenced names represent dead code, or code compiled for effect only. Referenced names are the natural input to a register allocator.
Continuations correspond to named basic blocks. Their names in the source code correspond to a natural flow analysis simply by tracing the definitions and uses of the names. Flow analysis enables more optimizations, like code motion.
Full beta-reduction is sound on terms of this type, even in call-by-value languages.

Depending on how you implement your CPS language, you can also attach notes to different continuations to help your graph reduce further: this continuation is an effect context (because its formal parameter is unreferenced in its body, or because you knew that when you made it), so its caller can be processed for effect and not for value; this one is of variable arity (e.g. can receive one or two values), so we could jump directly to the right handler, depending on what we want; etc. Guile's compiler is not in CPS right now, but I am thinking of rewriting it for this reason, to allow more transparent handling of control flow.

Note that nowhere in there did I mention Scheme's call-with-current-continuation! For me, the utility of CPS is in its explicit naming of temporaries, continuations, and its affordances for optimization. Call/cc is a rare construct in Guile Scheme, that could be optimized better with CPS, but one that I don't care a whole lot about, because it's often impossible to prove that the continuation doesn't escape, and in that case you're on the slow path anyway.

So that's CPS. Continuations compile to jumps within a function, and functions get compiled to closures, or labels for toplevel functions. The best reference I can give on it is Andrew Kennedy's 2007 paper, Compiling With Continuations, Continued. CWCC is a really fantastic paper and I highly recommend it.

a digression: anf

CPS fell out of favor in the nineties, in favor of what became known as Administrative Normal Form, or ANF. ANF is like CPS except instead of naming the continuations, the code is left in so-called "direct-style", in which the continuations are implicit. So my previous example would look like this:

let c2 = 2, c3 = 3
  let t2 = + 2 3
    let t1 = f t1
      let c1 = 1
        + c1 t1

There are ANF correspondences for CPS reductions, like the beta-rule. See the Essence of Compiling With Continuations paper, which introduced ANF and sparked the decline of the original CPS formulation, for more.

This CPS-vs-ANF discussion still goes on, even now in 2011. In particular, Kennedy's CWCC paper is quite compelling. But the debate has been largely mooted by the advances made by the machine tribe, as enabled by their SSA intermediate language.

the machine tribe in two sentences

In the beginning was the
Segmentation fault (core dumped)

(Just kidding, guys & ladies!)

Instead of compiling abstract ideas of naming and control to existing hardware, as the lambda tribe did, the machine tribe took as a given the hardware available, and tries to expose the capabilities of the machine to the programmer.

The machine tribe doesn't roll with closures, continuations, or tail calls. But they do have loops, and they crunch a lot of data. The most important thing for a compiler of a machine-tribe language like C is to produce efficient machine code for loops.

Clearly, I'm making some simplifications here. But if you look at a machine-tribe language like Java, you will be dealing with many control-flow constructs that are built-in to the language (for, while, etc.) instead of layered on top of recursion like loops in Scheme. What this means is that large, important parts of your program have already collapsed to a first-order control-flow graph problem. Layering other optimizations on top of this like inlining (the mother of all optimizations) only expands this first-order flow graph. More on "first-order" later.

So! After decades of struggling with this problem, after having abstracted away from assembly language to three-address register transfer language, finally the machine folks came up with something truly great: static single-assignment (SSA) form. The arc here is away from the machine, and towards more abstraction, in order to be able to optimize better, and thus generate better code.

It's precisely for this utilitarian reason that SSA was developed. Consider one of the earliest SSA papers, Global Value Numbers and Redundant Comparisons by Rosen, Wegman, and Zadeck. Rosen et al were concerned about being able to move invariant expressions out of loops, extending the "value numbering" technique to operate across basic blocks. But the assignment-oriented intermediate languages that they had been using were getting in the way of code motion.

To fix this issue, Rosen et al switched from the assignment-oriented model of the machine tribe to the binding-oriented model of the lambda tribe.

In SSA, variables are never mutated (assigned); they are bound once and then left alone. Assignment to a source-program variable produces a new binding in the SSA intermediate language.

For the following function:

function clamp (x, lower, upper) {
  if (x < lower)
    x = lower;
  else if (x > upper)
    x = upper;
  return x;
}

The SSA translation would be:

entry:
  x0, lower0, upper0 = args;
  goto b0;

b0:
  t0 = x0 < lower0;
  goto t0 ? b1 : b2;

b1:
  x1 = lower0;
  goto exit;

b2:
  t1 = x0 > upper0;
  goto t1 ? b3 : exit;
  
b3:
  x2 = upper0;
  goto exit;
  
exit:
  x4 = phi(x0, x1, x2);
  return x4;

SSA form breaks down a procedure into basic blocks, each of which ends with a branch to another block, either conditional or unconditional. Usually temporary values receive their own names as well, as it facilitates optimization.

phony functions

The funny thing about SSA is the last bit, the "phi" function. Phi functions are placed at control-flow joins. In our case, the value of x may be proceed from the argument or from the assignment in the first or second if statement. The phi function indicates that.

But you know, lambda tribe, I didn't really get what this meant. What is a phi function? It doesn't help to consider where the name comes from, that the original IBM compiler hackers put in a "phony" function to merge the various values, but considered that "phi" was a better name if they wanted to be taken seriously by journal editors.

Maybe phi functions are intuitive to the machine tribe; I don't know. I doubt it. But fortunately there is another interpretation: that each basic block is a function, and that a phi function indicates that the basic block has an argument.

Like this:

entry x lower upper =
  letrec b0 = let t0 = x0 < lower
                if t0 then b1() else b2()
         b1 = let x = lower
                exit(x);
         b2 = let t1 = x > upper0
                if t1 then b3() else exit()
         b3 = let x = upper
                exit(x);
         exit x = x
    b0()

Here I have represented basic blocks as named functions instead of labels. Instead of phi functions, we allow the blocks to take a number of arguments; the call sites determine the values that the phi function may take on.

Note that all calls to blocks are tail calls. Reminds you of CPS, doesn't it? For more, see Richard Kelsey's classic paper, A Correspondence Between Continuation-Passing Style and Static Single Assignment Form, or my earlier article about Landin, Steele, letrec, and labels.

But for a shorter, readable, entertaining piece, see Appel's SSA is Functional Programming. I agree with Appel that we in the lambda-tribe get too hung up on our formalisms, when sometimes the right thing to do is draw a flow-graph.

so what's the big deal?

If it were only this, what I've said up to now, then SSA would not be as interesting as CPS, or even ANF. But SSA is not just about binding, it is also about control flow. In order to place your phi functions correctly, you need to build what is called a dominator tree. One basic block is said to dominate another if all control paths must pass through the first before reaching the second.

For example, the entry block always dominates the entirety of a function. In our example above, b0 also dominates every other block. However though b1 does branch to exit, it does not dominate it, as exit may be reached on other paths.

It turns out that you need to place phi functions wherever a definition of a variable meets a use of the variable that is not strictly dominated by the definition. In our case, that means we place a phi node on exit. The dominator tree is a precise, efficient control-flow analysis that allows us to answer questions like this one (where do I place a phi node?).

For more on SSA and dominators, see the very readable 1991 paper by Cytron, Ferrante, Rosen, Wegman, and Zadeck, Efficiently Computing Static Single Assignment Form and the Control Dependence Graph.

Typical implementations of SSA embed in each basic block pointers to the predecessors and successors of the blocks, as well as the block's dominators and (sometimes) post-dominators. (A predecessor is a block that precedes the given node in the control-flow graph; a successor succeeds it. A post-dominator is like a dominator, but for the reverse control flow; search the tubes for more.) There are well-known algorithms to calculate these links in linear time, and the SSA community has developed a number of optimizations on top of this cheap flow information.

In contrast, the focus in the lambda tribe has been more on interprocedural control flow, which -- as far as I can tell -- no one does in less than O(N²) time, which is, as my grandmother would say, "just turrible".

I started off with a mention of global value numbering (GVN) on purpose. This is still, 20+ years later, the big algorithm for code motion in JIT compilers. HotSpot C1 and V8 both use it, and it just landed in IonMonkey. GVN is well-known, well-studied, and it works. It results in loop-invariant code motion: if an invariant definition reaches a loop header, it can be hoisted out of the loop. In contrast I don't know of anything from the lambda tribe that really stacks up. There probably is something, but it's certainly not as well-studied.

why not ssa?

Peoples of the machine tribe, could you imagine returning a block as a value? Didn't think so. It doesn't make sense to return a label. But that's exactly what the lambda-calculus is about. One may represent blocks as functions, and use them as such, but one may also pass them as arguments and return them as values. Such blocks are of a higher order than the normal kind of block that is a jump target. Indeed it's the only way to express recursion in the basic lambda calculus.

That's what I mean when I say that CPS is good as a higher-order intermediate language, and when I say that SSA is a good first-order intermediate language.

If you have a fundamentally higher-order language, one in which you need to loop by recursion, then you have two options: do whole-program analysis to aggressively closure-convert your program to be first-order, and then you can use SSA, or use a higher-order IL, and use something more like CPS.

MLton is an example of a compiler that does the former. Actually, MLton's SSA implementation is simply lovely. They do represent blocks as functions with arguments instead of labels and phi functions.

But if you can't do whole-program analysis -- maybe because you want to extend your program at runtime, support separate compilation, or whatever -- then you can't use SSA as a global IL. That's not to say that you shouldn't identify first-order segments of your program and apply SSA-like analysis and optimization on them, of course! That's really where the lambda tribe should go.

finally

I wrote this because I was in the middle of V8's Crankshaft compiler and realized I didn't understand some of the idioms, so I went off to read a bunch of papers. At the same time, I wanted to settle the CPS-versus-ANF question for my Guile work. (Guile currently has a direct-style compiler, for which there are precious few optimizations; this fact is mostly a result of being difficult to work with the IL.)

This post summarizes my findings, but I'm sure I made a mistake somewhere. Please note any corrections in the comments.

Syndicated 2011-07-12 16:39:50 from wingolog

5 Jul 2011 »

v8: a tale of two compilers

Regular readers will have noticed my fascination with the V8 JavaScript implementation. It is indeed an impressive piece of engineering.

When V8 was originally announced, Lars Bak wrote:

I hope the web community will adopt the code and the ideas we have developed to advance the performance of JavaScript. Raising the performance bar of JavaScript is important for continued innovation of web applications.

How right he was! V8 has succeeded admirably, not only in adoption, but also in "raising the performance bar" among all JavaScript implementations.

But as William Gibson says, "The future is already here — it's just not very evenly distributed." Many parts of V8 are not documented at all, and perhaps understandably, given the pace at which things are changing. So as I'm getting up to speed with V8 for my work with Igalia, I've been trying to document the interesting things that I find, so that all JavaScript implementations can learn and improve.

And indeed, with my selfish Guile hat on, this study in V8 is giving me lots of ideas and motivation. So perhaps V8's new motto should be, "shaming the world into faster code, one compiler at a time".

compiler the first: full-codegen

V8 compiles all JavaScript to native code. V8 has two compilers: one that runs fast and produces generic code, and one that doesn't run as fast but does try to produce optimized code.

The quick-and-simple compiler is known internally as the "full-codegen" compiler. It takes as its input the abstract syntax tree (AST) of a function, walks over the nodes in the AST, and emits calls to a macroassembler directly. Here's a picture:

The boxes represent the flow of data in the compilation process. There's only two boxes because, as we said, it's a simple compiler. All local variables are stored either on the stack or on the heap, not in registers. Any variable referenced by a nested function is stored on the heap, in the context object associated with the function in which the variable was defined.

The compiler will emit loads and stores to pull these values into registers to actually do the work. The top of the stack of temporaries is cached in a register. Complicated cases are handled by emitting calls to runtime procedures. The compiler does track the context in which an expression is being evaluated, so that tests can just jump directly to the consequent blocks, instead of pushing a value, testing if it's zero or not, then branching. Small integer arithmetic is usually inlined. And that's all there is!

Actually I should mention one important optimization that is done even with the full-codegen compiler, and that is inline caching. See the Hölzle, Chambers, and Ungar paper I link to there for all of the details. Inline caches are used for assignments, unary and binary operations, function calls, property accesses, and comparisons.

The inline caches also serve as sources of type information, used by the optimizing compiler. In the case of some statement types, like assignments, the only purpose of the IC is to record type information.

Here are some links to the source:

ast.h: The abstract syntax tree. Note that though the comment says that ASTs are allocated in separate zones, for constant-time deallocation, in practice the AST is kept around as long as the JavaScript functions are alive, for use as input to the optimizing compiler. Zone allocation is more of a win for the optimizing compiler, which produces more ephemeral garbage.
full-codegen.h
full-codegen.cc
full-codegen-ia32.cc: The full-codegen compiler. Most of the meat of the full-codegen compiler is in the target-specific directory (4257 lines vs 769+1323 lines). Currently supported targets are ia32, x64, arm, and mips.

type feedback

The first time V8 sees a function, it parses it to the AST but doesn't actually do anything with it. It only runs the full-codegen compiler when the function is first run. How's that for lazy! But after things have started up, it kicks off a profiler thread to see how things are going, and what functions are hot.

This lazy, sit-back-and-watch approach gives V8 time to record the type information flowing through it. So by the time it has decided that a function is hot, and could use a little more help, it has type information to give to the compiler.

Runtime type feedback information is recorded by and stored in the inline caches (ICs). Types are expressed internally as a 8-bit value constructed in such a way that it can detect a hierarchy of types, with a simple bitmask. At this point the best I can do is show the artwork in the source:

//         Unknown
//           |   \____________
//           |                |
//      Primitive       Non-primitive
//           |   \_______     |
//           |           |    |
//        Number       String |
//         /   \         |    |
//    Double  Integer32  |   /
//        |      |      /   /
//        |     Smi    /   /
//        |      |    / __/
//        Uninitialized.

Everytime an IC stub sees a new kind of value, it computes the type of that value, and bitwise-ands it with the old type. The initial type value is uninitialized. So if an IC only sees integers in the Smi (small integer) range, the recorded type will indicate that. But as soon as it sees a double value, the type becomes Number; and if it then sees an object, the type becomes Unknown. A non-primitive IC will necessarily store the map of the receiver type in the IC, for dispatch purposes. Type feedback can parse the IC stub to get at this map, when needed.

Type feedback information is associated with a particular AST node (the assignment, property load, etc.). An integer identifier for the node is serialized into the IC so that when V8 decides that a function is hot, it can parse out the recorded type information from the full-codegen code, and associate it with the AST node.

Now, this process is a bit complicated. It needs support up and down in your compiler stack. You need to have inline caches. Your inline caches need to have support for the type information, both for operands and for the result. You need to be able to walk this data to find the values. Then you need to link it back to the AST, so that when you pass the AST to your optimizing compiler, the compiler is able to ask the right questions.

The concrete strategy that V8 takes is to parse the data into a TypeFeedbackOracle object, associating information with particular AST nodes. Then V8 visits all of the AST nodes with this oracle, and the nodes themselves parse out the data that they might find useful from the oracle.

In the end one is able to, for example, ask a Property node if it is monomorphic, and in any case what are the receiver types at that node. It seems that this works well for V8, because it reduces the number of moving parts in the optimizing compiler, given that it doesn't need to have the TypeFeedbackOracle itself.

So, source links.

type-info.h: The TypeInfo 8-bit data type, and the TypeFeedbackOracle declaration. I have to admit, I really like the use of C++ within V8. It's a nasty tool, but they wield it well.
type-info.cc: The implementation of TypeFeedbackOracle. See ProcessTarget down at the bottom of the file for where the bulk of the work happens.

Also check the ast.h link before to see see how type feedback ties into the AST itself.

crankshaft = type feedback + hydrogen + lithium

Once V8 has identified that a function is hot and has collected some type feedback information, it tries to run the augmented AST through an optimizing compiler. This optimizing compiler is called Crankshaft in the marketing materials, though that name rarely appears in the source itself.

Instead, in the source, Crankshaft consists of the Hydrogen high-level intermediate representation (IR), the Lithium low-level IR, and their associated compilers. Like this:

(You probably caught on from my heavy-handed repetition above, but I believe that the names Hydrogen and Lithium come from High- and Low-level, respectively.)

It depends on your background, but you might have seen a diagram like that before:

Indeed I believe that Crankshaft was highly influenced by the changes that Sun introduced into the Hotspot Client compiler in Java 6. Let me quote from an illuminating 2008 paper by Kotzmann et al, Design of the HotSpot Client Compiler for Java 6:

First, a high-level intermediate representation (HIR) of the compiled method is built via an abstract interpretation of the bytecodes. It consists of a control-flow graph (CFG), whose basic blocks are singly linked lists of instructions. The HIR is in static single- assignment (SSA) form, which means that for every variable there is just a single point in the program where a value is assigned to it. An instruction that loads or computes a value represents both the operation and its result, so that operands can be represented as pointers to previous instructions. Both during and after generation of the HIR, several optimizations are performed, such as constant folding, value numbering, method inlining, and null check elimination. They benefit from the simple structure of the HIR and the SSA form.

The back end of the compiler translates the optimized HIR into a low-level intermediate representation (LIR). The LIR is conceptually similar to machine code, but still mostly platform-independent. In contrast to HIR instructions, LIR operations operate on virtual registers instead of references to previous instructions. The LIR facilitates various low-level optimizations and is the input for the linear scan register allocator, which maps virtual registers to physical ones.

This statement describes Crankshaft very neatly, and the rest of section 2 of that paper applies in a general sense. Of course there are some differences. Crankshaft starts with the AST, not bytecodes. The HotSpot client runtime does not use type-feedback to help its compiler, as it is less necessary for Java, though it would still be helpful. Crankshaft doesn't do very much with exception handlers.

But the similarities are such that V8 can actually produce traces that can be read by c1visualizer (docs), a program used to visualize the internals of the HotSpot client compiler. (The client compiler appears to be known internally as c1; the server compiler appears to be opto).

in our next installment

This post is getting quite long. I'm going to have to send my mom some flowers or something, because I'm sure she does check these things out, and it must be brain-numbing to slog through. Another paragraph? When will he stop?

So if you, dear reader, find yourself in this category, you're in good company, because my mom is a lovely lady.

But if you're a compiler geek, I think I have at least couple more articles to write on Crankshaft before I can get back to other hacks. Specifically, we'll go into the optimizations that Crankshaft does on the Hydrogen IR, and take a look on how it does the translation to Lithium. But until then, happy hacking!

Syndicated 2011-07-05 14:31:49 from wingolog

4 Jul 2011 »

guile 2.0.2: building the guildhall

Hey tubes, we made another Guile release!

what is guile

Guile is the GNU extension language. It's an implementation of Scheme. It's pretty fun to hack in, and on!

guile 2.0.2: even faster!

Here I'd like to highlight the thing that I really like about this release: performance. Relative to 2.0.1:

The VM is 20% faster
Compiled .go files are half the size
Startup time is down from 16.2 milliseconds to 11.7 milliseconds (on my machine)
Base memory use down from 3.8 MB of private dirty memory to 2.9MB (measured on a guile -c '(sleep 100)')

In short, thundercats are totally on the loose! See the release announcement for all the details.

where we're going

It's hard to tell sometimes where this jalopy is headed. Sometimes the vehicle just drives itself, like the recent speed improvements, which were totally unplanned.

But that said, there are a few stops we want to make on this journey, in the near- and mid-term. Let's take a look.

2.0: building the guildhall

On the 2.0 branch, the current stable series, the big feature that we'd like to land soon will be a CPAN-like repository of installable Scheme modules.

We agonized over the name for a while. It needs to be short and memorable, so it can be used as a command-line tool. (The Git experience is pretty definitive here.) At the same time it needs to be linked to Guile, and follow some sort of theme.

What we've done now is to take our guile-tools script runner (equivalent to the git wrapper) and rename it to guild. Currently it's mostly used for compiling files, as in guild compile -o foo.go foo.scm, but in the future we want to use it as a part of this system for Guile wizards and journeyfolk to band together to share code---to form a guild, so to speak.

And so the package archive itself will be the "guild hall" (or "guildhall"). We'll see if we can get guildhall.gnu.org, as a place to host the packages.

As far as code, protocol, and formats go, we are looking at porting Andreas Rottmann's Dorodango Scheme package manager to be Guile-specific, with guild commands. So guild update, guild upgrade, guild install xml-rpc, etc.

Dorodango is basically a port of Debian's apt to Scheme. It even has ports of the same dependency resolution algorithms, which is fun. It also allows for multiple sources, so we can have our Guile guildhall archive, but also have support for portable R6RS Scheme archives from Dorodango.

2.2: bit mucking

Guile is getting pretty good now, but there is more to be done before it is as cheap to run a program written in Scheme as it is to run a program written in C.

The first thing that we will do to help this is to switch to ELF as the format of our compiled code. We will do this on all platforms, and even before we do native compilation. The bytecode will simply be stored in an ELF section. This will also allow us to statically allocate more data, decreasing memory usage and increasing locality. Also our debugging information is currently interspersed with code -- not in the execution path, mind you, but it does make the memory profile of the code look more like emmental than comté cheese. Moving debug info out to a separate section will help with that.

ELF will also add flexibility. We can add sections to express dependencies of compiled code on source code. We can strip debug info and other things from compiled code. We can link multiple modules into one file, to avoid stat penalties on startup.

In addition ELF will allow us to add native code, in a separate section. Of course we have to actually produce native code, so that's the next big thing for Guile 2.2. The plan is to write an ahead-of-time code generator for x86 and x86-64 systems, while keeping the bytecode interpreter as well. Then once we know what we want, we can go and talk to the GCC folks and see if our use case can be supported there.

Anyway, that's the plan. I expect 2.2 to take another year or so. By the time we have native compilation, I hope also our Elisp implementation will be significantly faster than the one in Emacs, so it will really become attractive to people to finish the work that we have done to make the switch there.

Syndicated 2011-07-04 09:47:25 from wingolog

21 Jun 2011 »

security implications of jit compilation

Periodically the tubes alight with statements that just-in-time (JIT) compilers are somehow bad for security. At Igalia we work with many companies that have put or are interested in putting JIT compilers on mobile devices, and obviously security is one of their concerns. In this article I'd like to take a closer look at the risks and myths surrounding JIT and security.

executive summary

JIT compilers do not cause a significant security risk in and of themselves.

However, JIT compilers do allow flaws in unsafe native libraries to be more easily exploited.

The problem is not JIT compilation. The problem is code written in unsafe languages.

a background: interpreters and security

It used to be that JavaScript implementations all used interpreters to execute JavaScript code, usually in bytecode form. An interpreter is a program, usually written in C or C++, which interprets program code as data.

(I will take JavaScript as the example in this article, though similar considerations apply to other language implementations.)

JavaScript implementations also includes a language runtime, which implements all of the "primitive" objects of the language. In this case of JavaScript this might include Date objects and such. The runtime also includes a garbage collector, a parser, and other such things.

Sometimes interpreters are used to run "external" code, as is the case with JavaScript. In that case the language implementation is a fairly clear attack target: the attacker tries to produce an input which triggers a flaw in the language implementation, allowing takeover of the target system.

Note that in this case, the interpreter itself is only part of the target. Indeed the interpreter is typically only a small part of a language implementation, and is very well-tested. Usually the vulnerability is in some bit of the runtime, and often stems from having the runtime written in an unsafe language like C or C++.

So the interpreter part itself typically imposes no risk, beyond the obvious ones of interpreting data provided by an attacker.

compilers have more moving pieces

Instead of interpreting a program as data, compilers produce native code. A compiler is bigger than an interpreter, and so exposes a larger attack surface. A compiler usually has more corner cases than an interpreter, which often are not tested properly. Compilers are usually specific to particular target architectures, which makes bugs in less-used architectures more likely. Additionally, by generating native code, a compiler operates outside whatever meager safety guarantees are provide by the implementation language (C or C++).

So yes, compilers are a risk, JIT or otherwise. However with proper testing it is possible to have confidence in a compiler. First of all, a compiler is written by compiler writers, not ordinary programmers, and not attackers. Secondly a compiler is generally a functional process, so it's easy to test. Modern JS implementations have thousands and thousands of tests. It's still a cross-your-fingers-and-pray thing, usually, but the fact is that exploits these days come from bugs present in the source of unsafe languages, not bugs in code generated from safe languages.

jit compilation allows performant mobile code

JIT compilation can actually increase the overall security of a system, if one of the design goals of that system is user-installable "apps". For example, Android apps are (typically) written in Java, targetting the Dalvik VM. This choice of safe language and safe runtime largely eliminates the need to make a tight sandbox around apps. If it weren't for the presence of unsafe code in the libraries available to an Android app, things would be even easier.

Note that Dalvik's JIT has other positive security characteristics, in that its generated code can be verified against the bytecodes that it is supposed to implement. I recommend this video for a good discussion; thanks to my colleague Xan for pointing it out.

In contrast, Apple's decision to allow natively-compiled applications to be installed by users leaves them scrambling to try to mitigate the effects of what happens when an app has a buffer overflow. Again, the problem is unsafe code, not JIT compilation.

attempts to contain security vulnerabilities in unsafe code

That said, unsafe languages are part of the sordid reality, so let's look at how attackers compromise such applications.

In the old days, once you got a stack buffer overflow, you just wrote whatever code you wanted onto the stack, and arranged for a function return to execute that code. Smashing the stack for fun and profit, you see.

These days it is more difficult. One of the things that makes it more difficult is making it impossible to write code which is later executed. This technique goes under various names: the NX bit, DEP, W⊕X, etc. Basically it means you can't write shellcode. But, you can return to libc, which uses existing functions in libc (like system).

Some systems also prevent you from running code on disk unless it is signed. Dion Blazakis wrote a nice overview on the sandbox in iOS devices (subtitle: using Scheme for evil). Besides not letting you write code for your own device -- a terrible thing IMO, but not related to security -- this technique does prevent attackers from getting around the NX restriction by writing code to disk, then executing it.

But one need not write code to perform arbitrary computation. I assume everyone's seen the following papers, and if not, I highly recommend them:

The geometry of innocent flesh on the bone: Return-into-libc without function calls (on the x86), Hovav Shacham, CCS 2007
When good instructions go bad: generalizing return-oriented programming to RISC Erik Buchanan et al, CCS 2008
Return-oriented programming without returns, Stephen Checkoway et al, CCS 2010

Finally, instead of loading libc always at the same place, the latest thing to happen is to randomize the positions at which the various segments are loaded in memory. This is called Address Space Layout Randomization (ASLR). To effectively execute a return to libc under ASLR, you have to figure out where libc is first; not a trivial task.

I think I need to mention again that these are not security measures per se. If you are really interested in security, don't write systems software in unsafe languages. These measures are designed to contain the effects of a vulnerability.

I should also note that all of these measures are actively harmful to people writing code. ASLR increases load time by disallowing prelinking of various sorts, and forcing the generation of position-independent code. Disallowing writable, executable memory means that you can't do JIT compilation, not to mention quajects. Disallowing unsigned binaries means that you can't run your code on your own device.

jit compilation is not compatible with these restrictions

Having a JIT compiler by definition means that you are writing code and then going to run it. This makes it easier for an attacker who has found a bug in a native library to exploit it.

These exploits do happen, though currently not very often. A good example of such a paper is Interpreter Exploitation: Pointer Inference And JIT Spraying by Dion Blazakis.

Blazakis uses characteristics of the language runtime (Flash, in that case) to allow him to determine the address ranges to write his code. The actual code is written by taking advantage of the predictable output of the particular JIT in question to spray a big NOPsled onto the heap, which if I understand it correctly, is implemented by starting the IP in the middle of x86 instructions, in lovely contrapuntal style. Very clever stuff.

But again, the vulnerability is present not in code related to the JIT, but in native code. Having executable pages makes the vulnerability easier to exploit, but is not itself the security risk.

specific jit implementations

I called out Dalvik as an example of a JIT implementation that is particularly security-conscious. As much as I like V8, though, I cannot recommend it particularly, from the security perspective. It is simply changing too fast. Practically every commit touches code generators. The individual architectures share some pieces but duplicate a lot. I cannot speak for the other JavaScript implementations, as I do not know them as well.

V8's primary use case in the Chromium browser does have some security safeguards though, benefitting from the process sandbox and rapid release cycle. But I would be hesitant to put it in a device, lacking the same safeguards.

mitigations for jit vulnerabilities

These days all signs are pointing to the end of sandboxing as a security strategy, replacing it instead with verification and software fault isolation (SFI). Google's Native Client group is leading the way here. For more, see Adapting software fault isolation to contemporary CPU architectures by Sehr et al. But it does seem a poor thing to try to turn an unsafe language into a "fault-isolated" runtime; better to start with safe code to begin with.

Adapting SFI to JIT compilers could protect against flaws in JIT compilers, but it has abysmal results. The authors of that paper adapted V8 to produce the code sequences needed to preserve the SFI invariants, and got killed in speed. They somehow managed to claim success though, which is beyond me.

in closing

If you've already decided that you're going to ship remotely exploitable libraries written in unsafe languages, disallowing JIT can indeed reduce your attack surface. I understand it in some cases. But it is a shame to make safe languages pay the sins of their unsafe brethren. If you can, be Android, and not iOS.

Syndicated 2011-06-21 17:00:25 from wingolog

20 Jun 2011 »

on-stack replacement in v8

A recent post looked at what V8 does with a simple loop:

function g () { return 1; }
function f () { 
  var ret = 0;
  for (var i = 1; i < 10000000; i++) {
    ret += g ();
  }
  return ret;
}

It turns out that calling f actually inlines g into the body, at runtime. But how does V8 do that, and what was the dead code about? V8 hacker Kasper Lund kindly chimed in with an explanation, which led me to write this post.

When V8 compiled an optimized version of f with g, it decided to do so when f was already in the loop. It needed to stop the loop in order to replace the slow version with the fast version. The mechanism that it uses is to reset the stack limit, causing the overflow handler in g to be called. So the computation is stuck in a call to g, and somehow must be transformed to a call to a combined f+g.

The actual mechanics are a bit complicated, so I'm going to try a picture. Here we go:

As you can see, V8 replaces the top of the current call stack with one in which the frames have been merged together. This is called on-stack replacement (OSR). OSR takes the contexts of some contiguous range of function applications and the variables that are live in those activations, and transforms those inputs into a new context or contexts and live variable set, and splats the new activations on the stack.

I'm using weasel words here because OSR can go both ways: optimizing (in which N frames are combined to 1) and deoptimizing (typically the other way around), as Hölzle notes in "Optimizing Dynamically-Dispatched Calls with Run-Time Type Feedback". More on deoptimization some other day, perhaps.

The diagram above mentions "loop restart entry points", which is indeed the "dead code" that I mentioned previously. That code forms part of an alternate entry point to the function, used only once, when the computation is restarted after on-stack replacement.

details, godly or otherwise

Given what I know now about OSR, I'm going to try an alternate decompilation of V8's f+g optimized function. This time I am going to abuse C as a high-level assembler. (I know, I know. Play along with me :))

Let us start by defining our types.

typedef union { uint64_t bits; } Value;
typedef struct { Value map; uint64_t data[]; } Object;

All JavaScript values are of type Value. Some Values encode small integers (SMIs), and others encode pointers to Object. Here I'm going to assume we are on a 64-bit system. The data member of Object is a tail array of slots.

inline bool val_is_smi (Value v) 
{ return !(v.bits & 0x1); }

inline Value val_from_smi (int32 i)
{ return (Value) { ((uint64_t)i) << 32 }; }

inline int32 val_to_smi (Value v)
{ return v.bits >> 32; }

Small integers have 0 as their least significant bit, and the payload in the upper 32 bits. Did you see the union literal syntax here? (Value){ ((uint64_t)i) << 32 }? It's a C99 thing that most folks don't know about. Anyway.

inline bool val_is_obj (Value v)
{ return !val_is_smi (v); }

inline Value val_from_obj (Object *p)
{ return (Value) { ((uint64_t)p) + 1U }; }

inline Object* val_to_obj (Value v)
{ return (Object*) (v.bits - 1U); }

Values that are Object pointers have 01 as their least significant bits. We have to mask off those bits to get the pointer to Object.

Numbers that are not small integers are stored as Object values, on the heap. All Object values have a map field, which points to a special object that describes the value: what type it is, how its fields are laid out, etc. Much like GOOPS, actually, though not as reflective.

In V8, the double value will be the only slot in a heap number. Therefore to access the double value of a heap number, we simply check whether the Object's map is the heap number map, and in that case return the first slot, as a double.

There is a complication though; what is value of the heap number map? V8 actually doesn't encode it into the compiled function directly. I'm not sure why it doesn't. Instead it stores the heap number map in a slot in the root object, and stores a pointer into the middle of the root object in r13. It's as if we had a global variable like this:

Value *root;    // r13

const int STACK_LIMIT_IDX = 0;
const int HEAP_NUMBER_MAP_IDX = -7; // Really.

inline bool obj_is_double (Object* o)
{ return o->map == root[HEAP_NUMBER_MAP_IDX]; }

Indeed, the offset into the root pointer is negative, which is a bit odd. But hey, details! Let's add the functions to actually get a double from an object:

union cvt { double d; uint64_t u; };

inline double obj_to_double (Object* o)
{ return ((union cvt) { o->data[0] }).d; }

inline Object* double_to_obj (double d)
{
  Object *ret = malloc (sizeof Value * 2);
  ret->map = root[HEAP_NUMBER_MAP_IDX];
  ret->data[0] = ((union cvt) { d }).u;
  return ret;
}

I'm including double_to_obj here just for completeness. Also did you enjoy the union literals in this block?

So, we're getting there. Perhaps the reader recalls, but V8's calling convention specifies that the context and the function are passed in registers. Let's model that in this silly code with some global variables:

Value context;  // rsi
Value function; // rdi

Recall also that when f+g is called, it needs to check that g is actually bound to the same value. So here we go:

const Value* g_cell;
const Value expected_g = (Value) { 0x7f7b205d7ba1U };

Finally, f+g. I'm going to show the main path all in one go. Take a big breath!

Value f_and_g (Value receiver)
{
  // Stack slots (5 of them)
  Value osr_receiver, osr_unused, osr_i, osr_ret, arg_context;
  // Dedicated registers
  register int64_t i, ret;

  arg_context = context;

  // Assuming the stack grows down.
  if (&arg_context > root[STACK_LIMIT_IDX])
    // The overflow handler knows how to inspect the stack.
    // It can longjmp(), or do other things.
    handle_overflow ();

  i = 0;
  ret = 0;
  
restart_after_osr:
  if (*g_cell != expected_g)
    goto deoptimize_1;

  while (i < 10000000)
    {
      register uint64_t tmp = ret + 1;
      if ((int64_t)tmp < 0)
        goto deoptimize_2;

      i++;
      ret = tmp;
    }

  return val_from_smi (ret);

And exhale. The receiver object is passed as an argument. There are five stack slots, none of which are really used in the main path. Two locals are allocated in registers: i and ret, as we mentioned before. There's the stack check, locals initialization, the check for g, and then the loop, and the return. The test in the loop is intended to be a jump-on-overflow check.

restart_after_osr what?

But what about OSR, and what's that label about? What happens is that when f+g is replaced on the stack, the computation needs to restart. It does so from the osr_after_inlining_g label:

osr_after_inlining_g:
  if (val_is_smi (osr_i))
    i = val_to_smi (osr_i);
  else
    {
      if (!obj_is_double (osr_i))
        goto deoptimize_3;

      double d = obj_to_double (val_to_obj (osr_i));

      i = (int64_t)trunc (d);
      if ((double) i != d || isnan (d))
        goto deoptimize_3;
    }

Here we take the value for the loop counter, as stored on the stack by OSR, and unpack it so that it is stored in the right register.

Unpacking a SMI into an integer register is simple. On the other hand unpacking a heap number has to check that the number has an integer value. I tried to represent that here with the C code. The actual assembly is just as hairy but a bit more compact; see the article for the full assembly listing.

The same thing happens for the other value that was live at OSR time, ret:

  if (val_is_smi (osr_ret))
    ret = val_to_smi (osr_ret);
  else
    {
      if (!obj_is_double (osr_ret))
        goto deoptimize_4;

      double d = obj_to_double (val_to_obj (osr_ret));

      ret = (int64_t)trunc (d);
      if ((double) ret != d || isnan (d))
        goto deoptimize_4;

      if (ret == 0 && signbit (d))
        goto deoptimize_4;
    }
  goto restart_after_osr;

Here we see the same thing, except that additionally there is a check for -0.0 at the end, because V8 could not prove to itself that ret would not be -0.0. But if all succeeded, we jump back to the restart_after_osr case, and the loop proceeds.

Finally we can imagine some deoptimization bailouts, which would result in OSR, but for deoptimization instead of optimization. They are implemented as tail calls (jumps), so we can't represent them properly in C.

deoptimize_1:
  return deopt_1 ();
deoptimize_2:
  return deopt_2 ();
deoptimize_3:
  return deopt_3 ();
deoptimize_4:
  return deopt_4 ();
}

and that's that.

I think I've gnawed all the meat that's to be had off of this bone, so hopefully that's the last we'll see of this silly loop. Comments and corrections very much welcome. Next up will probably be a discussion of how V8's optimizer works. Happy hacking.

Syndicated 2011-06-20 13:14:03 from wingolog

10 Jun 2011 »

V8 is faster than GCC

Do you like the linkbait title? Neither do I, but it is true in this case. Check it out.

After my last post, Benjamin noted that GCC would reduce my simple test to a mov rax, $10000000; ret sequence. Well yes, that's true, and GCC does do that: but only if GCC is able and allowed to do the inlining. So the equivalent of the test, for GCC, is to compile the g in a separate file:

unsigned int g (void) { return 1; }

Then in the main file, we have:

extern unsigned int g (void);

int main (int argc, char *argv) {
  unsigned int i, ret = 0;
  for (i = 0; i < NU; i++)
    ret += g ();
  return ret;
}

Here we replace N with the number of iterations we're interested in testing.

I decided to run such a test; the harness is here. I ran both GCC and V8 for the above program, for N from 2⁰ to 2³¹, and plotted the total time taken by V8 as compared to GCC.

Naturally this includes the GCC compile times, for which we expect GCC to lose at low iterations, but the question is, when does GCC start being faster than V8?

answer: never!

That's right, V8 is always faster than GCC, right up to the point at which its fixnums start failing. For the record, only the points on the right of the graph are really worth anything, as the ones on the left only run for a few milliseconds. Raw data here.

Why is V8 faster, even for significant iteration counts? Because it can inline hot function calls at runtime. It can see more than GCC can, and it takes advantage of it: but still preserving the dynamic characteristics of JavaScript. In this context the ELF symbol interposition rules that GCC has to deal with are not such a burden!

Syndicated 2011-06-10 15:57:58 from wingolog

365 older entries...