Older blog entries for wingo (starting at number 379)

JavaScriptCore, the Webkit JS implementation

My readers will know that I have recently had the pleasure of looking into the V8 JavaScript implementation, from Google. I'm part of a small group in Igalia doing compiler work, and it's clear that in addition to being lots of fun, JavaScript implementations are an important part of the compiler market today.

But V8 is not the only JS implementation in town. Besides Mozilla's SpiderMonkey, which you probably know, there is another major Free Software JS implementation that you might not have even heard of, at least not by its proper name: JavaScriptCore.

jsc: js for webkit

JavaScriptCore (JSC) is the JavaScript implementation of the WebKit project.

In the beginning, JavaScriptCore was a simple tree-based interpreter, as Mozilla's SpiderMonkey was. But then in June of 2008, a few intrepid hackers at Apple wrote a compiler and bytecode interpreter for JSC, threw away the tree-based interpreter, and called the thing SquirrelFish. This was eventually marketed as "Nitro" inside Apple's products[0].

JSC's bytecode interpreter was great, and is still pretty interesting. I'll go into some more details later in this article, because its structure conditions the rest of the implementation.

But let me continue here with this historical sketch by noting that later in 2008, the WebKit folks added inline caches, a regular expression JIT, and a simple method JIT, and then called the thing SquirrelFish Extreme. Marketers called this Nitro Extreme. (Still, the proper name of the engine is JavaScriptCore; Wikipedia currently gets this one wrong.)

One thing to note here is that the JSC folks were doing great, well-factored work. It was so good that SpiderMonkey hackers at Mozilla adopted JSC's regexp JIT compiler and their native-code assembler directly.

As far as I can tell, for JSC, 2009 and 2010 were spent in "consolidation". By that I mean that JSC had a JIT and a bytecode interpreter, and they wanted to maintain them both, and so there was a lot of refactoring and tweaking to make them interoperate. This phase consolidated the SFX gains on x86 architectures, while also adding implementations for ARM and other architectures.

But with the release of V8's Crankshaft in late 2010, the JS performance bar had been lowered again (assuming it is a limbo), and so JSC folks started working on what they call their "DFG JIT" (DFG for "data-flow graph"), which aims be more like Crankshaft, basically.

It's possible to configure a JSC with all three engines: the interpreter, the simple method JIT, and the DFG JIT. In that case there is tiered compilation between the three forms: initial parsing and compilation produces bytecode, that can be optimized with the method JIT, that can be optimized by the DFG JIT. In practice, though, on most platforms the interpreter is not included, so that all code runs through the method JIT. As far as I can tell, the DFG JIT is shipping in Mac OS X Lion's Safari browser, but it is not currently enabled on any platform other than 64-bit Mac. (I am working on getting that fixed.)

a register vm

The interpreter has a number of interesting pieces, but it is important mostly for defining the format of bytecode. Bytecode is effectively the high-level intermediate representation (IR) of JSC.

To put that into perspective, in V8, the high-level intermediate representation is the JS source code itself. When V8 first sees a piece of code, it pre-parses it to raise early syntax errors. Later when it needs to analyze the source code, either for the full-codegen compiler or for Hydrogen, it re-parses it to an AST, and then works on the AST.

In contrast, in JSC, when code is first seen, it is fully parsed to an AST and then that AST is compiled to bytecode. After producing the bytecode, the source text isn't needed any more, and so it is forgotten. The interpreter interprets the bytecode directly. The simple method JIT compiles the bytecode directly. The DFG JIT has to re-parse the bytecode into an SSA-style IR before optimizing and producing native code, which is a bit more expensive but worth it for hot code.

As you can see, bytecode is the common language spoken by all of JSC's engines, so it's important to understand it.

Before really getting into things, I should make an aside about terminology here. Traditionally, at least in my limited experience, a virtual machine was considered to be a piece of software that interprets sequences of virtual instructions. This would be in contrast to a "real" machine, that interprets sequences of "machine" or "native" instructions in hardware.

But these days things are more complicated. A common statement a few years ago would be, "is JavaScript interpreted or compiled?" This question is nonsensical, because "interpreted" or "compiled" are properties of implementations, not languages. Furthermore the implementation can compile to bytecode, but then interpret that bytecode, as JSC used to do.

And in the end, if you compile all the bytecode that you see, where is the "virtual machine"? V8 hackers still call themselves "virtual machine engineers", even as there is no interpreter in the V8 sources (not counting the ARM simulator; and what of a program run under qemu?).

All in all though, it is still fair to say that JavaScriptCore's high-level intermediate language is a sequence of virtual instructions for an abstract register machine, of which the interpreter and the simple method JIT are implementations.

When I say "register machine", I mean that in contrast to a "stack machine". The difference is that in a register machine, all temporary values have names, and are stored in slots in the stack frame, whereas in a stack machine, temporary results are typically pushed on the stack, and most instructions take their operands by popping values off the stack.

(Incidentally, V8's full-codegen compiler operates on the AST in a stack-machine-like way. Accurately modelling the state of the stack when switching from full-codegen to Crankshaft has been a source of many bugs in V8.)

Let me say that for an interpreter, I am totally convinced that register machines are the way to go. I say this as a Guile co-maintainer, which has a stack VM. Here are some reasons.

First, stack machines penalize named temporaries. For example, consider the following code:

(lambda (x)
  (* (+ x 2)
     (+ x 2)))

We could do common-subexpression elimination to optimize this:

(lambda (x)
  (let ((y (+ x 2)))
    (* y y)))

But in a stack machine is this really a win? Consider the sequence of instructions in the first case:

; stack machine, unoptimized
0: local-ref 0      ; x
1: make-int8 2
2: add
3: local-ref 0      ; x
4: make-int8 2
5: add
6: mul
7: return

Contrast this to the instructions for the second case:

; stack machine, optimized
0: local-ref 0      ; push x
1: make-int8 2      ; push 2
2: add              ; pop x and 2, add, and push sum
3: local-set 1      ; pop and set y
4: local-ref 1      ; push y
5: local-ref 1      ; push y
6: mul              ; pop y and y, multiply, and push product
7: return           ; pop and return

In this case we really didn't gain anything, because the storing values to locals and loading them back to the stack take up separate instructions, and in general the time spent in a procedure is linear in the number of instructions executed in the procedure.

In a register machine, on the other hand, things are easier, and CSE is definitely a win:

0: add 1 0 0           ; add x to x and store in y
1: mul 2 1 1           ; multiply y and y and store in z
2: return 2            ; return z

In a register machine, there is no penalty to naming a value. Using a register machine reduces the push/pop noise around the instructions that do the real work.

Also, because they include the names (or rather, locations) of their operands within the instruction, register machines also take fewer instructions to do the job. This reduces dispatch cost.

In addition, with a register VM, you know the size of a call frame before going into it, so you can avoid overflow checks when pushing values in the function. (Some stack machines also have this property, like the JVM.)

But the big advantage of targeting a register machine is that you can take advantage of traditional compiler optimizations like CSE and register allocation. In this particular example, we have used three virtual registers, but in reality we only need one. The resulting code is also closer to what real machines expect, and so is easier to JIT.

On the down side, instructions for a register machine typically occupy more memory than instructions for a stack machine. This is particularly the case for JSC, in which the opcode and each of the operands takes up an entire machine word. This was done to implement "direct threading", in which the opcodes are not indexes into jump tables, but actually are the addresses of the corresponding labels. This might be an acceptable strategy for an implementation of JS that doesn't serialize bytecode out to disk, but for anything else the relocations are likely to make it a lose. In fact I'm not sure that it's a win for JSC even, and perhaps the bloat was enough of a lose that the interpreter was turned off by default.

Stack frames for the interpreter consist of a six-word frame, the arguments to the procedure, and then the locals. Calling a procedure reserves space for a stack frame and then pushes the arguments on the stack -- or rather, sets them to the last n + 6 registers in the stack frame -- then slides up the frame pointer. For some reason in JSC the stack is called the "register file", and the frame pointer is the "register window". Go figure; I suppose the names are as inscrutable as the "activation records" of the stack world.

jit: a jit, a method jit

I mention all these details about the interpreter and the stack (I mean, the register file), because they apply directly to the method JIT. The simple method JIT (which has no name) does the exact same things that the bytecode interpreter does, but it does them via emitted machine instructions instead of interpreting virtual instructions.

There's not much to say here; jitting the code has the result you would expect, reducing dispatching overhead, while at the same time allowing some context-specific compilation, like when you add a constant integer to a variable. This JIT is really quick-and-dirty though, so you don't get a lot of the visibility benefits traditionally associated with method JITs like what HotSpot's C1 or C2 currently have. Granted, the register VM bytecode does allow for some important optimizations to happen, but JSC currently doesn't do very much in the way of optimizing bytecode, as far as I can tell.

Thinking more on the subject, I suspect that for Javascript, CSE isn't even possible unless you know the types, as a valueOf() callback could have side effects.

an interlude of snarky footnotes

Hello, reader! This is a long article, and it's a bit dense. I had some snarky footnotes that I enjoyed writing, but it felt wrong to put them at the end, so I thought it better to liven things up in the middle here. The article continues in the next section.

0. In case you didn't know, compilers are approximately 37% composed of marketing, and rebranding is one of the few things you can do to a compiler, marketing-wise, hence the name train: SquirrelFish, Nitro, SFX, Nitro Extreme...[1] As in, in response to "I heard that Nitro is slow.", one hears, "Oh, man they totally fixed that in SquirrelFish Extreme. It's blazingly fast![2]"

1. I don't mean to pick on JSC folks here. V8 definitely has this too, with their "big reveals". Firefox people continue to do this for some reason (SpiderMonkey, TraceMonkey, JaegerMonkey, IonMonkey). I expect that even they have forgotten the reason at this point. In fact the JSC marketing has lately been more notable in its absence, resulting in a dearth of useful communication. At this point, in response to "Oh, man they totally are doing a great job in JavaScriptCore", you're most likely to hear, "JavaScriptCore? Never heard of it. Kids these days hack the darndest things."

2. This is the other implement in the marketer's toolbox: "blazingly fast". It means, "I know that you don't understand anything I'm saying, but I would like for you to repeat this phrase to your colleagues please." As in, "LLVM does advanced TBAA on the SSA IR, allowing CSE and LICM while propagating copies to enable SIMD loop vectorization. It is blazingly fast."

dfg: a new crankshaft for jsc?

JavaScriptCore's data flow graph (DFG) JIT is work by Gavin Barraclough and Filip Pizlo to enable speculative optimizations for JSC. For example, if you see the following code in JS:

a[i++] = 0.7*x;

then a is probably an array of floating-point numbers, and i is probably an integer. To get great performance, you want to use native array and integer operations, so you speculatively compile a version of your code that makes these assumptions. If the assumptions don't work out, then you bail out and try again with the normal method JIT.

The fact that the interpreter and simple method JIT have a clear semantic model in the form of bytecode execution makes it easy to bail out: you just reconstruct the state of the virtual registers and register window, then jump back into the code. (V8 calls this process "deoptimization"; the DFG calls it "speculation failure".)

You can go the other way as well, switching from the simple JIT to the optimized DFG JIT, using on-stack replacement. The DFG JIT does do OSR. I hear that it's needed if you want to win Kraken, which puts you in lots of tight loops that you need to be able to optimize without relying on being able to switch to optimized code only on function re-entry.

When the DFG JIT is enabled, the interpreter (if present) and the simple method JIT are augmented with profiling information, to record what types flow through the various parts of the code. If a loop is executed a lot (currently more than 1000 times), or a function is called a lot (currently about 70 times), the DFG JIT kicks in. It parses the bytecode of a function into an SSA-like representation, doing inlining and collecting type feedback along the way. This probably sounds very familiar to my readers.

The difference between JSC and Crankshaft here is that Crankshaft parses out type feedback from the inline caches directly, instead of relying on in-code instrumentation. I think Crankshaft's approach is a bit more elegant, but it is prone to lossage when GC blows the caches away, and in any case either way gets the job done.

I mentioned inlining before, but I want to make sure that you noticed it: the DFG JIT does do inlining, and does so at parse-time, like HotSpot does. The type profiling (they call it "value profiling") combined with some cheap static analysis also allows the DFG to unbox int32 and double-precision values.

One thing that the DFG JIT doesn't do, currently, is much in the way of code motion. It does do some dead-code elimination and common-subexpression elimination, and as I mentioned before, you need the DFG's value profiles in order to be able to do this correctly. But it does not, as far as I can tell, do much in the way of code motion, like loop-invariant code motion.

Also, the DFG's register allocator is not as good as Crankshaft's, yet. It is hampered in this regard by the JSC assembler that I praised earlier; while indeed a well-factored, robust piece of code, JSC's assembler has a two-address interface instead of a three-address interface. This means that instead of having methods like add(dest, op1, op2), it has methods like add(op1, op2), where the operation implicitly stores its result in its first operand. Though it does correspond to the x86 instruction set, this sort of interface is not great for systems where you have more registers (like on x86-64), and forces the compiler to shuffle registers around a lot.

The counter-based optimization triggers do require some code to run that isn't strictly necessary for the computation of the results, but this stratey does have the nice property that the DFG performance is fairly predictable, and measurable. Crankshaft, on the other hand, being triggered by a statistical profiler, has statistically variable performance.

And speaking of performance, AWFY on the mac is really where it's at for JSC right now. Since the DFG is only enabled by default on recent Mac OS 64-bit builds, you need to be sure you're benchmarking the right thing.

Looking at the results, I think we can say that JSC's performance on the V8 benchmark is really good. Also it's interesting to see JSC beat V8 on SunSpider. Of course, there are lots of quibbles to be had as to the validity of the various benchmarks, and it's also clear that V8 is the fastest right now once it has time to warm up. But I think we can say that JSC is doing good work right now, and getting better over time.

future

So that's JavaScriptCore. The team -- three people, really -- is mostly focusing on getting the DFG JIT working well right now, and I suspect they'll be on that for a few months. But we should at least get to the point where the DFG JIT is working and enabled by default on free systems within a week or two.

The one other thing that's in the works for JSC is a new generational garbage collector. This is progressing, but slowly. There are stubs in the code for card-marking write barriers, but currently there is no such GC implementation, as far as I can tell. I suspect that someone has a patch they're cooking in private; we'll see. At least JSC does have a Handle API, unlike SpiderMonkey.

conclusion

So, yes, in summary, JavaScriptCore is a fine JS implementation. Besides being a correct implementation for real-world JS -- something that is already saying quite a lot -- it also has good startup speed, is fairly robust, and is working on getting an optimizing compiler. There's work to do on it, as with all JS implementations, but it's doing fine.

Thanks for reading, if you got this far, though I understand if you skipped some parts in the middle. Comments and corrections are most welcome, from those of you that actually read all the way through, of course :). Happy hacking!

Syndicated 2011-10-28 15:51:24 from wingolog

the user in the loop

The videos from this year's GNU Hackers Meeting in Paris are up. All the videos are linked from that page. There were technical problems with a couple of them, and we didn't get BT Templeton's presentation on Emacs Lisp in Guile (using delimited dynamic bindings to implement buffer-local variables!), but all in all it's a good crop.

I gave a talk entitled "The User in the Loop" that argued for the place of extensibility in the GNU project. I also pointed out some ways in which Guile fulfills those needs, but that was not the central point to the talk. I argued that point at more length in a previous article.

Anyway, let's give that newfangled HTML5 video tag a go here. It starts with the cut-off statement, "We all know that code motion is very important to efficiency, so..."

Alternately you can download the video directly (~350MB, 52 minutes). There are notes too, a superset of the slides from the talk.

Syndicated 2011-10-19 14:32:34 from wingolog

what's your C migration plan?

I spend a lot of time in my neighborhood café. It has character. Some of it is generic: like in many other bars, the round tables are marble, the chairs are rickety, the customers and the barmen all shout at each other.

But it has its own thing. The owner inherited it from his father, but he was never sure whether he wanted a record store or a bar, so it's half and half. In practice the CDs in the changer don't change very often, and the rest are all in a side room that is otherwise packed full of junk.

In the morning, everyone gets tallats. Old ladies get theirs with pastries. Old men get theirs amb gotes, with liquor. Morning workers rush in, slam one back, slap coins on the counter and get out. Adéu!

Those with more time, like myself, get a cafè amb llet. Most read the paper. I hack. I spend a couple mornings a week there. It's right pleasant to work there, without internet. My most productive hours of the week are there in the bar.

I do chitchat a bit, though: with designers, mothers, the barmen, retirees, random folk. There's one fellow in particular I like talking with, a designer. Turns out he wants to learn how to program. He told me yesterday that he wanted to learn C.

on c

Now, I spend a lot of time in C. I've written loads of it. I continue to write it at times. I'm fond of it, and it has served me well.

But I have come to believe that writing new programs in C is the hacker equivalent of malpractice. C is just too dangerous to use. The risks are not worth the benefits.

Let's be clear about the benefits of writing in C, before looking at its flaws. I want to be really fair here. C is fast. It has great compilers, everywhere. And they are everywhere: C is ubiquitous. It is very flexible, also. You can do anything in C. It is great for programming drivers. It is possible to write big e-mail clients in it. It has great low-level bit operations (though C++ is getting better at it, with its value types). It is a power tool, and it puts you in control. There are loads of libraries written in it, and for it. It encourages minimalism.

In short, C has been a wild success, and with good reason.

But it's time to stop using it.

The essence of the problem with C is that the behavior of erroneous programs is not specified. This doesn't sound like a big deal, but it is. Let's make an example. What is the meaning of this Python program?

def array_set(a, n, x):
  a[n] = x

I think we would all say that it sets the nth element of a to x. But what about this one in C:

void array_set (void **a, size_t n, void *x) {
  a[n] = x;
}

This one, you really can't tell. You can't use Python's definition, because there are no guarantees about the valid indices to set in the array. In fact there is no array, just a pointer.

In the Python case, if you pass an out-of-bounds array index, you will get an exception. In the C case, you get undefined behavior. Lately what "undefined behavior" means is that the state or organized crime gets to take control of your computer: read your email, log your keystrokes, take pictures of you.

I'm deadly serious. Let's take a better example. Last year Liu Xiaobo won the Nobel Peace Prize for "his long and non-violent struggle for fundamental human rights in China." A couple weeks later, it was found that someone cracked the Nobel Prize website and was serving a Firefox zero-day that took over users' machines.

Apparently, someone was interested in what was on the computers of visitors to the Nobel prize site -- interested enough to use a fresh zero-day, something that probably sells on the black market for some $20,000 or more.

Lately I've been quoting Eben Moglen a lot, especially from the talk he gave at this year's FOSDEM conference. One thing he said there is that "software is the steel of the 21st century". As steel shaped the social relations of the last century, what we hack now forms the material conditions of tomorrow: the next Arab Spring, or the next Green Scare.

Almost all software is connected to the net these days, and so it is all under attack. If you write software in C these days -- software with bugs, like any software -- you are writing software with undefined behavior, and thus, software that enables powerful state and organized crime actors to take advantage of your users.

That sounds a bit exaggerated, no? It is true, though. Look at what is happening with browser vulnerabilities, or what is the same, PDF or Flash vulnerabilities, or PNG or MP4, or what-have-you. The commonality here is that powerful interests exploit unsuspecting users due to flaws in the C and C++ language. Writing more C these days is malpractice.

migration

I still write C. I work on implementations of safe languages -- languages that don't have the same kinds of fundamental vulnerabilities that C and C++ have. Eventually the amount of C in the world will stop growing, and decline as pieces that are now written in C will be written in Python, in JavaScript, in Guile: in short, in languages that don't launch the missiles when you try to write beyond the end of an array.

C has had a great run; we should celebrate it. But its time has passed. What is your migration strategy? How are you going to stop writing C?

Syndicated 2011-10-13 20:44:55 from wingolog

partial evaluation in guile

Friends, something awesome just happened: Guile just got itself a respectable inliner.

I have said before on this blog, quoting commenter Rémi Forax, that "inlining is the mother of all optimizations". It is true that inlining opens up space for code motion, constant folding, dead code elimination, and loop optimizations. However, credit might be better laid at the feet of partial evaluation, the mother of all inlining algorithms.

Partial evaluation is a source-to-source transformation that takes your program and produces a better one: one in which any computation that can be done at compile-time is already made, leaving only those computations that need to done at run-time.

For example, the application

(+ 2 3)

can clearly be evaluated at compile-time. We say that the source expression (+ 2 3) reduces to 5 via constant folding. The result, 5 in this case, is the residual expression.

A more complicated example would look like:

(let ((string->chars
       (lambda (s)
         (define char-at
           (lambda (n) (string-ref s n)))
         (define len
           (lambda () (string-length s)))
         (let loop ((i 0))
           (if (< i (len))
               (cons (char-at i)
                     (loop (1+ i)))
               '())))))
  (string->chars "yo"))
=> (list #\y #\o)

Here when I write =>, you should read it as, "residualizes at compile-time to". In this case our input program residualized, at compile-time, to a simple list construction. The loop was totally unrolled, the string-refs folded, and all leaf procedures were inlined.

Neat, eh?

optimization enables expressiveness

If the partial evaluator does its job right, the residual program will run faster. However this isn't the real reason that I'm so pleased with it; rather, it's that it lets me write different programs.

You see, I hack on Guile's compiler and VM and all that. When I write code, I know what Guile is going to do with it. Unfortunately, this caused my programs to be uglier than necessary, because I knew that Guile wasn't going to inline some important things for me. I wrote at a lower level of abstraction, because I couldn't trust the compiler.

Now, with the partial evaluator, I'm happy to use helper functions, even higher-order helpers, with the knowledge that Guile will mostly do the right thing. This is particularly important in the context of languages that support syntactic abstraction, like Scheme. If you're a Schemer and haven't seen Kent Dybvig's Macro Writers' Bill of Rights talk (slides), do check it out.

Incidentally, there was a sad moment in JSConf.eu a couple weekends ago when Andreas Gal (of all people!) indicated that he had to manually inline some functions in PDF.js in order to get adequate speed. More on JavaScript a little later, though.

about partial evaluation

A partial evaluator looks a lot like a regular meta-circular evaluator. It's a recursive function that takes an expression and an environment and yields a value. Guile's partial evaluator, peval, builds up lexical environments when it sees let and other binding constructs, and tries to propagate copies when it sees lexical references.

Inlining is facilitated by copy-propagation of lambda expressions. Just as the initial value 0 in the example above propagates through the lexical variable i to reach (< i (len)), (lambda () (string-length s)) propagates to len. Application of a lambda expression reduces to the equivalent of a let binding. So for the first iteration of loop above, we have:

(< i (len))
;; copy propagation
=> (< 0 ((lambda () (string-length s))))
;; beta-reduction
=> (< 0 (string-length s))
;; copy-propagation
=> (< 0 (string-length "yo"))
;; constant-folding
=> (< 0 2)
;; constant-folding
=> #t

In this case the condition folded to a constant, so we know at compile-time which branch to take. The second branch is dead, so we eliminate it. The process continues until we finally produce the resulting list.

down the rabbit hole

Up to here things are easy: we have a simple, well-typed example that terminates. But to be part of a real-world compiler, a partial evaluator needs to handle real-world code: accessors for mutable data, access to mutable bindings (lexical and global), indefinite recursion, unbound variables, and poorly-typed programs. In addition, a real-world inliner needs to run quickly and avoid producing bloated residual code.

I should take a moment and note that statically-typed, functional languages can avoid a number of these problems, simply by defining them away. It is no wonder that compiler people tend towards early binding. Scheme does exhibit a fair amount of early binding through its use of lexical scope, but it is not a pure functional language. Working on this peval was the first time that I wished for immutable pairs in Scheme, as promoted by Racket and R6RS.

Anyway, having mutability in your language isn't so bad. You do miss some optimization opportunities, but that is OK. What is not OK in a production peval is spending too much time on an expression.

Guile's solution, following Waddell and Dybvig's excellent Fast and Effective Procedure Inlining, is to simply count the number of times through the inliner. Each inlining attempt gets a fresh counter, and any work performed within an inlining attempt decrements the counter. When the counter reaches zero, the inlining attempt is aborted, and a call is residualized instead. Since the number of call sites in the program is fixed, and there is a maximum amount of work that will be done at each call site, the resulting algorithm is O(N) in the size of the source program.

Guile's partial evaluator also uses the on-demand, online strategy of Waddell and Dybvig, to allow definitions to be processed in their use contexts. For example, (cons 1 2) may be reduced to #t when processed as a test, in a conditional. If, after processing the body of a let, a binding is unreferenced, then it is processed for effect. Et cetera.

With the effort counter in place, Guile simply tries to inline every call site in the program, knowing that it will bail out if things don't work. It sounds a little crazy, but it works, as Waddell and Dybvig show. The effort counter also serves to limit code growth, though it is a bit crude. In any case I got less than a percent of code growth when optimizing the psyntax expander that Guile uses, which is a win in my book.

caveats

Partial evaluation can only propagate bindings whose definitions are known. In the case of Guile, then, that restricts inlining to lexical references and primitive references, and notably excludes global references and module imports, or fields of mutable objects. So this does not yet give us cross-module inlining, beyond the hacks that abuse the macro expander.

This observation has a correlary, in that some languages promote a style of programming that is difficult to analyze. I'm really talking about object-oriented languages here, and the dynamic ones in particular. When you see o.foo() in Java, there is at least the possibility that foo is a final method, so you know you can inline it if you choose to. But in JavaScript if you see o.foo(), you don't know anything: the set of properties of o can and does vary at runtime as people monkey-patch the object o, its prototype, or Object.prototype. You can even change o.__proto__ in most JS implementations. Even if you can see that your o.foo() call is dominated by a o.foo = ... assignment, you still don't know anything in ES5, as o could have a setter for the foo property.

This situation is mitigated in the JavaScript world by a couple of things.

First of all, you doesn't have to program this way: you can use lexical scoping in a more functional style. Coupled with strict mode, this allows a compiler to see that a call to foo can be inlined, as long as foo isn't mutated in the source program. That is a property that is cheap to prove statically.

However, as Andreas Gal found out, this isn't something that the mainstream JS implementations do. It is really a shame, and it has lasting impacts on programmers.

I even heard a couple people say that in JS, you should avoid deep lexical bindings, because the access time depends on the binding depth. While this is true for current implementations, it is a property of the implementations and not of the language. Absent with and eval-introduced bindings, a property that is true in strict-mode code, it is possible to quickly compute the set of free variables for every function expression. When the closure is made, instead of grabbing a handle on some sort of nested scope object, a JS implementation can just copy the values of the free variables, and store them in a vector associated with the function code. (You see, a closure is code with data.) Then any accesses to those variables go through the vector instead of the scope.

For assigned variables -- again, a property that can be proven statically -- you put the variables in a fresh "box", and rewrite accesses to those variables to go through that box. Capturing a free variable copies the box instead of its value.

There is nothing new about this technique; Cardelli and Dybvig (and probably others) discovered it independently in the 80s.

This point about closure implementation is related to partial evaluation: people don't complain much about the poor static inliners of JS, because the generally poor closure implementations penalize lexical abstraction. Truly a shame!

* * *

It seems I have digressed. Sorry about that!

I spoke about closures and lexical scope, properties of the JS language that can enable static inlining. The second (and more important) way that JS implementations can support inlining is dynamically. I trolled about that some months ago. Dynamic inlining is fantastic, when it works, though there are limiting heuristics (scroll down to "inlining", and note that the exact set of heuristics have changed in the intervening months).

So my last point was about something that Guile does well that JS implementations do poorly, and it's fair that this point should be the reverse. I would like to be able to dynamically inline, but this would mean associating the intermediate representation with Scheme functions. As Guile can compile code ahead-of-time, this means we would have to serialize the IR out to disk, in much the same way as GCC's new link-time optimizer (LTO) does. But I would like to put that off until we change the format of compiled Guile code to be ELF. Otherwise we run the risk of bloating our runtime memory size.

try it out

Guile's partial evaluator was joint work between myself and my fellow Guile maintainer Ludovic Courtès, and was inspired by a presentation by William Cook at DSL 2011, along with the Waddell and Dybvig's Fast and Effective Procedure Inlining.

This code is currently only in the development Guile tree, built from git. Barring problems, it will be part of Guile 2.0.3, which should be out in a couple weeks.

You can check out what the optimizer does at the command prompt:

>,optimize (let ((x 13)) (* x x))
$1 = 169
>,optimize (let ((x 13)) (* x foo))
$2 = (* 13 foo)

Have fun, and send bugs to bug-guile@gnu.org.

Syndicated 2011-10-11 10:01:30 from wingolog

a schemer at jsconf.eu

Yow! I am just back from JSConf.eu, a lovely & lively gathering of JavaScript hackers in sunny Berlin. And it was sunny, this foreign land, and foreign indeed: as my long-time readers will know, I am not a JavaScript hacker, but a Schemer with a C problem. JSConf.eu was a pleasant place to visit.

It turns out that I wasn't the only one there in the tourist condition; all of the other implementors seemed to be in a similar situation. The Mozilla folks look at themselves as "platform engineers". V8 hacker Vyacheslav Egorov has a thing for Oberon; I'm sure he'll tell you about it if you ask. And so on. Indeed I think that of the folks that were there, Brendan is the only one implementing JavaScript in JavaScript. Do You Speak It!

great job, organizers!

But besides giving me the opportunity to gossip about garbage collection and type feedback, JSConf.eu was a well-organized, energetic event, and I say that as one who has been to dozens of conferences in the last few years.

There was great sound and video, of course. It looked like they did a pretty good job with the recording, so the videos should be online at some point. In the downstairs room, they had Anna Lena Schiller drawing live "graphic recordings" as the speakers spoke, on large sheets of glossy paper, in color, illustrating the main ideas of the talks. As the weekend progressed, all of these pieces were mounted on the wall for people to see. In the end they will be scanned and uploaded, and the originals mailed to the speakers. It was amazing, and an impressive amount of work.

It's silly, but let me also mention the food: there were fresh-made frozen-yogurt smoothies with brownie toppings. There were fresh-made croissants and bread, and a plate of some dozen or so French cheeses, and made-to-order breakfast, and plates of tarts, and the bar was open all day long, with apfelschorle and club-mate and beer and coffee, and there was an espresso stand, and everything was free. There was lunch and dinner at the venue and free parties before, during, and after. The only caveat was that the vegetarian options were relatively few. Still, though way to raise the bar, JSConf.eu!

hypersensitivity?

I could be overly sensitive about this, but I would like to put one thing out there. There were a number of sexuality-related aspects of this event that I did not feel comfortable with. There were a couple speakers that made live-chat applications with node.js, and demoed them to the room, and a few anonymous penis jokes ended up on them. There was a speaker that made a sexual joke about Allow: GET, HEAD in HTTP headers. Most of all though, they had the (very talented) caberet singer Mandy Lauderdale there, who started with a "Brendan Eich Fanclub" thing, then put on a show the next night at the party.

Please understand me when I say that I have no problem with sexuality in general -- it is fun, subversive, and in any case a part of life. However in the programming world we have a gender-balance problem, and I don't think that incorporating sexuality into our culture helps that. In the particular case of Mandy's (talented and sometimes funny) performance, I get the feeling that we're promoting a "boy's club" atmosphere where it becomes OK to make penis jokes.

Like I said, I'm just putting that out there. Feel free to disagree, politely of course.

talks

The first talk I saw was by Dean McNamee of plask, a non-browser JS environment for making art. Plask helps make graphical art by providing 2D and 3D drawing primitives to JS. It also integrates with other devices via MIDI and OSC, and can be sequenced over MIDI (or OSC, I presume).

Dean's message was that we need to stop caring so much about the "how" and care more about the "what" and "when". There was no code in the presentation, only demos of what Plask has done; and indeed Dean was quite honest that it was the crappiest code he has ever written. But the demos were pretty awesome.

There are obviously some problems with that approach to software, that it is difficult to share or to form a piece of someone else's work. However it is equally true that many of us spend a lot of time working on things that people will never see, and that sometimes we should just take shortcuts and move on.

Another interesting talk was by Peter van der Zee on JavaScript tools. Peter wrote a JS parser in JS, and an in-browser editor to overlay that tool information and utility onto a piece of JS code. It can provide many warnings, do type inference, and similar things. It can do some automated refactorings, like var hoisting, extract-method, and minification. I wonder if some more principled analysis algorithms like kCFA would provide any features above what he is able to do with his fixed-point constraint solver. Still, neat stuff; check it out on github.

David Mandelin and David Anderson of Mozilla gave a nice introduction to what JIT compilers can do these days in JavaScript, at least at a very local level. The presentation reminded me of Mandelin's old pjit project, to figure out what sorts of optimizations would be necessary to yield the smallest assembly code. My only criticism is that some of their assumptions were specific to the nan-boxing approach. They did include a humble nod at the end that there was still a ways for IonMonkey to go in order to reach what Crankshaft does, which was appropriate IMO.

Brendan Eich gave a talk at the end of the day about the coming ES6. Frankly I have no idea how people could understand the things he spoke of without following es-discuss, and most people I asked afterwards were really skeptical about such things as the "monocle-mustache" operator (that's .{ for those of you following along at home). That said, I do like the triangle operator (|>) and the potential block-lambda syntax. Slides are not up yet, unfortunately.

. . .

The next day, I was pleased to find myself sitting with GNOME hackers Henri Bergius and Garrett LeSage. We all agreed that it was great to be in such a dynamic, energetic conference, and that JSConf.eu was much more lively than recent free software conferences we had been to. However it must be said that JS programmers are at an early stage of free software development in many ways, in that there are many more single-person projects than we have in GNOME (for example). It could be said that our use of C in GNOME forces you to collaborate because you can't build very much (hah!), but large, stable projects with many contributors are relatively rare.

Also, I thought there is a general low esteem among the JS folk for copyleft and privacy, and for software freedom in general. There are exceptions of course, but JS is closely tied to environments with lots of venture capital, so it is to be expected that we get divergence of interests between users and developers. I think this is an area in which GNU folk have something important to say: that it's not just about the tech, it's about the freedom too.

On that note, my Igalia colleague Eduardo Lima's talk on JavaScript in GNOME was well-received. Eduardo talked about the pieces that allow a good JavaScript hacking experience to be had in GNOME, and even did some live-coding. Hopefully he'll blog about his talk soon :-)

I really enjoyed Tom Robinson's talk about compilers and JavaScript. It was about compilers that produce or consume JavaScript. I suppose that in some ways compiling DSLs to JavaScript in an in-browser compiler that passes the result to eval is somewhat like macros, though it is hard to compose with other parts of the target language like lexical scope. Interesting stuff though, and I hope he puts up his slides soon.

Finally, and we're getting to the end here -- I did not intend to write so much! -- V8 hacker Erik Corry had a great talk about garbage collection. V8 always had a generational, moving collector, which is great for most purposes, but struggles with large heaps, because collecting the old generation required a stop-the-world mark and sweep. Their new GC can incrementally collect the old generation, leading to reductions in pause times and higher sustained allocation rates for some problems (like the splay benchmark). There's lots of tricky write-barrier stuff going on there, and they're just now working out the details, but it will arrive to a Chrome near you shortly.

As part of Erik's work, he had enhanced the V8 engine to output heap profiling information on a TCP port. He could then run a node.js process to consume this information and output a graphical representation of the heap in a browser. It shows the new space filling up, scavenging, and promoting data into various parts of the old space. It also shows the old space being incrementally marked and swept, and has a visualization of time spent in the collector versus the mutator. This last visualization nicely showed that the new GC allows the splay benchmark to make forward progress even with a large rate of tenuring into the old generation.

I'm sure the Google folks will make their press release about this as soon as things settle down, but for now you can check it out in the bleeding_edge of V8. Have fun!

fin

A gist of all the slides is up on github, and I suppose it will be updated as new slides are uploaded and videos come online.

There were a number of other talks I didn't get to mention, like Andreas Gal on PDF.js, or Alon Zakai on Emscripten. SQLite compiled from C to JS for the browser? What a world. Thanks again to the JSConf.eu folks for such a great conference, and to Igalia for sending me. Happy hacking!

Syndicated 2011-10-04 18:23:09 from wingolog

from ssa to native code: v8's lithium language

So, friends, about a month ago I posted a look at the Hydrogen intermediate language, as part of an ongoing series on the internals of the V8 JavaScript implementation. Hydrogen is the high-level part of V8's Crankshaft optimizing compiler, where the optimizations happen. This article tells the story of Lithium, Crankshaft's low-level, machine-dependent intermediate language.

totally metal

So here's the deal: you parsed your JavaScript functions to AST objects (parse all the functions!). You emitted some simple native code, quickly but not optimized, and you sat back and waited. After a little while, your profiling told you that some function was hot, so you re-parsed it, generated a graph of Hydrogen instructions and optimized that graph. Now you need to generate code. And that's where we are! Generate all the code!

But that's not what V8 does here. Instead of going directly to native code, it translates the HGraph to a chunk of Lithium code, then emits machine code from those Lithium instructions. Unlike Hydrogen, which is an SSA form in which instructions implicitly define values, the Lithium form is closer to three-address code, with labels and gotos, and explicitly names its operands (if any) and results (if any).

These temporary values are initially declared with a number of constraints, such as "must be in a double-precision register". Later, a register allocator considers the constraints and liveness ranges, allocating each value to a specific register. Code generation happens after register allocation.

an example

Let us take a look at an example to see how these things interact. Given that Lithium is target-specific, we'll just choose one of the targets; ia32 and x64 are the most advanced, so we'll just take x64. (That's what they call it.)

LInstruction* LChunkBuilder::DoAdd(HAdd* instr) {
  if (instr->representation().IsInteger32()) {
    ASSERT(instr->left()->representation().IsInteger32());
    ASSERT(instr->right()->representation().IsInteger32());
    LOperand* left =
        UseRegisterAtStart(instr->LeastConstantOperand());
    LOperand* right =
        UseOrConstantAtStart(instr->MostConstantOperand());
    LAddI* add = new LAddI(left, right);
    LInstruction* result = DefineSameAsFirst(add);
    if (instr->CheckFlag(HValue::kCanOverflow)) {
      result = AssignEnvironment(result);
    }
    return result;
  } else if (instr->representation().IsDouble()) {
    return DoArithmeticD(Token::ADD, instr);
  } else {
    ASSERT(instr->representation().IsTagged());
    return DoArithmeticT(Token::ADD, instr);
  }
  return NULL;
}

As an aside, this is fairly typical for Crankshaft code: tight, lots of asserts, custom type predicates instead of RTTI, self-explanatory enumerated values, and much more oriented towards function calls than in-place mutations. The new objects are created in a "zone", under the covers, so that once the compiler is done, it can free the whole zone at once. I reflowed the left and right lines; otherwise it really is a paragraph of text.

Unfortunately there are not a lot of big-picture comments, hence this blog series.

Anyway, here we have three cases. The first two are for when the HValue has been has been given an unboxed representation, as an int32 or as a double. The last is for tagged values, which can be small integers (Smi values) or heap-allocated doubles. In these latter cases, we dispatch to the DoArithmeticD and DoArithmeticT helpers, which create generic LArithmeticD and LArithmeticT instructions. But we see that in the first case, there is a special LAddI instruction.

When there is a special case in Crankshaft, it either means that we are working around some strange corner of JavaScript, or that something is being optimized. In this case, it is an optimization, reflecting the importance of integer addition. The concrete optimizations are:

  1. Immediate operands. If one of the addends is a constant at compile-time, it can be encoded into the instruction stream directly.

  2. Overflow check removal. If the range analysis (see my previous article) showed that the value would not overflow, then no overflow check is emitted. (Otherwise, an overflow check will be inserted, branching to a deoptimization bailout. The bailout will need to know what variables are live, in that case, hence the AssignEnvironment call; more on that later.)

The UseRegisterAtStart helper call indicates that the operand should be in a register, and that it will be live at the start of the instruction but not the end. This latter constraint is needed because x86 instructions typically clobber the first operand. UseRegisterOrConstant only allocates a register if the operand is not constant and so cannot be immediate. Finally DefineSameAsFirst constrains the output register to be the same as the first operand. All of these calls additionally inform the register allocator of the uses and definitions, in SSA parlance, so that the allocator can build accurate liveness ranges.

Just to be complete, the double arithmetic case (not shown) is bit more straightforward, in that there are no overflow checks or immediates. Tagged arithmetic, on the other hand, always dispatches to an inline cache. (I find this to be fascinating. Typically language implementations inline fixnum operations, but V8 does not because their inline caches also perform type feedback, allowing unboxing optimizations where possible. In V8, tagged integer arithmetic is never inlined, by default.)

allocate all the registers?

Why bother with this Lithium thing? Why not just generate code directly? The answer is that having a target-specific Lithium language allows target-specific code generation and temporary allocation, while permitting the register allocator itself to be generic. As an example, see the ARM DoAdd implementation.

Most everyone knows, but the idea of register allocation is to store your temporaries in machine registers instead of in memory, so that things run fast. It's tricky though, because nested procedure calls can trash your registers, so you need to "spill" them to memory in certain cases. You also need to avoid thrashing values between registers and memory. V8 also tries to use registers when calling internal runtime helpers, which can result in a fair amount of register shuffling.

That's all traditional compiler stuff. But in a garbage-collected language, we also have the complication of knowing which registers hold garbage-collected values and which don't. For every point in the code in which GC could run, the collector will need to mark those pointers, and possibly relocate them. This complicates the allocator.

Another complication is that besides the normal entry point, optimized functions often have an additional entry point, due to on-stack replacement. That entry point has to take values from memory and pack them into registers in a way that lets the computation proceed.

Finally, the allocator plays a role in permitting so-called "lazy deoptimization". V8 hacker Søren Gjesse writes, in a rare but pleasantly exegetic message to the list:

Lazy deoptimization is what happens when all code in a context is forced to be deoptimized either due to starting debugging or due to a global assumption on the optimized code that no longer holds. The reason for the term "lazy" is that actual deoptimization will not happen until control returns to the optimized function. The way we do this lazy deoptimization is somewhat fragile and we have had quite a few bugs in that code. It is based on destructively patching the optimized code placing a call just after each call forcing deoptimization. And actually the patching is not right after the return point but after the gap code inserted at the return point. For the deferred stack check the code up to restoring the registers (popad in IA-32) is considered the gap code.

gaps?

The gap code that Søren referred to is the set of moves that shuffles registers back into place after a call. Translation of Hydrogen to Lithium inserts LGap instructions between each LInstruction. When the allocator decides that it needs to spill, restore, or shuffle some registers, it records these moves on an LGap.

A later pass serializes these parallel moves into a specific order, trying to use the minimum number of temporaries. I link to Xavier Leroy's paper above, not because V8 uses Coq, but because of the nice introduction to the problem.

Anyway, the actual algorithm used by the Lithium register allocator appears to be a version of linear-scan, with live range splitting. This part of V8 actually is well-commented, so if you can understand the Sarkar paper, then head on over to ye olde google code and check it out.

generate all the code!

The result of register allocation is that all LOperands have been assigned specific machine registers, and some moves have been recorded in the LGap instructions. Now, finally, we generate code: each Lithium instruction is traversed in order, and they all write assembly into a buffer. See the LAddI code generator, as an example.

All of the JavaScript engines generate code using assemblers written in C++. Some people still assume that compilers must generate assembly, in text, but it's much easier and faster just to call methods on an object that writes bytes into a buffer. This approach also makes it easy to add macro instructions, just by subclassing the assembler. Finally, since modern JS implementations have to muck around with the bits all the time (disassembly, source debugging, backtraces, garbage collection (tracing and moving), deoptimization, OSR, inline caches, type feedback), it's easiest to maintain your invariants if you control the bit generation.

V8's assembler originally came from the Strongtalk project, where some of the V8 hackers came from, long ago. This assembler writes code into memory, going forward, and then when it's finished, it appends "relocation information" to the code, written backwards. Relocation information lets V8 mark interesting places in the generated code: pointers that might need to be relocated (after garbage collection, for example), correspondences between the machine program counter and source locations, etc.

As anyone who's hacked on debuggers knows, the problem with debug information is that it is large. Without clever encodings, the amount of debugging data generated by a compiler can kill runtime performance, simply because of the amount of data that needs to be paged into memory. Of course this concern also applies at compile-time, and especially so for JIT compilers. The Strongtalk assembler's main contribution to V8 (I think) is a complicated bit-encoding that tries to compress this data as far as possible. The assembler also provides iterators for traversing this data at runtime.

finish line

And now, dear readers, I believe that we have come full circle on this look at V8. We started low, looking at the assembly generated by V8 for a simple loop. We elaborated this low-level view with a look at on-stack replacement, though without understanding how such information was available. So we jumped back to the top, taking a high-level look at V8's compilers. We drilled down to Hydrogen, and here we covered Lithium, register allocation, and code generation.

I hope it's been an entertaining series to read; it has certainly been enlightening to write. Thanks very much to my employer Igalia for supporting my work on this. Comments and corrections are welcome, as always, and happy hacking.

Syndicated 2011-09-05 11:25:05 from wingolog

the gnu extension language

I'm sitting on a train waiting to leave Paris for Barcelona, returning from the 2011 GNU Hackers Meeting. It was fantastic, clearly the best we have had yet. Thanks a lot to Ludovic Courtès for organizing it, and to IRILL and Sylvestre Ledru for hosting. I hope to write more about it in the future, but this essay will be long enough :)

I gave a talk entitled "The User in the Loop", which made the perhaps obvious argument that "extensibility is good & stuff". I hope that it did so in an entertaining and illuminating fashion. It also argued that Guile is a great fit for the extensibility needs of the GNU project. The video will be out shortly. Slides are available here, though you probably just want the notes instead.

going meta: goals and strategies

Guile is the GNU extension language. This is the case because Richard Stallman said so, 17 years ago. Beyond being a smart guy, Richard is powerfully eloquent: his "let there be Guile" proclamation was sufficient to activate the existing efforts to give GNU a good extension language. These disparate efforts became a piece of software, a community of hackers and users, and an idea in peoples' heads, on their lips, and at their fingertips.

So, Guile is a great language. But is it still the right extension language for GNU? At this GHM, Jim Blandy brought up the good point that we should revisit our past decisions periodically, to see if they still fit with our goals and with the world. "X is the GNU Y" is still a powerful, generative proclamation, so we should be careful with that power, to make sure that we are generating the right world. In that spirit, I would like to re-make the argument for Guile as the GNU extension language.

goal

Why care about extension languages? My presentation deals with this issue at length, but a nice summary can be found in the Guile manual:

Guile was conceived by the GNU Project following the fantastic success of Emacs Lisp as an extension language within Emacs. Just as Emacs Lisp allowed complete and unanticipated applications to be written within the Emacs environment, the idea was that Guile should do the same for other GNU Project applications. This remains true today.

The idea of extensibility is closely related to the GNU project's primary goal, that of promoting software freedom. Software freedom means that people receiving a software package can modify or enhance it to their own desires, including in ways that may not have occurred at all to the software's original developers. For programs written in a compiled language like C, this freedom covers modifying and rebuilding the C code; but if the program also provides an extension language, that is usually a much friendlier and lower-barrier-of-entry way for the user to start making their own changes.

-- Guile and the GNU Project

Although GNU maintainers are free to do as they wish with their packages, we are working on making an integrated system, and a system that mostly uses one extension language is better than one with many implementations. It is reasonable to promote one language as the default choice, while not forbidding others, of course.

I think that Guile is a great strategy to achieve this goal. Guile is good for the programs it extends, it is good for users, and it is good for GNU as a whole. The rest of this article argues these points in more detail.

languages do matter

An extension language is first and foremost a language: a vehicle for expression and for computation. Guile is an implementation of Scheme, one of the most expressive languages out there.

In his 2002 essay, Revenge of the Nerds, Paul Graham argues (somewhat polemically) that some languages are better than others, and that you should use the better languages. Of course he identifies "better" with "closer to Common Lisp", which is somewhat naïve, but the general point still stands.

At the time when Guile was started, it was quite clear that Scheme was indeed more expressive than Bourne shell, or than Tcl. Since then, many languages have come and gone, and most of those that have stuck around do have many of the features of Scheme: garbage collection, closures, lexical scope, etc.

But still, there are many areas in which Guile Scheme remains more powerful than other languages.

macros: language extensibility

First and foremost, Scheme's macros have no comparison in any other language. Macros let users extend their language. Macros allow a language be adapt to time, to the different needs of the future (unevenly distributed as it is).

I realize that this broad argument will not speak to people that do not use macros, so here is a concrete example.

Some languages have pattern-matching facilities, like Erlang. Pattern-matching goes like this: you have a datum, and a set of expected patterns. You go over the patterns in order, seeing if the datum matches the pattern. If it matches, some matching part of the datum is extracted and bound to local variables, and a consequent expression is evaluated.

With most languages, either you have pattern matching, because Joe Armstrong put it there, or you don't, in which case you are either blissfully ignorant or terminally frustrated. But in languages with macros, like Scheme, you can extend your programming language to give it pattern-matching features.

I'll show an implementation here, to keep things concrete. Here's a definition of a match macro. If you are unfamiliar with Scheme macros, do take a look at the relevant section in Guile's manual.

(define-syntax match
  (syntax-rules (else)
    ;; Use `pat' helper macro to see if a clause matches.
    ((match v (p e0 e ...) cs ...)
     (let ((fk (lambda () (match v cs ...))))
       (pat v p (begin e0 e ...) (fk))))
    ;; If we get to an `else' clause, evaluate its body.
    ((match v (else e0 e ...))
     (begin e0 e ...))))

There are two rules in this macro. The first rule hands off the real work to a helper macro, pat. The second just matches an else clause at the end.

(define-syntax pat
  (syntax-rules (_ unquote)
    ;; () succeeds if the datum is null, and fails
    ;;  otherwise.
    ((pat v () kt kf)
     (if (null? v) kt kf))
    ;; _ matches anything and succeeds.
    ((pat v _ kt kf)
     kt)
    ;; ,VAR binds VAR to the datum and succeeds.
    ((pat v (unquote var) kt kf)
     (let ((var v)) kt))
    ;; If the pattern is a pair, succeed if the datum
    ;; is a pair and its car and cdr match.
    ((pat v (x . y) kt kf)
     (if (pair? v)
         (let ((vx (car v)) (vy (cdr v)))
           (pat vx x (pat vy y kt kf) kf))
         kf))
    ;; Literals in a pattern match themselves.
    ((pat v lit kt kf)
     (if (eq? v (quote lit)) kt kf))))

The pat helper macro is written in so-called continuation-passing style. This means that pat receives the consequent expressions to evaluate as arguments: one if the pattern matches (kt), and the other if it doesn't (kf). I have commented the clauses to help readers that are unfamiliar with syntax-rules.

I hope that you find this implementation to be elegant, but that's not really the important thing. The important thing is that if we put this code in a module and export match, then you have provided pattern matching to a language that didn't have it. You, as a user, have extended your programming language.

This facility is quite useful. Let us consider the case of some HTML, which you have parsed to an SXML representation:

(define html '(a (@ (href "http://gnu.org/")) "GNU"))

We can use the match form to get the URL for this link:

(define (link-target x)
  (match x
    ((a (@ (href ,target)) . _) target)
    (else #f)))

(link-target html) => "http://gnu.org"
(link-target '(p "not a link")) => #f

Sometimes people say that macros make code hard to read, because (they say) everyone makes up their own language. I hope that this example demonstrates that this is not the case: link-target above is eminently more readable, robust, and efficient than rolling your own matcher by hand. Good programmers use macros to bring the language up to the level of the problem they are solving.

Note that you didn't need to build consensus in the ECMAScript committees to add match to your language. Nor do you have to be satisfied with Joe Armstrong's particular implementation of matching, as you do in Erlang. Guile's default matcher has a number of bells and whistles, but you are free to write custom matchers for other formats, extending matchers to protocols and binary records of various types.

Finally, this same argument applies to embedded parsers, domain-specific languages, dynamic binding, and a whole host of other things. You don't have to wait for a language like Python to finally add a with statement; you can add it yourself. To paraphrase Alan Perlis, Guile is an extensible extension language.

delimited continuations

Delimited continuations let a user extend their code with novel, expressive control structures that compose well with existing code.

I'm going to give another extended example. I realize that there is a risk of losing you in the details, dear reader, but the things to keep in mind are: (1) most other programming languages don't let you do what I'm about to show you; and (2) this has seriously interesting practical applications.

So, let's take coroutines as the concrete example. Guile doesn't define a standard coroutine abstraction yet, though it probably should. Here is one implementation. We begin by defining call-with-yield:

(define (call-with-yield proc)
  (let ((tag (make-prompt-tag)))
    (define (handler k . args)
      (define (resume . args)
        (call-with-prompt tag
                          (lambda () (apply k args))
                          handler))
      (apply values resume args))

    (call-with-prompt
     tag
     (lambda ()
       (let ((yield (lambda args
                      (apply abort-to-prompt tag args))))
         (proc yield)))
     handler)))

The guts of this function is the call-with-prompt call, which makes a yield procedure, and passes it to proc. If the yield is called, it will jump back up the stack, passing control to handler.

The handler builds a resume function, and returns it along with the values that were yielded. The user can then call the resume function to continue computation, possibly passing it some values.

If we would like to add some syntactic sugar to make this form easier to read, we can do so:

(define-syntax with-yield
  (syntax-rules ()
    ((_ yield exp exp* ...)
     (call-with-yield
      (lambda (yield) exp exp* ...)))))

As an example, here is a function that turns a traversal operator into a generator:

(define (generate traverse collection)
  (with-yield yield
    (traverse yield collection)))

Let's see a use of it.

> (generate for-each '(three blind mice))
$1 = #<procedure resume args>
$2 = three
> ($1)
$3 = #<procedure resume args>
$4 = blind
> ($3)
$5 = #<procedure resume args>
$6 = mice
> ($1)
$7 = #<procedure resume args>
$8 = blind
> ($5)
> 

This is fascinating stuff. Normally when you call for-each, you lose control of your execution. for-each is going to call your function as many times as it wants, and that's that. But here we have inverted control-flow so that we're back in charge again.

Note that we did this without any cooperation from for-each! In Python, if you want to compose a number of its generators into a coroutine, every procedure in the call stack must cooperate. In fact until recently this wasn't even possible to implement efficiently. But in Guile, delimited continuations allow us to remove the unnecessary restrictions that make yield from seem like a good idea. Our with-yield composes with the other parts of our language.

Also note that these generators are functional: instead of mutating an object, causing the next call to it to proceed, we return the resume procedure as a value. It's not always what you want, and indeed you could make different choices, but this does have the advantage that you can rewind your traversal to different points, as I did above by calling $1 after calling $3. You can use this strategy to implement other advanced control-flow operators, like amb.

One can use this approach in other inversion-of-control contexts. Node.js people are all excited about their asynchronous idioms, but they have no idea how much better it could be if JavaScript actually supported asynchronous idioms on a language level.

For more on delimited continuations, see the readable 1993 paper by Dorai Sitaram, Handling Control.

the next 700 language features

Guile has a number of other language features that other extension languages do not. Here's a short list.

goops

GOOPS is Guile's Object-Oriented Programming System. It implements OOP like Common Lisp's CLOS does.

What this means to most programmers is that GOOPS is a little different. In Smalltalk-style OOP, the focus is on classes. In Self-style OOP, the focus is on prototypes. But in CLOS-style OOP, the focus is on generic functions. A good introduction to this topic can be found in Gregor Kiczales' beautiful book, The Art of the Metaobject Protocol. At OOPSLA 1997, Alan Kay called it "the best book written in computing in ten years".

Anyway, what I mean to say here is Guile does support object-oriented programming, and although it is not the same as (for example) Python's model, it is quite capable and expressive. Some of its features, like class redefinition, are not found in any of the more common OO systems.

parallel computation

Guile supports POSIX threads, with no global interpreter lock. Coordination between threads is managed via the normal primitives like mutexes and conditional variables, though there are higher-level forms like with-mutex.

Pthreads provide concurrency. Higher-level constructs provide structured parallelism, like futures, for fine-grained parallelism, and other forms like par-map.

I'm going to go out on a limb here and say that managing concurrency is the major problem of programming today. Guile's existing facilities are OK, but are not as convincing as those of Clojure or D. Is STM the right answer? Or is it shared-nothing by default, like Erlang? I don't know, but I'm pretty sure that we can build the right answer in Guile, whatever the right answer is.

the numeric tower

Guile implements the full numeric tower: integers (of any precision), real numbers, rationals, complex numbers, and the strange IEEE-754 values like negative zero, not-a-number, and the infinities. Arithmetic works with them all. In addition, there are some plans afoot to add support for optional multi-precision floating-point and complex numbers.

I shouldn't have to say this, but in Guile, (/ 12 10) is not 1. Nor is it a double-precision approximation, for that matter; it is the rational number, 6/5.

data types

Guile comes with a wealth of built-in data types: bytevectors, uniform arrays (for example, a vector of packed unsigned 16-bit integers), multidimensional arrays (possibly of packed values), bitvectors, strings consisting of characters (as opposed to bytes), records, etc. There are also the standard Lisp data types like symbols, pairs, vectors, characters, and so on.

You can define your own data types, using GOOPS, using the record facility, or the lower-level structs. Of course you can build ad-hoc data structures using the other predefined data types; after all, as Alan Perlis says, "it is better to have 100 functions operate on one data structure than 10 functions on 10 data structures."

functional programming

Scheme is a multi-paradigm language, supporting many kinds of programming techniques. It has excellent support for the functional style, with its lexical scope, closures, and proper tail calls. Also, unlike their call/cc cousins, Guile's delimited continuations are practical to use in a functional style, as shown above in the generator example.

Finally, I should mention that all of these features are designed to work well together. Macros compose with modules to preserve lexical scoping. Delimited continuations can transfer control through functional combinators while preserving referential transparency. Dynamic bindings behave sanely when continuations are unwound and reinstated. Scheme is a language that "hangs together": its features are carefully chosen to produce sound semantics when used in combination.

implementation features

So much for the Scheme language, as implemented by Guile. On a related but not identical tack, I would like to argue that Guile's implementation is a good practical fit for GNU.

development experience

Guile has a fantastic developer experience. Two factors contribute to this. One is that it ships with a full-featured, pleasant REPL, with tab-completion, readline, and a debugger. The second factor is Guile's reflective capabilities, coupled with the excellent Geiser Emacs mode.

As Geiser hacker Jao says, when it comes to debugging, you can be a mortician or you can be a physician. Core dumps are corpses. The REPL is the land of the living. If you run Guile with the --listen argument, it will spawn off a REPL thread, listening on a port. Geiser can telnet into that port to work with your program as it is running.

See Using Guile in Emacs, for more on how to hook this up.

multiple languages

It's pretty clear to me that we're going to nail the Elisp / Scheme integration. BT Templeton has been doing some great work on that this summer, and we are closer than ever. Admittedly some folks have doubts about how we will pull this off, but I am convinced. I will have to defer detailed arguments to some other forum, though.

Now, when Richard first promoted Guile as the GNU extension language, I think it was because he loved Lisp, but he did consider that not every user would share this perspective. For that reason he stressed the multi-lingual characteristics of Guile, which until now have not really come to fruition. However it does seem that in addition to Scheme and Emacs, with time we can probably get Lua and JavaScript working well, to the point that users will feel happy making small- and medium-sized programs and extensions in these languages.

It's true that if we decided that our best strategy would be to make JavaScript hackers happy, then we should adopt V8 instead, or something like that. But I don't think that is our best strategy. I will come back to this point in the conclusion.

health

Guile is some 16 years old now, and time has winnowed its interface to something that's mostly clean and orthogonal. There are still some crufty pieces, but we have a good deprecation story, both for Scheme code and for C code linking to libguile.

Despite its maturity, Guile is a very healthy project. Check the Ohloh page for details, but note that the activity statistics are based on the master branch, whereas most commits have been landing on stable-2.0 since 2.0.0 was released in February. We'll move back into a development phase at some point in the next few months.

And on the level of users, things seem to be going well -- it's not the most-used language out there, but it is gaining traction and in respectability. The number of contributors is going up, and the various forums are active.

choosing a strategy to achieve our goals

I have argued, I hope convincingly, that Guile is a good choice for an extension language. Now I would like to put the decision itself in a context. As Noam Chomsky keeps reminding us, we need to judge decisions by their expected consequences, not their proclaimed intentions. Does Guile generate the right future for GNU?

Let me begin by arguing this point from the contrapositive: that choosing another language is not the right thing for GNU. Let us take the example of JavaScript, a very popular language these days. Jim Blandy notes that the real measure of an extension language is that, when you want to implement a feature, that you want to implement it in the extension language rather than the original language (C, usually). So let us assume that JavaScript triumphs in this regard in the future, that the GNU system ends up mostly implemented in JavaScript, perhaps via the V8 implementation. Is this the right future to generate?

For me, the answer is no. First of all, what does it mean to have a GNU system based on a non-GNU technology? What would be left of GNU? It might be lack of imagination on my part, but such a future makes me think of country folk come to the city and losing sight of who they are. GNU is a social organization that makes software, with particular social characteristics. Mixing GNU with a foreign language community sounds dangerous to GNU, to me. See Kent Pitman's Lambda, the Ultimate Political Party for similar musings.

Beyond the cultural issues -- which to me are the ones that really matter -- there are some important technical points to consider. Right now GNU is mostly written in C, and controls its implementation of C, and the libraries that it depends on. Would GNU be able to exercise such control on JavaScript? Can GNU build in new concurrency primitives? Could we add new numeric types? Probably not. JavaScript is a language with a platform, and that platform is the browser. Its needs are not very related to the rest of GNU software. And if we did adapt JavaScript to POSIX-like platforms, GNU would lose some of the advantages of choosing a popular language, by making a second-class, foreign platform.

We should also consider the costs of using hastily designed languages. JavaScript has some crazy bad stuff, like with, var hoisting, a poor numeric model, dynamic this scoping, lack of modularity regarding binding lookup (witness the recent spate of browser bugs regarding prototype lookup on user objects), the strange arguments object, and the lack of tail recursion. Then there are the useful features that JavaScript lacks, like macros, modules, and delimited continuations. Using JavaScript has a cost, and its total cost in GNU would have to be integrated over its lifespan, probably 20 years or so. These limitations might be acceptable now, but not in the context of the larger programs of the future. (This, by the way, was the essence of Richard's argument against Tcl.)

Finally, we have the lifespan issue. If GNU had chosen Tcl because it was popular, we would have a mass of dead code. (You can like Tcl and still admit that we are past its prime.) Python is now, I think, at the height of its adoption curve, and beginning its descent. JavaScript is the next thing, and still on the uptick. But JavaScript will fall out of favor, and there will be a new paradigm, and a next language. The future of computing will not be the same as the present. So how will JavaScript adapt to these changes? We can't tell right now, but given the difficulty in changing simple things like making 010 parse as 10 and not 8 indicates that at some point it will stagnate. But Scheme will still be with us, because its parts are well thought-out, and because it is a language that is inherently more adaptable.

Paul Graham concludes Revenge of the Nerds with the following advice:

"If you start a startup, don't design your product to please VCs or potential acquirers. Design your product to please the users. If you win the users, everything else will follow. And if you don't, no one will care how comfortingly orthodox your technology choices were."

In the case of a GNU extension language, our users are developers, so we do need to work hard to please them. But free software has a different life cycle than a VC-backed startup; instead of racing to the flip, we have to please our users now and in ten years. Viewed from the long now, the future is not simply a first-order extrapolation of the present: it is something that we generate. Guile is still the right choice for the GNU extension language because it generates the right future: a healthy, sustainable, powerful GNU project that adapts to the changing needs of time.

Syndicated 2011-08-30 18:15:29 from wingolog

to guadec!

I'm very happy to say that I'm going to Berlin on Friday, for GUADEC, the GNOME conference.

This will be my seventh GUADEC, and my first as an Igalian. I'm very much looking forward to seeing old friends, as well as to meeting new folks.

I'm especially looking forward to soaking in the energy of GNOME 3. GNOME 3 is the best user-facing project that Free Software has going right now -- it's a dynamic group of creative, talented people, making users' experiences with computers beautiful and natural.

The only other Free Software project that is focused on users in the same way is Android, and it is its own corporate kettle of fish. I hope that it's GNOME that ultimately wins, against the hoarders of the software world, from the workstation down to the phone, because GNOME brings with it our participatory culture of software as a means for human emancipation: that we can work together to make a better world. We've been working on this Free Software thing for some 30 years now; we're not a fad, or a hype: we are people working together, making our world. We can win, and humanity deserves it.

. . .

This year GNOME is again sharing a conference with KDE, another "desktop environment", to use the outdated term. I have pleasant memories of hanging out with KDE folk at the last joint conference we had, but I don't think it's healthy. Last time we were together because it was convenient for Nokia. Now that Nokia sold us both out, it doesn't make sense.

I do wish we could collaborate in meaningful ways -- and we do, in lower levels in the stack, and partly at the window manager level -- but we're working in the same space with two different cultures, and colocation ultimately dilutes the impact these conferences can have among the members of these two projects. So my KDE friends, let's have one last fling this weekend in Berlin, and thereafter see each other in Brussels, at FOSDEM :)

My seven GUADECs marks me as some kind of old fart now, I suppose, and it's true that my own work with GNOME is mostly in the past. I used to work on GStreamer, but don't anymore; Guile is mostly where I'm at, free-software-wise now. I wrote some bindings for the GNOME libraries back in the 2.x days but have not had the time to update them to the modern 3.x world. I fear that I'll just have to pass that project on, though there does not appear to be much hacker interest in maintaining API compatibility.

Anyway, as I said, I mostly work with Guile now. If you're interested at all in Guile, please pull me aside, and I'll be happy to talk your ear off ;-) I've also been doing work with the V8 JavaScript implementation recently, for work, and I'd also be happy to talk about that: compilers, optimization, and what-not. But kids, remember: JavaScript is a fad. Guile is the official scripting language of GNOME.1

See you in Berlin!

1;-)

Syndicated 2011-08-03 21:19:02 from wingolog

a closer look at crankshaft, v8's optimizing compiler

Continuing in my series of articles on V8, in this article I'd like to take a closer look at V8's optimizing compiler, with a focus on the Hydrogen intermediate language.

brief summary

As my readers might recall, Crankshaft is the marketing name of V8's optimizing compiler. It complements the "full" compiler by optimizing hot functions, while ignoring uncommon cases. It translates the JavaScript abstract syntax tree to a high-level static single-assignment (SSA) representation called Hydrogen, and tries to optimize that Hydrogen graph. Then the Hydrogen is translated to the machine-specific Lithium low-level language, which facilitates register allocation and, finally, code generation.

The point of the Hydrogen translation is threefold:

  1. To permit inlining. Inlining is truly the mother of all optimizations, in that it permits more optimizations to occur.

  2. To permit temporary values to be represented as untagged integer or double values. In JavaScript, you need type feedback to do this with any degree of success. Unboxing reduces memory pressure and permits Crankshaft to use efficient native machine instructions. But as you would expect, this unboxing needs to be done carefully.

  3. To facilitate loop-invariant code motion and common subexpression elimination.

The Hydrogen SSA representation facilitates all of these operations. This article takes a closer look at how it does so.

static single assignment


(click for accessible SVG)

Hydrogen consists of values--instructions and phis--contained in basic blocks, which themselves are contained in a graph.

The standard SSA notation associates a named value with every expression:

t1 = x + y;
t2 = 3;
t3 = t1 + t2;
...

It's useful to write it out this way, but these temporary names are not strictly necessary: since each instruction defines a temporary, we can use the identity (address) of each instruction as its name, for the purpose of SSA.

A later pass will allocate storage for all of these temporaries -- on the stack, on the heap, or unboxed in general-purpose or SSA/Neon registers. At first I was quite confused about the term "instruction", as we functional programmers tend to use "expression", but it's something you get used to after a while, and it's useful when you use it to refer to the named definition in addition to the expression that produces it.

An instruction is a kind of value -- an entity that defines a name. There other kind of value is the phi. Phis occur at control-flow joins, and are logically associated with the beginning of a block. My previous article discusses phis in detail.

All values have a pointer to their containing block and to their uses. A use is when a later instruction uses the value computed (defined) by an earlier instruction. Values also have a pointer to the next instruction in their block. It is cheaper to traverse instructions forward, in the direction of control and data flow, than it is to go backward. Values also have fields for their type, their representation, a flags bitfield, and a pointer to a range structure. We will look at these in more detail later.

The last instruction in a block is a control instruction. No one uses the value defined by a control expression -- their purpose is to transfer control to another basic block. The set of possible blocks that may follow a block in control-flow order are the successors of a block. Preceding blocks are predecessors. Every block has an array of pointers to its predecessors and successors, and also to its dominator block, and to all the blocks that it dominates.

Now, friends, when I first saw this I was disgusted. What a rat's nest of pointers! But then I read Kennedy's great Compiling With Continuations, Continued paper and it became clear that it is very difficult to have an efficient functional-style intermediate representation. The diagrams above are from a part of his paper in which he suggests a mutable IR. In my opinion, if he's gone that far, he might as well finish the job and eliminate temporary identifiers entirely, as Hydrogen does.

electrolysis

Crankshaft has a funny job to do. Not only does it have to optimize JavaScript, which is hard enough; it also has to do so in a way that permits on-stack replacement to optimize a long-running loop, without exiting the loop.

In practice what this means is that Crankshaft needs to walk the abstract syntax tree (AST) to emit Hydrogen code in the same way as the full compiler. It needs to simulate what the full compiler is doing: what is on the stack, which variables are heap-allocated and which are not, and, for named local variables, what HValue is bound to that variable. This simulation is kept in an environment.

This AST-to-Hydrogen translation process also handles phi insertion. So unlike the textbook dominance-frontier algorithm for optimal phi insertion, the HGraphBuilder simply adds phi values whenever a block has more than one predecessor, for every variable in the environment that is not always bound to the same value. For loops, a phi is added for every variable live in the environment at that time. A later pass attempts to prune the number of phis, where it can.

type feedback

The AST that was given to the crankshaft compiler has integer identifiers assigned to each node. These identifiers are assigned predictably, so they are the same identifiers that the "full-codegen" compiler saw, and they are embedded in the inline caches used by the full-codegen code. This correspondence allows crankshaft to know the types of the values that have been seen at any given call site, property access, etc.

This information is used by the graph builder to initialize the type field of each HValue, to help in type inference. Additionally, property accesses and function calls can record the precise type of object being accessed, or the function that was called, so that Crankshaft has an opportunity to inline these function and method calls.

inlining

When the graph builder reaches a call, it can try to inline it. Note that this happens very early, before any optimizations are made on the source code. This allows the optimizations to have more visibility, and allows for more code motion. (In general, V8 will not move instructions across a call boundary.)

There are lots of heuristics and conditions that affect inlining, and these are in flux. Some conditions that must currently be met for inlining to proceed are:

  1. The function to be inlined is less than 600 characters long.

  2. Neither the inner nor the outer functions may contain heap-allocated variables. (In practice, this means that neither function may contain lexical closures.)

  3. Inlining of for-in, with, and some other expression types is not currently supported.

  4. There is a limit on the maximum inlining depth.

  5. A function's call to itself will not be inlined.

  6. There is a limit to the number of nodes added to the AST.

source-to-source optimization passes

As parsing proceeds, Crankshaft generates specific HInstruction nodes and packs them into HBasicBlocks. There is a large set of about 120 instructions, which allows Crankshaft to reason precisely about the various operations and their representations.

At this point, inlining is complete, so Crankshaft moves to focus on representation of temporary values. The goal is to unbox as many values as possible, allowing the use of native machine instructions, and decreasing the allocation rate of tagged doubles. The Hydrogen SSA language directly facilitates this control-flow and data-flow analysis.

The following sections discuss the various source-to-source passes that are made on the Hydrogen IR. It's a lot of detail, but I'm sure that someone will find it useful.

block reordering

Crankshaft reorders the array of blocks in the HGraph to appear in reverse post-order, equivalent to a topological sort. In this way, iterating the array of blocks from beginning to end traverses them in data-flow order.

dominator assignment

The reverse post-order sort from the previous step facilitates calculation of the dominator for each block. The entry block dominates its successors. Other blocks are iterated over, in order, assigning them the dominator of their predecessors, or the common dominator of their predecessors if there is more than one.

mark dead subgraphs

It can often be the case that a function being optimized does not have full type information. Some branches in the original function may never have been executed, and so code on those branches does not have any type feedback. If the parser reaches such a branch -- and it will know at parse-time, because that's when it has the type-feedback information -- then it inserts an unconditional soft deoptimization, to force the collection of more type information, should that branch be reached in optimized code.

However, instead of causing a branch to a terminal deoptimization block, the parse continues, presumably because Crankshaft needs to proceed with its abstract interpretation of the AST. For that reason, the soft deoptimization appears in the instruction stream, not as the last instruction in the block.

To avoid the compiler wasting its time trying to optimize loops and move code in parts of the graph that have never been reached, this dead-subgraph pass puts a mark on all blocks dominated by a block containing an unconditional soft deoptimization. As you can see, this analysis is facilitated by the SSA dominator relation.

redundant phi elimination

The naïve phi placement algorithm detailed above can put phis where they are not needed. This pass eliminates phi values whose input is always the same, replacing the uses with the use of the value.

Like a number of phi-centered optimizations, elimination of one phi can expose other phis for elimination. For this reason, this and many other algorithms use a worklist.

In this case, all phis are placed on a work list. Then, while there are any phis on the worklist, a phi is taken off the list, and if it is eliminatable, it is replaced. Now here's the trick: if the phi was in turn used by any other phi, the phi uses are pushed onto the list, if they weren't there already. Iteration proceeds until the worklist is empty.

To keep the big-O order of worklist operations down, you need a flag for each element indicating whether the element is already in the list. The bit should either be in the objects being traversed, or in a separate bitvector. In this particular case -- and in this case only, as far as I can tell -- there is no such bit, so the algorithm can take too much time, in certain cases.

dead phi elimination, and phi collection

Some phi values are dead, having no real uses, neither direct nor indirect (through some other phi). This pass eliminates any phi that doesn't have a real use. It's somewhat like a garbage collection algorithm, and also uses a worklist.

Once this pass has run, we have minimized the set of phis, so we can collect them all into an array and assign them indices. These indices can be used in the future bitvectors, to contain worklist-driven algorithms.

As the phis are collected, if any undefined value is found to reach a phi, or arguments can reach a phi only via some inputs, then optimization is aborted entirely.

representation inference

Every HValue has a type field, representing a coarse approximation of the types seen at a particular AST node, and a representation field, representing a specific choice of how to represent a value. The goal, as I have said, is to represent temporaries as untagged int32 or double values. Type and representation are related, but they are treated separately in Crankshaft. This pass is about representation inference.

Firstly, Crankshaft looks at all of the phis, and sees which ones are connected. This is an N2 fixed-point iteration. Then, for each phi, if any operand of the phi or any connected phi may not be converted to an int32, Crankshaft pessimistically marks the phi as not convertible to an int32. This pass tries to prevent too many representation changes for phi values.

Finally, for all values with flexible representation--phis, Math.abs, and binary bitwise and arithmetic operations--Crankshaft determines the representation to use for that value. If the representation can be inferred directly from the inputs, then that representation is used. Otherwise, some heuristics are run on the representations needed by the use sites of the value. If all uses are untagged, then the representation will be untagged, as an int32 if possible, unless any use treats the value as a double.

The end result is that, hopefully, many values have been unboxed to int32 or double values.

range analysis

This pass looks at each value, and tries to determine its range, if Crankshaft was actually able to allocate a specialized representation for it. In practice this is mainly used for values represented as integers. It allows HInstructions to assert various properties about the instruction, such as lack of overflow, or lack of negative zero values. These properties can influence code generation.

The algorithm propagates constraints forward on all control paths, so that control instructions can produce different ranges for the same value on different control paths. Ranges reaching a phi node are unioned together.

type inference and canonicalization

I mentioned that representation was related to type. Here I should be more specific. In Crankshaft, types are logically associated with objects. One can treat any object as an array, for the purposes of a calculation, but that treatment does not coerce the object itself to another type. Representation is about how to use an object; type is about the object itself.

Type inference can help eliminate runtime checks. If we can infer that a certain value is not only a tagged value, but is not a smi (small integer--a fixnum), then we can access its map directly without having to do the smi check. If we are storing an element in an array, and we know it is a smi, we don't need to emit a write barrier. So, type inference is another forward control-flow propagation algorithm, with a fixed-point iteration for phis in loop headers.

After type inference is run, each instruction in the whole graph is asked to canonicalize itself, which basically eliminates useless instructions: HToInt32 of a value with int32 representation, HCheckNonSmi of a value with non-smi type, etc.

deoptimization marks

When can you coerce undefined to an unboxed double? When you need a double for math, but you aren't going to eventually compare it to another double. The problem is that undefined coerces to NaN, but undefined == undefined and NaN != NaN. So, crankshaft marks any value used by a numeric compare as deoptimize-on-undefined. If the value is a phi, all connected phis must check for undefined, which can be a pretty big lose. But hey, that's JavaScript!

representation changes

Up to this point we've been dealing with values on a fairly high level, assuming that all of the definitions will be compatible with the uses. But as we chose to represent some temporaries as untagged values, there will need to be explicit conversions between representations -- hopefully not too many, but they will be needed. This pass inserts these representation changes.

This pass uses the truncating-to-int32 and deoptimize-on-undefined flags calculated in previous phases, to tell the changes what conditions they need to watch out for. Representation changes for phis happen just before branching to the block with the phi -- effectively, just before calling the block.

minus zero checks

The previous pass probably inserted a number of HChange instructions. For any HChange from int32 (which must be to tagged or double), Crankshaft traces dataflow backwards to mark the HChange(s) that produced the int32 as deoptimizing on negative zero. This allows parts of the graph to use fast int32 operations, while preserving the semantics of -0 should it appear, albeit more slowly (via a deoptimization).

stack check elimination

Loops need to be interruptable, and V8 does so by placing a stack check at the beginning of each loop iteration. If the runtime wants to interrupt a loop, it resets the stack limit of the process, and waits for the process's next stack check.

However if a call dominates all backward branches of a loop, then the loop can be interrupted by the stack check in the callee's prologue, so the stack check in the loop itself is unneeded and can be removed.

global value numbering: licm

I've been trying to get here for all this long page. Thanks to readers that have kept with it!

At this point, Hydrogen has served two of its three purposes: inlining and unboxing. The third is loop-invariant code motion (LICM) and common subexpression elimination (CSE). These important optimizations are performed by the perhaps misnamed global value numbering (GVN) pass.

LICM is about code motion: moving code out of hot loops. So what prevents an HInstruction from being moved? Two things: data dependencies, and side effects.

The LICM algorithm takes care of both of these. For every instruction in a loop, if it only depends on values defined outside the loop, and the side effects of the loop as a whole do not prevent the move, then LICM hoists the instruction to the loop header. If you do this in reverse post-order, you can hoist whole chains of expressions out of loops.

Every hydrogen instruction has a bitfield indicating the side effects that it causes, and a corresponding bitfield for the side effects that can affect it. In addition there is a flag indicating whether code motion is allowed for that instruction or not. These flags constitute a simple effects analysis -- if instruction B directly follows A, and A does not cause any side effects that B depends on, and B does not use the value from A, then B may be moved up before A.

Since the effects fields are simple bitfields, one can easily compute the side effects of a whole block by iterating over the instructions, doing a bitwise OR at each step. This is what LICM does -- it computes the side effects for the entire loop, and moves any instructions that it can to the loop header.

global value numbering: cse

Global value numbering, as implemented in V8, uses this technique in the opposite direction: instead of trying to hoist expressions up, it propagates them down the control-flow graph, trying to kill common subexpressions. A common subexpression is a Hydrogen value B that is "equal" in some sense to a previous instruction A, and whose value will not be affected by any side effects that occur on the control-flow paths between A and B. In this context, "value numbering" essentially means "define appropriate hash and equality functions and stick all the instructions you've seen in a hash table."

As GVN goes through the instructions of each block, in reverse-post-order, it puts them into this hash table (HValueMap). When GVN reaches an instruction that causes side effects, it flushes out the entries that depend on those side effects from the table. In that way, if ever it sees an instruction B that is equal to an instuction A in the table, GVN can remove B from its block, and replace every use of B with a use of A.

In standard SSA terminology you can consider instructions to be "definitions", because each instruction defines a value. Global value numbering works across basic blocks, hence the "global" in its name. Specifically, it eliminates equivalent definitions. One of the invariants of SSA is that "definitions dominate uses". So the part of the graph that should be searched for equivalent definitions is the part that is dominated by the original definition.

checked value replacement

In optimized code, accessing an array out of its bounds will cause a deoptimization. So the AST-to-Hydrogen translation will not emit a use of an index directly -- it inserts a bounds check, and has the array access use the result of the bounds check. But the purpose of this instruction is just to cause the side effect of deoptimization, and the resulting value is just a copy of the incoming index, which increases register allocation pressure for no reason. So now that code motion is complete, Crankshaft replaces uses of HBoundsCheck with the instruction that defines the index value, relying on the HBoundsCheck's position in the instruction stream to cause deopt as needed.

conclusions

In the end, the practical goals of Crankshaft are simple, and the optimizations are too, for the most part. Java people like to trot out all their gerundic compiler passes (see slide 7) and declare victory, and it is true that V8 and other modern JavaScript implementations will be beaten by the HotSpot JVM -- once it is allowed to warm up, of course. (V8's 5ms startup is particularly impressive, here.) But Crankshaft definitely has the basics, and throws in some range analysis to squeeze out a few more percent on top.

Finally, if you've gotten here, thanks again for reading this far. Comments and corrections are most welcome.

I have enjoyed being able to do this serious code reading, and some small bit of V8 hacking, as part of my work at Igalia. I think I'll have one more article about the platform-specific Lithium language, and its register allocator, and then see where I should move on from there. Happy hacking to all, and to all a good hack!

Syndicated 2011-08-02 21:21:01 from wingolog

static single assignment for functional programmers

Friends, I have an admission to make: I am a functional programmer.

By that I mean that lambda is my tribe. And you know how tribalism works: when two tribes meet, it's usually to argue and not to communicate.

So it is that I've been well-indoctrinated in the lore of the lambda calculus, continuation-passing style intermediate languages, closure conversion and lambda lifting. But when it comes to ideas from outside our tribe, we in the lambda tribe tune out, generally.

At the last Scheme workshop in Montreal, some poor fellow had the temerity to mention SSA on stage. (SSA is what the "machine tribe" uses as an intermediate langauge in their compilers.) I don't think the "A" was out of his mouth before Olin Shivers' booming drawl started, "d'you mean CPS?" (CPS is what "we" use.) There were titters from the audience, myself included.

But there are valuable lessons to be learned from SSA language and the optimizations that it enables, come though it may from another tribe. In this article I'd like to look at what the essence of SSA is. To do so, I'll start with explaining the functional programming story on intermediate languages, as many of my readers are not of my tribe. Then we'll use that as a fixed point against which SSA may be compared.

the lambda tribe in two sentences

In the beginning was the lambda. God saw it, realized he didn't need anything else, and stopped there.


functional programmers got god's back

Hey, it's true, right? The lambda-calculus is great because of its expressivity and precision. In that sense this evaluation is a utilitarian one: the lambda-calculus allows us to reason about computation with precision, so it is worth keeping around.

I don't think that Church was thinking about digital computers when he came up with the lambda-calculus back in the 1930s, given that digital computers didn't exist yet. Nor was McCarthy thinking about computers when he came up with Lisp in the 1960s. But one of McCarthy's students did hack it up, and that's still where we are now: translating between the language of the lambda-calculus and machine language.

This translation process is compilation, of course. For the first 20 years or so of practicing computer science, compilers (and indeed, languages) were very ad-hoc. In the beginning they didn't exist, and you just wrote machine code directly, using switches on a control panel or other such things, and later, assembly language. But eventually folks figured out parsing, and you get the first compilers for high-level languages.

I've written before about C not being a high-level assembly language, but back then, FORTRAN was indeed such a language. There wasn't much between the parser and the code generator. Everyone knows how good compilers work these days: you parse, you optimize, then you generate code. The medium in which you do your work is your intermediate language. A good intermediate language should be simple, so your optimizer can be simple; expressive, so that you can easily produce it from your source program; and utilitarian, in that its structure enables the kinds of optimizations that you want to make.

The lambda tribe already had a good intermediate language in this regard, in the form of the lambda-calculus itself. In many ways, solving a logic problem in the lambda-calculus is a lot like optimizing a program. Copy propagation is beta-reduction. Inlining is copy propagation extended to lambda expressions. Eta-conversion of continuations eliminates "forwarding blocks" -- basic blocks which have no statements, and just jump to some other continuation. Eta-conversion of functions eliminates functional trampolines.

continuation-passing style

But I'm getting ahead of myself. In the lambda tribe, we don't actually program in the lambda-calculus, you see. If you read any of our papers there's always a section in the beginning that defines the language we're working in, and then defines its semantics as a translation to the lambda-calculus.

This translation is always possible, for any programming language, and indeed Peter Landin did so in 1965 for Algol. Landin's original translations used his "J operator" to capture continuations, allowing a more direct translation of code into the lambda-calculus.

I wrote more on Landin, letrec, and the Y combinator a couple of years ago, but I wanted to mention one recent paper that takes a modern look at J, A Rational Deconstruction of Landin's J Operator. This paper is co-authored by V8 hacker Kevin Millikin, and cites work by V8 hackers Mads Sig Ager and Lasse R. Nielsen. Indeed all three seem to have had the privilege of having Olivier Danvy as PhD advisor. That's my tribe!

Anyway, J was useful in the context of Landin's abstract SECD machine, used to investigate the semantics of programs and programming languages. However it does not help the implementor of a compiler to a normal machine, and intermediate languages are all about utility. The answer to this problem, for functional programmers, was to convert the source program to what is known as continuation-passing style (CPS).

With CPS, the program is turned inside out. So instead of (+ 1 (f (+ 2 3))), you would have:

lambda return
  let c1 = 1
    letcont k1 t1 = _+ return c1 t1
      letcont k2 t2 = f k1 t2
        let c2 = 2, c3 = 3
          _+ k2 c2 c3

Usually the outer lambda is left off, as it is implicit. Every call in a CPS program is a tail call, for the purposes of the lambda calculus. Continuations are explicitly represented as lambda expressions. Every function call or primitive operation takes the continuation as an argument. Papers in this field usually use Church's original lambda-calculus notation instead of the ML-like notation I give here. Continuations introduced by a CPS transformation are usually marked as such, so that they can be efficiently compiled later, without any flow analysis.

Expressing a program in CPS has a number of practical advantages:

  • CPS is capable of expressing higher-order control-flow, for languages in which functions may be passed as values.

  • All temporary values are named. Unreferenced names represent dead code, or code compiled for effect only. Referenced names are the natural input to a register allocator.

  • Continuations correspond to named basic blocks. Their names in the source code correspond to a natural flow analysis simply by tracing the definitions and uses of the names. Flow analysis enables more optimizations, like code motion.

  • Full beta-reduction is sound on terms of this type, even in call-by-value languages.

Depending on how you implement your CPS language, you can also attach notes to different continuations to help your graph reduce further: this continuation is an effect context (because its formal parameter is unreferenced in its body, or because you knew that when you made it), so its caller can be processed for effect and not for value; this one is of variable arity (e.g. can receive one or two values), so we could jump directly to the right handler, depending on what we want; etc. Guile's compiler is not in CPS right now, but I am thinking of rewriting it for this reason, to allow more transparent handling of control flow.

Note that nowhere in there did I mention Scheme's call-with-current-continuation! For me, the utility of CPS is in its explicit naming of temporaries, continuations, and its affordances for optimization. Call/cc is a rare construct in Guile Scheme, that could be optimized better with CPS, but one that I don't care a whole lot about, because it's often impossible to prove that the continuation doesn't escape, and in that case you're on the slow path anyway.

So that's CPS. Continuations compile to jumps within a function, and functions get compiled to closures, or labels for toplevel functions. The best reference I can give on it is Andrew Kennedy's 2007 paper, Compiling With Continuations, Continued. CWCC is a really fantastic paper and I highly recommend it.

a digression: anf

CPS fell out of favor in the nineties, in favor of what became known as Administrative Normal Form, or ANF. ANF is like CPS except instead of naming the continuations, the code is left in so-called "direct-style", in which the continuations are implicit. So my previous example would look like this:

let c2 = 2, c3 = 3
  let t2 = + 2 3
    let t1 = f t1
      let c1 = 1
        + c1 t1

There are ANF correspondences for CPS reductions, like the beta-rule. See the Essence of Compiling With Continuations paper, which introduced ANF and sparked the decline of the original CPS formulation, for more.

This CPS-vs-ANF discussion still goes on, even now in 2011. In particular, Kennedy's CWCC paper is quite compelling. But the debate has been largely mooted by the advances made by the machine tribe, as enabled by their SSA intermediate language.

the machine tribe in two sentences

In the beginning was the
Segmentation fault (core dumped)

(Just kidding, guys & ladies!)

Instead of compiling abstract ideas of naming and control to existing hardware, as the lambda tribe did, the machine tribe took as a given the hardware available, and tries to expose the capabilities of the machine to the programmer.

The machine tribe doesn't roll with closures, continuations, or tail calls. But they do have loops, and they crunch a lot of data. The most important thing for a compiler of a machine-tribe language like C is to produce efficient machine code for loops.

Clearly, I'm making some simplifications here. But if you look at a machine-tribe language like Java, you will be dealing with many control-flow constructs that are built-in to the language (for, while, etc.) instead of layered on top of recursion like loops in Scheme. What this means is that large, important parts of your program have already collapsed to a first-order control-flow graph problem. Layering other optimizations on top of this like inlining (the mother of all optimizations) only expands this first-order flow graph. More on "first-order" later.

So! After decades of struggling with this problem, after having abstracted away from assembly language to three-address register transfer language, finally the machine folks came up with something truly great: static single-assignment (SSA) form. The arc here is away from the machine, and towards more abstraction, in order to be able to optimize better, and thus generate better code.

It's precisely for this utilitarian reason that SSA was developed. Consider one of the earliest SSA papers, Global Value Numbers and Redundant Comparisons by Rosen, Wegman, and Zadeck. Rosen et al were concerned about being able to move invariant expressions out of loops, extending the "value numbering" technique to operate across basic blocks. But the assignment-oriented intermediate languages that they had been using were getting in the way of code motion.

To fix this issue, Rosen et al switched from the assignment-oriented model of the machine tribe to the binding-oriented model of the lambda tribe.

In SSA, variables are never mutated (assigned); they are bound once and then left alone. Assignment to a source-program variable produces a new binding in the SSA intermediate language.

For the following function:

function clamp (x, lower, upper) {
  if (x < lower)
    x = lower;
  else if (x > upper)
    x = upper;
  return x;
}  

The SSA translation would be:

entry:
  x0, lower0, upper0 = args;
  goto b0;

b0:
  t0 = x0 < lower0;
  goto t0 ? b1 : b2;

b1:
  x1 = lower0;
  goto exit;

b2:
  t1 = x0 > upper0;
  goto t1 ? b3 : exit;
  
b3:
  x2 = upper0;
  goto exit;
  
exit:
  x4 = phi(x0, x1, x2);
  return x4;

SSA form breaks down a procedure into basic blocks, each of which ends with a branch to another block, either conditional or unconditional. Usually temporary values receive their own names as well, as it facilitates optimization.

phony functions

The funny thing about SSA is the last bit, the "phi" function. Phi functions are placed at control-flow joins. In our case, the value of x may be proceed from the argument or from the assignment in the first or second if statement. The phi function indicates that.

But you know, lambda tribe, I didn't really get what this meant. What is a phi function? It doesn't help to consider where the name comes from, that the original IBM compiler hackers put in a "phony" function to merge the various values, but considered that "phi" was a better name if they wanted to be taken seriously by journal editors.

Maybe phi functions are intuitive to the machine tribe; I don't know. I doubt it. But fortunately there is another interpretation: that each basic block is a function, and that a phi function indicates that the basic block has an argument.

Like this:

entry x lower upper =
  letrec b0 = let t0 = x0 < lower
                if t0 then b1() else b2()
         b1 = let x = lower
                exit(x);
         b2 = let t1 = x > upper0
                if t1 then b3() else exit()
         b3 = let x = upper
                exit(x);
         exit x = x
    b0()

Here I have represented basic blocks as named functions instead of labels. Instead of phi functions, we allow the blocks to take a number of arguments; the call sites determine the values that the phi function may take on.

Note that all calls to blocks are tail calls. Reminds you of CPS, doesn't it? For more, see Richard Kelsey's classic paper, A Correspondence Between Continuation-Passing Style and Static Single Assignment Form, or my earlier article about Landin, Steele, letrec, and labels.

But for a shorter, readable, entertaining piece, see Appel's SSA is Functional Programming. I agree with Appel that we in the lambda-tribe get too hung up on our formalisms, when sometimes the right thing to do is draw a flow-graph.

so what's the big deal?

If it were only this, what I've said up to now, then SSA would not be as interesting as CPS, or even ANF. But SSA is not just about binding, it is also about control flow. In order to place your phi functions correctly, you need to build what is called a dominator tree. One basic block is said to dominate another if all control paths must pass through the first before reaching the second.

For example, the entry block always dominates the entirety of a function. In our example above, b0 also dominates every other block. However though b1 does branch to exit, it does not dominate it, as exit may be reached on other paths.

It turns out that you need to place phi functions wherever a definition of a variable meets a use of the variable that is not strictly dominated by the definition. In our case, that means we place a phi node on exit. The dominator tree is a precise, efficient control-flow analysis that allows us to answer questions like this one (where do I place a phi node?).

For more on SSA and dominators, see the very readable 1991 paper by Cytron, Ferrante, Rosen, Wegman, and Zadeck, Efficiently Computing Static Single Assignment Form and the Control Dependence Graph.

Typical implementations of SSA embed in each basic block pointers to the predecessors and successors of the blocks, as well as the block's dominators and (sometimes) post-dominators. (A predecessor is a block that precedes the given node in the control-flow graph; a successor succeeds it. A post-dominator is like a dominator, but for the reverse control flow; search the tubes for more.) There are well-known algorithms to calculate these links in linear time, and the SSA community has developed a number of optimizations on top of this cheap flow information.

In contrast, the focus in the lambda tribe has been more on interprocedural control flow, which -- as far as I can tell -- no one does in less than O(N2) time, which is, as my grandmother would say, "just turrible".

I started off with a mention of global value numbering (GVN) on purpose. This is still, 20+ years later, the big algorithm for code motion in JIT compilers. HotSpot C1 and V8 both use it, and it just landed in IonMonkey. GVN is well-known, well-studied, and it works. It results in loop-invariant code motion: if an invariant definition reaches a loop header, it can be hoisted out of the loop. In contrast I don't know of anything from the lambda tribe that really stacks up. There probably is something, but it's certainly not as well-studied.

why not ssa?

Peoples of the machine tribe, could you imagine returning a block as a value? Didn't think so. It doesn't make sense to return a label. But that's exactly what the lambda-calculus is about. One may represent blocks as functions, and use them as such, but one may also pass them as arguments and return them as values. Such blocks are of a higher order than the normal kind of block that is a jump target. Indeed it's the only way to express recursion in the basic lambda calculus.

That's what I mean when I say that CPS is good as a higher-order intermediate language, and when I say that SSA is a good first-order intermediate language.

If you have a fundamentally higher-order language, one in which you need to loop by recursion, then you have two options: do whole-program analysis to aggressively closure-convert your program to be first-order, and then you can use SSA, or use a higher-order IL, and use something more like CPS.

MLton is an example of a compiler that does the former. Actually, MLton's SSA implementation is simply lovely. They do represent blocks as functions with arguments instead of labels and phi functions.

But if you can't do whole-program analysis -- maybe because you want to extend your program at runtime, support separate compilation, or whatever -- then you can't use SSA as a global IL. That's not to say that you shouldn't identify first-order segments of your program and apply SSA-like analysis and optimization on them, of course! That's really where the lambda tribe should go.

finally

I wrote this because I was in the middle of V8's Crankshaft compiler and realized I didn't understand some of the idioms, so I went off to read a bunch of papers. At the same time, I wanted to settle the CPS-versus-ANF question for my Guile work. (Guile currently has a direct-style compiler, for which there are precious few optimizations; this fact is mostly a result of being difficult to work with the IL.)

This post summarizes my findings, but I'm sure I made a mistake somewhere. Please note any corrections in the comments.

Syndicated 2011-07-12 16:39:50 from wingolog

370 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!