Older blog entries for wingo (starting at number 422)

18 May 2014 »

effects analysis in guile

OK kids, so I had a bit of time recently and have been hacking on Guile's new CPS-based compiler slated for stable release in a few months. I have a few things to write about but today's article is on effects analysis.

what it is, yo

The job of the optimization phase of a compiler is to remove, replace and reorder the sub-expressions of a program. The optimizer recognizes ways in which the program can be better, and then needs to check if those transformations are valid. A transformation is valid if a program exhibits the same behavior both before and after the transformation.

Effects analysis is a proof technique that builds a conservative model of how expressions can affect each other. It can be used to prove the validity of some transformations. For example, if we determine that an expression A reads the first field in a vector, and expression B sets the second field in a vector, then we can use effects analysis to show that the expressions don't affect each others' value and can be freely reordered, for example to hoist either one out of a loop.

In effects analysis, we model the program state coarsely and conservatively with a limited set of categories. The precise set of categories depends on the domain. In graphics, for example, you might have a state bit for the current coordinate system, the current color, the current fill mode, etc. In general-purpose compilation, the effects are mostly about memory and (sometimes) exceptions.

In Guile we model four kinds of effects: type checks (T), allocations (A), reads (R), and writes (W). Each of these effects is allocated to a bit. An expression can have any or none of these effects.

For the purposes of Guile, a "type check" is the possibility that this expression will throw an exception if its arguments are somehow out of range. For example, cons will never throw an exception, except in out-of-memory situations, but + may throw an exception if its arguments aren't numbers. However if an expression is dominated by a copy of itself -- that is, if we see that (+ a b) (which may throw an exception) is dominated by (+ a b), perhaps after CSE -- then we can determine that only the first will exhibit the type-check effects, and that the second will not.

Allocation indicates that an expression allocates a fresh object. In Scheme (and many other languages), each allocated object has its own identity: (eq? (cons 1 2) (cons 1 2)) must be false. Note that this restriction does not apply to constant literals, in Scheme; (eq? '(1 . 2) '(1 . 2)) may or may not be true. In Guile both results are possible. They are the same object when compiled (and thus deduplicated), but different when interpreted; the objects are just the ones returned from `read' and this are different. Anyway we use this allocation bit to indicate that an expression allocates a fresh object with a fresh identity.

The remaining effect kinds are "read" and "write", which indicate reads or writes to memory. So there are just 4 kinds of effects.

Allocation, read, and write effects are associated at run-time with particular memory locations. At compile-time we cannot in general know which of these locations are distinct, and which are actually the same. To simplify the problem, we simply record the type of the object, knowing that (say) a pair and a vector will never be at the same location. We also record the field in the object in the case of objects with multiple fields. There are special catch-all values to indicate "all memory kinds", when we really don't know what an expression will do (which is the case for all expression kinds without specific support in the effect analyzer), and for "all fields" when we don't know which field we are accessing.

One example of the use of this analysis is in common subexpression elimination (CSE). If at a program point P we have a set of available expressions A, then to determine which members of A are still available after the expression at P, you subtract members of A that are clobbered by P. Computation of A at each P plus value numbering is most of CSE; but more on that in some later dispatch. Anyway here's the definition of effects-clobber?.

(define (effect-clobbers? a b)
  "Return true if A clobbers B.  This is the case
if A might be a write, and B might be a read or a
write to the same location as A."
  (define (locations-same?)
    (let ((a (ash a (- &effect-kind-bits)))
          (b (ash b (- &effect-kind-bits))))
      (or (eqv? &unknown-memory-kinds
                (logand a &memory-kind-mask))
          (eqv? &unknown-memory-kinds
               (logand b &memory-kind-mask))
          (and (eqv? (logand a &memory-kind-mask)
                     (logand b &memory-kind-mask))
               ;; A negative field indicates "the
               ;; whole object".  Non-negative fields
               ;; indicate only part of the object.
               (or (< a 0) (< b 0) (= a b))))))
  (and (not (zero? (logand a &write)))
       (not (zero? (logand b (logior &read &write))))
       (locations-same?)))

This analysis is simple, small, and fast. It's also coarse and imprecise -- if you are reading from and writing to two vectors at once, you're almost sure to miss some optimization opportunities as accesses to all vectors are conflated into one bit. Oh well. If you get into this situation, you'll know it, and be able to invest a bit more time into alias analysis; there's lots of literature out there. A simple extension would be to have alias analysis create another mapping from expression to equivalence class, and to use those equivalence classes in the same-location? check above.

Of course this assumes that expressions just access one object. This is the case for Guile's CPS intermediate language, because in CPS, as in SSA or ANF, expressions don't have subexpressions.

This contrasts with direct-style intermediate languages, in which expressions may have nested subexpressions. If you are doing effects analysis on such a language, it's probably more appropriate to allocate a bit to each kind of effect on each kind of object, so that you can usefully union effects for a tree of expressions. Since you don't have to do this for CPS, we can allocate a fixed bit-budget towards more precision as to which field of an object is being accessed. The inability to be precise as to which field was being accessed due to the direct-style IL was one of the problems in Guile's old CSE pass.

Finally, a note about type checks. Guile includes type checks as part of the effects analysis for two reasons. The first is the obvious case of asking whether an expression is effect-free or not, which can lead to some optimizations in other parts of the compiler. The other is to express the potential for elision of duplicate expressions, if one dominates the other. But it's also possible to remove type checks in more cases than that: one can run a type inference pass to remove type-check effects if we can prove that the arguments to an expression are in range. Obviously this is more profitable for dynamically-typed languages, but the same considerations apply to any language with sum types.

Guile's effects analysis pass is here. V8 seems to have two effects analysis passes; one is in effects.h and typing.cc, and operates over the direct-style AST, and the other is in the value numbering pass (hydrogen-instructions.h and hydrogen-gvn.h; search for GVNFlag).

More on how effects analysis is used in Guile in a future missive. Until then, happy hacking.

Syndicated 2014-05-18 19:19:29 from wingolog

17 Mar 2014 »

stack overflow

Good morning, gentle hackers. Today's article is about stack representation, how stack representations affect programs, what it means to run out of stack, and that kind of thing. I've been struggling with the issue for a while now in Guile and finally came to a nice solution. But I'm getting ahead of myself; read on for some background on the issue, and details on what Guile 2.2 will do.

stack limits

Every time a program makes a call that is not a tail call, it pushes a new frame onto the stack. Returning a value from a function pops the top frame off the stack. Stack frames take up memory, and as nobody has an infinite amount of memory, deep recursion could cause your program to run out of memory. Running out of stack memory is called stack overflow.

Most languages have a terrible stack overflow story. For example, in C, if you use too much stack, your program will exhibit "undefined behavior". If you are lucky, it will crash your program; if are unlucky, it could crash your car. It's especially bad in C, as you neither know ahead of time how much stack your functions use, nor the stack limit imposed by the user's system, and the stack limit is often quite small relative to the total memory size.

Things are better, but not much better, in managed languages like Python. Stack overflow is usually assumed to throw an exception (though I couldn't find the specification for this), but actually making that happen is tricky enough that simple programs can cause Python to abort and dump core. And still, like C, Python and most dynamic languages still have a fixed stack size limit that is usually much smaller than the heap.

Arbitrary stack limits would have an unfortunate effect on Guile programs. For example, the following implementation of the inner loop of map is clean and elegant:

(define (map f l)
  (if (pair? l)
      (cons (f (car l))
            (map f (cdr l)))
      '()))

However, if there were a stack limit, that would limit the size of lists that can be processed with this map. Eventually, you would have to rewrite it to use iteration with an accumulator:

(define (map f l)
  (let lp ((l l) (out '()))
    (if (pair? l)
        (lp (cdr l) (cons (f (car l)) out))
        (reverse out))))

This second version is sadly not as clear, and it also allocates twice as much heap memory (once to build the list in reverse, and then again to reverse the list). You would be tempted to use the ~~destructive~~ linear-update reverse! to save memory and time, but then your code would not be continuation-safe -- if f returned again after the map had finished, it would see an out list that had already been reversed. (If you're interested, you might like this little Scheme quiz.) The recursive map has none of these problems.

a solution?

Guile 2.2 will have no stack limit for Scheme code.

When a thread makes its first Guile call, a small stack is allocated -- just one page of memory. Whenever that memory limit would be reached, Guile arranges to grow the stack by a factor of two.

Ideally, stack growth happens via mremap, and ideally at the same address in memory, but it might happen via mmap or even malloc of another memory block. If the stack moves to a different address, we fix up the frame pointers. Recall that right now Guile runs on a virtual machine, so this is a stack just for Scheme programs; we'll talk about the OS stack later on.

Being able to relocate the stack was not an issue for Guile, as we already needed them to implement delimited continuations. However, relocation on stack overflow did cause some tricky bugs in the VM, as relocation could happen at more places. In the end it was OK. Each stack frame in Guile has a fixed size, and includes space to make any nested calls; check my earlier article on the Guile 2.2 VM for more. The entry point of a function handles allocation of space for the function's local variables, and that's basically the only point the stack can overflow. The few things that did need to point into the stack were changed to be an offset from the stack base instead of a raw pointer.

Even when you grow a stack by a factor of 2, that doesn't mean you immediately take up twice as much memory. Operating systems usually commit memory to a process on a page-by-page granularity, which is usually around 4 kilobytes. Once accessed, this memory is always a part of your process's memory footprint. However, Guile mitigates this memory usage here; because it has to check for stack overflow anyway, it records a "high-water mark" stack usage since the last garbage collection. When garbage collection happens, Guile arranges to return the unused part of the stack to the operating system (using MADV_DONTNEED), but without causing the stack to shrink. In this way, the stack can grow to consume up to all memory available to the Guile process, and when the recursive computation eventually finishes, that stack memory is returned to the system.

You might wonder, why not just allocate enormous stacks, relying on the kernel to page them in lazily as needed? The biggest part of the answer is that we need to still be able to target 32-bit platforms, and this isn't a viable strategy there. Even on 64-bit, whatever limit you choose is still a limit. If you choose 4 GB, what if you want to map over a larger list? It's admittedly extreme, given Guile's current GC, but not unthinkable. Basically, your stack should be able to grow as big as your heap could grow. The failure mode for the huge-stack case is also pretty bad; instead of getting a failure to grow your stack, which you can handle with an exception, you get a segfault as the system can't page in enough memory.

The other common strategy is "segmented stacks", but the above link covers the downsides of that in Go and Rust. It would also complicate the multiple-value return convention in Guile, where currently multiple values might temporarily overrun the receiver's stack frame.

exceptional situations

Of course, it's still possible to run out of stack memory. Usually this happens because of a program bug that results in unbounded recursion, as in:

(define (faulty-map f l)
  (if (pair? l)
      (cons (f (car l)) (faulty-map f l))
      '()))

Did you spot the bug? The recursive call to faulty-map recursed on l, not (cdr l). Running this program would cause Guile to use up all memory in your system, and eventually Guile would fail to grow the stack. At that point you have a problem: Guile needs to raise an exception to unwind the stack and return memory to the system, but the user might have throw handlers in place that want to run before the stack is unwound, and we don't have any stack in which to run them.

Therefore in this case, Guile throws an unwind-only exception that does not run pre-unwind handlers. Because this is such an odd case, Guile prints out a message on the console, in case the user was expecting to be able to get a backtrace from any pre-unwind handler.

runaway recursion

Still, this failure mode is not so nice. If you are running an environment in which you are interactively building a program while it is running, such as at a REPL, you might want to impose an artificial stack limit on the part of your program that you are building to detect accidental runaway recursion. For that purpose, there is call-with-stack-overflow-handler. You run it like this:

(call-with-stack-overflow-handler 10000
  (lambda ()              ; body
    (faulty-map (lambda (x) x) '(1 2 3)))
  (lambda ()              ; handler
    (error "Stack overflow!")))

→ ERROR: Stack overflow

The body procedure is called in an environment in which the stack limit has been reduced to some number of words (10000, in the above example). If the limit is reached, the handler procedure will be invoked in the dynamic environment of the error. For the extent of the call to the handler, the stack limit and handler are restored to the values that were in place when call-with-stack-overflow-handler was called.

Unlike the unwind-only exception that is thrown if Guile is unable to grow its stack, any exception thrown by a stack overflow handler might invoke pre-unwind handlers. Indeed, the stack overflow handler is itself a pre-unwind handler of sorts. If the code imposing the stack limit wants to protect itself against malicious pre-unwind handlers from the inner thunk, it should abort to a prompt of its own making instead of throwing an exception that might be caught by the inner thunk. (Overflow on unwind via inner dynamic-wind is not a problem, as the unwind handlers are run with the inner stack limit.)

Usually, the handler should raise an exception or abort to an outer prompt. However if handler does return, it should return a number of additional words of stack space to grant to the inner environment. A stack overflow handler may only ever "credit" the inner thunk with stack space that was available when the handler was instated. When Guile first starts, there is no stack limit in place, so the outer handler may allow the inner thunk an arbitrary amount of space, but any nested stack overflow handler will not be able to consume more than its limit.

I really, really like Racket's notes on iteration and recursion, but treating stack memory just like any other kind of memory isn't always what you want. It doesn't make sense to throw an exception on an out-of-memory error, but it does make sense to do so on stack overflow -- and you might want to do some debugging in the context of the exception to figure out what exactly ran away. It's easy to attribute blame for stack memory use, but it's not so easy for heap memory. And throwing an exception will solve the problem of too much stack usage, but it might not solve runaway memory usage. I prefer the additional complexity of having stack overflow handlers, as it better reflects the essential complexity of resource use.

os stack usage

It is also possible for Guile to run out of space on the "C stack" -- the stack that is allocated to your program by the operating system. If you call a primitive procedure which then calls a Scheme procedure in a loop, you will consume C stack space. Guile tries to detect excessive consumption of C stack space, throwing an error when you have hit 80% of the process' available stack (as allocated by the operating system), or 160 kilowords in the absence of a strict limit.

For example, looping through call-with-vm, a primitive that calls a thunk, gives us the following:

(use-modules (system vm vm))

(let lp () (call-with-vm lp))

→ ERROR: Stack overflow

Unfortunately, that's all the information we get. Overrunning the C stack will throw an unwind-only exception, because it's not safe to do very much when you are close to the C stack limit.

If you get an error like this, you can either try rewriting your code to use less stack space, or you can increase Guile's internal C stack limit. Unfortunately this is a case in which the existence of a limit affects how you would write your programs. The the best thing is to have your code operate without consuming so much OS stack by avoiding loops through C trampolines.

I don't know what will happen when Guile starts to do native compilation. Obviously we can't relocate the C stack, so lazy stack growth and relocation isn't a viable strategy if we want to share the C and Scheme stacks. Still, we need to be able to relocate stack segments for delimited continuations, so perhaps there will still be two stacks, even with native C compilation. We will see.

Well, that's all the things about stacks. Until next time, happy recursing!

Syndicated 2014-03-17 11:40:42 from wingolog

7 Mar 2014 »

es6 generator and array comprehensions in spidermonkey

Good news, everyone: ES6 generator and array comprehensions just landed in SpiderMonkey!

Let's take a quick look at what comprehensions are, then talk about what just landed in SpiderMonkey and when you'll see it in a Firefox release. Thanks to Bloomberg for sponsoring this work.

comprendes, mendes

Comprehensions are syntactic sugar for iteration. Unlike for-of, which processes its body for side effects, an array comprehension processes its body for its values, collecting them into a new array. Like this:

// Before (by hand)
var foo = (function(){
             var result = [];
             for (var x of y)
               result.push(x*x);
             return result;
           })();

// Before (assuming y has a map() method)
var foo = y.map(function(x) { return x*x });

// After
var foo = [for (x of y) x*x];

As you can see, array comprehensions are quite handy. They're expressions, not statements, and so their result can be passed directly to whatever code needs it. This can make your program more clear, because you aren't forced to give names to intermediate values, like result. At the same time, they work on any iterable, so you can use them on more kinds of data than just arrays. Because array comprehensions don't make a new closure, you can access arguments and this even yield from within the comprehension tail.

Generator comprehensions are also nifty, but for a different reason. Let's look at an example first:

// Before
var bar = (function*(){ for (var x of y) yield y })();

// After
var bar = (for (x of y) y);

As you can see the syntactic win here isn't that much, compared to just writing out the function* and invoking it. The real advantage of generator comprehensions is their similarity to array comprehensions, and that often you can replace an array comprehension by a generator comprehension. That way you never need to build the complete list of values in memory -- you get laziness for free, just by swapping out those square brackets for the comforting warmth of parentheses.

Both kinds of comprehension can contain multiple levels of iteration, with embedded conditionals as you like. You can do [for (x of y) for (z of x) if (z % 2) z + 1] and all kinds of related niftiness. Comprehensions are almost always more concise than map and filter, with the added advantage that they are usually more efficient.

what happened

SpiderMonkey has had comprehensions for a while now, but only as a non-standard language extension you have to opt into. Now that comprehensions are in the draft ES6 specification, we can expose them to the web as a whole, by default.

Of course, the comprehensions that ES6 specified aren't quite the same as the ones that were in SM. The obvious difference is that SM's legacy comprehensions were written the other way around: [x for (x of y)] instead of the new [for (x of y) x]. There were also a number of other minor differences, which I'll list here for posterity:

ES6 comprehensions create one scope per "for" node -- not one for the comprehension as a whole.
ES6 comprehensions can have multiple "if" components, which may be followed by other "for" or "if" components.
ES6 comprehensions should make a fresh binding on each iteration of a "for", although Firefox currently doesn't do this (bug 449811). Incidentally, for-of in Firefox has this same problem.
ES6 comprehensions only do for-of iteration, not for-in iteration.
ES6 generator comprehensions always need parentheses around them. (The parentheses were optional in some cases for SM's old generator comprehensions.
ES6 generator comprehensions are ES6 generators (returning {value, done} objects), not legacy generators (StopIteration).

I should note in particular that the harmony wiki is out of date, as the feature has moved into the spec proper: array comprehensions, generator comprehensions.

For another fine article on ES6 generators, check out Ariya Hidayat's piece on comprehensions from about a year ago.

So, ES6 comprehensions just landed in SpiderMonkey today, which means it should be part of Firefox 30, which should reach "beta" in April and become a stable release in June. You can try it out tomorrow if you use a nightly build, provided it doesn't cause some crazy breakage tonight. As of this writing, Firefox will be the first browser to ship ES6 array and generator comprehensions.

colophon

I had a Monday of despair: hacking at random on something that didn't solve my problem. But I had a Tuesday morning of pleasure, when I realized that my Monday's flounderings could be cooked into a delicious mid-week bisque; the hack was obvious and small and would benefit the web as a whole. (Wednesday was for polish and Thursday on another bug, and Friday on a wild parser-to-OSR-to-assembly-and-back nailbiter; but in the end all is swell.)

Thanks again to Bloomberg for this opportunity to build out the web platform, and to Mozilla for their quality browser wares (and even better community of hackers).

This has been an Igalia joint. Until next time!

Syndicated 2014-03-07 21:11:55 from wingolog

18 Feb 2014 »

compost, a leaf function compiler for guile

What's that out by the woodshed? It's a steaming pile -- it's full of bugs -- it's compost, a leaf function compiler for Guile!

Around this time last year, a few of us cooked up some hack-dishes to bring to a potluck for Guile 2.0's release anniversary. Last year, mine was a little OpenGL particle demo.

That demo was neat but it couldn't be as big as I would have liked it to be because it was too slow. So, this year when the potluck season rolled around again I sat down to make a little compiler for the subset of Scheme that you see in inner numeric loops -- bytevector access, arithmetic, and loops.

The result is compost. Compost compiles inner loops into native x86-64 machine code that operates on unboxed values.

As you would imagine, compost-compiled code is a lot faster than code interpreted by Guile's bytecode interpreter. I go from being able to compute and render 5K particles at 60 fps up to 400K particles or so -- an 80-fold improvement. That's swell but it gets sweller. The real advantage is that with fast inner loops, I can solve more complicated problems.

Like this one!

Last year's demo hard-coded a gravitational attractor at (0, 0, 0). This one has no hard-coded attractor -- instead, each particle attracts each other. This so-called n-body simulation is an n-squared problem, so you need to be really efficient with the primitives to scale up, and even then the limit approaches quickly.

With compost, I can get to about 1650 particles at 60 frames per second, using 700% CPU on this 4-core 2-thread-per-core i7-3770 machine, including display with the free software radeon drivers. Without compost -- that is to say, just with Guile's bytecode virtual machine -- I max out at something more like 120 particles, and only 200% CPU.

The rest of this post describes how compost works. If compilers aren't your thing, replace the rest of the words with cat noises.

meow meow meow meow meow meow meow meow

The interface to compost is of course a macro, define/compost. Here's a little loop to multiply two vectors into a third, element-wise:

(use-modules (compost syntax) (rnrs bytevectors))
(define/compost (multiply-vectors (dst bytevector?)
                                  (a bytevector?)
                                  (b bytevector?)
                                  (start exact-integer?)
                                  (end exact-integer?))
  (let lp ((n start))
    (define (f32-ref bv n)
      (bytevector-ieee-single-native-ref bv (* n 4)))
    (define (f32-set! bv n val)
      (bytevector-ieee-single-native-set! bv (* n 4) val))
    (when (< n end)
      (f32-set! dst n (* (f32-ref a n) (f32-ref b n)))
      (lp (1+ n)))))

It's low-level but that's how we roll. If you evaluate this form and all goes well, it prints out something like this at run-time:

;;; loading /home/wingo/.cache/guile/compost/rmYZoT-multiply-vectors.so

This indicates that compost compiled your code into a shared object at macro-expansion time, and then printed out that message when it loaded it at runtime. If composting succeeds, compost writes out the compiled code into a loadable shared object (.so file). It then residualizes a call to dlopen to load that file at run-time, followed by code to look up the multiply-vectors symbol and create a foreign function. If composting fails, it prints out a warning and falls back to normal Scheme (by residualizing a plain lambda).

In the beginning of the article, I called compost a "leaf function compiler". Composted functions should be "leaf functions" -- they shouldn't call other functions. This restriction applies only to the low-level code, however. The first step in composting is to run the function through Guile's normal source-to-source optimizers, resulting in a CPS term. The upshot is that you can use many kinds of abstraction inside the function, like the little f32-ref/f32-set! helpers above, but in the end Guile should have inlined or contified them all away. It's a restriction, but hey, this is just a little hack.

Let's look at some assembly. We could get disassembly just by calling objdump -d /home/wingo/.cache/guile/compost/rmYZoT-multiply-vectors.so, but let's do it a different way. Let's put that code into a file, say "/tmp/qux.scm", and add on this code at the end:

(define size #e1e8) ;; 100 million
(define f32v (make-f32vector size 2.0))
(multiply-vectors f32v f32v f32v 0 size)

OK. Now we run Guile under GDB:

$ gdb --args guile /tmp/qux.scm 
(gdb) b 'multiply-vectors'
Function "multiply-vectors" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 ('multiply-vectors') pending.
(gdb) r
Starting program: /opt/guile/bin/guile /tmp/qux.scm
[New Thread 0x7ffff604b700 (LWP 13729)]
;;; loading /home/wingo/.cache/guile/compost/Kl0Xpc-multiply-vectors.so

Breakpoint 1, 0x00007ffff5322000 in multiply-vectors () from /home/wingo/.cache/guile/compost/Kl0Xpc-multiply-vectors.so
(gdb) step
multiply-vectors () at /tmp/qux.scm:12
12	    (when (< n end)

Word! GDB knows about the symbol, multiply-vectors. That's top. We are able to step into it, and it prints Scheme code!

Both of these swell things are because compost residualizes its compiled code as ELF files, and actually re-uses Guile's linker. The ELF that we generate can be loaded by dlopen, and its symbol tables and DWARF debugging information are known to GDB.

(In my last article on ELF I mentioned that Guile had no plans to use the system dynamic library loader (dlopen). That's still true; Guile has its own loader. I used the system loader in this place, though, just because I thought it was a neat hack.)

We can tell GDB to disassemble the next line:

(gdb) set disassemble-next-line on
(gdb) step
9	      (bytevector-ieee-single-native-ref bv (* n 4)))
=> 0x00007ffff532201d <multiply-vectors+29>:	4c 0f af c9	imul   %rcx,%r9
(gdb) 
13	      (f32-set! dst n (* (f32-ref a n) (f32-ref b n)))
=> 0x00007ffff532203b <multiply-vectors+59>:	f2 0f 59 c1	mulsd  %xmm1,%xmm0
   0x00007ffff532203f <multiply-vectors+63>:	49 b9 04 00 00 00 00 00 00 00	movabs $0x4,%r9

GDB does OK with these things, but it doesn't have special support for Scheme, and really you would want column pointers, not just lines. That data is in the DWARF but it's not processed by GDB. Anyway here's the disassembly:

(gdb) disassemble
Dump of assembler code for function multiply-vectors:
   0x00007ffff5322000 <+0>:	push   %rbx
   0x00007ffff5322001 <+1>:	push   %rbp
   0x00007ffff5322002 <+2>:	push   %r12
   0x00007ffff5322004 <+4>:	push   %r13
   0x00007ffff5322006 <+6>:	push   %r14
   0x00007ffff5322008 <+8>:	push   %r15
   0x00007ffff532200a <+10>:	cmp    %r8,%rcx
   0x00007ffff532200d <+13>:	jge    0x7ffff5322060 <multiply-vectors+96>
   0x00007ffff5322013 <+19>:	movabs $0x4,%r9
   0x00007ffff532201d <+29>:	imul   %rcx,%r9
   0x00007ffff5322021 <+33>:	cvtss2sd (%rsi,%r9,1),%xmm0
   0x00007ffff5322027 <+39>:	movabs $0x4,%r9
   0x00007ffff5322031 <+49>:	imul   %rcx,%r9
   0x00007ffff5322035 <+53>:	cvtss2sd (%rdx,%r9,1),%xmm1
=> 0x00007ffff532203b <+59>:	mulsd  %xmm1,%xmm0
   0x00007ffff532203f <+63>:	movabs $0x4,%r9
   0x00007ffff5322049 <+73>:	imul   %rcx,%r9
   0x00007ffff532204d <+77>:	cvtsd2ss %xmm0,%xmm15
   0x00007ffff5322052 <+82>:	movss  %xmm15,(%rdi,%r9,1)
   0x00007ffff5322058 <+88>:	inc    %rcx
   0x00007ffff532205b <+91>:	jmpq   0x7ffff532200a <multiply-vectors+10>
   0x00007ffff5322060 <+96>:	movabs $0x804,%rdi
   0x00007ffff532206a <+106>:	mov    %rdi,%rax
   0x00007ffff532206d <+109>:	pop    %r15
   0x00007ffff532206f <+111>:	pop    %r14
   0x00007ffff5322071 <+113>:	pop    %r13
   0x00007ffff5322073 <+115>:	pop    %r12
   0x00007ffff5322075 <+117>:	pop    %rbp
   0x00007ffff5322076 <+118>:	pop    %rbx
   0x00007ffff5322077 <+119>:	retq   
End of assembler dump.
(gdb)

Now if you know assembly, this is pretty lame stuff -- it saves registers it doesn't use, it multiplies instead of adds to get the bytevector indexes, it loads constants many times, etc. It's a proof of concept. Sure beats heap-allocated floating-point numbers, though.

safety and semantics

Compost's goal is to match Guile's semantics, while processing values with native machine operations. This means that it needs to assign concrete types and representations to all values in the function. To do this, it uses the preconditions, return types from primitive operations, and types from constant literals to infer other types in the function. If it succeeds, it then chooses representations (like "double-precision floating point") and assigns values to registers. If the types don't check out, or something is unsupported, compilation bails and runtime will fall back on Guile's normal execution engine.

There are a couple of caveats, however.

One is that compost assumes that small integers do not promote to bignums. We could remove this assumption with better range analysis. Compost does do some other analysis, like sign analysis to prove that the result of sqrt is real. Complex numbers will cause compost to bail.

Compost also doesn't check bytevector bounds at run-time. This is terrible. I didn't do it though because to do this nicely you need to separate the bytevector object into two variables: the pointer to the contents and the length. Both should be register-allocated separately, and range analysis would be nice too. Oh well!

Finally, compost is really limited in terms of the operations it supports. In fact, if register allocation would spill on the function, it bails entirely :)

the source

If it's your thing, have fun over on yon gitorious. Compost needs Guile from git, and the demos need Figl, the GL library. For me this project is an ephemeral thing; a trial for future work, with some utility now, but not something I really want to support. Still, if it's useful to you, have at it.

coda

I woke up this morning at 5 thinking about the universe, and how friggin big it is. I couldn't go back to sleep. All these particles swirling and interacting in parallel, billions upon zillions, careening around, somehow knowing what forces are on them, each somehow a part of the other. I studied physics and I never really boggled at the universe in the way I did this morning thinking about this n-body simulation. Strange stuff.

I remember back in college when I was losing my catholicism and I would be at a concert or a dance show or something and I would think "what is this, this is just particles vibrating, bodies moving in the nothing; there is no meaning here." It was a thought of despair and unmooring. I don't suffer any more over it, but mostly because I don't think about it any more. I still don't believe in some omniscient skydude or skylady, but if I did, I know he or she has got a matrix somewhere with every particle's position and velocity.

Swirl on, friends, and until next time, happy hacking.

Syndicated 2014-02-18 20:21:18 from wingolog

19 Jan 2014 »

elf in guile

Good evening, gentle hackfolk!

Today I'd like to wrap up my three-part series of articles on what's new in Guile 2.2's compiler and runtime. I talked about the virtual machine a couple months ago, and the compiler internals just last Sunday. Today's article is about the object file format.

Sounds boring, right? Well, probably for most humans. But hackers, Guile compiles to ELF. Pretty rad, amirite? I thought so too. Read on, nerdy readers, read on!

object files

So let's consider the problem: Guile compiles to bytecode for a custom virtual machine. In the future we want to do native compilation. In both cases we'll need to write bytes out to disk in some format, and be able to load that code back into Guile. What should be the format for those bytes? For our purposes, a good object file format has a number of characteristics:

Above all else, it should be very cheap to load a compiled file.
It should be possible to statically allocate constants in the file. For example, a bytevector literal in source code can be emitted directly into the object file.
The compiled file should enable maximum code and data sharing between different processes.
The compiled file should contain debugging information, such as line numbers, but that information should be separated from the code itself. It should be possible to strip debugging information if space is tight.

These characteristics are not specific to Scheme. So why not just steal from the C and C++ people? They use the flexible ELF object file format, and so does Guile.

Note that although Guile uses ELF on all platforms, we do not use platform support for ELF. Guile implements its own linker and loader. The advantage of using ELF is not sharing code, but sharing ideas. ELF is simply a well-designed object file format.

An ELF file has two meta-tables describing its contents. The first meta-table is for the loader, and is called the program table or sometimes the segment table. The program table divides the file into big chunks that should be treated differently by the loader. Mostly the difference between these segments is their permissions.

Typically all segments of an ELF file are marked as read-only, except that part that represents modifiable static data or static data that needs load-time initialization. Loading an ELF file is as simple as mmapping the thing into memory with read-only permissions, then using the segment table to mark a small sub-region of the file as writable. This writable section is typically added to the root set of the garbage collector as well.

The other meta-table in an ELF file is the section table. Whereas the program table divides an ELF file into big chunks for the loader, the section table specifies small sections for use by introspective tools like debuggers or the like. One segment (program table entry) typically contains many sections. There may be sections outside of any segment, as well.

I know, this is a bit dry. So I made you a picture; or rather, a program that makes pictures. Here's an example of one of the pictures:

What we see here is a page map of assembler.go, a module in Guile's compiler. The upper part of the top shows each section in a different color, in the place they appear in the file. Each horizontal line is one four-kilobyte page.

As you can see, some pages contain multiple sections or parts of sections; that is the case for pages that have the same permissions (read-only versus read-write), and which don't have special alignment requirements. About halfway down after the small red .dynamic section there is a gap, indicating that the next section, .data needs to start on a separate page -- in this case because it is writable.

This page map shows the sections of the ELF file. (There are only three segments; the first read-only part, a small "dynamic" segment holding the .dynamic section, used when the file is loaded; and the final read-write section. You can't see this from the visualization, but actually everything after .data is in no segment at all -- because it's not strictly necessary at run-time. That's the ELF convention.)

(Why is this file so big, you ask? It's a complicated answer. Part of it is because much of the assembler is itself a generated program; it uses Scheme procedural macros to define emit-foo procedures for each kind of instruction, based information extracted from the VM code (link). Guile doesn't do phasing, so these macros are residualized into the object file, and because of datum->syntax, macros have a lot of associated constant literals. That probably explains the huge size of .data, as syntax objects contain vectors and lists and symbols needing run-time relocation. I would have to check, but I think .rtl-text is big mostly because of a number of dynamic type checks in this particular module that the optimizer is (rightly) unable to elide. Alack. Surely both of these can be fixed eventually.)

But enough of problems. Did I mention that I made a program? And how! I think this may be the only blog post you have ever read that has a form in it, but that's how I am rolling today. Select an ELF file on your system, click submit, and you can have your very own page map. It works on any ELF file, not just Guile's files, so you can send your libc.so or whatever.

If you don't see a form above, your blog reader must have stripped it out. Click through to the blog post itself, or go visit the tool directly. And of course the program is written in Guile -- the file upload handling, the ELF parsing, and the pixel munging (using Cairo and a homebrew charting library).

Anyway, here's another example, this time of a small file:

Here we see that this file has the same sections as our earlier bloated example, only they are smaller.

Allow me to point out some details.

Firstly, the .rtl-text section holds the bytecode. Usually this would be called .text, but I didn't want to mess with people's expectations, and the new VM used to be called the "RTL" VM. (No longer, that was a silly name.)

The .data section holds data that needs initialization, or which may be modified at runtime. .rodata holds statically allocated data that needs no run-time initialization, and which therefore can be shared between processes. The initializations themselves, like relocations, are compiled to a procedure in the .rtl-text section, linked to from the dynamic section.

And that's all the sections! Except for the introspection and debugging-related sections, that is. These sections are used by the Guile runtime to be able to answer questions like "what is the name of this function?" or "what is the source position corresponding to bytecode position X?" or "what is the documentation string for this function?" or, well, I think you get the point. None of this is usually needed at runtime, and because it is all allocated at the end of the file, that means that usually none of it is ever paged into memory.

Note that for some of this metadata we use the standard DWARF format. I think we are one of the few dynamic language runtimes that does this nifty thing.

Also, all read-only data in these ELF files is ripe for sharing between processes, paging out to disk if under memory pressure, etc. For example, if I look on the smaps file for one of the two web processes running on this server, I see:

/opt/guile-2.2/lib/guile/2.2/ccache/web/server/http.go
Size:                132 kB
Rss:                 124 kB
Pss:                  62 kB

meaning that for this particular file, almost all of it is not only shareable but being shared. Good times.

Finally, all of this has a positive impact on start-up time. While I can get Guile 2.0 to start up in 11 milliseconds, with Guile 2.2 I am down to 8 milliseconds. Likewise guile -c '(sleep 100)' in Guile 2.0 uses 3144 kB of private dirty memory, compared to 1852 kB with Guile 2.2. There's still improvements to be made, but things are going well.

Well, again I find myself rambling. Check out the little ELF mapper tool I have above and let me know of any curious results. Do send your questions as well; though I've been derelict at responding, who knows, new year, new leaf right? Until next time, happy hacking.

Syndicated 2014-01-19 19:55:14 from wingolog

12 Jan 2014 »

a continuation-passing style intermediate language for guile

Happy new year's, hackfolk!

A few weeks ago I wrote about the upcoming Guile 2.2 release, and specifically about its new register virtual machine. Today I'd like to burn some electrons on another new part in Guile 2.2, its intermediate language.

To recap, we switched from a stack machine to a register machine because, among other reasons, register machines can consume and produce named intermediate results in fewer instructions than stack machines, and that makes things faster.

To take full advantage of this new capability, it is appropriate to switch at the same time from the direct-style intermediate language (IL) that we had to an IL that names all intermediate values. This lets us effectively reason about each subexpression that goes into a computation, for example in common subexpression elimination.

As far as intermediate languages go, basically there are two choices these days: something SSA-based, or something CPS-based. I wrote an article on SSA, ANF and CPS a few years ago; if you aren't familiar with these or are feeling a little rusty, I suggest you go and take a look.

In Guile I chose a continuation-passing style language. I still don't know if I made the right choice. I'll go ahead and describe Guile's IL and then follow up with some reflections. The description below is abbreviated from a more complete version in Guile's manual.

guile's cps language

Guile's CPS language is composed of terms, expressions, and continuations. It was heavily inspired by Andrew Kennedy's Compiling with Continuations, Continued paper.

A term can either evaluate an expression and pass the resulting values to some continuation, or it can declare local continuations and contain a sub-term in the scope of those continuations.

$continue k src exp: Evaluate the expression exp and pass the resulting values (if any) to the continuation labelled k.
$letk conts body: Bind conts in the scope of the sub-term body. The continuations are mutually recursive.

Additionally, the early stages of CPS allow for a set of mutually recursive functions to be declared as a term via a $letrec term. A later pass will attempt to transform the functions declared in a $letrec into local continuations. Any remaining functions are later lowered to $fun expressions. More on "contification" later.

Here is an inventory of the kinds of expressions in Guile's CPS language. Recall that all expressions are wrapped in a $continue term which specifies their continuation.

$void: Continue with the unspecified value.
$const val: Continue with the constant value val.
$prim name: Continue with the procedure that implements the primitive operation named by name.
$fun src meta free body: Continue with a procedure. body is the $kentry $cont of the procedure entry. free is a list of free variables accessed by the procedure. Early CPS uses an empty list for free; only after closure conversion is it correctly populated.
$call proc args: Call proc with the arguments args, and pass all values to the continuation. proc and the elements of the args list should all be variable names. The continuation identified by the term's k should be a $kreceive or a $ktail instance.
$primcall name args: Perform the primitive operation identified by name, a well-known symbol, passing it the arguments args, and pass all resulting values to the continuation.
$values args: Pass the values named by the list args to the continuation.
$prompt escape? tag handler: Push a prompt on the stack identified by the variable name tag and continue with zero values. If the body aborts to this prompt, control will proceed at the continuation labelled handler, which should be a $kreceive continuation. Prompts are later popped by pop-prompt primcalls.

The remaining element of the CPS language in Guile is the continuation. In CPS, all continuations have unique labels. Since this aspect is common to all continuation types, all continuations are contained in a $cont instance:

$cont k cont: Declare a continuation labelled k. All references to the continuation will use this label.

The most common kind of continuation binds some number of values, and then evaluates a sub-term. $kargs is this kind of simple lambda.

$kargs names syms body: Bind the incoming values to the variables syms, with original names names, and then evaluate the sub-term body.

Variables (e.g., the syms of a $kargs) should be globally unique. To bind the result of an expression a variable and then use that variable, you would continue from the expression to a $kargs that declares one variable. The bound value would then be available for use within the body of the $kargs.

$kif kt kf: Receive one value. If it is a true value, branch to the continuation labelled kt, passing no values; otherwise, branch to kf.

Non-tail function calls should continue to a $kreceive continuation in order to adapt the returned values to their uses in the calling function, if any.

$kreceive arity k: Receive values from a function return. Parse them according to arity, and then proceed with the parsed values to the $kargs continuation labelled k.

$arity is a helper data structure used by $kreceive and also by $kclause, described below.

$arity req opt rest kw allow-other-keys?: A data type declaring an arity. See Guile's manual for details.

Additionally, there are three specific kinds of continuations that can only be declared at function entries.

$kentry self tail clauses: Declare a function entry. self is a variable bound to the procedure being called, and which may be used for self-references. tail declares the $cont wrapping the $ktail for this function, corresponding to the function's tail continuation. clauses is a list of $kclause $cont instances.

$ktail: A tail continuation.

$kclause arity cont: A clause of a function with a given arity. Applications of a function with a compatible set of actual arguments will continue to cont, a $kargs $cont instance representing the clause body.

reflections

Before starting Guile's compiler rewrite, I had no real-world experience with CPS-based systems. I had worked with a few SSA-based systems, and a few more direct-style systems. I had most experience with the previous direct-style system that Guile had, but never had to seriously design another kind of IL, so basically I was ignorant. It shows, I think; but time will tell if it came out OK anyway. At this point I am cautiously optimistic.

As far as fitness for purpose goes, the CPS IL works in the sense that it is part of a self-hosting compiler. I'll say no more on that point other than to mention that it has explicit support for a number of Guile semantic features: multiple-value returns; optional, rest, and keyword arguments; cheap delimited continuations; Guile-native constant literals.

Why not ANF instead? If you recall from my SSA and CPS article, I mentioned that ANF is basically CPS with fewer labels. It tries to eliminate "administrative" continuations, whereas Guile's CPS labels everything. There is no short-hand let form.

ANF proponents tout its parsimony as a strength, but I do not understand this argument. I like having labels for everything. In CPS, I have as many labels as there are expressions, plus a few for continuations that don't contain terms. I use them directly in the bytecode compiler; the compiler never has to generate a fresh label, as they are part of the CPS itself.

More importantly, labelling every control-flow point allows me to reason precisely about control flow. For example, if a function is always called with the same continuation, it can be incorporated in the flow graph of the calling function. This is called "contification". It is not the same thing as inlining, as it works for sets of recursive functions as well, and never increases code size. This is a crucial transformation for a Scheme compiler to make, as it turns function calls into gotos, and self-function calls into loop back-edges.

Guile's previous compiler did a weak form of contification, but because we didn't have names for all control points it was gnarly and I was afraid to make it any better. Now its contifier is optimal. See Fluet and Weeks' Contification using Dominators and Kennedy's CWCC, for more on contification.

One more point in favor of labelling all continuations. Many tranformations can be best cast as a two-phase process, in which you first compute a set of transformations to perform, and then you apply them. Dead-code elimination works this way; first you find the least fixed-point of live expressions, and then you residualize only those expressions. Without names, how are you going to refer to an expression in the first phase? It's nasty and much cleaner with the ubiquitous, through labelling that CPS provides.

So I am happy with CPS, relative to ANF. But what about SSA? In my previous article, I asked SSA proponents to imagine returning a block from a block. Of course it doesn't make any sense; SSA is a first-order language. But modern CPS is also first-order, is the thing! Modern CPS distinguishes "continuations" syntactically from functions, which is exactly the same as SSA's distinction between basic blocks and functions. CPS and SSA really are the same on this level.

The fundamental CPS versus SSA difference is, as Stephen Weeks noted a decade ago, one of data structures: do you group your expressions into basic blocks stored in a vector (SSA), or do you nest them into a scope tree (CPS)? It's not clear that I made the correct choice.

In practice with Guile's CPS you end up building graphs on the side that describe some aspect of your term. For example you can build a reverse-post-ordered control flow analysis that linearizes your continuations, and assigns them numbers. Then you can compute a bitvector for each continuation representing each one's reachable continuations. Then you can use this reachability analysis to determine the extent of a prompt's body, for example.

But this analysis is all on the side and not really facilitated by the structure of CPS itself; the CPS facilities that it uses are the globally unique continuation and value names of the CPS, and the control-flow links. Once the transformation is made, all of the analysis is thrown away.

Although this seems wasteful, the SSA approach of values having "implicit" names by their positions in a vector (instead of explicit ephemeral name-to-index mappings) is terrifying to me. Any graph transformation could renumber things, or leave holes, or cause vectors to expand... dunno. Perhaps I am too shy of the mutation foot-gun. I find comfort in CPS's minimalism.

One surprise I have found is that I haven't needed to do any dominator-based analysis in any of the paltry CPS optimizations I have made so far. I expect to do so once I start optimizing loops; here we see the cultural difference with SSA I guess, loops being the dominant object of study there. On the other hand I have had to solve flow equations on a few occasions, which was somewhat surprising, though enjoyable.

The optimizations I have currently implemented for CPS are fairly basic. Contification was tricky. One thing I did recently was to make all non-tail $call nodes require $kreceive continuations; if, as in the common case, extra values were unused, that was reflected in an unused rest argument. This required a number of optimizations to clean up and remove the extra rest arguments for other kinds of source expressions: dead-code elimination, the typical beta/eta reduction, and some code generation changes. It was worth it though, and now with the optimization passes things are faster than they were before.

Well, I find that I am rambling now. I know this is a lot of detail, but I hope that it helps some future compiler hacker understand more about intermediate language tradeoffs. I have been happy with CPS, but I'll report back if anything changes :) And if you are actually hacking on Guile, check the in-progress manual for all the specifics.

Happy hacking to all, and to all a good hack!

Syndicated 2014-01-12 21:58:00 from wingolog

18 Dec 2013 »

optimizing let in spidermonkey

Peoples! Firefox now optimizes let-bound variables!

What does this mean, you ask? Well, as you nerdy wingolog readers probably know, the new ECMAScript 6 standard is coming soon. ES6 has new facilities that make JavaScript more expressive. At the same time, though, many ES6 features stress parts of JavaScript engines that are currently ignored.

One of these areas is lexical scope. Until now, with two exceptions, all local variables in JavaScript have been "hoisted" to the function level. (The two exceptions are the exception in a catch clause, and the name of a named function expression.) Therefore JS engines have rightly focused on compiling methods at a time, falling back to slower execution strategies when lexical scopes are present.

The presence of let in ES6 changes this. It was a hip saying a couple years ago that "let is the new var", but in reality no one uses let -- not only because engines didn't yet implement let without feature flags, if they implemented it at all, but because let was slow. Because let-bound variables were on the fallback path, they were unoptimized.

But that is changing. Firefox now optimizes many kinds of let-bound variables, making "let is the new var" much closer to being an actual thing. V8 has made some excellent steps in similar directions as well. I'll focus on the Firefox bit here as that's what I've been working on, then go on to mention the work that the excellent V8 hackers have done.

implementing scope in javascript

JavaScript the language is defined in terms of a "scope chain", with respect to which local variables are looked up. Each link in the chain binds some set of variables. Lookup of a name traverses the chain from tip to tail looking for a binding.

Usually it's possible to know at compile-time where a binding will be in the chain. For example, you can know that the "x" reference in the returned function here:

function () {
  var x = 3;
  return function () { return x; };
}

will be one link up in the scope chain, and will be stored in the first named slot in that link. This is a classic implementation of lexical scope, and it plays very well with the semantics of JavaScript as specified. All engines do something like this.

Another thing to note in this case is that we don't need to store the name for "x" anywhere, as we can see through all of its uses at compile-time. In practice however, as JavaScript functions are typically compiled lazily on first call, we do store the association "x -> slot 0" in the outer function's environment.

(Could you just propagate the constant, you ask? Of course you could. But since JS compilation typically occurs one function at a time, lazily, no engine does this initially. Of course later when the optimizing compiler kicks in, it does get propagated, but that usually doesn't avoid the scope chain node allocation.)

Let's take another case.

function foo() { var x = 3; return x; }

In this case we know that the "x" reference can be found in the top link of the chain, in the first slot. Semantically speaking, that is. No JS engine implements things this way. Instead we take advantage of noticing that "x" cannot be captured by any other function. Therefore we are free to assign "x" to a slot in a stack frame, and refer to it by index directly -- without allocating a scope chain node on the heap. All engines do this.

And of course later if this function is hot, the constant exists only in a register, probably inlined to its caller. For this reason scope hasn't had a lot of work put into it. The unit of optimization -- the function -- is the same as the unit of scope. If the implementation details of scope get costly, we take advantage of run-time optimization to speculatively remove allocations, loads, and stores.

OK. What if you have some variables captured, and some not captured?

function foo() { var x = 3; var y = 4; return function () { return x; } }

Here "x" is captured, but "y" is not. Usually engines will allocate "x" on the scope chain, and "y" as a local. (JavaScriptCore is an interesting exception; it uses a strategy I haven't seen elsewhere called "lazy tear-off". Basically all variables are on the stack for the dynamic extent of the scope, whether they are captured or not. If needed, potentially lazily, an object is pushed on the scope chain that aliases the stack slots. If a scope is pushed on the scope chain, when the scope ends the current values are "torn off" the stack and copied to the heap, and the slots pointer of the scope object updated to point to the heap.)

I digress. "x" on the scope chain, "y" not. (I regress: why allocate "y" at all, isn't it dead? Yes it is dead. Optimizing compilers kill it. Baseline compilers don't bother.) So typically access to "x" goes through the scope chain, and access to "y" does not.

Calculating the set of variables that can be captured is one of the few static analyses done by JS baseline compilers. The other ones are whether we are in strict mode or not, whether nested scopes contain "with" or "eval" (thus forcing all locals to be on the scope chain), and whether the "arguments" object is used or not. There may be more, but that is the lowest common denominator of efficiency.

implementing let

let is part of ES6, and all browsers will eventually implement it.

Let me give you an insight as to the mindset of a browser maker. What a browser maker does when they see a feature is first to try to ignore it -- all features have a cost and not all of them pay for themselves. Next you try to do the minimum correct thing -- the thing that passes all the test suites, but imposes a minimal burden on the rest of the system. Usually this means that the feature probably works, except for the corner cases which users will file bugs for, but it is slow. Finally when there is either an internal use the browser maker cares about (Google, Mozilla to an extent) or you want to heat up a benchmark war (everyone else), you start to optimize.

The state of let was until recently between "ignore" and "slow". "Slow" means different things to different browsers.

The engines that have started to implement let are V8 (Chrome) and SpiderMonkey (Firefox). In Chrome, until recently using let in a function was an easy way of preventing that function from ever being optimized by Crankshaft. Good times? When writing this article I was going to claim this was still the case, but I see that in fact this is happily not true. You can Crankshaft a function that has let bindings. I believe, with a cursory glance, that locals that are not captured are still allocated on a scope chain -- but perhaps I am misreading things.

Firefox, on the other hand, would not Ion-compile any function with a let. You would think that this would not be the case, given that Firefox has had let for many years. If you thought this, you misunderstand how browser development works. Basically let me lay it out for you:

class struggle : marxism :: benchmarks : browsers

Get it? Benchmarks are the motor of browser history. If this bloviation has any effect, please let it be to inspire you to go and make benchmarks!

Anyway, Firefox. No Ion for let. The reason why you would avoid optimizing these things is clear -- breaking the "optimization unit == scope unit" equivalence has a cost. However the time has come.

What I have done is to recognize let scopes that have no captured locals. This class of let scope will now be optimized by Ion. So for example in this case:

function sumto(n) {
  let sum = 0;
  for (let i=0; i<n; i++)
    sum += i;
  return sum;
}

Previously this would allocate a block scope around the "for". (Incidentally, Firefox borks block scoping in this case; each iteration should produce a fresh binding. But this is historical, will be fixed, and I digress.) The operations that created and pushed block scopes were not optimized. Avoiding pushing and popping scopes from the scope chain avoids this limitation, and thus we have awesome speed!

The effect of simply turning on Ion cannot be denied. In this case, the runtime of a sum up to 1e9 goes from 8.8 seconds (!) to 0.8 seconds. Obviously it's a little micro-benchmark but that's how I'm rolling today. The biggest optimization is to stop deoptimization.

At the same time, we are still able to reify a parallel chain of "debug scopes" that correspond to the scope chain as the specification would see it. This was the most challenging part of optimizing block scope in Firefox -- not the actual optimization, which was trivial, but optimization while retaining introspection and debuggability.

future work

Unhappily, scopes with let-bound variables that are captured by nested scopes ("aliased", in spidermonkey parlance) are not yet optimized. So not only do they cause scope chain allocation, but they also don't benefit from Ion. Boo. Bug 942810.

I should also mention that the let supported by Firefox is not the let specified in ES6. In ES6 there is thing called a "temporal dead zone" whereby it is invalid to access the value of a let before its initialization. This is like Scheme's "letrec restriction", and has to be enforced dynamically for similar reasons. V8 does a great job in actually implementing this, and Firefox should do it soon.

Of course, it's not clear to me how let can actually be deployed without breaking the tubes. I think it probably can somehow but I haven't checked the latest TC39 cogitations in that regard.

twisted paths

It's been a super-strange end-of-year. I was sure I would be optimizing SpiderMonkey generators and moving on to other things, but I just got caught up with stuff -- the for-of bits took approximately forever and then the amount of state carried in SpiderMonkey stack frames filled me with despair. So it was that this hack, while assisting me in that goal, was not actually a planned thing.

See, SpiderMonkey used to actually reserve a stack slot for the "block chain". No, they aren't using your browser to mine for Bitcoins, though that would be a neat hack. The "block chain" was a simulation of the spec-mandated scope chain. But of course this might not reflect the actual implemented, optimized behavior -- but again one might want to map them to each other for debugging purposes. It was a mess.

The resulting changeset to fix this ended up so large that it took weeks to land. Well, live and learn, right? I remember Michael Starzinger telling me the same thing about V8 -- you just have to keep your patches small, as small as possible, and always working. Words to the wise indeed.

happy days

But in the end we at least have some juice from this citric fruit. This has been an Igalia joint. Thanks very much to Mozilla's Luke Wagner for suffering through the reviews.

Thanks especially to Bloomberg for making this work possible; you folks are swell. Forward ES6!

Syndicated 2013-12-18 20:00:23 from wingolog

26 Nov 2013 »

a register vm for guile

Greetings, hacker comrades! Tonight's epistle is gnarly nargery of the best kind. See, we just landed a new virtual machine, compiler, linker, loader, assembler, and debugging infrastructure in Guile, and stories like that don't tell themselves. Oh no. I am a firm believer in Steve Yegge's Big Blog Theory. There are nitties and gritties and they need explication.

a brief brief history

As most of you know, Guile is an implementation of Scheme. It started about 20 years ago as a fork of SCM.

I think this lines-of-code graph pretty much sums up the history:

That's from the Ohloh, in case you were wondering. Anyway the story is that in the beginning it was all C, pretty much: Aubrey Jaffer's SCM, just packaged as a library. And it was C people making it, obviously. But Scheme is a beguiling language, and over time Guile has had a way of turning C hackers into Scheme hackers.

I like to think of this graph as showing my ignorance. I started using Guile about 10 years ago, and hacking on it in 2008 or so. In the beginning I was totally convinced by the "C for speed, Scheme for flexibility" thing -- to the extent that I was willing to write off Scheme as inevitably slow. But that's silly of course, and one needs no more proof than the great performance JavaScript implementations have these days.

In 2009, we merged in a bytecode VM and a compiler written in Scheme itself. All that is pretty nifty stuff. We released that version of Guile as 2.0 in 2011, and that's been good times. But it's time to move onward and upward!

A couple of years ago I wrote an article on JavaScriptCore, and in it I spoke longingly of register machines. I think that's probably when I started to make sketches towards Guile 2.2, after having spent time with JavaScriptCore's bytecode compiler and interpreter.

Well, it took a couple of years, but Guile 2.2 is finally a thing. No, we haven't even made any prereleases yet, but the important bits have landed in master. This is the first article about it.

trashing your code

Before I start trashing Guile 2.0, I think it's important to say what it does well. It has a great inlining pass -- better than any mainstream language, I think. Its startup time is pretty good -- around 13 milliseconds on my machine. Its runs faster than other "scripting language" implementations like Python (CPython) or Ruby (MRI). The debugging experience is delightful. You get native POSIX threads. Plus you get all the features of a proper Scheme, like macros and delimited continuations and all of that!

But the Guile 2.0 VM is a stack machine. That means that its instructions usually take their values from the stack, and produce values (if appropriate) by pushing values onto the stack.

The problem with stack machines is that they penalize named values. If I realize that a computation is happening twice and I factor it out to a variable, that means in practice that I allocate a stack frame slot to the value. So far so good. However, to use the value, I have to emit an instruction to fetch the value for use by some other instruction; and to store it, I likewise have to have another instruction to do that.

For example, in Guile 2.0, check out the bytecode produced for this little function:

scheme@(guile-user)> ,disassemble (lambda (x y)
                                    (let ((z (+ x y)))
                                      (* z z)))

   0    (assert-nargs-ee/locals 10)     ;; 2 args, 1 local
   2    (local-ref 0)                   ;; `x'
   4    (local-ref 1)                   ;; `y'
   6    (add)
   7    (local-set 2)                   ;; `z'
   9    (local-ref 2)                   ;; `z'
  11    (local-ref 2)                   ;; `z'
  13    (mul)
  14    (return)

This is silly. There are seven instructions in the body of this procedure, not counting the prologue and epilogue, and only two of them are needed. The cost of interpreting a bytecode is largely dispatch cost, which is linear in the number of instructions executed, and we see here that we could be some 7/2 = 3.5 times as fast if we could somehow make the operations reference their operands by slot directly.

register vm to the rescue

The solution to this problem is to use a "register machine". I use scare quotes because in fact this is a virtual machine, so unlike a CPU, the number of "registers" is unlimited, and in fact they are just stack slots accessed by index.

So in Guile 2.2, our silly procedure produces the following code:

scheme@(guile-user)> ,disassemble (lambda (x y)
                                    (let ((z (+ x y)))
                                      (* z z)))

   0    (assert-nargs-ee/locals 3 1)    ;; 2 args, 1 local
   1    (add 3 1 2)
   2    (mul 3 3 3)
   3    (return 3)

This is optimal! There are four things that need to happen, and there are four opcodes that do them. Receiving operands and sending values is essentially free -- they are indexed accesses off of a pointer stored in a hardware register, into memory that is in cache.

This is a silly little example, but especially in loops, Guile 2.2 stomps Guile 2.0. A simple count-up-to-a-billion test runs in 9 seconds on Guile 2.2, compared to 24 seconds in Guile 2.0. Let's make a silly graph!

Of course if we compare to V8 for example we find that V8 does a loop-to-a-billion in about 1 second, or 9 times faster. There is some way to go. There are a couple of ways that I could generate better bytecode for this loop, for another 30% speed boost or so, but ultimately we will have to do native compilation. And we will! But that is another post.

gritties

Here's the VM. It's hairy in the prelude, and the whole thing is #included twice in another C file (for a debugging and a non-debugging mode; terrible), but I think it's OK for being in C. (If it were in C++ it could be nicer in various ways.)

The calling convention for this VM is that when a function is called, it receives its arguments on the stack. The stack frame looks like this:

   /------------------\
   | Local N-1        | <- sp
   | ...              |
   | Local 1          |
   | Local 0          | <- fp
   +==================+
   | Return address   |
   | Dynamic link     |
   +==================+
   :                  :

Local 0 holds the procedure being called. Free variables, if any, are stored inline with the (flat) closure. You know how many arguments you get by the difference between the stack pointer (SP) and the frame pointer (FP). There are a number of opcodes to bind optional arguments, keyword arguments, rest arguments, and to skip to other case-lambda clauses.

After deciding that a given clause applies to the actual arguments, a prelude opcode will reset the SP to have enough space to hold all locals. In this way the SP is only manipulated in function prologues and epilogues, and around calls.

Guile's stack is expandable: it is originally only a page or two, and it expands (via mremap if possible) by a factor of two on every overflow, up to a configurable maximum. At expansion you have to rewrite the saved FP chain, but nothing else points in, so it is safe to move the stack.

To call a procedure, you put it and its arguments in contiguous slots, with no live values below them, and two empty slots for the saved instruction pointer (IP) and FP. Getting this right requires some compiler sophistication. Then you reset your SP to hold just the arguments. Then you branch to the procedure's entry, potentially bailing out to a helper if it's not a VM procedure.

To return values, a procedure shuffles the return values down to start from slot 1, resets the stack pointer to point to the last return value, and then restores the saved FP and IP. The calling function knows how many values are returned by looking at the SP. There are convenience instructions for returning and receiving a single value. Multiple values can be returned on the stack easily and efficiently.

Each operation in Guile's VM consists of a number of 32-bit words. The lower 8 bits in the first word indicate the opcode. The width and layout of the operands depends on the word. For example, MOV takes two 12-bit operands. Of course, 4096 locals may not be enough. For that reason there is also LONG-MOV which has two words, and takes two 24-bit operands. In LONG-MOV there are 8 bits of wasted space, but I decided to limit the local frame address space to 24 bits.

In general, most operations cannot address the full 24-bit space. For example, there is ADD, which takes two 8-bit operands and one 8-bit destination. The plan is to have the compiler emit some shuffles in this case, but I haven't hit it yet, and it was too tricky to try to get right in the bootstrapping phase.

JavaScriptCore avoids the address space problem by having all operands be one full pointer wide. This wastes a lot of memory, but they lazily compile and can throw away bytecode and reparse from source as needed, neither of which are true for Guile. We aim to do a good ahead-of-time compilation, to enable self-hosting of the compiler.

JSC's pointer-wide operands do provide the benefit of allowing the "opcode" word to actually hold the address of the label, instead of an index to a table of addresses. This is a great trick, but again it's not applicable to Guile as we don't want to relocate bytecode that we load from disk.

Relative jumps in Guile's VM are 24 bits wide, and are measured in 32-bit units, giving us effectively a 26 bit jump space. Relative references -- references to static data, or other procedures -- are 32 bits wide. I certainly hope that four gigabytes in a compilation unit is enough! By the time it is a problem, hopefully we will be doing native compilation.

Well, those are the basics of Guile's VM. There's more to say, but I already linked to the source, so that should be good enough :) In some future dispatch, we'll talk about the other parts of Guile 2.2. Until then!

Syndicated 2013-11-26 22:07:55 from wingolog

2 Nov 2013 »

scheme quiz time!

Scheme quiz time!

Consider the following two functions:

(define (test1 get)
  (let ((v (make-vector 2 #f)))
    (vector-set! v 0 (get 0))
    (vector-set! v 1 (get 1))
    v))

(define (test2 get)
  (let* ((a (get 0))
         (b (get 1)))
    (vector a b)))

Assume the usual definitions for all of the free variables like make-vector and so on. These functions both create a vector with two elements. The first element is the result of a call to (get 0), where get is a function the user passes in as an argument. Likewise the second comes from (get 1).

(test1 (lambda (n) n)) => #(0 1)
(test2 (lambda (n) n)) => #(0 1)

So the functions are the same.

Or are they?

Your challenge: write a standard Scheme function discriminate that, when passed either test1 or test2 as an argument, can figure out which one it is.

. . .

Ready? If you know Scheme, you should think on this a little bit before looking at the answer. I'll wait.

. . .

OK!

We know that in both functions, two calls are made to the get function, in the same order, and so really there should be no difference whatsoever.

However there is a difference in the continuations of the get calls. In test1, the continuation includes the identity of the result vector -- because the vector was allocated before the get calls. On the other hand test2 only allocates the result after the calls to get. So the trick is just to muck around with continuations so that you return twice from a call to the test function, and see if both returns are the same or not.

(define (discriminate f)
  (let ((get-zero-cont #t)
        (first-result #f))
    (define (get n)
      (when (zero? n)
        (call/cc (lambda (k)
                   (set! get-zero-cont k))))
      n)
    (let ((result (f get)))
      (cond
       (first-result
        (eq? result first-result))
       (else
        (set! first-result result)
        (get-zero-cont))))))

In the call to f, we capture the continuation of the entry to the (get 0) call. Then later we re-instate that continuation, making the call to f return for a second time. Then we see if both return values are the same object.

(discriminate test1) => #t
(discriminate test2) => #f

If they are the same object, then the continuation captured the identity of the result vector -- and if not, the result was only allocated after the get calls.

so what?

Unhappily, this has practical ramifications. In many compilers it would be advantagous to replace calls to vector with calls to make-vector plus a series of vector-set! operations. Such a transformation lowers the live variable pressure. If you have a macro that generates a bison-like parser whose parse table is built by a call to vector with 400 arguments -- this happens -- you'd rather not have 400 live variables in the function that builds that table. But this isn't a safe transformation to make, unless you can prove that no argument captures the current continuation. Happily, for the parser generator macro this is the case, but it's not something to bet on.

It gets worse, though. Since test1 returns the same object, it is possible to use continuations to mutate previous return values, with nary a vector-set! in sight!

(define (discriminate2 f)
  (let ((get-zero-cont #f)
        (escape #f))
    (define (get n)
      (case n
        ((0) (call/cc (lambda (k)
                        (set! get-zero-cont k)
                        0)))
        ((1) (if escape
                 (escape)
                 1))))
    (let ((result (f get)))
      (call/cc
       (lambda (k)
         (set! escape k)
         (get-zero-cont 42)))
      result)))

(discriminate2 test1) => #(42 1)
(discriminate2 test2) => #(0 1)

This... this is crazy.

story time

Now it's story time. Guile has a bytecode VM, and usually all code is compiled to that VM. But it also has an interpreter, for various purposes, and that interpreter is fairly classic: it's a recursive function that takes a "memoized expression" and an environment as parameters. Only, the environment was silly -- it was just a list of values. Before evaluating, a "memoizer" runs to resolve lexical references to indexes in that list, and entering a new lexical contour conses on that list.

Well of course that makes lexical variable lookup expensive. It usually doesn't matter as everything is compiled, but it's a bit shameful, so I rewrote it recently to use two-dimensional environments. Let me drop some ASCII on you:

   +------+------+------+------+------+
   | prev |slot 0|slot 1| ...  |slot N|
   +------+------+------+------+------+
      \/
   +------+------+------+------+------+
   | prev |slot 0|slot 1| ...  |slot N|
   +------+------+------+------+------+
      \/
     ...
      \/
   toplevel

It's a chain of vectors, linked through their first elements. Resolving a lexical in this environment has two dimensions, the depth and the width.

Vectors.

You see where I'm going with this?

I implemented the "push-new-environment" operation as a sequence of make-vector and an eval-and-vector-set! loop. Here's the actual clause that implements this:

(define (eval exp env)
  (match exp
    ...
    (('let (inits . body))
     (let* ((width (vector-length inits))
            (new-env (make-env width #f env)))
       (let lp ((i 0))
         (when (< i width)
           (env-set! new-env 0 i (eval (vector-ref inits i) env))
           (lp (1+ i))))
       (eval body new-env)))
    ...))

This worked fine. It was fast, and correct. Or so I thought. I used this interpreter to bootstrap a fresh Guile compile and all was good. Until running those damned test suites that use call/cc to return multiple times from let initializers, as in my discriminate2 test. While the identity of env isn't visible to a program as such, the ability of call/cc to peel apart allocation and initialization of the environment vector makes this particular implementation strategy not viable.

In the end I'll inline a few arities, and then have a general case that allocates heap storage for the temporaries:

(case (vector-length env)
  ((0) (vector env))
  ((1) (vector env (eval (vector-ref inits 0) env)))
  ...
  (else
   (list->vector
    (cons env
          (map (lambda (x) (eval x env))
               (vector->list inits))))))

Of course I'd use a macro to generate that. It's terrible, but oh well. Es lo que hay.

Syndicated 2013-11-02 13:42:06 from wingolog

7 Oct 2013 »

es6 generators and iteration in spidermonkey

It is my great pleasure to announce that SpiderMonkey, the JavaScript engine of the Firefox browser, just got support for ES6 generators and iteration!

the chase

As I have written before, generators are functions that can suspend. SpiderMonkey has had generators since Firefox 2 (really!), but the new thing is that now their generators are standard, and available by default. No need to futz with Firefox-specific "language versions": there is just 1JS now.

My last article extolled the virtues of generators for asynchronous programs, and since then there have been a number of great articles and talks around the web on that topic. Generators are a kind of coroutine, and coroutines work well for asynchronous computation.

But that's not all that coroutines are good for. Another great application of generators is in iteration, where you have a producer that knows how to produce values, and a consumer that knows how to process them. Being able to write the producer as a coroutine is great win.

words, words, words

This is a bit abstract, so let me show some code. Let's say you are implementing an auto-complete widget, so you use a trie. Here is a simple trie:

function Trie() {
  this.kids = Object.create(null);
  this.terminal = false;
}

Trie.prototype.insert = function (str) {
  if (str.length == 0) {
    this.terminal = true;
  } else {
    var node = this.kids[str[0]];
    if (!node)
      this.kids[str[0]] = node = new Trie;
    node.insert(str.slice(1));
  }
}

This kind of data structure is nice because it encodes the minimal amount of information. Unlike some of the examples on Wikipedia, the information from the prefix is not contained in the suffix.

Let's add some words:

var db = new Trie;
['fee', 'fie', 'fo', 'foo'].forEach(db.insert.bind(db));

Now, how do we get the completions?

Trie.prototype.completions = function (prefix) {
  // ?
}

It would be nice to avoid making an array of all of the possible completions, because who knows how many we're actually going to show. One of the whole points of using a trie was to avoid storing all of the strings into memory.

Generators offer a nice solution to this problem:

Trie.prototype.strings = function* () {
  if (this.terminal) yield '';
  for (var k in this.kids)
    for (var suffix of this.kids[k].completions(''))
      yield k + suffix;
}

Trie.prototype.completions = function* (prefix) {
  if (prefix.length) {
    var node = this.kids[prefix[0]];
    if (node !== undefined)
      for (var suffix of node.completions(prefix.slice(1)))
        yield prefix[0] + suffix;
  } else {
    yield* this.strings();
  }
}

Here you see that the information that is part of the prefix is now present implicitly, as part of the control stack. Operations on recursive data structures are naturally expressed with recursive functions.

Note that if I call db.strings(), nothing happens -- not even if there are millions of strings in the trie. Values get produced by the coroutine only as the consumer asks for them.

Speaking of consumers, "for-of" is usually the consumer that you want, as it is the built-in ES6 language construct for iteration. For example:

js> for (var word of db.completions('fo')) print(word)       
fo
foo
js> for (var word of db.completions('f')) print(word)  
fee
fie
fo
foo
js> for (var word of db.completions('q')) print(word)
js>

This code is not just part of some future standard: it's available now in Firefox nightlies, out of the box. It's also available in Chrome, if you enable "experimental features".

niceties

What is also in Firefox, but not yet in Chrome, is support for custom iterators. Wouldn't it be nice if you could just say:

for (var word of db)
  print(word);

Well you can! But here it gets complicated.

The right-hand side of the for-of statement, in this case db, is the iterable. The iterator is derived from the iterable by calling the iterable's special @@iterator method. The @@ indicates that this method is keyed by a "well-known symbol".

Symbols are a new kind of data type in ES6, and their salient characteristic is that they can be used as unique property keys. You can imagine that in the future there is some piece of the JavaScript engine that, on startup, does something like this:

function Symbol() {}
Symbol.iterator = new Symbol;

and then uses that Symbol.iterator value as a property key.

However, Firefox does not implement symbols yet. In the meantime Firefox uses a special string as a stand-in for @@iterator. This name is not publicised anywhere, but you can get the value of @@iterator using this future-proof code:

const iterator = (function() {
  try {
    for (var _ of Proxy.create({get: function(_, name) { throw name; } }))
      break;
  } catch (name) {
    return name;
  }
  throw 'wat';
})();

However²! V8 implements for-of but not the @@iterator protocol, so if you target V8 also, you might find this gist to be useful.

Given iterator, we can now make Trie iterable directly:

Trie.prototype[iterator] = Trie.prototype.strings;

js> for (var word of db)
      print(word);
fee
fie
fo
foo

how did we get here?

That's the status: Firefox does ES6 generators and iterators. Excellent!

So, I have to toot my toot some bells and ring some horns here. I implemented ES6 generators and iteration in SpiderMonkey as part of my work with Igalia, your friendly neighborhood browser consultancy. I did the same in V8 a few months ago.

It seems odd, on the face of it, to hack on core components of two major browser engines, while not being employed by either associated company. We think of Google, Mozilla, Apple, and Microsoft as the ones that are really "driving the web", either because it's strategic for them, or because they are shamed by their competitors into doing so. But that's not the totality of the browser ecosystem.

To develop the web platform, there is a simple condition: you have to ship a browser, or components of a browser (e.g. a JS engine). If you do that, you can usefully invest in improving your tools, to make them better for you in some way. The magical part is that when you improve a browser, you improve the web.

Since three of the four main browser engines are Free Software, it's possible for third parties that ship browsers to meaningfully contribute to them -- and happily, this improves not only their products but also the web as a whole.

So for ES6 generators and iteration, big ups to Bloomberg, who sponsored this work. More expressive language features helps them better develop their products, both on the server and client side. Because they deploy V8 and SpiderMonkey components, they were in this magical position of being able to usefully invest in the web platform, making it better for all of us. Thanks!

Thanks also to the great Mozilla folks, who were very kind, helpful, and forgiving of my mozilla-rookie mistakes. It was also a privilege to be able to get such a close insight into the workings of Firefox development. Thanks especially to Jason Orendorff and Jeff Walden for their detailed patch reviews, and to Till Schneidereit for enduring many stupid questions over IRC :) Y'all are great!

next

I have a richly variegated set of minor bugs to fix up with this work, so while I've still got this all paged in, do be sure to give a try to the Firefox nightlies and see if this works for you. File bugs in Bugzilla and copy me (:wingo) on them. Thanks in advance for submitting bugs there and not in the blog comments :) Happy hacking!

Syndicated 2013-10-07 15:17:56 from wingolog

413 older entries...