Older blog entries for wingo (starting at number 368)

guile 2.0.2: building the guildhall

Hey tubes, we made another Guile release!

what is guile

Guile is the GNU extension language. It's an implementation of Scheme. It's pretty fun to hack in, and on!

guile 2.0.2: even faster!

Here I'd like to highlight the thing that I really like about this release: performance. Relative to 2.0.1:

  • The VM is 20% faster

  • Compiled .go files are half the size

  • Startup time is down from 16.2 milliseconds to 11.7 milliseconds (on my machine)

  • Base memory use down from 3.8 MB of private dirty memory to 2.9MB (measured on a guile -c '(sleep 100)')

In short, thundercats are totally on the loose! See the release announcement for all the details.

where we're going

It's hard to tell sometimes where this jalopy is headed. Sometimes the vehicle just drives itself, like the recent speed improvements, which were totally unplanned.

But that said, there are a few stops we want to make on this journey, in the near- and mid-term. Let's take a look.

2.0: building the guildhall

On the 2.0 branch, the current stable series, the big feature that we'd like to land soon will be a CPAN-like repository of installable Scheme modules.

We agonized over the name for a while. It needs to be short and memorable, so it can be used as a command-line tool. (The Git experience is pretty definitive here.) At the same time it needs to be linked to Guile, and follow some sort of theme.

What we've done now is to take our guile-tools script runner (equivalent to the git wrapper) and rename it to guild. Currently it's mostly used for compiling files, as in guild compile -o foo.go foo.scm, but in the future we want to use it as a part of this system for Guile wizards and journeyfolk to band together to share code---to form a guild, so to speak.

And so the package archive itself will be the "guild hall" (or "guildhall"). We'll see if we can get guildhall.gnu.org, as a place to host the packages.

As far as code, protocol, and formats go, we are looking at porting Andreas Rottmann's Dorodango Scheme package manager to be Guile-specific, with guild commands. So guild update, guild upgrade, guild install xml-rpc, etc.

Dorodango is basically a port of Debian's apt to Scheme. It even has ports of the same dependency resolution algorithms, which is fun. It also allows for multiple sources, so we can have our Guile guildhall archive, but also have support for portable R6RS Scheme archives from Dorodango.

2.2: bit mucking

Guile is getting pretty good now, but there is more to be done before it is as cheap to run a program written in Scheme as it is to run a program written in C.

The first thing that we will do to help this is to switch to ELF as the format of our compiled code. We will do this on all platforms, and even before we do native compilation. The bytecode will simply be stored in an ELF section. This will also allow us to statically allocate more data, decreasing memory usage and increasing locality. Also our debugging information is currently interspersed with code -- not in the execution path, mind you, but it does make the memory profile of the code look more like emmental than comté cheese. Moving debug info out to a separate section will help with that.

ELF will also add flexibility. We can add sections to express dependencies of compiled code on source code. We can strip debug info and other things from compiled code. We can link multiple modules into one file, to avoid stat penalties on startup.

In addition ELF will allow us to add native code, in a separate section. Of course we have to actually produce native code, so that's the next big thing for Guile 2.2. The plan is to write an ahead-of-time code generator for x86 and x86-64 systems, while keeping the bytecode interpreter as well. Then once we know what we want, we can go and talk to the GCC folks and see if our use case can be supported there.

Anyway, that's the plan. I expect 2.2 to take another year or so. By the time we have native compilation, I hope also our Elisp implementation will be significantly faster than the one in Emacs, so it will really become attractive to people to finish the work that we have done to make the switch there.

Syndicated 2011-07-04 09:47:25 from wingolog

security implications of jit compilation

Periodically the tubes alight with statements that just-in-time (JIT) compilers are somehow bad for security. At Igalia we work with many companies that have put or are interested in putting JIT compilers on mobile devices, and obviously security is one of their concerns. In this article I'd like to take a closer look at the risks and myths surrounding JIT and security.

executive summary

JIT compilers do not cause a significant security risk in and of themselves.

However, JIT compilers do allow flaws in unsafe native libraries to be more easily exploited.

The problem is not JIT compilation. The problem is code written in unsafe languages.

a background: interpreters and security

It used to be that JavaScript implementations all used interpreters to execute JavaScript code, usually in bytecode form. An interpreter is a program, usually written in C or C++, which interprets program code as data.

(I will take JavaScript as the example in this article, though similar considerations apply to other language implementations.)

JavaScript implementations also includes a language runtime, which implements all of the "primitive" objects of the language. In this case of JavaScript this might include Date objects and such. The runtime also includes a garbage collector, a parser, and other such things.

Sometimes interpreters are used to run "external" code, as is the case with JavaScript. In that case the language implementation is a fairly clear attack target: the attacker tries to produce an input which triggers a flaw in the language implementation, allowing takeover of the target system.

Note that in this case, the interpreter itself is only part of the target. Indeed the interpreter is typically only a small part of a language implementation, and is very well-tested. Usually the vulnerability is in some bit of the runtime, and often stems from having the runtime written in an unsafe language like C or C++.

So the interpreter part itself typically imposes no risk, beyond the obvious ones of interpreting data provided by an attacker.

compilers have more moving pieces

Instead of interpreting a program as data, compilers produce native code. A compiler is bigger than an interpreter, and so exposes a larger attack surface. A compiler usually has more corner cases than an interpreter, which often are not tested properly. Compilers are usually specific to particular target architectures, which makes bugs in less-used architectures more likely. Additionally, by generating native code, a compiler operates outside whatever meager safety guarantees are provide by the implementation language (C or C++).

So yes, compilers are a risk, JIT or otherwise. However with proper testing it is possible to have confidence in a compiler. First of all, a compiler is written by compiler writers, not ordinary programmers, and not attackers. Secondly a compiler is generally a functional process, so it's easy to test. Modern JS implementations have thousands and thousands of tests. It's still a cross-your-fingers-and-pray thing, usually, but the fact is that exploits these days come from bugs present in the source of unsafe languages, not bugs in code generated from safe languages.

jit compilation allows performant mobile code

JIT compilation can actually increase the overall security of a system, if one of the design goals of that system is user-installable "apps". For example, Android apps are (typically) written in Java, targetting the Dalvik VM. This choice of safe language and safe runtime largely eliminates the need to make a tight sandbox around apps. If it weren't for the presence of unsafe code in the libraries available to an Android app, things would be even easier.

Note that Dalvik's JIT has other positive security characteristics, in that its generated code can be verified against the bytecodes that it is supposed to implement. I recommend this video for a good discussion; thanks to my colleague Xan for pointing it out.

In contrast, Apple's decision to allow natively-compiled applications to be installed by users leaves them scrambling to try to mitigate the effects of what happens when an app has a buffer overflow. Again, the problem is unsafe code, not JIT compilation.

attempts to contain security vulnerabilities in unsafe code

That said, unsafe languages are part of the sordid reality, so let's look at how attackers compromise such applications.

In the old days, once you got a stack buffer overflow, you just wrote whatever code you wanted onto the stack, and arranged for a function return to execute that code. Smashing the stack for fun and profit, you see.

These days it is more difficult. One of the things that makes it more difficult is making it impossible to write code which is later executed. This technique goes under various names: the NX bit, DEP, W⊕X, etc. Basically it means you can't write shellcode. But, you can return to libc, which uses existing functions in libc (like system).

Some systems also prevent you from running code on disk unless it is signed. Dion Blazakis wrote a nice overview on the sandbox in iOS devices (subtitle: using Scheme for evil). Besides not letting you write code for your own device -- a terrible thing IMO, but not related to security -- this technique does prevent attackers from getting around the NX restriction by writing code to disk, then executing it.

But one need not write code to perform arbitrary computation. I assume everyone's seen the following papers, and if not, I highly recommend them:

Finally, instead of loading libc always at the same place, the latest thing to happen is to randomize the positions at which the various segments are loaded in memory. This is called Address Space Layout Randomization (ASLR). To effectively execute a return to libc under ASLR, you have to figure out where libc is first; not a trivial task.

I think I need to mention again that these are not security measures per se. If you are really interested in security, don't write systems software in unsafe languages. These measures are designed to contain the effects of a vulnerability.

I should also note that all of these measures are actively harmful to people writing code. ASLR increases load time by disallowing prelinking of various sorts, and forcing the generation of position-independent code. Disallowing writable, executable memory means that you can't do JIT compilation, not to mention quajects. Disallowing unsigned binaries means that you can't run your code on your own device.

jit compilation is not compatible with these restrictions

Having a JIT compiler by definition means that you are writing code and then going to run it. This makes it easier for an attacker who has found a bug in a native library to exploit it.

These exploits do happen, though currently not very often. A good example of such a paper is Interpreter Exploitation: Pointer Inference And JIT Spraying by Dion Blazakis.

Blazakis uses characteristics of the language runtime (Flash, in that case) to allow him to determine the address ranges to write his code. The actual code is written by taking advantage of the predictable output of the particular JIT in question to spray a big NOPsled onto the heap, which if I understand it correctly, is implemented by starting the IP in the middle of x86 instructions, in lovely contrapuntal style. Very clever stuff.

But again, the vulnerability is present not in code related to the JIT, but in native code. Having executable pages makes the vulnerability easier to exploit, but is not itself the security risk.

specific jit implementations

I called out Dalvik as an example of a JIT implementation that is particularly security-conscious. As much as I like V8, though, I cannot recommend it particularly, from the security perspective. It is simply changing too fast. Practically every commit touches code generators. The individual architectures share some pieces but duplicate a lot. I cannot speak for the other JavaScript implementations, as I do not know them as well.

V8's primary use case in the Chromium browser does have some security safeguards though, benefitting from the process sandbox and rapid release cycle. But I would be hesitant to put it in a device, lacking the same safeguards.

mitigations for jit vulnerabilities

These days all signs are pointing to the end of sandboxing as a security strategy, replacing it instead with verification and software fault isolation (SFI). Google's Native Client group is leading the way here. For more, see Adapting software fault isolation to contemporary CPU architectures by Sehr et al. But it does seem a poor thing to try to turn an unsafe language into a "fault-isolated" runtime; better to start with safe code to begin with.

Adapting SFI to JIT compilers could protect against flaws in JIT compilers, but it has abysmal results. The authors of that paper adapted V8 to produce the code sequences needed to preserve the SFI invariants, and got killed in speed. They somehow managed to claim success though, which is beyond me.

in closing

If you've already decided that you're going to ship remotely exploitable libraries written in unsafe languages, disallowing JIT can indeed reduce your attack surface. I understand it in some cases. But it is a shame to make safe languages pay the sins of their unsafe brethren. If you can, be Android, and not iOS.

Syndicated 2011-06-21 17:00:25 from wingolog

on-stack replacement in v8

A recent post looked at what V8 does with a simple loop:

function g () { return 1; }
function f () { 
  var ret = 0;
  for (var i = 1; i < 10000000; i++) {
    ret += g ();
  return ret;

It turns out that calling f actually inlines g into the body, at runtime. But how does V8 do that, and what was the dead code about? V8 hacker Kasper Lund kindly chimed in with an explanation, which led me to write this post.

When V8 compiled an optimized version of f with g, it decided to do so when f was already in the loop. It needed to stop the loop in order to replace the slow version with the fast version. The mechanism that it uses is to reset the stack limit, causing the overflow handler in g to be called. So the computation is stuck in a call to g, and somehow must be transformed to a call to a combined f+g.

The actual mechanics are a bit complicated, so I'm going to try a picture. Here we go:

As you can see, V8 replaces the top of the current call stack with one in which the frames have been merged together. This is called on-stack replacement (OSR). OSR takes the contexts of some contiguous range of function applications and the variables that are live in those activations, and transforms those inputs into a new context or contexts and live variable set, and splats the new activations on the stack.

I'm using weasel words here because OSR can go both ways: optimizing (in which N frames are combined to 1) and deoptimizing (typically the other way around), as Hölzle notes in "Optimizing Dynamically-Dispatched Calls with Run-Time Type Feedback". More on deoptimization some other day, perhaps.

The diagram above mentions "loop restart entry points", which is indeed the "dead code" that I mentioned previously. That code forms part of an alternate entry point to the function, used only once, when the computation is restarted after on-stack replacement.

details, godly or otherwise

Given what I know now about OSR, I'm going to try an alternate decompilation of V8's f+g optimized function. This time I am going to abuse C as a high-level assembler. (I know, I know. Play along with me :))

Let us start by defining our types.

typedef union { uint64_t bits; } Value;
typedef struct { Value map; uint64_t data[]; } Object;

All JavaScript values are of type Value. Some Values encode small integers (SMIs), and others encode pointers to Object. Here I'm going to assume we are on a 64-bit system. The data member of Object is a tail array of slots.

inline bool val_is_smi (Value v) 
{ return !(v.bits & 0x1); }

inline Value val_from_smi (int32 i)
{ return (Value) { ((uint64_t)i) << 32 }; }

inline int32 val_to_smi (Value v)
{ return v.bits >> 32; }

Small integers have 0 as their least significant bit, and the payload in the upper 32 bits. Did you see the union literal syntax here? (Value){ ((uint64_t)i) << 32 }? It's a C99 thing that most folks don't know about. Anyway.

inline bool val_is_obj (Value v)
{ return !val_is_smi (v); }

inline Value val_from_obj (Object *p)
{ return (Value) { ((uint64_t)p) + 1U }; }

inline Object* val_to_obj (Value v)
{ return (Object*) (v.bits - 1U); }

Values that are Object pointers have 01 as their least significant bits. We have to mask off those bits to get the pointer to Object.

Numbers that are not small integers are stored as Object values, on the heap. All Object values have a map field, which points to a special object that describes the value: what type it is, how its fields are laid out, etc. Much like GOOPS, actually, though not as reflective.

In V8, the double value will be the only slot in a heap number. Therefore to access the double value of a heap number, we simply check whether the Object's map is the heap number map, and in that case return the first slot, as a double.

There is a complication though; what is value of the heap number map? V8 actually doesn't encode it into the compiled function directly. I'm not sure why it doesn't. Instead it stores the heap number map in a slot in the root object, and stores a pointer into the middle of the root object in r13. It's as if we had a global variable like this:

Value *root;    // r13

const int STACK_LIMIT_IDX = 0;
const int HEAP_NUMBER_MAP_IDX = -7; // Really.

inline bool obj_is_double (Object* o)
{ return o->map == root[HEAP_NUMBER_MAP_IDX]; }

Indeed, the offset into the root pointer is negative, which is a bit odd. But hey, details! Let's add the functions to actually get a double from an object:

union cvt { double d; uint64_t u; };

inline double obj_to_double (Object* o)
{ return ((union cvt) { o->data[0] }).d; }

inline Object* double_to_obj (double d)
  Object *ret = malloc (sizeof Value * 2);
  ret->map = root[HEAP_NUMBER_MAP_IDX];
  ret->data[0] = ((union cvt) { d }).u;
  return ret;

I'm including double_to_obj here just for completeness. Also did you enjoy the union literals in this block?

So, we're getting there. Perhaps the reader recalls, but V8's calling convention specifies that the context and the function are passed in registers. Let's model that in this silly code with some global variables:

Value context;  // rsi
Value function; // rdi

Recall also that when f+g is called, it needs to check that g is actually bound to the same value. So here we go:

const Value* g_cell;
const Value expected_g = (Value) { 0x7f7b205d7ba1U };

Finally, f+g. I'm going to show the main path all in one go. Take a big breath!

Value f_and_g (Value receiver)
  // Stack slots (5 of them)
  Value osr_receiver, osr_unused, osr_i, osr_ret, arg_context;
  // Dedicated registers
  register int64_t i, ret;

  arg_context = context;

  // Assuming the stack grows down.
  if (&arg_context > root[STACK_LIMIT_IDX])
    // The overflow handler knows how to inspect the stack.
    // It can longjmp(), or do other things.
    handle_overflow ();

  i = 0;
  ret = 0;
  if (*g_cell != expected_g)
    goto deoptimize_1;

  while (i < 10000000)
      register uint64_t tmp = ret + 1;
      if ((int64_t)tmp < 0)
        goto deoptimize_2;

      ret = tmp;

  return val_from_smi (ret);

And exhale. The receiver object is passed as an argument. There are five stack slots, none of which are really used in the main path. Two locals are allocated in registers: i and ret, as we mentioned before. There's the stack check, locals initialization, the check for g, and then the loop, and the return. The test in the loop is intended to be a jump-on-overflow check.

restart_after_osr what?

But what about OSR, and what's that label about? What happens is that when f+g is replaced on the stack, the computation needs to restart. It does so from the osr_after_inlining_g label:

  if (val_is_smi (osr_i))
    i = val_to_smi (osr_i);
      if (!obj_is_double (osr_i))
        goto deoptimize_3;

      double d = obj_to_double (val_to_obj (osr_i));

      i = (int64_t)trunc (d);
      if ((double) i != d || isnan (d))
        goto deoptimize_3;

Here we take the value for the loop counter, as stored on the stack by OSR, and unpack it so that it is stored in the right register.

Unpacking a SMI into an integer register is simple. On the other hand unpacking a heap number has to check that the number has an integer value. I tried to represent that here with the C code. The actual assembly is just as hairy but a bit more compact; see the article for the full assembly listing.

The same thing happens for the other value that was live at OSR time, ret:

  if (val_is_smi (osr_ret))
    ret = val_to_smi (osr_ret);
      if (!obj_is_double (osr_ret))
        goto deoptimize_4;

      double d = obj_to_double (val_to_obj (osr_ret));

      ret = (int64_t)trunc (d);
      if ((double) ret != d || isnan (d))
        goto deoptimize_4;

      if (ret == 0 && signbit (d))
        goto deoptimize_4;
  goto restart_after_osr;

Here we see the same thing, except that additionally there is a check for -0.0 at the end, because V8 could not prove to itself that ret would not be -0.0. But if all succeeded, we jump back to the restart_after_osr case, and the loop proceeds.

Finally we can imagine some deoptimization bailouts, which would result in OSR, but for deoptimization instead of optimization. They are implemented as tail calls (jumps), so we can't represent them properly in C.

  return deopt_1 ();
  return deopt_2 ();
  return deopt_3 ();
  return deopt_4 ();

and that's that.

I think I've gnawed all the meat that's to be had off of this bone, so hopefully that's the last we'll see of this silly loop. Comments and corrections very much welcome. Next up will probably be a discussion of how V8's optimizer works. Happy hacking.

Syndicated 2011-06-20 13:14:03 from wingolog

V8 is faster than GCC

Do you like the linkbait title? Neither do I, but it is true in this case. Check it out.

After my last post, Benjamin noted that GCC would reduce my simple test to a mov rax, $10000000; ret sequence. Well yes, that's true, and GCC does do that: but only if GCC is able and allowed to do the inlining. So the equivalent of the test, for GCC, is to compile the g in a separate file:

unsigned int g (void) { return 1; }

Then in the main file, we have:

extern unsigned int g (void);

int main (int argc, char *argv) {
  unsigned int i, ret = 0;
  for (i = 0; i < NU; i++)
    ret += g ();
  return ret;

Here we replace N with the number of iterations we're interested in testing.

I decided to run such a test; the harness is here. I ran both GCC and V8 for the above program, for N from 20 to 231, and plotted the total time taken by V8 as compared to GCC.

Naturally this includes the GCC compile times, for which we expect GCC to lose at low iterations, but the question is, when does GCC start being faster than V8?

answer: never!

That's right, V8 is always faster than GCC, right up to the point at which its fixnums start failing. For the record, only the points on the right of the graph are really worth anything, as the ones on the left only run for a few milliseconds. Raw data here.

Why is V8 faster, even for significant iteration counts? Because it can inline hot function calls at runtime. It can see more than GCC can, and it takes advantage of it: but still preserving the dynamic characteristics of JavaScript. In this context the ELF symbol interposition rules that GCC has to deal with are not such a burden!

Syndicated 2011-06-10 15:57:58 from wingolog

what does v8 do with that loop?


I've spent the last month or so swimming in V8. Not the tomato juice, mind you, though that would be pleasant, but rather the JavaScript implementation.

In my last dispatch I made the assertion that the current JavaScript implementations are totally undocumented, but in the case of V8 that's not precisely true. There were at least two PhD dissertations written on it, along with a few technical reports and conference proceedings.

I refer of course to the Self work at Stanford and Sun, back in the late eighties and early nineties. I had read a number of their papers, but hadn't seen Hölzle's thesis yet. That one, for me, is the best of all, because it explains the Self implementation on an engineering level. The V8 implementation, I mean to say, because the vocabulary is entirely the same: maps, on-stack replacement, lazy deoptimization, etc. These are all parts of V8 now.

And it's no wonder, really. Lars Bak, the lead developer of V8, was there back then, hacking on Self, and then Strongtalk, then HotSpot, then a little startup built around virtual machines on mobile devices. So you see there's a reason why V8 doesn't have very much big-picture internal documentation -- Bak has been writing the same program for 20 years now; he knows how it works.

(Note however that Bak's name never appears in the V8 commit logs. You can see his hand at work, but never his face. Like Dr. Claw. Actually not very much like Dr. Claw but I just wanted to put that out there. Is Lars Bak Dr. Claw? Is he?)

enough with the personification

As you might recall, V8 always compiles JavaScript to native code. The first time V8 sees a piece of code, it compiles it quickly but without optimizing it. The initial unoptimized code is fully general, handling all of the various cases that one might see, and also includes some type-feedback code, recording what types are being seen at various points in the procedure.

At startup, if the "Crankshaft" optimizing compiler is enabled, V8 spawns off a profiling thread. If it notices that a particular unoptimized procedure is hot, it collects the recorded type feedback data for that procedure and uses it to compile an optimized version of the procedure. The old unoptimized code is then replaced with the new optimized code, and the process continues. (This on-stack replacement (OSR) process is covered in Hölzle's thesis.)

The optimized procedure does not actually cover all cases. If a callee procedure starts returning floating-point values where it was returning integer values before, the optimized procedure is deoptimized -- a relatively expensive process of recompiling the original procedure to the unoptimized form, replacing the optimized function on the stack with the unoptimized version, and then continuing the computation. This requires that the compiler keep around additional information about how to continue the computation -- you could be in the middle of a loop, after all.

Deoptimization is particularly tricky if the optimizing process inlined a callee procedure, and thus has to de-inline, replacing the activation of one optimized procedure call with two or more unoptimized calls. Again, Hölzle's thesis discusses this in depth. Inlining is such a win though that the complexity appears to be worth it.


I wanted to see what V8 does with a simple loop, but one for which lexical inlining isn't possible. Like this:

function g () { return 1; }
function f () { 
  var ret = 0;
  for (var i = 1; i < 10000000; i++) {
    ret += g ();
  return ret;

Pretty simple: adding 1 to a counter, 10 million times, but attempting to foil statically apparent inlining possibilities. I entered these two definitions at the d8 shell, invoked with the --code_comments --print_code options.

Note that running V8 in this way is going to spew a lot on your console, as V8 itself warms up. Startup is quick, though. On my modern laptop with an SSD, the debugging shell takes about 17ms to start up. The standard shell takes about 5ms to start. Both of these numbers are with snapshots on; without snapshots, the numbers are more like 32ms and 18ms, respectively. Just for comparison, the JavaScriptCore shell (jsc) takes about 12ms to start here.

Interestingly, V8's profiler decides that the best thing to do here is not to optimize g -- which it actually can't, as it's so small the unoptimized code is already optimal -- but to inline g into f, and optimize f.

V8 is able to do this inlining because it keeps around the parsed AST of every piece of code it sees. It needs some high-level internal representation, and it turns out that the AST is the easiest one: it's fairly small, it's already the input format to the "full-codegen" unoptimized compiler, and it also has all of the lexical information necessary to do good inlining analysis. Much easier than trying to decompile bytecode, it seems to me.

I did say that I was going to show some assembly, so here we go. This is what d8 prints out when evaluating f(). I've trimmed the output a bit. The comments on the right are my annotations.

Instructions (size = 466)
  0  push rbp           ;; Save the frame pointer.
  1  movq rbp,rsp       ;; Set the new frame pointer.
  4  push rsi           ;; Save the callee's "context object".
  5  push rdi           ;; Save the callee's JSFunction object.
  6  subq rsp,0x28      ;; Reserve space for 5 locals.

Here we have a standard prelude. The JavaScript calling conventions in V8 pass arguments on the stack, using rbp and rsp as callee-saved stack and frame pointers. Additionally, information associated with the function itself is passed in rsi and rdi: the context, and the function object itself. The context is an array optimized for looking up various information that the function needs at runtime, mostly free variables (lexical and global).

In this case it's redundant to take the context out of the function, but it does allow for faster access to the global scope object that was current when the function was defined. In the case of closures, every time the function() expression is evaluated, a new context will be created with a new set of free variables.

Anyway! Moving along:

 10  movq rax,rsi        ;; Store the context in a scratch register.
 13  movq [rbp-0x38],rax ;; And save it in the last (4th) stack slot.
 17  cmpq rsp,[r13+0x0]  ;; Check for stack overflow
 21  jnc 28              ;; If we overflowed,
 23  call 0x7f7b20f40a00 ;; Call the overflow handler.

Here we store the context again, in the last stack slot, and we check for overflow. On x86-64, V8 reserves r13 for a pointer to somewhere in the middle of a "global object", holding the GC roots for a given JavaScript execution context. There is a cell in the root list that holds the stack limit, which V8 abuses for other purposes: interrupting loops, breakpoints, deoptimization, etc. Again, Hölzle's thesis has some details on this technique.

The "somewhere in the middle" bit is arranged so that a simple dereference of r13 will allow us to get at the stack limit. V8 will reset the stack limit whenever it needs to interrupt a procedure.

Having passed the stack overflow check, we initialize local variables:

 28  movq rax,[rbp+0x10] ;; Receiver object to rcx (unused).
 32  movq rcx,rax        ;;
 35  movq rdx,[rbp-0x38] ;; Global object to rdx.
 39  movl rbx,(nil)      ;; Loop counter (i) to 0.
 44  movl rax,(nil)      ;; Accumulator (ret) to 0.
 49  jmp 97              ;; Jump over some stuff.

Note that the receiver object (the this object) is passed as an argument, and thus is above rbp.

Following this bit we have some code that appears to be completely dead. I'll include it for completeness, but unless the deoptimization bailouts jump back here, I don't know why it's there.

 54  movq rax,rsi        ;; Dead code.
 57  movq rbx,[rbp-0x28]
 61  testb rbx,0x1
 64  jnz 189  (0x7f7b20fa2a7d)
 70  shrq rbx,32
 74  movq rdx,[rbp-0x30]
 78  testb rdx,0x1
 81  jnz 237  (0x7f7b20fa2aad)
 87  shrq rdx,32
 91  movq rcx,[rbp-0x18]
 95  xchgq rax, rdx

We'll just forget that happened, shall we? However we are getting closer to the firm tofu of the matter, the loop. First one more check:

 97  movq rdx,[rsi+0x2f] ;; Slot 6 of the context: the global object.
101  movq rdi,0x7f7b20e401e8 ;; Location of cell holding `g'
111  movq rdi,[rdi]      ;; Dereference cell
114  movq r10,0x7f7b205d7ba1 ;; The expected address of `g'
124  cmpq rdi,r10        ;; If they're not the same...
127  jnz 371             ;; Deoptimization bailout 2

Here we see if the current definition of g, the function that we inlined below, is actually the same as when the inlining was performed.

Note that on line 97, since pointers in V8 are low-bit tagged with 01, to access slot 6 (0-based) of the context object, we only need to add 0x2f instead of 0x30. Clever, right? But we don't actually need the global object here in the main loop, so we could have delayed this load until finding out that deoptimization was necessary. Perhaps it was needed though for GC reasons.

133  movq rdx,[rdx+0x27] ;; Another redundant load.
137  cmpl rbx,0x989680   ;; 10000000, you see.
143  jge 178             ;; If i >= 10000000, break.
149  movq rdx,rax        ;; tmp = ret
152  addl rdx,0x1        ;; tmp += 1
155  jo 384              ;; On overflow, deoptimize.
161  addl rbx,0x1        ;; i++
164  movq rax,rdx        ;; ret = tmp
167  cmpq rsp,[r13+0x0]  ;; Reload stack limit.
171  jnc 137             ;; Loop if no interrupt,
173  jmp 306             ;; Otherwise bail out.
178  shlq rax,32         ;; Tag rax as a small integer.
182  movq rsp,rbp        ;; Restore stack pointer.
185  pop rbp             ;; Restore frame pointer.
186  ret 0x8             ;; Return, popping receiver.

And that's it! It's a fairly tight loop: g is inlined of course, its return value is untagged to a native int32, as are the accumulator and loop counter variables. Of course, improvements are possible -- the loop could be unrolled a few times, range analysis could avoid the deoptimization check, the overflow check could possibly be cheaper, and indeed the whole thing could be folded, but all in all, good job, MAD kittens!

Note the interesting approach to tagging: instead of storing the integer in the lower 32 bits, shifted by one, it is stored in the upper 32 bits, without a shift.

Actually there's some more junk after all of this. Another dead code block, apparently meaning to deal with floating-point values, but totally unreferenced:

189  movq r10,[r13-0x38]
193  cmpq [rbx-0x1],r10
197  jnz 397
203  movsd xmm0,[rbx+0x7]
208  cvttsd2sil rbx,xmm0
212  cvtsi2sd xmm1,rbx
216  ucomisd xmm0,xmm1
220  jnz 397
226  jpe 397
232  jmp 74
237  movq r10,[r13-0x38]
241  cmpq [rdx-0x1],r10
245  jnz 410
251  movsd xmm0,[rdx+0x7]
256  cvttsd2sil rdx,xmm0
260  cvtsi2sd xmm1,rdx
264  ucomisd xmm0,xmm1
268  jnz 410
274  jpe 410
280  testl rdx,rdx
282  jnz 301
288  movmskpd xmm2,xmm0
292  andl rdx,0x1
295  jnz 410
301  jmp 91

And then there's our stack check handler, saving all of the registers of interest, and calling out:

306  push rax
307  push rcx
308  push rdx
309  push rbx
310  push rsi
311  push rdi
312  push r8
314  push r9
316  push r11
318  push r14
320  push r15
322  leaq rsp,[rsp-0x28]
327  movq rsi,[rbp-0x8]
331  xorl rax,rax
333  leaq rbx,[r13-0x20a4b70]
340  call 0x7f7b20fa25c0
345  leaq rsp,[rsp+0x28]
350  pop r15
352  pop r14
354  pop r11
356  pop r9
358  pop r8
360  pop rdi
361  pop rsi
362  pop rbx
363  pop rdx
364  pop rcx
365  pop rax
366  jmp 137

And finally the deoptimization bailouts:

371  movq r10,0x7f7b20f7c054
381  jmp r10
384  movq r10,0x7f7b20f7c05e
394  jmp r10
397  movq r10,0x7f7b20f7c068
407  jmp r10
410  movq r10,0x7f7b20f7c072
420  jmp r10

Whew! I'd say that's long enough for today. I wanted to establish a fixed point on the low level though, so that I could write in the future about how it is that V8 compiles down to this code, what deoptimization is about, and such; filling in the middle, between the assembly and the JavaScript source. Corrections and suggestions in the comments please. Happy hacking!

Syndicated 2011-06-08 14:33:55 from wingolog

value representation in javascript implementations

In my last note I mentioned that I had been doing a lot of reading of JavaScript implementation code. One point that I didn't mention is that the state of the art is completely undocumented. So to work on rectifying that, here's the first in what might be a series of articles about the internals of JavaScript implementations.


Initially, all JavaScript implementations used tagged pointers to represent JS values. This is a old trick that comes from the observation that allocated memory takes up at least 4 or 8 bytes, and are aligned in such a way that the least significant bit or three will be zero.

Here's an old part of the Guile manual that explains the technique. For example you could take values whose lowest bit is 1 to be a pointer to the heap. Any value that ends in 0 could be an integer, represented directly in the bits of the pointer. (If your pointer tag is not 0, you will have to mask off the low bits before dereferencing a pointer..)

Google's V8 JavaScript engine still does it this way. This yields 31-bit immediate signed integers; you can just do a signed right-shift to get the value. They call them "smi" values, for "small integers"; they are more commonly called "fixnums". There are other fun tricks you can do to avoid shifts for some common operations, like addition. Note that they also take advantage of the four-byte alignment (in their case) to encode another immediate value, "failure objects", using the other low bit.

The size of a tagged pointer is the size of a pointer, so 32 bits on a 32-bit machine, and 64 on a 64-bit machine. V8 doesn't actually use 63-bit smi values on 64-bit machines, AFAIK.


JavaScriptCore, the implementation used by WebKit, switched from tagged pointers to a "nan-boxed" format. The best explanation of this strategy is given by Rob Sayre of Mozilla. I believe the technique was first made popular by LuaJIT, though I would appreciate additional references.

Sayre does a good job of explaining things, but to recap, there are about 253 bit patterns that represent not-a-number values for IEEE-754 double-precision floats, but that only one of those is actually emitted by current hardware and software. So cheeky implementations are free to use the other patterns for their own purposes.

What I always wondered when I heard about this strategy was how 53 bits were possibly enough, for a 64-bit machine. How are you going to fit a pointer in that space?

Well it turns out that on x64-64 systems, you only have a 48-bit address space, currently anyway. The shape of that space is quite peculiar, too. I don't know whether Linux gives out addresses in the upper half, but Windows does not, though Solaris does.

Anyway, the other way of looking at NaN-boxing is that there are 264 - 248 values in a pointer that aren't being used, and we might as well stuff something in them, and hey, the whole space of valid double precision numbers fits!

It's convenient too that numbers in JS are specified as doubles. (Even still, all implementations define separate "small integer" types, for use in loops and indexing and such; integer operations are faster when you can use them, too.)

JSC's implementation of nan-boxing is described here, and you can see some of the neat tricks they do here. It actually makes me envious of C++, as a C programmer!


So, when you choose to do nan-boxing, you basically choose to do one of two things: you favor pointers, or you favor doubles. To favor pointers means that you recognize pointers as having initial (most-significant) 0 bits; if the initial bits are not 0, then it's a double, and you have to add or subtract a bit pattern to get to the double value.

Favoring doubles means that pointers are left as NaN values, so their initial bits are all ones (or the sign bit can be unset, it doesn't matter), and to unpack a pointer, you have to rotate the double space around.

Amusingly, there is a third option as well. For 32-bit machines, you can address the second word in the double directly, so there is no need to mask off anything. This is the JSVALUE32_64 case mentioned in the JSValue code I linked to above.

JSC chose to favor pointers, and as the first JS implementation to nan-box, got to call their choice "nan-boxing". Mozilla chose to favor doubles, and so made up the silly name "nun-boxing".

get thee to a nun boxery?

So you're implementing a language. Should you nan-box or not? I can't say in your case but I can give a couple of examples.

NaN-boxing has the obvious advantage of not allocating doubles on the heap. This reduces cache pressure, GC pressure, and such. That's why Moz and JSC chose it.

V8 on the other hand has not chosen it, at least not yet, anyway. I think that one reason is because especially on embedded devices, and to an extent on ia32, passing around 64-bit values is a big lose. It's so bad that I understand that Mozilla actually passes these values by reference instead of by value in some places, on 32-bit systems only.

But the real reason that V8 doesn't nan-box I think is that they have a compiler that is able to unbox values in registers and in temporary stack locations, both as int32 values and as double values. This works for loops and such things in a function, and is obviously quite efficient, as you don't need to box and unbox at all. Passing doubles as arguments or return values however does allocate them on the heap, as far as I can tell anyway.

Also there is the hackery that with NaN-boxing, you assume things about your OS's memory management. A language implementation should be able to get around this with use of mmap at specific addresses, but still, it's tricky.

I looked at doing nan-boxing for Guile, and eventually decided against it. Guile has a few more constraints that V8 does not have. Firstly it is very portable, so it can't rely the address range constraints that x86-64 has, not without some hackery. It's also portable to the low end, like V8, so 64-bit values are a lose there.

But the killer is the conservative GC on 32-bit systems. If you represent integers with the tag in the high bits, as you would with nan-boxing, then the lower 32 bits are not distinguishable from a pointer, so they can cause the GC to retain heap objects for longer than it should. With tagged pointers, the low-bit masking lets the GC know that an integer is not a pointer. If you were to reintroduce low-bit tagging to a nan-boxed system, you've lost much of the advantage of it. Furthermore if you want to support more than 32 bits of precision in a fixnum, then on a 32-bit system you need to play games with sign extension of a 48-bit value to a 64-bit value, and for all word sizes it looks like the overhead will be significantly higher than tagging.

Finally I think that if we're really going to shoot for the moon, we should implement something like V8's Crankshaft compiler that does decent type inference, to allow unboxed values in registers and on the stack. (For readers of source code, "crankshaft" is basically the "hydrogen" flow-graph code plus a "lithium" assembler. Confusing, right?)

Well that's all for now. If you corrections, or an idea for a further topic, let me know in the comments. Thanks!

Syndicated 2011-05-18 16:57:41 from wingolog

javascript in 2011

Good evening tubes,

This hacklog needs an update, and how! I meant to write some stuff about Guile, which recently got some pretty great performance improvements, but instead bits on JavaScript were what trickled out of my fingers onto this Emacs. So here we go!

kung fu

I took a fairly deep dive into the world of JavaScript implementations recently, and man, what a world: a rotated-double-space punboxed jaeger-lithium crankshafted nitro-assembling trace stack of bit-froth. Yow! Now that's a fat value!

That probably makes no sense, right, but you have to understand that I've been jacked into this matrix for like three weeks now, and they speak a different language out there. If I were to draw some preliminary conclusions about the state of the art in JavaScript implementations, they would be these:

  • Trace looked fun, but it seems that it's possible to get trace-level performance with more traditional compilation techniques. (The trick appears to be to treat tagging, untagging and type checks as nodes in your control-flow graph.) V8 already has an SSA-based intermediate language, JavaScriptCore (the WebKit JavaScript implementation) is moving to one, and the above bug is from the Mozilla folk.

  • As a correlary to the previous point, data representation matters. It seems to me that JavaScriptCore's solid improvements have all been centered on cleanliness: keeping it simple, direct, and clean leads to speed. Right now JSC is trailing the pack again, but with a good SSA compiler, they could reach V8 again. Inner loops are important, but that's not where big applications spend most of their time.

  • The JS benchmarks are terrible, but everyone is using them to optimize their implementations. There's a serious danger of garbage-in, garbage-out here.

    To my eye the V8 ones are the best general-purpose benchmarks: the ones that represent not just the inner loops, but bigger programs. But hey, maybe that's just the Schemer in me amused at seeing cross-compiled Lisp benchmarks in their suite.

  • Tool support is terrible.

    All of these projects have to implement their own profilers, disassemblers, debuggers, etc, and for all platforms, in theory. In practice, yuck. GDB just got support for JIT-generated code, and it's terrible: you have to generate whole object files (ELF on ELF architectures, presumably mach-O or whatever on Darwin, and who knows on windows!), and if you want debugging information, you have to generate DWARF.

    Of course oprofile couldn't use the same format, it has its own thing, and it's not the same as the lovely but ungooglable perf, which appears to only support map files generated by some JVM no one has heard of, but has no support for garbage-collecting executable code or for line numbers. Aaargh!!!

    Incidentally, my colleague Xan appears to have run into these issues independently.

I've been joking recently that I've been coming up like Neo, in the training sequence, eyes open, drawn breath: "whoa: I know kung-funun-boxing."

All in all JavaScript in mid-2011 is looking much better than JavaScript in mid-2010 -- humbling, even -- and I look forward to hacking more on it. Actually, simply "hacking on it", as I have been taking uncomfortably long to get up to speed on things. (No one has yet said "Really? Show me." Ha ha.)

I suspect that tools is where I'll start. It sure would be nice to have line-level profiling interleaved with disassembly for JavaScript, somehow. (I hope it would be perf, but they seem an opinionated bunch. I hope you don't have to be Peter Zijlstra to get something into it.)

forward-looking statements

I try not to make promises about the future, so consider it simply a prediction, that the next entry will feature more parentheses than this one. Until then, happy hacking!

Syndicated 2011-05-15 17:53:21 from wingolog

meta: per-tag feeds

Just a brief meta-note: I've added per-tag feeds to tekuti, based on a patch kindly provided by Brian Gough, and unkindly neglected by me these last four months.

Anyway, you can now access http://wingolog.org/feed/atom?with=guile&without=web&with=scheme and other such URLs. Should be good for planets.

Let me know if there are any problems. Thanks!

Syndicated 2011-04-25 18:13:27 from wingolog

new beginnings

Friday is my last day at Oblong. It's been good times: my colleagues have all been generous, intelligent, respectful folk. It has been a real privilege to work with them for these few years, and I wish them all the best. Oblong is in a great position to emancipate both pixels and people from the proxy world of mice and windows.

At Oblong I was mostly focused on video, but if you're a regular reader of this electro-rag, you'd see that my passion is more on the side of compilers and free software, especially Guile. So when the opportunity presented itself to do something more directly aligned with those interests, I let myself be tempted.

igalia ho!

So what's next? Igalia is what's next! I'll join their new compilers group later this month, with an initial focus on JavaScriptCore optimization work and on Guile projects.

Igalia is a Spanish free software consulting shop that is fairly well-known in the GNOME community, but perhaps not so widely outside of it. I admit: it is refreshing (and relieving) to return to free software "on the clock", and to hack compilers for a living. This is fantastic. But that's not the thing that's really awesome (and unique, as far as I know) about Igalia.

No, the truly great thing is that Igalia is an entirely egalitarian organization. All important decisions are taken by the assembly, using consensus and similar collective decision-making procedures. That includes research, business development, budgets, hiring, salaries: everything. After a few years working there you become a partner, which makes you an equal co-owner of the business.

Wild, no? And wonderful. Igalia has been around for almost 10 years now, so the system definitely works, and I am very much looking forward to seeing how it works. It's a project, and I'm pleased to be a part.


My last note of this kind quoted Thoreau, ending with:

If I should sell both my forenoons and afternoons to society, neglecting my peculiar calling, there would be nothing left worth living for. I trust that I shall never thus sell my birthright for a mess of pottage.

Time passes, and it turns out that, among other things, Guile is my "peculiar calling". So the thing that I've worked out with Igalia folk is to go back to full time, though still with a nominal split between JavaScriptCore work and Guile work.

To be honest this split seems to be something akin to therapy. Before, I sold my time for sustenance, and if I could get interesting projects and nice folks on top of that, then great. But now, the prospect of drawing passion and work closer together is daunting at the same time that it's exciting. Hopefully, eventually, this split will heal itself, into a state in which I do what needs doing, while doing what I love. We'll see.

I'll be staying in Barcelona, at least until the fall. Until next time, dear readers, happy hacking. It's good to be back!

Syndicated 2011-04-05 06:16:26 from wingolog


git-brunch - Fresh fruit before lunch

git brunch [-p]

With no arguments, stashes all work and cherry-picks a snack onto the dining room working tree. Option -p runs a git-prune-snacked(1) afterwards.

If running git-brunch does not produce the desired effects, try with sudo. For external brunches, see git-remote.

Objects received via git-fetch-snack(1) should be inspected manually to ensure their fitness for brunch. Strawberries, cream, and waffles are currently unimplemented.

git-cherry-pick(1), git-snack-objects(1)

Independently discovered a few days ago, it seems.

git-brunch: the power of an idea whose time has come.

Syndicated 2011-03-28 11:24:52 from wingolog

359 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!