Older blog entries for wingo (starting at number 366)

on-stack replacement in v8

A recent post looked at what V8 does with a simple loop:

function g () { return 1; }
function f () { 
  var ret = 0;
  for (var i = 1; i < 10000000; i++) {
    ret += g ();
  }
  return ret;
}

It turns out that calling f actually inlines g into the body, at runtime. But how does V8 do that, and what was the dead code about? V8 hacker Kasper Lund kindly chimed in with an explanation, which led me to write this post.

When V8 compiled an optimized version of f with g, it decided to do so when f was already in the loop. It needed to stop the loop in order to replace the slow version with the fast version. The mechanism that it uses is to reset the stack limit, causing the overflow handler in g to be called. So the computation is stuck in a call to g, and somehow must be transformed to a call to a combined f+g.

The actual mechanics are a bit complicated, so I'm going to try a picture. Here we go:

As you can see, V8 replaces the top of the current call stack with one in which the frames have been merged together. This is called on-stack replacement (OSR). OSR takes the contexts of some contiguous range of function applications and the variables that are live in those activations, and transforms those inputs into a new context or contexts and live variable set, and splats the new activations on the stack.

I'm using weasel words here because OSR can go both ways: optimizing (in which N frames are combined to 1) and deoptimizing (typically the other way around), as Hölzle notes in "Optimizing Dynamically-Dispatched Calls with Run-Time Type Feedback". More on deoptimization some other day, perhaps.

The diagram above mentions "loop restart entry points", which is indeed the "dead code" that I mentioned previously. That code forms part of an alternate entry point to the function, used only once, when the computation is restarted after on-stack replacement.

details, godly or otherwise

Given what I know now about OSR, I'm going to try an alternate decompilation of V8's f+g optimized function. This time I am going to abuse C as a high-level assembler. (I know, I know. Play along with me :))

Let us start by defining our types.

typedef union { uint64_t bits; } Value;
typedef struct { Value map; uint64_t data[]; } Object;

All JavaScript values are of type Value. Some Values encode small integers (SMIs), and others encode pointers to Object. Here I'm going to assume we are on a 64-bit system. The data member of Object is a tail array of slots.

inline bool val_is_smi (Value v) 
{ return !(v.bits & 0x1); }

inline Value val_from_smi (int32 i)
{ return (Value) { ((uint64_t)i) << 32 }; }

inline int32 val_to_smi (Value v)
{ return v.bits >> 32; }

Small integers have 0 as their least significant bit, and the payload in the upper 32 bits. Did you see the union literal syntax here? (Value){ ((uint64_t)i) << 32 }? It's a C99 thing that most folks don't know about. Anyway.

inline bool val_is_obj (Value v)
{ return !val_is_smi (v); }

inline Value val_from_obj (Object *p)
{ return (Value) { ((uint64_t)p) + 1U }; }

inline Object* val_to_obj (Value v)
{ return (Object*) (v.bits - 1U); }

Values that are Object pointers have 01 as their least significant bits. We have to mask off those bits to get the pointer to Object.

Numbers that are not small integers are stored as Object values, on the heap. All Object values have a map field, which points to a special object that describes the value: what type it is, how its fields are laid out, etc. Much like GOOPS, actually, though not as reflective.

In V8, the double value will be the only slot in a heap number. Therefore to access the double value of a heap number, we simply check whether the Object's map is the heap number map, and in that case return the first slot, as a double.

There is a complication though; what is value of the heap number map? V8 actually doesn't encode it into the compiled function directly. I'm not sure why it doesn't. Instead it stores the heap number map in a slot in the root object, and stores a pointer into the middle of the root object in r13. It's as if we had a global variable like this:

Value *root;    // r13

const int STACK_LIMIT_IDX = 0;
const int HEAP_NUMBER_MAP_IDX = -7; // Really.

inline bool obj_is_double (Object* o)
{ return o->map == root[HEAP_NUMBER_MAP_IDX]; }

Indeed, the offset into the root pointer is negative, which is a bit odd. But hey, details! Let's add the functions to actually get a double from an object:

union cvt { double d; uint64_t u; };

inline double obj_to_double (Object* o)
{ return ((union cvt) { o->data[0] }).d; }

inline Object* double_to_obj (double d)
{
  Object *ret = malloc (sizeof Value * 2);
  ret->map = root[HEAP_NUMBER_MAP_IDX];
  ret->data[0] = ((union cvt) { d }).u;
  return ret;
}

I'm including double_to_obj here just for completeness. Also did you enjoy the union literals in this block?

So, we're getting there. Perhaps the reader recalls, but V8's calling convention specifies that the context and the function are passed in registers. Let's model that in this silly code with some global variables:

Value context;  // rsi
Value function; // rdi

Recall also that when f+g is called, it needs to check that g is actually bound to the same value. So here we go:

const Value* g_cell;
const Value expected_g = (Value) { 0x7f7b205d7ba1U };

Finally, f+g. I'm going to show the main path all in one go. Take a big breath!

Value f_and_g (Value receiver)
{
  // Stack slots (5 of them)
  Value osr_receiver, osr_unused, osr_i, osr_ret, arg_context;
  // Dedicated registers
  register int64_t i, ret;

  arg_context = context;

  // Assuming the stack grows down.
  if (&arg_context > root[STACK_LIMIT_IDX])
    // The overflow handler knows how to inspect the stack.
    // It can longjmp(), or do other things.
    handle_overflow ();

  i = 0;
  ret = 0;
  
restart_after_osr:
  if (*g_cell != expected_g)
    goto deoptimize_1;

  while (i < 10000000)
    {
      register uint64_t tmp = ret + 1;
      if ((int64_t)tmp < 0)
        goto deoptimize_2;

      i++;
      ret = tmp;
    }

  return val_from_smi (ret);

And exhale. The receiver object is passed as an argument. There are five stack slots, none of which are really used in the main path. Two locals are allocated in registers: i and ret, as we mentioned before. There's the stack check, locals initialization, the check for g, and then the loop, and the return. The test in the loop is intended to be a jump-on-overflow check.

restart_after_osr what?

But what about OSR, and what's that label about? What happens is that when f+g is replaced on the stack, the computation needs to restart. It does so from the osr_after_inlining_g label:

osr_after_inlining_g:
  if (val_is_smi (osr_i))
    i = val_to_smi (osr_i);
  else
    {
      if (!obj_is_double (osr_i))
        goto deoptimize_3;

      double d = obj_to_double (val_to_obj (osr_i));

      i = (int64_t)trunc (d);
      if ((double) i != d || isnan (d))
        goto deoptimize_3;
    }

Here we take the value for the loop counter, as stored on the stack by OSR, and unpack it so that it is stored in the right register.

Unpacking a SMI into an integer register is simple. On the other hand unpacking a heap number has to check that the number has an integer value. I tried to represent that here with the C code. The actual assembly is just as hairy but a bit more compact; see the article for the full assembly listing.

The same thing happens for the other value that was live at OSR time, ret:

  if (val_is_smi (osr_ret))
    ret = val_to_smi (osr_ret);
  else
    {
      if (!obj_is_double (osr_ret))
        goto deoptimize_4;

      double d = obj_to_double (val_to_obj (osr_ret));

      ret = (int64_t)trunc (d);
      if ((double) ret != d || isnan (d))
        goto deoptimize_4;

      if (ret == 0 && signbit (d))
        goto deoptimize_4;
    }
  goto restart_after_osr;

Here we see the same thing, except that additionally there is a check for -0.0 at the end, because V8 could not prove to itself that ret would not be -0.0. But if all succeeded, we jump back to the restart_after_osr case, and the loop proceeds.

Finally we can imagine some deoptimization bailouts, which would result in OSR, but for deoptimization instead of optimization. They are implemented as tail calls (jumps), so we can't represent them properly in C.

deoptimize_1:
  return deopt_1 ();
deoptimize_2:
  return deopt_2 ();
deoptimize_3:
  return deopt_3 ();
deoptimize_4:
  return deopt_4 ();
}

and that's that.

I think I've gnawed all the meat that's to be had off of this bone, so hopefully that's the last we'll see of this silly loop. Comments and corrections very much welcome. Next up will probably be a discussion of how V8's optimizer works. Happy hacking.

Syndicated 2011-06-20 13:14:03 from wingolog

V8 is faster than GCC

Do you like the linkbait title? Neither do I, but it is true in this case. Check it out.

After my last post, Benjamin noted that GCC would reduce my simple test to a mov rax, $10000000; ret sequence. Well yes, that's true, and GCC does do that: but only if GCC is able and allowed to do the inlining. So the equivalent of the test, for GCC, is to compile the g in a separate file:

unsigned int g (void) { return 1; }

Then in the main file, we have:

extern unsigned int g (void);

int main (int argc, char *argv) {
  unsigned int i, ret = 0;
  for (i = 0; i < NU; i++)
    ret += g ();
  return ret;
}

Here we replace N with the number of iterations we're interested in testing.

I decided to run such a test; the harness is here. I ran both GCC and V8 for the above program, for N from 20 to 231, and plotted the total time taken by V8 as compared to GCC.

Naturally this includes the GCC compile times, for which we expect GCC to lose at low iterations, but the question is, when does GCC start being faster than V8?

answer: never!

That's right, V8 is always faster than GCC, right up to the point at which its fixnums start failing. For the record, only the points on the right of the graph are really worth anything, as the ones on the left only run for a few milliseconds. Raw data here.

Why is V8 faster, even for significant iteration counts? Because it can inline hot function calls at runtime. It can see more than GCC can, and it takes advantage of it: but still preserving the dynamic characteristics of JavaScript. In this context the ELF symbol interposition rules that GCC has to deal with are not such a burden!

Syndicated 2011-06-10 15:57:58 from wingolog

what does v8 do with that loop?

Hi!

I've spent the last month or so swimming in V8. Not the tomato juice, mind you, though that would be pleasant, but rather the JavaScript implementation.

In my last dispatch I made the assertion that the current JavaScript implementations are totally undocumented, but in the case of V8 that's not precisely true. There were at least two PhD dissertations written on it, along with a few technical reports and conference proceedings.

I refer of course to the Self work at Stanford and Sun, back in the late eighties and early nineties. I had read a number of their papers, but hadn't seen Hölzle's thesis yet. That one, for me, is the best of all, because it explains the Self implementation on an engineering level. The V8 implementation, I mean to say, because the vocabulary is entirely the same: maps, on-stack replacement, lazy deoptimization, etc. These are all parts of V8 now.

And it's no wonder, really. Lars Bak, the lead developer of V8, was there back then, hacking on Self, and then Strongtalk, then HotSpot, then a little startup built around virtual machines on mobile devices. So you see there's a reason why V8 doesn't have very much big-picture internal documentation -- Bak has been writing the same program for 20 years now; he knows how it works.

(Note however that Bak's name never appears in the V8 commit logs. You can see his hand at work, but never his face. Like Dr. Claw. Actually not very much like Dr. Claw but I just wanted to put that out there. Is Lars Bak Dr. Claw? Is he?)

enough with the personification

As you might recall, V8 always compiles JavaScript to native code. The first time V8 sees a piece of code, it compiles it quickly but without optimizing it. The initial unoptimized code is fully general, handling all of the various cases that one might see, and also includes some type-feedback code, recording what types are being seen at various points in the procedure.

At startup, if the "Crankshaft" optimizing compiler is enabled, V8 spawns off a profiling thread. If it notices that a particular unoptimized procedure is hot, it collects the recorded type feedback data for that procedure and uses it to compile an optimized version of the procedure. The old unoptimized code is then replaced with the new optimized code, and the process continues. (This on-stack replacement (OSR) process is covered in Hölzle's thesis.)

The optimized procedure does not actually cover all cases. If a callee procedure starts returning floating-point values where it was returning integer values before, the optimized procedure is deoptimized -- a relatively expensive process of recompiling the original procedure to the unoptimized form, replacing the optimized function on the stack with the unoptimized version, and then continuing the computation. This requires that the compiler keep around additional information about how to continue the computation -- you could be in the middle of a loop, after all.

Deoptimization is particularly tricky if the optimizing process inlined a callee procedure, and thus has to de-inline, replacing the activation of one optimized procedure call with two or more unoptimized calls. Again, Hölzle's thesis discusses this in depth. Inlining is such a win though that the complexity appears to be worth it.

assembly

I wanted to see what V8 does with a simple loop, but one for which lexical inlining isn't possible. Like this:

function g () { return 1; }
function f () { 
  var ret = 0;
  for (var i = 1; i < 10000000; i++) {
    ret += g ();
  }
  return ret;
}

Pretty simple: adding 1 to a counter, 10 million times, but attempting to foil statically apparent inlining possibilities. I entered these two definitions at the d8 shell, invoked with the --code_comments --print_code options.

Note that running V8 in this way is going to spew a lot on your console, as V8 itself warms up. Startup is quick, though. On my modern laptop with an SSD, the debugging shell takes about 17ms to start up. The standard shell takes about 5ms to start. Both of these numbers are with snapshots on; without snapshots, the numbers are more like 32ms and 18ms, respectively. Just for comparison, the JavaScriptCore shell (jsc) takes about 12ms to start here.

Interestingly, V8's profiler decides that the best thing to do here is not to optimize g -- which it actually can't, as it's so small the unoptimized code is already optimal -- but to inline g into f, and optimize f.

V8 is able to do this inlining because it keeps around the parsed AST of every piece of code it sees. It needs some high-level internal representation, and it turns out that the AST is the easiest one: it's fairly small, it's already the input format to the "full-codegen" unoptimized compiler, and it also has all of the lexical information necessary to do good inlining analysis. Much easier than trying to decompile bytecode, it seems to me.

I did say that I was going to show some assembly, so here we go. This is what d8 prints out when evaluating f(). I've trimmed the output a bit. The comments on the right are my annotations.

Instructions (size = 466)
  0  push rbp           ;; Save the frame pointer.
  1  movq rbp,rsp       ;; Set the new frame pointer.
  4  push rsi           ;; Save the callee's "context object".
  5  push rdi           ;; Save the callee's JSFunction object.
  6  subq rsp,0x28      ;; Reserve space for 5 locals.

Here we have a standard prelude. The JavaScript calling conventions in V8 pass arguments on the stack, using rbp and rsp as callee-saved stack and frame pointers. Additionally, information associated with the function itself is passed in rsi and rdi: the context, and the function object itself. The context is an array optimized for looking up various information that the function needs at runtime, mostly free variables (lexical and global).

In this case it's redundant to take the context out of the function, but it does allow for faster access to the global scope object that was current when the function was defined. In the case of closures, every time the function() expression is evaluated, a new context will be created with a new set of free variables.

Anyway! Moving along:

 10  movq rax,rsi        ;; Store the context in a scratch register.
 13  movq [rbp-0x38],rax ;; And save it in the last (4th) stack slot.
 17  cmpq rsp,[r13+0x0]  ;; Check for stack overflow
 21  jnc 28              ;; If we overflowed,
 23  call 0x7f7b20f40a00 ;; Call the overflow handler.

Here we store the context again, in the last stack slot, and we check for overflow. On x86-64, V8 reserves r13 for a pointer to somewhere in the middle of a "global object", holding the GC roots for a given JavaScript execution context. There is a cell in the root list that holds the stack limit, which V8 abuses for other purposes: interrupting loops, breakpoints, deoptimization, etc. Again, Hölzle's thesis has some details on this technique.

The "somewhere in the middle" bit is arranged so that a simple dereference of r13 will allow us to get at the stack limit. V8 will reset the stack limit whenever it needs to interrupt a procedure.

Having passed the stack overflow check, we initialize local variables:

 28  movq rax,[rbp+0x10] ;; Receiver object to rcx (unused).
 32  movq rcx,rax        ;;
 35  movq rdx,[rbp-0x38] ;; Global object to rdx.
 39  movl rbx,(nil)      ;; Loop counter (i) to 0.
 44  movl rax,(nil)      ;; Accumulator (ret) to 0.
 49  jmp 97              ;; Jump over some stuff.

Note that the receiver object (the this object) is passed as an argument, and thus is above rbp.

Following this bit we have some code that appears to be completely dead. I'll include it for completeness, but unless the deoptimization bailouts jump back here, I don't know why it's there.

 54  movq rax,rsi        ;; Dead code.
 57  movq rbx,[rbp-0x28]
 61  testb rbx,0x1
 64  jnz 189  (0x7f7b20fa2a7d)
 70  shrq rbx,32
 74  movq rdx,[rbp-0x30]
 78  testb rdx,0x1
 81  jnz 237  (0x7f7b20fa2aad)
 87  shrq rdx,32
 91  movq rcx,[rbp-0x18]
 95  xchgq rax, rdx

We'll just forget that happened, shall we? However we are getting closer to the firm tofu of the matter, the loop. First one more check:

 97  movq rdx,[rsi+0x2f] ;; Slot 6 of the context: the global object.
101  movq rdi,0x7f7b20e401e8 ;; Location of cell holding `g'
111  movq rdi,[rdi]      ;; Dereference cell
114  movq r10,0x7f7b205d7ba1 ;; The expected address of `g'
124  cmpq rdi,r10        ;; If they're not the same...
127  jnz 371             ;; Deoptimization bailout 2

Here we see if the current definition of g, the function that we inlined below, is actually the same as when the inlining was performed.

Note that on line 97, since pointers in V8 are low-bit tagged with 01, to access slot 6 (0-based) of the context object, we only need to add 0x2f instead of 0x30. Clever, right? But we don't actually need the global object here in the main loop, so we could have delayed this load until finding out that deoptimization was necessary. Perhaps it was needed though for GC reasons.

133  movq rdx,[rdx+0x27] ;; Another redundant load.
137  cmpl rbx,0x989680   ;; 10000000, you see.
143  jge 178             ;; If i >= 10000000, break.
149  movq rdx,rax        ;; tmp = ret
152  addl rdx,0x1        ;; tmp += 1
155  jo 384              ;; On overflow, deoptimize.
161  addl rbx,0x1        ;; i++
164  movq rax,rdx        ;; ret = tmp
167  cmpq rsp,[r13+0x0]  ;; Reload stack limit.
171  jnc 137             ;; Loop if no interrupt,
173  jmp 306             ;; Otherwise bail out.
178  shlq rax,32         ;; Tag rax as a small integer.
182  movq rsp,rbp        ;; Restore stack pointer.
185  pop rbp             ;; Restore frame pointer.
186  ret 0x8             ;; Return, popping receiver.

And that's it! It's a fairly tight loop: g is inlined of course, its return value is untagged to a native int32, as are the accumulator and loop counter variables. Of course, improvements are possible -- the loop could be unrolled a few times, range analysis could avoid the deoptimization check, the overflow check could possibly be cheaper, and indeed the whole thing could be folded, but all in all, good job, MAD kittens!

Note the interesting approach to tagging: instead of storing the integer in the lower 32 bits, shifted by one, it is stored in the upper 32 bits, without a shift.

Actually there's some more junk after all of this. Another dead code block, apparently meaning to deal with floating-point values, but totally unreferenced:

189  movq r10,[r13-0x38]
193  cmpq [rbx-0x1],r10
197  jnz 397
203  movsd xmm0,[rbx+0x7]
208  cvttsd2sil rbx,xmm0
212  cvtsi2sd xmm1,rbx
216  ucomisd xmm0,xmm1
220  jnz 397
226  jpe 397
232  jmp 74
237  movq r10,[r13-0x38]
241  cmpq [rdx-0x1],r10
245  jnz 410
251  movsd xmm0,[rdx+0x7]
256  cvttsd2sil rdx,xmm0
260  cvtsi2sd xmm1,rdx
264  ucomisd xmm0,xmm1
268  jnz 410
274  jpe 410
280  testl rdx,rdx
282  jnz 301
288  movmskpd xmm2,xmm0
292  andl rdx,0x1
295  jnz 410
301  jmp 91

And then there's our stack check handler, saving all of the registers of interest, and calling out:

306  push rax
307  push rcx
308  push rdx
309  push rbx
310  push rsi
311  push rdi
312  push r8
314  push r9
316  push r11
318  push r14
320  push r15
322  leaq rsp,[rsp-0x28]
327  movq rsi,[rbp-0x8]
331  xorl rax,rax
333  leaq rbx,[r13-0x20a4b70]
340  call 0x7f7b20fa25c0
345  leaq rsp,[rsp+0x28]
350  pop r15
352  pop r14
354  pop r11
356  pop r9
358  pop r8
360  pop rdi
361  pop rsi
362  pop rbx
363  pop rdx
364  pop rcx
365  pop rax
366  jmp 137

And finally the deoptimization bailouts:

371  movq r10,0x7f7b20f7c054
381  jmp r10
384  movq r10,0x7f7b20f7c05e
394  jmp r10
397  movq r10,0x7f7b20f7c068
407  jmp r10
410  movq r10,0x7f7b20f7c072
420  jmp r10

Whew! I'd say that's long enough for today. I wanted to establish a fixed point on the low level though, so that I could write in the future about how it is that V8 compiles down to this code, what deoptimization is about, and such; filling in the middle, between the assembly and the JavaScript source. Corrections and suggestions in the comments please. Happy hacking!

Syndicated 2011-06-08 14:33:55 from wingolog

value representation in javascript implementations

In my last note I mentioned that I had been doing a lot of reading of JavaScript implementation code. One point that I didn't mention is that the state of the art is completely undocumented. So to work on rectifying that, here's the first in what might be a series of articles about the internals of JavaScript implementations.

tagging

Initially, all JavaScript implementations used tagged pointers to represent JS values. This is a old trick that comes from the observation that allocated memory takes up at least 4 or 8 bytes, and are aligned in such a way that the least significant bit or three will be zero.

Here's an old part of the Guile manual that explains the technique. For example you could take values whose lowest bit is 1 to be a pointer to the heap. Any value that ends in 0 could be an integer, represented directly in the bits of the pointer. (If your pointer tag is not 0, you will have to mask off the low bits before dereferencing a pointer..)

Google's V8 JavaScript engine still does it this way. This yields 31-bit immediate signed integers; you can just do a signed right-shift to get the value. They call them "smi" values, for "small integers"; they are more commonly called "fixnums". There are other fun tricks you can do to avoid shifts for some common operations, like addition. Note that they also take advantage of the four-byte alignment (in their case) to encode another immediate value, "failure objects", using the other low bit.

The size of a tagged pointer is the size of a pointer, so 32 bits on a 32-bit machine, and 64 on a 64-bit machine. V8 doesn't actually use 63-bit smi values on 64-bit machines, AFAIK.

nan-boxing

JavaScriptCore, the implementation used by WebKit, switched from tagged pointers to a "nan-boxed" format. The best explanation of this strategy is given by Rob Sayre of Mozilla. I believe the technique was first made popular by LuaJIT, though I would appreciate additional references.

Sayre does a good job of explaining things, but to recap, there are about 253 bit patterns that represent not-a-number values for IEEE-754 double-precision floats, but that only one of those is actually emitted by current hardware and software. So cheeky implementations are free to use the other patterns for their own purposes.

What I always wondered when I heard about this strategy was how 53 bits were possibly enough, for a 64-bit machine. How are you going to fit a pointer in that space?

Well it turns out that on x64-64 systems, you only have a 48-bit address space, currently anyway. The shape of that space is quite peculiar, too. I don't know whether Linux gives out addresses in the upper half, but Windows does not, though Solaris does.

Anyway, the other way of looking at NaN-boxing is that there are 264 - 248 values in a pointer that aren't being used, and we might as well stuff something in them, and hey, the whole space of valid double precision numbers fits!

It's convenient too that numbers in JS are specified as doubles. (Even still, all implementations define separate "small integer" types, for use in loops and indexing and such; integer operations are faster when you can use them, too.)

JSC's implementation of nan-boxing is described here, and you can see some of the neat tricks they do here. It actually makes me envious of C++, as a C programmer!

nun-boxing

So, when you choose to do nan-boxing, you basically choose to do one of two things: you favor pointers, or you favor doubles. To favor pointers means that you recognize pointers as having initial (most-significant) 0 bits; if the initial bits are not 0, then it's a double, and you have to add or subtract a bit pattern to get to the double value.

Favoring doubles means that pointers are left as NaN values, so their initial bits are all ones (or the sign bit can be unset, it doesn't matter), and to unpack a pointer, you have to rotate the double space around.

Amusingly, there is a third option as well. For 32-bit machines, you can address the second word in the double directly, so there is no need to mask off anything. This is the JSVALUE32_64 case mentioned in the JSValue code I linked to above.

JSC chose to favor pointers, and as the first JS implementation to nan-box, got to call their choice "nan-boxing". Mozilla chose to favor doubles, and so made up the silly name "nun-boxing".

get thee to a nun boxery?

So you're implementing a language. Should you nan-box or not? I can't say in your case but I can give a couple of examples.

NaN-boxing has the obvious advantage of not allocating doubles on the heap. This reduces cache pressure, GC pressure, and such. That's why Moz and JSC chose it.

V8 on the other hand has not chosen it, at least not yet, anyway. I think that one reason is because especially on embedded devices, and to an extent on ia32, passing around 64-bit values is a big lose. It's so bad that I understand that Mozilla actually passes these values by reference instead of by value in some places, on 32-bit systems only.

But the real reason that V8 doesn't nan-box I think is that they have a compiler that is able to unbox values in registers and in temporary stack locations, both as int32 values and as double values. This works for loops and such things in a function, and is obviously quite efficient, as you don't need to box and unbox at all. Passing doubles as arguments or return values however does allocate them on the heap, as far as I can tell anyway.

Also there is the hackery that with NaN-boxing, you assume things about your OS's memory management. A language implementation should be able to get around this with use of mmap at specific addresses, but still, it's tricky.

I looked at doing nan-boxing for Guile, and eventually decided against it. Guile has a few more constraints that V8 does not have. Firstly it is very portable, so it can't rely the address range constraints that x86-64 has, not without some hackery. It's also portable to the low end, like V8, so 64-bit values are a lose there.

But the killer is the conservative GC on 32-bit systems. If you represent integers with the tag in the high bits, as you would with nan-boxing, then the lower 32 bits are not distinguishable from a pointer, so they can cause the GC to retain heap objects for longer than it should. With tagged pointers, the low-bit masking lets the GC know that an integer is not a pointer. If you were to reintroduce low-bit tagging to a nan-boxed system, you've lost much of the advantage of it. Furthermore if you want to support more than 32 bits of precision in a fixnum, then on a 32-bit system you need to play games with sign extension of a 48-bit value to a 64-bit value, and for all word sizes it looks like the overhead will be significantly higher than tagging.

Finally I think that if we're really going to shoot for the moon, we should implement something like V8's Crankshaft compiler that does decent type inference, to allow unboxed values in registers and on the stack. (For readers of source code, "crankshaft" is basically the "hydrogen" flow-graph code plus a "lithium" assembler. Confusing, right?)

Well that's all for now. If you corrections, or an idea for a further topic, let me know in the comments. Thanks!

Syndicated 2011-05-18 16:57:41 from wingolog

javascript in 2011

Good evening tubes,

This hacklog needs an update, and how! I meant to write some stuff about Guile, which recently got some pretty great performance improvements, but instead bits on JavaScript were what trickled out of my fingers onto this Emacs. So here we go!

kung fu

I took a fairly deep dive into the world of JavaScript implementations recently, and man, what a world: a rotated-double-space punboxed jaeger-lithium crankshafted nitro-assembling trace stack of bit-froth. Yow! Now that's a fat value!

That probably makes no sense, right, but you have to understand that I've been jacked into this matrix for like three weeks now, and they speak a different language out there. If I were to draw some preliminary conclusions about the state of the art in JavaScript implementations, they would be these:

  • Trace looked fun, but it seems that it's possible to get trace-level performance with more traditional compilation techniques. (The trick appears to be to treat tagging, untagging and type checks as nodes in your control-flow graph.) V8 already has an SSA-based intermediate language, JavaScriptCore (the WebKit JavaScript implementation) is moving to one, and the above bug is from the Mozilla folk.

  • As a correlary to the previous point, data representation matters. It seems to me that JavaScriptCore's solid improvements have all been centered on cleanliness: keeping it simple, direct, and clean leads to speed. Right now JSC is trailing the pack again, but with a good SSA compiler, they could reach V8 again. Inner loops are important, but that's not where big applications spend most of their time.

  • The JS benchmarks are terrible, but everyone is using them to optimize their implementations. There's a serious danger of garbage-in, garbage-out here.

    To my eye the V8 ones are the best general-purpose benchmarks: the ones that represent not just the inner loops, but bigger programs. But hey, maybe that's just the Schemer in me amused at seeing cross-compiled Lisp benchmarks in their suite.

  • Tool support is terrible.

    All of these projects have to implement their own profilers, disassemblers, debuggers, etc, and for all platforms, in theory. In practice, yuck. GDB just got support for JIT-generated code, and it's terrible: you have to generate whole object files (ELF on ELF architectures, presumably mach-O or whatever on Darwin, and who knows on windows!), and if you want debugging information, you have to generate DWARF.

    Of course oprofile couldn't use the same format, it has its own thing, and it's not the same as the lovely but ungooglable perf, which appears to only support map files generated by some JVM no one has heard of, but has no support for garbage-collecting executable code or for line numbers. Aaargh!!!

    Incidentally, my colleague Xan appears to have run into these issues independently.

I've been joking recently that I've been coming up like Neo, in the training sequence, eyes open, drawn breath: "whoa: I know kung-funun-boxing."

All in all JavaScript in mid-2011 is looking much better than JavaScript in mid-2010 -- humbling, even -- and I look forward to hacking more on it. Actually, simply "hacking on it", as I have been taking uncomfortably long to get up to speed on things. (No one has yet said "Really? Show me." Ha ha.)

I suspect that tools is where I'll start. It sure would be nice to have line-level profiling interleaved with disassembly for JavaScript, somehow. (I hope it would be perf, but they seem an opinionated bunch. I hope you don't have to be Peter Zijlstra to get something into it.)

forward-looking statements

I try not to make promises about the future, so consider it simply a prediction, that the next entry will feature more parentheses than this one. Until then, happy hacking!

Syndicated 2011-05-15 17:53:21 from wingolog

meta: per-tag feeds

Just a brief meta-note: I've added per-tag feeds to tekuti, based on a patch kindly provided by Brian Gough, and unkindly neglected by me these last four months.

Anyway, you can now access http://wingolog.org/feed/atom?with=guile&without=web&with=scheme and other such URLs. Should be good for planets.

Let me know if there are any problems. Thanks!

Syndicated 2011-04-25 18:13:27 from wingolog

new beginnings

Friday is my last day at Oblong. It's been good times: my colleagues have all been generous, intelligent, respectful folk. It has been a real privilege to work with them for these few years, and I wish them all the best. Oblong is in a great position to emancipate both pixels and people from the proxy world of mice and windows.

At Oblong I was mostly focused on video, but if you're a regular reader of this electro-rag, you'd see that my passion is more on the side of compilers and free software, especially Guile. So when the opportunity presented itself to do something more directly aligned with those interests, I let myself be tempted.

igalia ho!

So what's next? Igalia is what's next! I'll join their new compilers group later this month, with an initial focus on JavaScriptCore optimization work and on Guile projects.

Igalia is a Spanish free software consulting shop that is fairly well-known in the GNOME community, but perhaps not so widely outside of it. I admit: it is refreshing (and relieving) to return to free software "on the clock", and to hack compilers for a living. This is fantastic. But that's not the thing that's really awesome (and unique, as far as I know) about Igalia.

No, the truly great thing is that Igalia is an entirely egalitarian organization. All important decisions are taken by the assembly, using consensus and similar collective decision-making procedures. That includes research, business development, budgets, hiring, salaries: everything. After a few years working there you become a partner, which makes you an equal co-owner of the business.

Wild, no? And wonderful. Igalia has been around for almost 10 years now, so the system definitely works, and I am very much looking forward to seeing how it works. It's a project, and I'm pleased to be a part.

pottage

My last note of this kind quoted Thoreau, ending with:

If I should sell both my forenoons and afternoons to society, neglecting my peculiar calling, there would be nothing left worth living for. I trust that I shall never thus sell my birthright for a mess of pottage.

Time passes, and it turns out that, among other things, Guile is my "peculiar calling". So the thing that I've worked out with Igalia folk is to go back to full time, though still with a nominal split between JavaScriptCore work and Guile work.

To be honest this split seems to be something akin to therapy. Before, I sold my time for sustenance, and if I could get interesting projects and nice folks on top of that, then great. But now, the prospect of drawing passion and work closer together is daunting at the same time that it's exciting. Hopefully, eventually, this split will heal itself, into a state in which I do what needs doing, while doing what I love. We'll see.

I'll be staying in Barcelona, at least until the fall. Until next time, dear readers, happy hacking. It's good to be back!

Syndicated 2011-04-05 06:16:26 from wingolog

git-brunch(1)

NAME
git-brunch - Fresh fruit before lunch

SYNOPSIS
git brunch [-p]

DESCRIPTION
With no arguments, stashes all work and cherry-picks a snack onto the dining room working tree. Option -p runs a git-prune-snacked(1) afterwards.

NOTES
If running git-brunch does not produce the desired effects, try with sudo. For external brunches, see git-remote.

BUGS
Objects received via git-fetch-snack(1) should be inspected manually to ensure their fitness for brunch. Strawberries, cream, and waffles are currently unimplemented.

SEE ALSO
git-cherry-pick(1), git-snack-objects(1)

Independently discovered a few days ago, it seems.

git-brunch: the power of an idea whose time has come.

Syndicated 2011-03-28 11:24:52 from wingolog

storage primitives for the distributed web

An example-driven discussion of what seems to me to be the minimal set of persistent storage primitives needed in an autonomous web.

Lisa would like to keep a private journal, and store the journal entries in Marge's computer. Lisa is not worried about Marge reading her journal entries, but she wants to make sure that Bart, who also uses the computer, does not have access to her journal. Marge needs to give Lisa some storage primitives so that she can write her journal-entry program.

Marge calls up her operating system vendor, and gets the following procedure documentation in response:

make-cell init-string
Create a new storage location, and store init-string in it. Returns a write-cap, which allows a user to change the string associated with this storage location.
write-cap? obj
Return #t if obj is a write-cap, and #f for any other kind of object.
cell-set! write-cap string
Set the string associated with a cell to a new value.
write-cap->read-cap write-cap
Take a write-cap and return a read-cap, which allows a user to read the value associated with a cell.
read-cap? obj
Return #t if obj is a read-cap, and #f for any other kind of object.
cell-ref read-cap
Return the string associated with the cell.

Marge makes all of these procedures available to Lisa. Lisa then starts to program. She does not want to allow herself to edit her old entries, so she makes a helper:

(define (put-text string)
  (write-cap->read-cap (make-cell string)))

Since put-text throws away the write-cap for this string, nothing else will be able to change an entry, once it is written.

Read-caps and write-caps are capabilities. They are unforgeable. Since Lisa did not give any of these capabilities to Bart, she feels safe typing her innermost thoughts into put-text.

persistent objects

In Scheme, capabilities are objects. A piece of code has capabilities, in the normal English sense of the term, to all of the objects that are in its scope, and to objects that are in the current module.

But since all does not live in the warm world of Scheme, the storage primitives that the computer vendor provides allow read-caps and write-caps to be serialized to strings:

cap->string cap
Return a string representation of the capability cap.
string->cap string
Return a capability corresponding to string. Read capabilities have a different representation from write capabilities, so this procedure may return a read-cap or a write-cap. It returns #f if this string does not look like a capability.

When Lisa started writing, she wrote down the cap-strings for all of her entries in a book. Then when she wants to read them, she types the capability strings into the terminal:

(cell-ref (string->cap "read-cap:a6785kjiyv8c0..."))
=> "I was really happy with my solo today at band, but..."

But this got tiring after a while, so she decided to store the list of capabilities instead:

;; Build a list data type.
;;
(define (make-list-cell)
  (make-mutable-cell ""))

(define (list-cell-ref read-cap)
  (let ((str (cell-ref read-cap)))
    (if (equal? str "")
        '()
        (map string->cap (string-split str #\,)))))

(define (list-cell-set! write-cap caps)
  (cell-set! write-cap
             (string-join (map cap->string caps) ",")))

;; Helper.
;;
(define (->readable cap)
  (if (write-cap? cap)
      (write-cap->read-cap cap)
      cap))

;; Make a new cell, and print out its cap-string.
;; Note to self: write down this string!
;;
(display (cap->string (make-list-cell)))

(define (add-entry! entries cap)
  (list-cell-set!
   entries
   (cons cap (list-cell-ref (->readable *all-entries*)))))

Now she just has to write down the cap-string of the new list cell that she made, and she has a reference to all of her entries. Whenever she writes a new entry, she uses add-entry! to update the cell's value, adding on the new cap-string.

distribution

Colin's father Bono has a computer just like Marge's, and Lisa would like for Colin to be able to be able to read some specific entries. So she asks Marge how to give access to the cells that she is using for data storage to other machines.

Marge asks her vendor, and the vendor says that actually, the cells implementation that was provided to her stores its data in the cloud. So Lisa can just give Colin a cap-string -- read-only, presumably -- for the essays that she would like to share, and all is good.

Marge doesn't know what this means, but she tells Lisa, and Lisa freaks out. "You mean that I've been practicing Careless Computing this whole time, and you didn't tell me??? You mean that Chief Wiggum could have called up the cloud storage provider and gotten access to all of my data?"

careful computing

Lisa's concern is a good one. Marge puts her in contact with the vendor directly, who explains that actually, the cells implementation is designed to preserve users' autonomy.

Creating a cell creates an RSA key-pair. The write-cap contains the signing key (SK) and the public key (PK), and the read-cap contains just the PK. Before sending data to the cloud provider, the data is signed and encrypted using the SK, so only people with access to the read-cap can actually decrypt the strings.

The cells are stored in a standard key-value store. The key is computed as the secure hash (H) of the PK, so that even the cloud storage provider cannot decrypt the data. Furthermore, the cell-ref user does not rely on the provider to vouch for the data's integrity, as she can verify it directly against the PK. The only drawback is that Lisa cannot be sure that cell-ref is returning the latest value of a cell, whatever that means.

The vendor continues by noting that it doesn't actually matter much, from Lisa's perspective, what the key-value store is. It could be Amazon S3-backed web service, a Tahoe-LAFS friendnet, home-grown CouchDB things, or GNUnet. This is an important property in a time in which the peer-to-peer key-value stores have not yet stabilized. The vendor also says that they don't have accounting figured out yet, so they don't know how to charge people for storage, but that they trust that the commercial folks will work that out.

context

These primitives are sufficient to build proper data structures on top of k-v stores -- tries, queues, and such -- all with fine-grained access controls, and without having to trust the store itself. Lisa can, if she ever grows up, publish all (or a set) of her diaries to the world, which could then form part of larger data structures, like the "wall" of whatever comes after Facebook.

It seems to me that this set of primitives is a minimal set. You could add in better support for immutable data, but since you can implement it in terms of mutable data, it seemed unnecessary.

This scheme was mostly inspired by Tahoe-LAFS. You can read a short and very interesting technical paper about it here.

future

Next up would be seeing if these primitives interact well with a capabilities-based security kernel for mobile code. Cross your fingers!

Syndicated 2011-03-24 14:28:17 from wingolog

bart and lisa, hacker edition

Bart and Lisa are both hacking on a computer run by Marge. Bart has a program that sorts a list of integers. Being a generous person, he shares it with Lisa. Lisa would like to use Bart's program, but she doesn't trust Bart -- she wants to make sure that the program is safe to run before running it. She would like to sort a list of credit card numbers and would be quite vexed indeed if Bart's sort procedure posted them all to a web site.

(This example and much of the discussion is taken from the excellent Rees thesis, which I highly, highly recommend.)

One approach she can take is to examine the program, and only run it if it is obviously harmless. Due to fundamental concerns like undecidability, this will be a conservative evaluation. Then if the program is deemed safe, it can be compiled and invoked directly, without having to put it in a sandbox.

The question I wish to discuss this evening is how to do this safe compilation in Guile, in the case in which a Guile program provides environments for Bart and Lisa to both hack on, at the same time, along with means for Bart and Lisa to communicate.

Let us imagine that Bart gives Lisa a program, as a string. The first thing to do is to read it in as a Scheme data structure. Lisa creates a string port, and calls the system's read procedure on the port. This produces a Scheme value.

Lisa trusts that the read procedure and the open-input-string procedures are safe to call. Indeed she has to trust many things from the system; she doesn't really have much of a choice. She has to trust Marge. In this case, barring bugs, the choice is mostly valid. The exception would be reader macros, which cause arbitrary code to run at read-time. Lisa must assume that Bart has not invoked read-hash-extend on her machine. Indeed, Marge cannot supply read-hash-extend to Bart or to Lisa, just as she would not give them access to the low-level foreign function interface.

This brief discussion indicates that there exist in Guile cross-cutting, non-modular facilities which cannot be given to users. If you want to create a secure environment in which programs may only use the capabilities they are provided, "user-level" environments must not include some routines, like read-hash-extend.

names come to have meanings

So, having proceeded on, Lisa reads a Scheme form which she can pretty-print to her console, and it looks like this:

(define sort
  (lambda (l <)
    (define insert
      (lambda (x sorted)
        (if (null? sorted)
            (list x)
            (if (< x (car sorted))
                (cons x sorted)
                (cons (car sorted) (insert x (cdr sorted)))))))
    (if (null? l)
        '()
        (insert (car l) (sort (cdr l) <)))))

How can Lisa know if this program is safe to run?

Actually, this question is something of a cart before the horse; what we should ask is, what does this program mean? To that question, we have the lambda calculus to answer for, as part of the Scheme language definition.

In this form we have binding forms, bound variable references, free variable references, and a few conditionals and constants. The compiler obtains this information during its expansion of the form. Here we are assuming that Lisa compiles the form in an environment with the conventional bindings for define, lambda, and if.

Of these, the constants, conditionals, lexical binding forms, and bound variable references are all safe.

The only thing we need be concerned about are the free variable references (null?, list, car, cdr, cons, and, interestingly, sort itself), and the effect of the toplevel definition (of sort).

In Scheme, forms are either definitions, expressions, or sequences of forms. Definitions bind names to values, and have no value themselves. Expressions can evaluate to any number of values. So the question here for Lisa is: is she interested in a definition of a sort procedure, or a value? If the latter, she must mark this form as unsafe, as it is a definition. If the former, she needs to decide on what it means for Bart to provide a definition to her.

In Guile, top-level definitions and variable references are given meaning by modules. Free variables are scoped relative to the module in which the code appears. If Lisa expands the sort definition in a fresh module, then all of the free variables will be resolved relative to that module -- including the sort reference within the lambda expression.

Lisa can probably give Bart access to all of the primitives that his program calls: car, cons, etc.

We start to see how the solution works, then: a program is safe if all of its free variables are safe. if is safe. null? is safe. And so on. Lisa freely provides these resources to Bart (through his program), because she knows that he can't break them, or use them to break other things. She provides those resources through a fresh module, in which she compiles his program. Once compiled, she can invoke Bart's sort directly, via ((module-ref bart-module 'sort) my-credit-card-numbers), without the need to run Bart's code in any kind of sandbox.

world enough, and time

However, there are two resources which Lisa transfers to Bart's program which are not passed as procedure arguments: time, and space.

What if Bart's program had a bug that made it fail to terminate? What if it simply took too much time? In this case, Marge may provide Lisa with a utility, safe-apply, which calls Bart's function, but which cancels it after some amount of time. Such a procedure is easy for Marge to implement with setitimer. setitimer is one of those cross-cutting concerns however which Marge would not want to provide to either Lisa or Bart directly.

The space question is much more difficult, however. Bart's algorithm might cons a lot of memory, but that's probably OK if it's all garbage. However it's tough for Marge to determine what is Lisa's garbage and what is Bart's garbage, and what non-garbage is Lisa's, and what non-garbage is Bart's, given that they may share objects.

In the Guile case, Marge has even fewer resources at her disposal, as the BDW-GC doesn't give you all that much information. If however she can assume that she can have accurate per-thread byte allocation counters, she can give Lisa the tools needed to abort Bart's program if it allocates too much. Similarly, if Marge restricts her program to only one OS thread, she can guesstimate the active heap size. In that way she can give Lisa tools to restrict Bart's program to a certain amount of active memory.

related

Note that Marge's difficulties are not unique to her: until just a few weeks ago, the Linux kernel was still vulnerable to fork bombs, which are an instance of denial-of-service through resource allocation (processes, in that case).

Nor are Lisa's, for that matter, considering the number of Android applications that are given access to user data, and then proceed to give it to large corporations, and Android does have a security model for apps. Likewise the sort UNIX utility has access to your SSH private key, and it's only the hacker ethic that has kept free software systems in their current remarkably benign state; though, given the weekly Flash Player and Acrobat Reader vulnerabilities, even that is not good enough (for those of us that actually use those programs).

future

I'm going to work on creating a "safe evaluator" that Lisa can use to evaluate expressions from Bart, and later call them. The goal is to investigate the applicability of the ocap security model to the needs of an autonomy- and privacy-preserving distributed computing substrate. In such an environment, Bart should be able to share an "app" (in the current "app store" sense) with all of his friends. His friends could then run the app, but without needing to trust the app maker, Bart, or anyone else. (One precedent would be Racket's sandboxed evaluators; though I don't understand all of their decisions.)

The difficulty of this model seems to me to be more on the side of data storage than of local computation. It's tough to build a distributed queue on top of Tahoe-LAFS, much less GNUnet. Amusingly though, this sort of safe evaluation can be a remedy for that: if an app not only stores data but code, and data storage nodes allow apps to run safe mapreduce / indexing operations there, we may regain more proper data structures, built from the leaves on up. But hey, I get ahead of myself. Until next time, happy hacking.

Syndicated 2011-03-19 21:43:28 from wingolog

357 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!