Older blog entries for wingo (starting at number 365)

10 Jun 2011 »

V8 is faster than GCC

Do you like the linkbait title? Neither do I, but it is true in this case. Check it out.

After my last post, Benjamin noted that GCC would reduce my simple test to a mov rax, $10000000; ret sequence. Well yes, that's true, and GCC does do that: but only if GCC is able and allowed to do the inlining. So the equivalent of the test, for GCC, is to compile the g in a separate file:

unsigned int g (void) { return 1; }

Then in the main file, we have:

extern unsigned int g (void);

int main (int argc, char *argv) {
  unsigned int i, ret = 0;
  for (i = 0; i < NU; i++)
    ret += g ();
  return ret;
}

Here we replace N with the number of iterations we're interested in testing.

I decided to run such a test; the harness is here. I ran both GCC and V8 for the above program, for N from 2⁰ to 2³¹, and plotted the total time taken by V8 as compared to GCC.

Naturally this includes the GCC compile times, for which we expect GCC to lose at low iterations, but the question is, when does GCC start being faster than V8?

answer: never!

That's right, V8 is always faster than GCC, right up to the point at which its fixnums start failing. For the record, only the points on the right of the graph are really worth anything, as the ones on the left only run for a few milliseconds. Raw data here.

Why is V8 faster, even for significant iteration counts? Because it can inline hot function calls at runtime. It can see more than GCC can, and it takes advantage of it: but still preserving the dynamic characteristics of JavaScript. In this context the ELF symbol interposition rules that GCC has to deal with are not such a burden!

Syndicated 2011-06-10 15:57:58 from wingolog

8 Jun 2011 »

what does v8 do with that loop?

Hi!

I've spent the last month or so swimming in V8. Not the tomato juice, mind you, though that would be pleasant, but rather the JavaScript implementation.

In my last dispatch I made the assertion that the current JavaScript implementations are totally undocumented, but in the case of V8 that's not precisely true. There were at least two PhD dissertations written on it, along with a few technical reports and conference proceedings.

I refer of course to the Self work at Stanford and Sun, back in the late eighties and early nineties. I had read a number of their papers, but hadn't seen Hölzle's thesis yet. That one, for me, is the best of all, because it explains the Self implementation on an engineering level. The V8 implementation, I mean to say, because the vocabulary is entirely the same: maps, on-stack replacement, lazy deoptimization, etc. These are all parts of V8 now.

And it's no wonder, really. Lars Bak, the lead developer of V8, was there back then, hacking on Self, and then Strongtalk, then HotSpot, then a little startup built around virtual machines on mobile devices. So you see there's a reason why V8 doesn't have very much big-picture internal documentation -- Bak has been writing the same program for 20 years now; he knows how it works.

(Note however that Bak's name never appears in the V8 commit logs. You can see his hand at work, but never his face. Like Dr. Claw. Actually not very much like Dr. Claw but I just wanted to put that out there. Is Lars Bak Dr. Claw? Is he?)

enough with the personification

As you might recall, V8 always compiles JavaScript to native code. The first time V8 sees a piece of code, it compiles it quickly but without optimizing it. The initial unoptimized code is fully general, handling all of the various cases that one might see, and also includes some type-feedback code, recording what types are being seen at various points in the procedure.

At startup, if the "Crankshaft" optimizing compiler is enabled, V8 spawns off a profiling thread. If it notices that a particular unoptimized procedure is hot, it collects the recorded type feedback data for that procedure and uses it to compile an optimized version of the procedure. The old unoptimized code is then replaced with the new optimized code, and the process continues. (This on-stack replacement (OSR) process is covered in Hölzle's thesis.)

The optimized procedure does not actually cover all cases. If a callee procedure starts returning floating-point values where it was returning integer values before, the optimized procedure is deoptimized -- a relatively expensive process of recompiling the original procedure to the unoptimized form, replacing the optimized function on the stack with the unoptimized version, and then continuing the computation. This requires that the compiler keep around additional information about how to continue the computation -- you could be in the middle of a loop, after all.

Deoptimization is particularly tricky if the optimizing process inlined a callee procedure, and thus has to de-inline, replacing the activation of one optimized procedure call with two or more unoptimized calls. Again, Hölzle's thesis discusses this in depth. Inlining is such a win though that the complexity appears to be worth it.

assembly

I wanted to see what V8 does with a simple loop, but one for which lexical inlining isn't possible. Like this:

function g () { return 1; }
function f () { 
  var ret = 0;
  for (var i = 1; i < 10000000; i++) {
    ret += g ();
  }
  return ret;
}

Pretty simple: adding 1 to a counter, 10 million times, but attempting to foil statically apparent inlining possibilities. I entered these two definitions at the d8 shell, invoked with the --code_comments --print_code options.

Note that running V8 in this way is going to spew a lot on your console, as V8 itself warms up. Startup is quick, though. On my modern laptop with an SSD, the debugging shell takes about 17ms to start up. The standard shell takes about 5ms to start. Both of these numbers are with snapshots on; without snapshots, the numbers are more like 32ms and 18ms, respectively. Just for comparison, the JavaScriptCore shell (jsc) takes about 12ms to start here.

Interestingly, V8's profiler decides that the best thing to do here is not to optimize g -- which it actually can't, as it's so small the unoptimized code is already optimal -- but to inline g into f, and optimize f.

V8 is able to do this inlining because it keeps around the parsed AST of every piece of code it sees. It needs some high-level internal representation, and it turns out that the AST is the easiest one: it's fairly small, it's already the input format to the "full-codegen" unoptimized compiler, and it also has all of the lexical information necessary to do good inlining analysis. Much easier than trying to decompile bytecode, it seems to me.

I did say that I was going to show some assembly, so here we go. This is what d8 prints out when evaluating f(). I've trimmed the output a bit. The comments on the right are my annotations.

Instructions (size = 466)
  0  push rbp           ;; Save the frame pointer.
  1  movq rbp,rsp       ;; Set the new frame pointer.
  4  push rsi           ;; Save the callee's "context object".
  5  push rdi           ;; Save the callee's JSFunction object.
  6  subq rsp,0x28      ;; Reserve space for 5 locals.

Here we have a standard prelude. The JavaScript calling conventions in V8 pass arguments on the stack, using rbp and rsp as callee-saved stack and frame pointers. Additionally, information associated with the function itself is passed in rsi and rdi: the context, and the function object itself. The context is an array optimized for looking up various information that the function needs at runtime, mostly free variables (lexical and global).

In this case it's redundant to take the context out of the function, but it does allow for faster access to the global scope object that was current when the function was defined. In the case of closures, every time the function() expression is evaluated, a new context will be created with a new set of free variables.

Anyway! Moving along:

 10  movq rax,rsi        ;; Store the context in a scratch register.
 13  movq [rbp-0x38],rax ;; And save it in the last (4th) stack slot.
 17  cmpq rsp,[r13+0x0]  ;; Check for stack overflow
 21  jnc 28              ;; If we overflowed,
 23  call 0x7f7b20f40a00 ;; Call the overflow handler.

Here we store the context again, in the last stack slot, and we check for overflow. On x86-64, V8 reserves r13 for a pointer to somewhere in the middle of a "global object", holding the GC roots for a given JavaScript execution context. There is a cell in the root list that holds the stack limit, which V8 abuses for other purposes: interrupting loops, breakpoints, deoptimization, etc. Again, Hölzle's thesis has some details on this technique.

The "somewhere in the middle" bit is arranged so that a simple dereference of r13 will allow us to get at the stack limit. V8 will reset the stack limit whenever it needs to interrupt a procedure.

Having passed the stack overflow check, we initialize local variables:

 28  movq rax,[rbp+0x10] ;; Receiver object to rcx (unused).
 32  movq rcx,rax        ;;
 35  movq rdx,[rbp-0x38] ;; Global object to rdx.
 39  movl rbx,(nil)      ;; Loop counter (i) to 0.
 44  movl rax,(nil)      ;; Accumulator (ret) to 0.
 49  jmp 97              ;; Jump over some stuff.

Note that the receiver object (the this object) is passed as an argument, and thus is above rbp.

Following this bit we have some code that appears to be completely dead. I'll include it for completeness, but unless the deoptimization bailouts jump back here, I don't know why it's there.

 54  movq rax,rsi        ;; Dead code.
 57  movq rbx,[rbp-0x28]
 61  testb rbx,0x1
 64  jnz 189  (0x7f7b20fa2a7d)
 70  shrq rbx,32
 74  movq rdx,[rbp-0x30]
 78  testb rdx,0x1
 81  jnz 237  (0x7f7b20fa2aad)
 87  shrq rdx,32
 91  movq rcx,[rbp-0x18]
 95  xchgq rax, rdx

We'll just forget that happened, shall we? However we are getting closer to the firm tofu of the matter, the loop. First one more check:

 97  movq rdx,[rsi+0x2f] ;; Slot 6 of the context: the global object.
101  movq rdi,0x7f7b20e401e8 ;; Location of cell holding `g'
111  movq rdi,[rdi]      ;; Dereference cell
114  movq r10,0x7f7b205d7ba1 ;; The expected address of `g'
124  cmpq rdi,r10        ;; If they're not the same...
127  jnz 371             ;; Deoptimization bailout 2

Here we see if the current definition of g, the function that we inlined below, is actually the same as when the inlining was performed.

Note that on line 97, since pointers in V8 are low-bit tagged with 01, to access slot 6 (0-based) of the context object, we only need to add 0x2f instead of 0x30. Clever, right? But we don't actually need the global object here in the main loop, so we could have delayed this load until finding out that deoptimization was necessary. Perhaps it was needed though for GC reasons.

133  movq rdx,[rdx+0x27] ;; Another redundant load.
137  cmpl rbx,0x989680   ;; 10000000, you see.
143  jge 178             ;; If i >= 10000000, break.
149  movq rdx,rax        ;; tmp = ret
152  addl rdx,0x1        ;; tmp += 1
155  jo 384              ;; On overflow, deoptimize.
161  addl rbx,0x1        ;; i++
164  movq rax,rdx        ;; ret = tmp
167  cmpq rsp,[r13+0x0]  ;; Reload stack limit.
171  jnc 137             ;; Loop if no interrupt,
173  jmp 306             ;; Otherwise bail out.
178  shlq rax,32         ;; Tag rax as a small integer.
182  movq rsp,rbp        ;; Restore stack pointer.
185  pop rbp             ;; Restore frame pointer.
186  ret 0x8             ;; Return, popping receiver.

And that's it! It's a fairly tight loop: g is inlined of course, its return value is untagged to a native int32, as are the accumulator and loop counter variables. Of course, improvements are possible -- the loop could be unrolled a few times, range analysis could avoid the deoptimization check, the overflow check could possibly be cheaper, and indeed the whole thing could be folded, but all in all, good job, MAD kittens!

Note the interesting approach to tagging: instead of storing the integer in the lower 32 bits, shifted by one, it is stored in the upper 32 bits, without a shift.

Actually there's some more junk after all of this. Another dead code block, apparently meaning to deal with floating-point values, but totally unreferenced:

189  movq r10,[r13-0x38]
193  cmpq [rbx-0x1],r10
197  jnz 397
203  movsd xmm0,[rbx+0x7]
208  cvttsd2sil rbx,xmm0
212  cvtsi2sd xmm1,rbx
216  ucomisd xmm0,xmm1
220  jnz 397
226  jpe 397
232  jmp 74
237  movq r10,[r13-0x38]
241  cmpq [rdx-0x1],r10
245  jnz 410
251  movsd xmm0,[rdx+0x7]
256  cvttsd2sil rdx,xmm0
260  cvtsi2sd xmm1,rdx
264  ucomisd xmm0,xmm1
268  jnz 410
274  jpe 410
280  testl rdx,rdx
282  jnz 301
288  movmskpd xmm2,xmm0
292  andl rdx,0x1
295  jnz 410
301  jmp 91

And then there's our stack check handler, saving all of the registers of interest, and calling out:

306  push rax
307  push rcx
308  push rdx
309  push rbx
310  push rsi
311  push rdi
312  push r8
314  push r9
316  push r11
318  push r14
320  push r15
322  leaq rsp,[rsp-0x28]
327  movq rsi,[rbp-0x8]
331  xorl rax,rax
333  leaq rbx,[r13-0x20a4b70]
340  call 0x7f7b20fa25c0
345  leaq rsp,[rsp+0x28]
350  pop r15
352  pop r14
354  pop r11
356  pop r9
358  pop r8
360  pop rdi
361  pop rsi
362  pop rbx
363  pop rdx
364  pop rcx
365  pop rax
366  jmp 137

And finally the deoptimization bailouts:

371  movq r10,0x7f7b20f7c054
381  jmp r10
384  movq r10,0x7f7b20f7c05e
394  jmp r10
397  movq r10,0x7f7b20f7c068
407  jmp r10
410  movq r10,0x7f7b20f7c072
420  jmp r10

Whew! I'd say that's long enough for today. I wanted to establish a fixed point on the low level though, so that I could write in the future about how it is that V8 compiles down to this code, what deoptimization is about, and such; filling in the middle, between the assembly and the JavaScript source. Corrections and suggestions in the comments please. Happy hacking!

Syndicated 2011-06-08 14:33:55 from wingolog

18 May 2011 »

value representation in javascript implementations

In my last note I mentioned that I had been doing a lot of reading of JavaScript implementation code. One point that I didn't mention is that the state of the art is completely undocumented. So to work on rectifying that, here's the first in what might be a series of articles about the internals of JavaScript implementations.

tagging

Initially, all JavaScript implementations used tagged pointers to represent JS values. This is a old trick that comes from the observation that allocated memory takes up at least 4 or 8 bytes, and are aligned in such a way that the least significant bit or three will be zero.

Here's an old part of the Guile manual that explains the technique. For example you could take values whose lowest bit is 1 to be a pointer to the heap. Any value that ends in 0 could be an integer, represented directly in the bits of the pointer. (If your pointer tag is not 0, you will have to mask off the low bits before dereferencing a pointer..)

Google's V8 JavaScript engine still does it this way. This yields 31-bit immediate signed integers; you can just do a signed right-shift to get the value. They call them "smi" values, for "small integers"; they are more commonly called "fixnums". There are other fun tricks you can do to avoid shifts for some common operations, like addition. Note that they also take advantage of the four-byte alignment (in their case) to encode another immediate value, "failure objects", using the other low bit.

The size of a tagged pointer is the size of a pointer, so 32 bits on a 32-bit machine, and 64 on a 64-bit machine. V8 doesn't actually use 63-bit smi values on 64-bit machines, AFAIK.

nan-boxing

JavaScriptCore, the implementation used by WebKit, switched from tagged pointers to a "nan-boxed" format. The best explanation of this strategy is given by Rob Sayre of Mozilla. I believe the technique was first made popular by LuaJIT, though I would appreciate additional references.

Sayre does a good job of explaining things, but to recap, there are about 2⁵³ bit patterns that represent not-a-number values for IEEE-754 double-precision floats, but that only one of those is actually emitted by current hardware and software. So cheeky implementations are free to use the other patterns for their own purposes.

What I always wondered when I heard about this strategy was how 53 bits were possibly enough, for a 64-bit machine. How are you going to fit a pointer in that space?

Well it turns out that on x64-64 systems, you only have a 48-bit address space, currently anyway. The shape of that space is quite peculiar, too. I don't know whether Linux gives out addresses in the upper half, but Windows does not, though Solaris does.

Anyway, the other way of looking at NaN-boxing is that there are 2⁶⁴ - 2⁴⁸ values in a pointer that aren't being used, and we might as well stuff something in them, and hey, the whole space of valid double precision numbers fits!

It's convenient too that numbers in JS are specified as doubles. (Even still, all implementations define separate "small integer" types, for use in loops and indexing and such; integer operations are faster when you can use them, too.)

JSC's implementation of nan-boxing is described here, and you can see some of the neat tricks they do here. It actually makes me envious of C++, as a C programmer!

nun-boxing

So, when you choose to do nan-boxing, you basically choose to do one of two things: you favor pointers, or you favor doubles. To favor pointers means that you recognize pointers as having initial (most-significant) 0 bits; if the initial bits are not 0, then it's a double, and you have to add or subtract a bit pattern to get to the double value.

Favoring doubles means that pointers are left as NaN values, so their initial bits are all ones (or the sign bit can be unset, it doesn't matter), and to unpack a pointer, you have to rotate the double space around.

Amusingly, there is a third option as well. For 32-bit machines, you can address the second word in the double directly, so there is no need to mask off anything. This is the JSVALUE32_64 case mentioned in the JSValue code I linked to above.

JSC chose to favor pointers, and as the first JS implementation to nan-box, got to call their choice "nan-boxing". Mozilla chose to favor doubles, and so made up the silly name "nun-boxing".

get thee to a nun boxery?

So you're implementing a language. Should you nan-box or not? I can't say in your case but I can give a couple of examples.

NaN-boxing has the obvious advantage of not allocating doubles on the heap. This reduces cache pressure, GC pressure, and such. That's why Moz and JSC chose it.

V8 on the other hand has not chosen it, at least not yet, anyway. I think that one reason is because especially on embedded devices, and to an extent on ia32, passing around 64-bit values is a big lose. It's so bad that I understand that Mozilla actually passes these values by reference instead of by value in some places, on 32-bit systems only.

But the real reason that V8 doesn't nan-box I think is that they have a compiler that is able to unbox values in registers and in temporary stack locations, both as int32 values and as double values. This works for loops and such things in a function, and is obviously quite efficient, as you don't need to box and unbox at all. Passing doubles as arguments or return values however does allocate them on the heap, as far as I can tell anyway.

Also there is the hackery that with NaN-boxing, you assume things about your OS's memory management. A language implementation should be able to get around this with use of mmap at specific addresses, but still, it's tricky.

I looked at doing nan-boxing for Guile, and eventually decided against it. Guile has a few more constraints that V8 does not have. Firstly it is very portable, so it can't rely the address range constraints that x86-64 has, not without some hackery. It's also portable to the low end, like V8, so 64-bit values are a lose there.

But the killer is the conservative GC on 32-bit systems. If you represent integers with the tag in the high bits, as you would with nan-boxing, then the lower 32 bits are not distinguishable from a pointer, so they can cause the GC to retain heap objects for longer than it should. With tagged pointers, the low-bit masking lets the GC know that an integer is not a pointer. If you were to reintroduce low-bit tagging to a nan-boxed system, you've lost much of the advantage of it. Furthermore if you want to support more than 32 bits of precision in a fixnum, then on a 32-bit system you need to play games with sign extension of a 48-bit value to a 64-bit value, and for all word sizes it looks like the overhead will be significantly higher than tagging.

Finally I think that if we're really going to shoot for the moon, we should implement something like V8's Crankshaft compiler that does decent type inference, to allow unboxed values in registers and on the stack. (For readers of source code, "crankshaft" is basically the "hydrogen" flow-graph code plus a "lithium" assembler. Confusing, right?)

Well that's all for now. If you corrections, or an idea for a further topic, let me know in the comments. Thanks!

Syndicated 2011-05-18 16:57:41 from wingolog

15 May 2011 »

javascript in 2011

Good evening tubes,

This hacklog needs an update, and how! I meant to write some stuff about Guile, which recently got some pretty great performance improvements, but instead bits on JavaScript were what trickled out of my fingers onto this Emacs. So here we go!

kung fu

I took a fairly deep dive into the world of JavaScript implementations recently, and man, what a world: a rotated-double-space punboxed jaeger-lithium crankshafted nitro-assembling trace stack of bit-froth. Yow! Now that's a fat value!

That probably makes no sense, right, but you have to understand that I've been jacked into this matrix for like three weeks now, and they speak a different language out there. If I were to draw some preliminary conclusions about the state of the art in JavaScript implementations, they would be these:

Trace looked fun, but it seems that it's possible to get trace-level performance with more traditional compilation techniques. (The trick appears to be to treat tagging, untagging and type checks as nodes in your control-flow graph.) V8 already has an SSA-based intermediate language, JavaScriptCore (the WebKit JavaScript implementation) is moving to one, and the above bug is from the Mozilla folk.
As a correlary to the previous point, data representation matters. It seems to me that JavaScriptCore's solid improvements have all been centered on cleanliness: keeping it simple, direct, and clean leads to speed. Right now JSC is trailing the pack again, but with a good SSA compiler, they could reach V8 again. Inner loops are important, but that's not where big applications spend most of their time.
The JS benchmarks are terrible, but everyone is using them to optimize their implementations. There's a serious danger of garbage-in, garbage-out here.

To my eye the V8 ones are the best general-purpose benchmarks: the ones that represent not just the inner loops, but bigger programs. But hey, maybe that's just the Schemer in me amused at seeing cross-compiled Lisp benchmarks in their suite.
Tool support is terrible.

All of these projects have to implement their own profilers, disassemblers, debuggers, etc, and for all platforms, in theory. In practice, yuck. GDB just got support for JIT-generated code, and it's terrible: you have to generate whole object files (ELF on ELF architectures, presumably mach-O or whatever on Darwin, and who knows on windows!), and if you want debugging information, you have to generate DWARF.

Of course oprofile couldn't use the same format, it has its own thing, and it's not the same as the lovely but ungooglable perf, which appears to only support map files generated by some JVM no one has heard of, but has no support for garbage-collecting executable code or for line numbers. Aaargh!!!

Incidentally, my colleague Xan appears to have run into these issues independently.

I've been joking recently that I've been coming up like Neo, in the training sequence, eyes open, drawn breath: "whoa: I know ~~kung-fu~~nun-boxing."

All in all JavaScript in mid-2011 is looking much better than JavaScript in mid-2010 -- humbling, even -- and I look forward to hacking more on it. Actually, simply "hacking on it", as I have been taking uncomfortably long to get up to speed on things. (No one has yet said "Really? Show me." Ha ha.)

I suspect that tools is where I'll start. It sure would be nice to have line-level profiling interleaved with disassembly for JavaScript, somehow. (I hope it would be perf, but they seem an opinionated bunch. I hope you don't have to be Peter Zijlstra to get something into it.)

forward-looking statements

I try not to make promises about the future, so consider it simply a prediction, that the next entry will feature more parentheses than this one. Until then, happy hacking!

Syndicated 2011-05-15 17:53:21 from wingolog

25 Apr 2011 »

meta: per-tag feeds

Just a brief meta-note: I've added per-tag feeds to tekuti, based on a patch kindly provided by Brian Gough, and unkindly neglected by me these last four months.

Anyway, you can now access http://wingolog.org/feed/atom?with=guile&without=web&with=scheme and other such URLs. Should be good for planets.

Let me know if there are any problems. Thanks!

Syndicated 2011-04-25 18:13:27 from wingolog

5 Apr 2011 »

new beginnings

Friday is my last day at Oblong. It's been good times: my colleagues have all been generous, intelligent, respectful folk. It has been a real privilege to work with them for these few years, and I wish them all the best. Oblong is in a great position to emancipate both pixels and people from the proxy world of mice and windows.

At Oblong I was mostly focused on video, but if you're a regular reader of this electro-rag, you'd see that my passion is more on the side of compilers and free software, especially Guile. So when the opportunity presented itself to do something more directly aligned with those interests, I let myself be tempted.

igalia ho!

So what's next? Igalia is what's next! I'll join their new compilers group later this month, with an initial focus on JavaScriptCore optimization work and on Guile projects.

Igalia is a Spanish free software consulting shop that is fairly well-known in the GNOME community, but perhaps not so widely outside of it. I admit: it is refreshing (and relieving) to return to free software "on the clock", and to hack compilers for a living. This is fantastic. But that's not the thing that's really awesome (and unique, as far as I know) about Igalia.

No, the truly great thing is that Igalia is an entirely egalitarian organization. All important decisions are taken by the assembly, using consensus and similar collective decision-making procedures. That includes research, business development, budgets, hiring, salaries: everything. After a few years working there you become a partner, which makes you an equal co-owner of the business.

Wild, no? And wonderful. Igalia has been around for almost 10 years now, so the system definitely works, and I am very much looking forward to seeing how it works. It's a project, and I'm pleased to be a part.

pottage

My last note of this kind quoted Thoreau, ending with:

If I should sell both my forenoons and afternoons to society, neglecting my peculiar calling, there would be nothing left worth living for. I trust that I shall never thus sell my birthright for a mess of pottage.

Time passes, and it turns out that, among other things, Guile is my "peculiar calling". So the thing that I've worked out with Igalia folk is to go back to full time, though still with a nominal split between JavaScriptCore work and Guile work.

To be honest this split seems to be something akin to therapy. Before, I sold my time for sustenance, and if I could get interesting projects and nice folks on top of that, then great. But now, the prospect of drawing passion and work closer together is daunting at the same time that it's exciting. Hopefully, eventually, this split will heal itself, into a state in which I do what needs doing, while doing what I love. We'll see.

I'll be staying in Barcelona, at least until the fall. Until next time, dear readers, happy hacking. It's good to be back!

Syndicated 2011-04-05 06:16:26 from wingolog

28 Mar 2011 »

git-brunch(1)

NAME: git-brunch - Fresh fruit before lunch
SYNOPSIS: git brunch [-p]
DESCRIPTION: With no arguments, stashes all work and cherry-picks a snack onto the dining room working tree. Option -p runs a git-prune-snacked(1) afterwards.
NOTES: If running git-brunch does not produce the desired effects, try with sudo. For external brunches, see git-remote.
BUGS: Objects received via git-fetch-snack(1) should be inspected manually to ensure their fitness for brunch. Strawberries, cream, and waffles are currently unimplemented.
SEE ALSO: git-cherry-pick(1), git-snack-objects(1)

Independently discovered a few days ago, it seems.

git-brunch: the power of an idea whose time has come.

Syndicated 2011-03-28 11:24:52 from wingolog

24 Mar 2011 »

storage primitives for the distributed web

An example-driven discussion of what seems to me to be the minimal set of persistent storage primitives needed in an autonomous web.

❧

Lisa would like to keep a private journal, and store the journal entries in Marge's computer. Lisa is not worried about Marge reading her journal entries, but she wants to make sure that Bart, who also uses the computer, does not have access to her journal. Marge needs to give Lisa some storage primitives so that she can write her journal-entry program.

Marge calls up her operating system vendor, and gets the following procedure documentation in response:

make-cell init-string: Create a new storage location, and store init-string in it. Returns a write-cap, which allows a user to change the string associated with this storage location.
write-cap? obj: Return #t if obj is a write-cap, and #f for any other kind of object.
cell-set! write-cap string: Set the string associated with a cell to a new value.
write-cap->read-cap write-cap: Take a write-cap and return a read-cap, which allows a user to read the value associated with a cell.
read-cap? obj: Return #t if obj is a read-cap, and #f for any other kind of object.
cell-ref read-cap: Return the string associated with the cell.

Marge makes all of these procedures available to Lisa. Lisa then starts to program. She does not want to allow herself to edit her old entries, so she makes a helper:

(define (put-text string)
  (write-cap->read-cap (make-cell string)))

Since put-text throws away the write-cap for this string, nothing else will be able to change an entry, once it is written.

Read-caps and write-caps are capabilities. They are unforgeable. Since Lisa did not give any of these capabilities to Bart, she feels safe typing her innermost thoughts into put-text.

persistent objects

In Scheme, capabilities are objects. A piece of code has capabilities, in the normal English sense of the term, to all of the objects that are in its scope, and to objects that are in the current module.

But since all does not live in the warm world of Scheme, the storage primitives that the computer vendor provides allow read-caps and write-caps to be serialized to strings:

cap->string cap: Return a string representation of the capability cap.
string->cap string: Return a capability corresponding to string. Read capabilities have a different representation from write capabilities, so this procedure may return a read-cap or a write-cap. It returns #f if this string does not look like a capability.

When Lisa started writing, she wrote down the cap-strings for all of her entries in a book. Then when she wants to read them, she types the capability strings into the terminal:

(cell-ref (string->cap "read-cap:a6785kjiyv8c0..."))
=> "I was really happy with my solo today at band, but..."

But this got tiring after a while, so she decided to store the list of capabilities instead:

;; Build a list data type.
;;
(define (make-list-cell)
  (make-mutable-cell ""))

(define (list-cell-ref read-cap)
  (let ((str (cell-ref read-cap)))
    (if (equal? str "")
        '()
        (map string->cap (string-split str #\,)))))

(define (list-cell-set! write-cap caps)
  (cell-set! write-cap
             (string-join (map cap->string caps) ",")))

;; Helper.
;;
(define (->readable cap)
  (if (write-cap? cap)
      (write-cap->read-cap cap)
      cap))

;; Make a new cell, and print out its cap-string.
;; Note to self: write down this string!
;;
(display (cap->string (make-list-cell)))

(define (add-entry! entries cap)
  (list-cell-set!
   entries
   (cons cap (list-cell-ref (->readable *all-entries*)))))

Now she just has to write down the cap-string of the new list cell that she made, and she has a reference to all of her entries. Whenever she writes a new entry, she uses add-entry! to update the cell's value, adding on the new cap-string.

distribution

Colin's father Bono has a computer just like Marge's, and Lisa would like for Colin to be able to be able to read some specific entries. So she asks Marge how to give access to the cells that she is using for data storage to other machines.

Marge asks her vendor, and the vendor says that actually, the cells implementation that was provided to her stores its data in the cloud. So Lisa can just give Colin a cap-string -- read-only, presumably -- for the essays that she would like to share, and all is good.

Marge doesn't know what this means, but she tells Lisa, and Lisa freaks out. "You mean that I've been practicing Careless Computing this whole time, and you didn't tell me??? You mean that Chief Wiggum could have called up the cloud storage provider and gotten access to all of my data?"

careful computing

Lisa's concern is a good one. Marge puts her in contact with the vendor directly, who explains that actually, the cells implementation is designed to preserve users' autonomy.

Creating a cell creates an RSA key-pair. The write-cap contains the signing key (SK) and the public key (PK), and the read-cap contains just the PK. Before sending data to the cloud provider, the data is signed and encrypted using the SK, so only people with access to the read-cap can actually decrypt the strings.

The cells are stored in a standard key-value store. The key is computed as the secure hash (H) of the PK, so that even the cloud storage provider cannot decrypt the data. Furthermore, the cell-ref user does not rely on the provider to vouch for the data's integrity, as she can verify it directly against the PK. The only drawback is that Lisa cannot be sure that cell-ref is returning the latest value of a cell, whatever that means.

The vendor continues by noting that it doesn't actually matter much, from Lisa's perspective, what the key-value store is. It could be Amazon S3-backed web service, a Tahoe-LAFS friendnet, home-grown CouchDB things, or GNUnet. This is an important property in a time in which the peer-to-peer key-value stores have not yet stabilized. The vendor also says that they don't have accounting figured out yet, so they don't know how to charge people for storage, but that they trust that the commercial folks will work that out.

context

These primitives are sufficient to build proper data structures on top of k-v stores -- tries, queues, and such -- all with fine-grained access controls, and without having to trust the store itself. Lisa can, if she ever grows up, publish all (or a set) of her diaries to the world, which could then form part of larger data structures, like the "wall" of whatever comes after Facebook.

It seems to me that this set of primitives is a minimal set. You could add in better support for immutable data, but since you can implement it in terms of mutable data, it seemed unnecessary.

This scheme was mostly inspired by Tahoe-LAFS. You can read a short and very interesting technical paper about it here.

future

Next up would be seeing if these primitives interact well with a capabilities-based security kernel for mobile code. Cross your fingers!

Syndicated 2011-03-24 14:28:17 from wingolog

19 Mar 2011 »

bart and lisa, hacker edition

Bart and Lisa are both hacking on a computer run by Marge. Bart has a program that sorts a list of integers. Being a generous person, he shares it with Lisa. Lisa would like to use Bart's program, but she doesn't trust Bart -- she wants to make sure that the program is safe to run before running it. She would like to sort a list of credit card numbers and would be quite vexed indeed if Bart's sort procedure posted them all to a web site.

(This example and much of the discussion is taken from the excellent Rees thesis, which I highly, highly recommend.)

One approach she can take is to examine the program, and only run it if it is obviously harmless. Due to fundamental concerns like undecidability, this will be a conservative evaluation. Then if the program is deemed safe, it can be compiled and invoked directly, without having to put it in a sandbox.

The question I wish to discuss this evening is how to do this safe compilation in Guile, in the case in which a Guile program provides environments for Bart and Lisa to both hack on, at the same time, along with means for Bart and Lisa to communicate.

Let us imagine that Bart gives Lisa a program, as a string. The first thing to do is to read it in as a Scheme data structure. Lisa creates a string port, and calls the system's read procedure on the port. This produces a Scheme value.

Lisa trusts that the read procedure and the open-input-string procedures are safe to call. Indeed she has to trust many things from the system; she doesn't really have much of a choice. She has to trust Marge. In this case, barring bugs, the choice is mostly valid. The exception would be reader macros, which cause arbitrary code to run at read-time. Lisa must assume that Bart has not invoked read-hash-extend on her machine. Indeed, Marge cannot supply read-hash-extend to Bart or to Lisa, just as she would not give them access to the low-level foreign function interface.

This brief discussion indicates that there exist in Guile cross-cutting, non-modular facilities which cannot be given to users. If you want to create a secure environment in which programs may only use the capabilities they are provided, "user-level" environments must not include some routines, like read-hash-extend.

names come to have meanings

So, having proceeded on, Lisa reads a Scheme form which she can pretty-print to her console, and it looks like this:

(define sort
  (lambda (l <)
    (define insert
      (lambda (x sorted)
        (if (null? sorted)
            (list x)
            (if (< x (car sorted))
                (cons x sorted)
                (cons (car sorted) (insert x (cdr sorted)))))))
    (if (null? l)
        '()
        (insert (car l) (sort (cdr l) <)))))

How can Lisa know if this program is safe to run?

Actually, this question is something of a cart before the horse; what we should ask is, what does this program mean? To that question, we have the lambda calculus to answer for, as part of the Scheme language definition.

In this form we have binding forms, bound variable references, free variable references, and a few conditionals and constants. The compiler obtains this information during its expansion of the form. Here we are assuming that Lisa compiles the form in an environment with the conventional bindings for define, lambda, and if.

Of these, the constants, conditionals, lexical binding forms, and bound variable references are all safe.

The only thing we need be concerned about are the free variable references (null?, list, car, cdr, cons, and, interestingly, sort itself), and the effect of the toplevel definition (of sort).

In Scheme, forms are either definitions, expressions, or sequences of forms. Definitions bind names to values, and have no value themselves. Expressions can evaluate to any number of values. So the question here for Lisa is: is she interested in a definition of a sort procedure, or a value? If the latter, she must mark this form as unsafe, as it is a definition. If the former, she needs to decide on what it means for Bart to provide a definition to her.

In Guile, top-level definitions and variable references are given meaning by modules. Free variables are scoped relative to the module in which the code appears. If Lisa expands the sort definition in a fresh module, then all of the free variables will be resolved relative to that module -- including the sort reference within the lambda expression.

Lisa can probably give Bart access to all of the primitives that his program calls: car, cons, etc.

We start to see how the solution works, then: a program is safe if all of its free variables are safe. if is safe. null? is safe. And so on. Lisa freely provides these resources to Bart (through his program), because she knows that he can't break them, or use them to break other things. She provides those resources through a fresh module, in which she compiles his program. Once compiled, she can invoke Bart's sort directly, via ((module-ref bart-module 'sort) my-credit-card-numbers), without the need to run Bart's code in any kind of sandbox.

world enough, and time

However, there are two resources which Lisa transfers to Bart's program which are not passed as procedure arguments: time, and space.

What if Bart's program had a bug that made it fail to terminate? What if it simply took too much time? In this case, Marge may provide Lisa with a utility, safe-apply, which calls Bart's function, but which cancels it after some amount of time. Such a procedure is easy for Marge to implement with setitimer. setitimer is one of those cross-cutting concerns however which Marge would not want to provide to either Lisa or Bart directly.

The space question is much more difficult, however. Bart's algorithm might cons a lot of memory, but that's probably OK if it's all garbage. However it's tough for Marge to determine what is Lisa's garbage and what is Bart's garbage, and what non-garbage is Lisa's, and what non-garbage is Bart's, given that they may share objects.

In the Guile case, Marge has even fewer resources at her disposal, as the BDW-GC doesn't give you all that much information. If however she can assume that she can have accurate per-thread byte allocation counters, she can give Lisa the tools needed to abort Bart's program if it allocates too much. Similarly, if Marge restricts her program to only one OS thread, she can guesstimate the active heap size. In that way she can give Lisa tools to restrict Bart's program to a certain amount of active memory.

related

Note that Marge's difficulties are not unique to her: until just a few weeks ago, the Linux kernel was still vulnerable to fork bombs, which are an instance of denial-of-service through resource allocation (processes, in that case).

Nor are Lisa's, for that matter, considering the number of Android applications that are given access to user data, and then proceed to give it to large corporations, and Android does have a security model for apps. Likewise the sort UNIX utility has access to your SSH private key, and it's only the hacker ethic that has kept free software systems in their current remarkably benign state; though, given the weekly Flash Player and Acrobat Reader vulnerabilities, even that is not good enough (for those of us that actually use those programs).

future

I'm going to work on creating a "safe evaluator" that Lisa can use to evaluate expressions from Bart, and later call them. The goal is to investigate the applicability of the ocap security model to the needs of an autonomy- and privacy-preserving distributed computing substrate. In such an environment, Bart should be able to share an "app" (in the current "app store" sense) with all of his friends. His friends could then run the app, but without needing to trust the app maker, Bart, or anyone else. (One precedent would be Racket's sandboxed evaluators; though I don't understand all of their decisions.)

The difficulty of this model seems to me to be more on the side of data storage than of local computation. It's tough to build a distributed queue on top of Tahoe-LAFS, much less GNUnet. Amusingly though, this sort of safe evaluation can be a remedy for that: if an app not only stores data but code, and data storage nodes allow apps to run safe mapreduce / indexing operations there, we may regain more proper data structures, built from the leaves on up. But hey, I get ahead of myself. Until next time, happy hacking.

Syndicated 2011-03-19 21:43:28 from wingolog

19 Mar 2011 »

/usr/local, fedora, rpath, foo

Good evening intertubes,

I wish to broach a nerdy topic, once more. May my lay readers be rewarded in some heaven for your earthly sufferings. I'll see you all at the bar.

problem statement

For the rest of us, we have a topic, and that topic is about autotools, software installation, -rpath, and all that kind of thing.

As you might be aware, last month we released Guile 2.0.0. Of course we got a number of new interested users. For some of them things went great. It all started like this:

wget http://ftp.gnu.org/gnu/guile/guile-2.0.0.tar.gz
tar zxvf guile-2.0.0.tar.gz
cd guile-2.0.0
./configure
make
sudo make install

Up to here, things are pretty much awesome for everybody. Obviously the next thing is to run it.

$ guile
GNU Guile 2.0.0
Copyright (C) 1995-2011 Free Software Foundation, Inc.

Guile comes with ABSOLUTELY NO WARRANTY; for details type `,show w'.
This program is free software, and you are welcome to redistribute it
under certain conditions; type `,show c' for details.

Enter `,help' for help.
scheme@(guile-user)> "Hello, World!"
$1 = "Hello, World!"

Sweetness! Let's start going through the manual. (Time passes.) You get to the point of compiling a short program that links to Guile:

gcc -o simple-guile simple-guile.c \
  `guile-config compile` `guile-config link`

And so far so good! But alack:

$ ./simple-guile
./simple-guile: error while loading shared libraries:
    libguile-2.0.so.22: cannot open shared object

Pants! What is the deal?

righteous indignation

The deal is, you appear to be on a Fedora or some other Red Hat-derived system. These systems add /usr/local/bin to the PATH, so the guile-config call succeeds. /usr/local/lib is in the link-time path too, so it finds libguile-2.0.so. But /usr/local/lib is not in the runtime library lookup path by default. Hence the error above.

This, my friends, is a bug. It is a bug in Fedora and similar systems. This recipe works fine on Debian. It works fine with the GNU toolchain, as configured out-of-the-box.

The only reason I can think that Fedora would break this is because of their lib versus lib64 split. It would be strange to say, "32-bit libraries go in /usr/lib, but maybe or maybe not in /usr/local/lib." And in fact the FHS explicitly disclaims what's happening in /usr/local.

But the fact is, /usr/local is the default location for people to compile, and the $PATH and other settings indicate that Fedora folk know this. I can only conjecture that the situation is this way in order to preserve compatibility with legacy 32-bit closed-source binary apps that cannot be recompiled to politely install their libraries elsewhere.

This decision costs me time and energy in the form of bug reports. Thanks but no thanks, whoever made this decision.

solutions?

It's true, this is a specific case of a more general issue. If you installed instead to /opt/guile, you would have to adjust your PATH to be able to run guile-config, or your PKG_CONFIG_PATH if you wanted to use the underlying pkg-config file instead. So you do that:

export PKG_CONFIG_PATH=/opt/guile/lib/pkgconfig
gcc -o simple-guile simple-guile.c `pkg-config --cflags --libs guile-2.0`

Awesome. You run it:

$ ./simple-guile

Hey awesome it works! Actually that's not true. Or rather, it's only true if you're on some other system, like a Mac. If you're on GNU, you get, again:

./simple-guile: error while loading shared libraries:
   libguile-2.0.so.22: cannot open shared object

Willikers! The same badness! And in this case, we can't even blame Fedora, as no one would think to add /opt/guile/lib to the runtime library path. (Well, actually, Apple did; it linked the binary to the link-time location of the library. That has other problems, though.)

The deal is, that we ourselves need to add /opt/guile/lib to the runtime library lookup path. But how we do that is compiler-specific; and in any case it's not necessary if we're installing to somewhere that's already in the system's runtime search path.

So---and this is the point I was coming to, besides ripping on Fedora's defaults---you need a build-time determination of what to add to the compilation line to set the rpath.

That, my friends, is the function of AC_LIB_HAVE_LINKFLAGS, from the excellent gnulib. It takes some link-time flags and spits out something that will also set the run-time library search path.

why haven't I seen this?

Amusingly I had not personally encountered this issue because all of my projects use libtool for linking, which does add -rpath as appropriate. Libtool-induced blissful ignorance: who'd have thought it possible?

Having said my message, I now entreat my more informed readers to leave corrections below in the comments. Thanks!

Syndicated 2011-03-18 23:52:21 from wingolog

356 older entries...