Older blog entries for wingo (starting at number 387)

an in-depth look at the performance of guile's web server

What ho, ladies! And what ho, gentlemen! The hack is on and apace. Today, the topic is performance: of Guile and of its web server, in microseconds and kiloinstructions. Brew up a pot of tea; this is a long article.

the problem

I have been poking at Guile's web server recently. To recap, Guile is an implementation of Scheme. It is byte-compiled, and has a set of runtime libraries written in C and, increasingly, in Scheme itself.

Guile includes some modules for dealing with the web: HTTP, clients, servers, and such. It's all written in Scheme. It runs this blog you're reading right now.

To be precise, this web log is served by tekuti, an application written on top of Guile's web server; and actually, in this case there is Apache in front of everything right now, so you're not talking directly to Guile. Perhaps that will change at some point.

But anyway, a few months ago, this blog was really slow to access. That turned out to be mostly due to a bug in tekuti, the blog application, in which generating the "related links" for a post would always end up invoking git. (The database for the blog is implemented in git.) Spawning another process was slow. Fellow hacker and tekuti user Aleix Conchillo Flaqué fixed the problem a year ago, but it took a while for it me to finally review and merge it.

So then, at that point, things were tolerable, but I had already contracted the performance bug, so I went on to spend a couple months drilling down, optimizing Guile and its web server -- the layers below tekuti.

10K requests per second: achievement unlocked!

After a lot of tweaking, to the compiler, runtime, HTTP libraries, and to the VM, Guile can now serve over 10000 requests per second on a simple "Hello world" benchmark. This is out of the box, so to speak, if the master branch in git were a box. You just check out Guile, build it using the normal autoreconf -vif & & ./configure && make dance, and then run the example uninstalled:

$ meta/guile examples/web/hello.scm

In another terminal, you can connect directly to the port to see what it does. Paste the first paragraph of this and press return, and you should see the second part:

$ nc localhost 8080
GET / HTTP/1.0
Host: localhost:8080
User-Agent: ApacheBench/2.3
Accept: */*

HTTP/1.0 200 OK
Content-Length: 13
Content-Type: text/plain;charset=utf-8

Hello, World!

I should say a little more about what the server does, and what the test application is. It reads the request, and parses all the headers to native Scheme data types. This is strictly more work than is needed for a simple "pong" benchmark, but it's really useful for applications to have all of the headers parsed for you already. It also filters out bad requests.

Guile then passes the request and request body to a handler, which returns the response and body. This example's handler is very simple:

(define (handler request body)
  (values '((content-type . (text/plain)))
          "Hello, World!"))

It uses a shorthand, that instead of building a proper response object, it just returns an association-list of headers along with the body as a string, and relies on the web server's sanitize-response to produce a response object and encode the body as a bytevector. Again, this is more work for the server, but it's a nice convenience.

Finally, the server writes the response and body to the client, and either closes the port, as in this case for HTTP/1.0 with no keep-alive, or moves the client back to the poll set if we have a persistent connection. The reads and writes are synchronous (blocking), and the web server runs in one thread. I'll discuss this a bit more later.

You can then use ApacheBench to test it out:

$ ab -n 100000 -c100 http://localhost:8080/
Server Software:        
Server Hostname:        localhost
Server Port:            8080

Document Path:          /
Document Length:        13 bytes

Concurrency Level:      100
Time taken for tests:   9.631 seconds
Complete requests:      100000
Failed requests:        0
Write errors:           0
Total transferred:      9200000 bytes
HTML transferred:       1300000 bytes
Requests per second:    10383.03 [#/sec] (mean)
Time per request:       9.631 [ms] (mean)
Time per request:       0.096 [ms] (mean, across all concurrent requests)
Transfer rate:          932.85 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       3
Processing:     1   10   1.7      9      20
Waiting:        1   10   1.7      9      20
Total:          3   10   1.7      9      20

Percentage of the requests served within a certain time (ms)
  50%      9
  66%      9
  75%     12
  80%     12
  90%     12
  95%     12
  98%     13
  99%     13
 100%     20 (longest request)

I give you the whole printout because I find it interesting. There's no huge GC pause anywhere. This laptop has an Intel i7-2620M 2.70GHz CPU. It was bought last year, so it's recent but not fancy. There are two cores, and so part of one core is being used by ApacheBench, and the whole other one used by Guile.

Of course, this is a flawed benchmark. If you really care about this sort of thing, Mark Nottingham wrote a nice article on HTTP benchmarking last year that shows all the ways in which this test is wrong.

However, flawed though it is, this test serves my purposes: to understand the overhead that Guile and its web server imposes on a web application.

So in that light, I'd like to take apart this test and try to understand its performance. I'll look at it from three directions: bottom-up (using low-level profiling), top-down (using a Scheme profiler), and transverse (profiling the garbage collector).

bottom-up: clock cycles, instructions retired

The best way to measure the performance of an application is with a statistical profiler. Statistical profiling samples what's really happening, without perturbing the performance characteristics of your application.

On GNU/Linux systems, we have a few options. None are really easy to use, however. Perf comes the closest. You just run perf record -g guile examples/web/hello.scm, and it records its information. Then you run perf report and dive into the details through the mostly pleasant curses application.

However, for some reason perf could not capture the call graphs, in my tests. My machine is x86-64, which does not include frame pointers in its call frames, so perhaps it was a naive stack-walking algorithm. The associated DWARF information does include the necessary stack-walking data.

Anyway, at least you can get a good handle on what individual functions are hot, and indeed what source lines and instructions take the most time. These lowest-level statistical profilers typically sample based on number of clock cycles, so they correspond to real time very well.

So much for measurement. For actually understanding and improving performance, I find valgrind's callgrind tool much more useful. You run it like this:

valgrind --tool=callgrind --num-callers=20 /path/to/guile examples/web/http.scm

Valgrind can record the execution of your program as it runs, tracking every call and every instruction that is executed. You can then explore the resulting log with the kcachegrind graphical tool. It's the best thing there is for exploring low-level execution of your program.

Now, valgrind is not a statistical profiler. It slows down and distorts the execution of your program, as you would imagine. Beyond those caveats, though, you have to keep two things in mind: firstly, that a count of instructions executed does not correspond directly to clock cycles spent. There are memory latencies, cache misses, branch mispredictions, and a whole host of randomnesses that can affect your runtime. Secondly, valgrind only records the user-space behavior of your program. I'll have more to say on that later.

If I run the "hello.scm" benchmark under valgrind, and hit it with a hundred thousand connections, I can get a pretty good idea what the aggregate behavior of my program is. But I can do better than big-picture, for this kind of test. Given that this test does the same thing a hundred thousand times, I can use valgrind's accurate call and instruction counts to give me precise information on how much a single request takes.

So, looking at the total instruction and call counts, and dividing by the number of requests (100K), I see that when Guile handles one request, the algorithmic breakdown is as follows:

instructions per request percentage of total procedure calls per request
250K 100 (total) -
134K 52.5 bytecode interpretation -
47K 18.5 port buffer allocation 1
18K 7.2 display 1
7.6K 3 allocation within VM (closures, pairs, etc) 83
7.2K 2.84 read-line 5
3.6K 1.40 substring-downcase 3
3.2K 1.24 string->symbol 4
3.1K 1.23 accept 1
2.7K 1.06 substring 11
2.6K 1.03 string-index 8
2.5K 1.00 hashq-ref 14
23K 9 other primitives < 1% each -

Guile's ports can be buffered, like C's stdio FILE* streams, and the web server does turn on buffering. The 18.5% of the time spent allocating port buffers seems like a ripe place for optimization, but digging into it, almost all of the time within those routines is spent in the garbage collector, and most of that marking the heap. Switching to a generational collector could help here, but I'm not sure how much, given that port buffers are probably 4 KB each for input and output, and thus might not fit into a young generation. Marking from more threads at once could help -- more on that in some future essay.

There are some primitives that can be optimized as well, but with the VM taking up 52% of the runtime, and 23% for allocation and the garbage collector, Amdahl's law is against us: making the primitives twice as fast would result only a 15% improvement in throughput.

Turning Amdahl's argument around, we can predict the effect of native compilation on throughput. If Guile 2.2 comes out with native compilation, as it might, and that makes Scheme code run 5 times faster (say), then the 50% of the instructions that are currently in the VM might drop to 10% -- leading to an expected throughput improvement of 67%.

top-down: scheme-level profiling with statprof

But what is going on in the VM? For that, I need a Scheme profiler. Fortunately, Guile comes with one, accessible at the REPL:

$ guile --no-debug
> ,profile (load "examples/web/hello.scm")
%     cumulative   self             
time   seconds     seconds      name
 15.42      1.55      1.55  close-port
 11.61      1.17      1.17  %after-gc-thunk
  6.20      1.65      0.62  setvbuf
  4.29      0.43      0.43  display
  3.34      0.35      0.34  accept
  2.86      0.35      0.29  call-with-error-handling
  2.38      0.24      0.24  hashq-ref
  2.23      9.16      0.22  with-default-trap-handler
  2.07      0.67      0.21  build-response
  1.75      1.17      0.18  sanitize-response
...
Sample count: 629
Total time: 10.055736686 seconds (1.123272351 seconds in GC)

(I use the --no-debug argument to avoid some per-VM-instruction overhead imposed by running Guile interactively; see VM Hooks for more.)

Here I get a really strange result. How is close-port taking all the time? It's implemented as a primitive, not in Scheme, and Valgrind only thought it took 0.40% of the instructions. How is that?

To answer this, we need to remember a couple things. First of all, we recall that Guile's statistical profiler uses setitimer to get signals delivered periodically, after some amount of time spent on the program's behalf, including time spent in the kernel. Valgrind doesn't account for system time. So here we are seeing that close-port is indeed taking time, specifically to flush out the buffered writes. The time is really being spent in the write system call.

So this is good! We know now that we should perhaps look at tuning the kernel to buffer our writes better.

We can also use the profiling data counts to break down the time spent in serving one request from a high level. For example, http-read handles traversing the poll set and accepting connections, and tail-calls read-request to actually read the request. Looking at the cumulative times in the chart tells me that out of each request, the time spent breaks down like this:

microseconds per request procedure
100 (total)
29 poll and accept
16 reading request
0 request handler
12 sanitize-response
8 writing response headers
0 writing response body
15 close-port
20 other

Of course, since this is statistical, there is some uncertainty about the whole thing. Still, it seems sensible enough.

transverse: who is doing all the allocating?

Often when you go to optimize a Scheme program, you find that it's spending a fair amount of time in garbage collection. How do you optimize that? The rookie answer to this is to try to patch the collector to allocate faster, or less frequently, or something. Veterans know that the solution to GC woes is usually just to allocate less. But how do you know what is allocating? GC is a fairly transverse cost, in that it can charge the good parts of your program for the expenses of the bad.

Guile uses the Boehm-Demers-Weiser conservative collector. For what it is, it's pretty good. However its stock configuration does not provide very much insight into the allocation patterns of your program. One approximation that can be made, though, is that the parts of the program that cause garbage collection to run are the parts that are allocating the most.

Based on this insight, Guile includes a statistical profiler that samples when the garbage collector runs. One thing to consider is that GC probably doesn't run very often, so for gcprof tests, one might need to run the test for longer. In this example, I increased the load to a million requests.

$ guile --no-debug
> (use-modules (statprof))
> (gcprof (lambda () (load "examples/web/hello.scm")))
%     cumulative   self             
time   seconds     seconds      name
 86.55     82.75     82.75  setvbuf
  9.37      8.96      8.96  accept
  1.70      1.63      1.63  call-with-error-handling
  1.31      1.25      1.25  %read-line
  0.29      0.28      0.28  substring
  0.24      0.23      0.23  string-downcase
  0.10     91.80      0.09  http-read
  0.10      0.09      0.09  parse-param-list
  0.10      0.09      0.09  write-response
---
Sample count: 2059
Total time: 95.611532937 seconds (9.371730448 seconds in GC)

Here we confirm the result that we saw in the low-level profile: that the setvbuf Scheme procedure, which can cause Guile to allocate buffers for ports, is the primary allocator in this test.

Another interesting question to answer is, "how much are we allocating, anyway?" Using the statistics REPL command, I can see that the 100K requests entailed a total allocation of about 1.35 GB, which divides out to 13.5 KB per request. That sounds reasonable: about 4 KB each for the read and write buffers, and some 4 KB of various other allocation: strings, pairs, the final bytevector for output, etc.

The test incurred 220 stop-the-world garbage collections. So, about 1 out of 450 (0.2%) of requests trigger a GC. The average GC time seems to be about 5 ms (1100 ms / 220 times). That squares fairly well with our last-percentile ApacheBench results.

The total heap size is modest: 14 megabytes. It does not leak memory, thankfully. If I run mem_usage.py on it, I get:

Mapped memory:
               Shared            Private
           Clean    Dirty    Clean    Dirty
    r-xp    1236        0     1028        0  -- Code
    rw-p      24        0       28      160  -- Data
    r--p      60        0     1580      112  -- Read-only data
    ---p       0        0        0        0
    r--s      24        0        0        0
   total    1344        0     2636      272
Anonymous memory:
               Shared            Private
           Clean    Dirty    Clean    Dirty
    r-xp       4        0        0        0
    rw-p       0        0        0    11652  -- Data (malloc, mmap)
    ---p       0        0        0        0
   total       4        0        0    11652
   ----------------------------------------
   total    1348        0     2636    11924

Native ahead-of-time compilation would allow for more shareable, read-only memory. Still, as it is, this seems acceptable.

a quick look at more dynamic tests

That's all well and good, you say, but it's a fairly static test, right?

For that I have a couple of data points. One is the simple SXML debugging test in examples/test/debug-sxml.scm, which simply spits back the headers that it receives, wrapped in an HTML table. The values are printed as the Scheme object that they parse to. Currently, I have a version of this script running on my site. (You can see the headers that Apache adds on, there.)

Testing it as before tells me that Guile can serve 6000 of these pages per second, on my laptop. That's pretty respectable for an entirely dynamic page that hasn't been optimized at all. You can search around the net for comparable tests in other languages; I think you'll find Guile's performance to be very good.

The other point to mention is Tekuti itself. Tekuti includes a caching layer in the application, so after the first request, it's not really a dynamic test. It does check to make sure its caches are fresh on every round, though, by checking the value of the refs/heads/master ref. But still, it is a test of pushing a lot of data; for my test, the page is 50 KB, and Guile still reaches 5700 requests per second on this one core, serving 280 megabytes per second of... well, of my blather, really. But it's a powerful blather-pipe!

related & future work

Here's a similarly flawed test from a year ago of static servers, serving small static files over localhost. Flawed, but it's a lot like this "hello world" test in semantics. We see that nginx gets up to about 20K requests per second, per core. It also does so with a flat memory profile, which is nice.

Guile's bundled web server is currently single-threaded and blocking, which does not make it a good frontend server. There's room for a project to build a proper web server on top of Guile, I think, but I probably won't do it myself. In the meantime though, I do want to offer the possibility for the built-in web server to be multithreaded, with some number of I/O threads and some more limited number of compute threads. I've been testing out some code in that direction -- in fact, this server is running that code -- but as yet Guile's synchronization primitives have too much overhead for it to be a real win. There's more runtime and compiler work to do here.

As far as web servers written in safe languages, it would be remiss to not mention Warp, a Haskell web server. Again, their tests effectively utilize multiple cores, but it seems that 20K per core is the standard.

Unlike Haskell, Guile lacks a proper event manager. I'm not sure whether to work on such a thing; Havoc seems to think it's necessary, and who am I to argue.

Finally, I mention a benchmark of python WSGI servers from a couple years ago. (Is March the month of benchmarks?) The python performance is notably worse; hopefully PyPy has improved things in the meantime. On the other hand, GEvent's use of greenlets is really nice, and makes me envious.

conclusions

Hey, you read the thing! Congratulations to you! Good thing you didn't just skip down to the end :)

If I have a message to send, it's this: that you should consider using Guile to be perfectly acceptable for implementing dynamic web applications with high performance requirements.

It's a modest point, I know. There are all kinds of trade-offs here, but hey, Guile is plucky and still a little bit shy, but would love it if you to ask it to the hack.

If it works for you, boast about it to your colleagues! And if it doesn't, let us know, over at the usual places (guile-user@gnu.org, and #guile on freenode). Happy hacking!

Syndicated 2012-03-08 22:44:59 from wingolog

for love and $

Friends, I'm speaking at JSConf.us this year! Yee haw!

I would have mentioned this later, but events push me to say something now.

You see, I wrote the web server that runs this thing, together with the blog software. I've been hacking on it recently, too; more on that soon. But it seems that this is the first time I've noticed a link from a site that starts with a number. The URI parsers for the referer link were bombing out, because I left off a trailing $ on a regular expression.

So, for love and $, JSConf ho! We ride at dawn!

Syndicated 2012-02-21 16:48:19 from wingolog

palindromically delimited carnival

Our time aligns on strange axes, sometimes. Last palindrome day, 11/11/11, found me walking the streets of Gothenburg with Werner Koch, the GPG maintainer. Werner said that in Germany, the carnival season opens on the 11th of November, at 11:11:11 in the morning. Today, 21022012, closes the carnival week here in Catalunya.

I was in Gothenburg for FSCONS. It so happens that the videos for the talk I gave there, Guile: Free Software Means of Production, just came out last week. So, as another point along that carnival axis, I offer in <video> form:

Alternately you can download the video directly (~112MB, 50 minutes). There are notes too, a superset of the slides from the talk.

As I said back then, this one was aimed at folks that didn't necessarily know very much about Guile. It was also different from other talks in that it emphasized Guile as a general programming environment, not as an extension language. Guile is both things, and as the general-purpose side gets a lot less publicity, I wanted to emphasize it in this talk.

In the last 20 minutes or so, we did a live-hack. Inspired by a tweet by mattmight, we built Bitter, a one-bit Twitter. I tried to convey what it's like to hack in Guile, with some success I think. Source code for the live-hack, such as it is, is linked to at the end of the page.

Syndicated 2012-02-21 12:07:58 from wingolog

unexpected concurrency

OK kids, quiz time. Spot the bugs in this Python class:

import os

class FD:
    _all_fds = set()

    def __init__(self, fd):
        self.fd = fd
        self._all_fds.add(fd)

    def close(self):
        if (self.fd):
            os.close(self.fd)
            self._all_fds.remove(self.fd)
            self.fd = None

    @classmethod
    def for_each_fd(self, proc):
        for fd in self._all_fds:
            proc(fd)

    def __del__(self):
        self.close()

The intention is pretty clear: you have a limited resource (file descriptors, in this case). You would like to make sure they get closed, no matter what happens in your program, so you wrap them in objects known to the garbage collector, and attach finalizers that close the descriptors. You have a for_each_fd procedure that should at least allow you to close all file descriptors, for example when your program is about to exec another program.

So, bugs?

* * *

Let's start with one: FD._all_fds can't sensibly be accessed from multiple threads at the same time. The file descriptors in the set are logically owned by particular pieces of code, and those pieces of code could be closing them while you're trying to for_each_fd on them.

Well, OK. Let's restrict the problem, then. Let's say there's only one thread. Next bug?

* * *

Another bug is that signals cause arbitrary code to run, at arbitrary points in your program. For example, if in the close method, you get a SIGINT after the os.close but before removing the file descriptor from the set, causing an exception to be thrown, you will be left with a closed descriptor in the set. If you swap the operations, you leak an fd. Neither situation is good.

The root cause of the problem here is that asynchronous signals introduce concurrency. Signal handlers are run in another logical thread of execution in your program -- even if they happen to share the same stack (as they do in CPython).

OK, let's mask out signals then. (This is starting to get ugly). What next?

* * *

What happens if, during for_each_fd, one of the FD objects becomes unreachable?

The Python language does not guarantee anything about when finalizers (__del__ methods) get called. (Indeed, it doesn't guarantee that they get called at all.) The CPython implementation will immediately finalize objects whose refcount equals zero. Running a finalizer on the method will mutate FD._all_fds, while it is being traversed, in this case.

The implications of this particular bug are either that CPython will throw an exception when it sees that the set was modified while iterating over it, or that the finalizer happens to close the fd being processed. Neither one of these cases are very nice, either.

This is the bug I wanted to get to with this article. Like asynchronous signals, finalizers introduce concurrency: even in languages with primitive threading models like Python.

Incidentally, this behavior of running finalizers from the main thread was an early bug in common Java implementations, 15 years ago. All JVM implementors have since fixed this, in the same way: running finalizers within a dedicated thread. This avoids the opportunity for deadlock, or for seeing inconsistent state. Guile will probably do this in 2.2.

For a more thorough discussion of this problem, Hans Boehm has published a number of papers on this topic. The 2002 Destructors, Finalizers, and Synchronization paper is a good one.

Syndicated 2012-02-16 22:12:33 from wingolog

eval, that spectral hound

Friends, I am not a free man. Eval has been my companion of late, a hellhound on my hack-trail. I give you two instances.

the howl of the-environment, across the ages

As legend has it, in the olden days, Aubrey Jaffer, the duke of SCM, introduced low-level FEXPR-like macros into his Scheme implementation. These allowed users to capture the lexical environment:

(define the-environment
  (procedure->syntax
   (lambda (exp env)
     env)))

Tom Lord inherited this cursed bequest from Jaffer, when he established himself in the nearby earldom of Guile. It so affected him that he added local-eval to Guile, allowing the user to evaluate an expression within a captured local environment:

(define env (let ((x 10)) (the-environment)))
(local-eval 'x env)
=> 10
(local-eval '(set! x 42) env)
(local-eval 'x env)
=> 42

Since then, the tenants of the earldom of Guile have been haunted by this strange leakage of the state of the interpreter into the semantics of Guile.

When the Guile co-maintainer title devolved upon me, I had a plan to vanquish the hound: to compile Guile into fast bytecode. There would be no inefficient association-lists of bindings at run-time. Indeed, there would be no "environment object" to capture. I succeeded, and with Guile 2.0, local-eval, procedure->syntax and the-environment were no more.

But no. As Guile releases started to make it into distributions, and users started to update their code, there arose such a howling on the mailing lists as set my hair on end. The ghost of local-eval was calling: it would not be laid to rest.

I resisted fate, for as long as I could do so in good conscience. In the end, Guile hacker Mark Weaver led an expedition to the mailing list moor, and came back with a plan.

Mark's plan was to have the syntax expander recognize the-environment, and residualize a form that would capture the identities of all lexical bindings. Like this:

(let ((x 10)) (the-environment))
=>
(let ((x 10))
  (make-lexical-environment
   ;; Procedure to wrap captured environment around
   ;; an expression
   wrapper
   ;; Captured variables: only "x" in this case
   (list (capture x))))

I'm taking it a little slow because hey, this is some tricky macrology. Let's look at (capture x) first. How do you capture a variable? In Scheme, with a closure. Like this:

;; Capture a variable with a closure.
;;
(define-syntax-rule (capture var)
  (case-lambda
    ;; When called with no arguments, return the value
    ;; of VAR.
    (() var)
    ;; When called with one argument, set the VAR to the
    ;; new value.
    ((new-val) (set! var new-val))))

The trickier part is reinstating the environment, so that x in a local-eval'd expression results in the invocation of a closure. Identifier syntax to the rescue:

;; The wrapper from above: a procedure that wraps
;; an expression in a lexical environment containing x.
;;
(lambda (exp)
  #`(lambda (x*) ; x* is a fresh temporary var
      (let-syntax ((x (identifier-syntax
                        (_ (x*))
                        ((set! _ val) (x* val)))))
        #,exp)))

By now it's clear what local-eval does: it wraps an expression, using the wrapper procedure from the environment object, evaluates that expression, then calls the resulting procedure with the case-lambda closures that captured the lexical variable.

So it's a bit intricate and nasty in some places, but hey, it finally tames the ghostly hound with modern Scheme. We were able to build local-eval on top of Guile's procedural macros, once a couple of accessors were added to our expander to return the set of bound identifiers visible in an expression, and to query whether those bindings were regular lexicals, or macros, or pattern variables, or whatever.

"watson, your service revolver, please."

As that Guile discussion was winding down, I started to hear the howls from an unexpected quarter: JavaScript. You might have heard, perhaps, that JavaScript eval is crazy. Well, it is. But ES5 strict was meant to kill off its most egregious aspect, in which eval can introduce new local variables to a function.

Now I've been slowly hacking on implementing block-scoped let and const in JavaScriptCore, so that we can consider switching gnome-shell over to use JSC. Beyond standard ES5 supported in JSC, existing gnome-shell code uses let, const, destructuring binding, and modules, all of which are bound to be standardized in the upcoming ES6. So, off to the hack.

My initial approach was to produce a correct implementation, and then make it fast. But the JSC maintainers, inspired by the idea that "let is the new var", wanted to ensure that let was fast from the beginning, so that it doesn't get a bad name with developers. OK, fair enough!

Beyond that, though, it looks like TC39 folk are eager to get let and const into all parts of JavaScript, not just strict mode. Do you hear the hound? It rides again! Now we have to figure out how block scope interacts with non-strict eval. Awooooo!

Thankfully, there seems to be a developing consensus that eval("let x = 20") will not introduce a new block-scoped lexical. So, down boy. The hound is at bay, for now.

life with dogs

I'm making my peace with eval. Certainly in JavaScript it's quite a burden for an implementor, but the current ES6 drafts don't look like they're making the problem worse. And in Scheme, I'm very happy to provide the primitives needed so that local-eval can be implemented in terms of our existing machinery, without needing symbol tables at runtime. But if you are making a new language, as you value your life, don't go walking on the local-eval moors at night!

Syndicated 2012-02-01 15:33:49 from wingolog

javascript eval considered crazy

Peoples. I was hacking recently on JavaScriptCore, and I came to a realization: JavaScript's eval is absolutely crazy.

I mean, I thought I knew this before. But... words fail me, so I'll have to show a few examples.

eval and introduced bindings

This probably isn't worth mentioning, as you probably know it, but eval can introduce lexical bindings:

 > var foo = 10;
 > (function (){ eval('var foo=20;'); return foo; })()
 20
 > foo
 10

I find this to be pretty insane already, but I knew about it. You would think though that var x = 10; and eval('var x = 10;'); would be the same, though, but they're not:

 > (function (){ var x = 10; return delete x; })()
 false
 > (function (){ eval('var x = 10;'); return delete x; })()
 true

eval-introduced bindings do not have the DontDelete property set on them, according to the post-hoc language semantics, so unlike proper lexical variables, they may be deleted.

when is eval really eval?

Imagine you are trying to analyze some JavaScript code. If you see eval(...), is it really eval?

Not necessarily.

eval pretends to be a regular, mutable binding, so it can be rebound:

 > eval = print
 > eval('foo')
 foo // printed

or, shadowed lexically:

 > function () { var eval = print; eval('foo'); }
 foo // printed

or, shadowed dynamically:

 > with({'eval': print}) { eval('foo'); }
 foo // printed

You would think that if you can somehow freeze the eval binding on the global object, and verify that there are no with forms, and no statements of the form var eval = ..., that you could guarantee that eval is eval, but that is not the case:

 > Object.freeze(this);
 > (function (x){ return [eval(x), eval(x)]; })('var eval = print; 10')
 var eval = print; 10 // printed, only once!
 10,

(Here the first eval created a local binding for eval, and evaluated to 10. The second eval was actually a print.)

Crazy!!!!

an eval by any other name

So eval is an identifier that can be bound to another value. OK. One would expect to be able to bind another identifier to eval, then. Does that work? It seems to work:

 > var foo = eval;
 > foo('foo') === eval;
 true

But not really:

 > (function (){ var quux = 10; return foo('quux'); } )()
 Exception: ReferenceError: Can't find variable: quux

eval by any other name isn't eval. (More specifically, eval by any other name doesn't have access to lexical scope.)

Note, however, the mere presence of a shadowed declaration of eval doesn't mean that eval isn't eval:

 > var foo = 10
 > (function(x){ var eval = x; var foo = 20; return [x('foo'), eval('foo')] })(eval)
 10,20

Crazy!!!!

strict mode restrictions

ECMAScript 5 introduces "strict mode", which prevents eval from being rebound:

 > (function(){ "use strict"; var eval = print; })
 Exception: SyntaxError: Cannot declare a variable named 'eval' in strict mode.
 > (function(){ "use strict"; eval = print; })
 Exception: SyntaxError: 'eval' cannot be modified in strict mode
 > (function(){ "use strict"; eval('eval = print;'); })()
 Exception: SyntaxError: 'eval' cannot be modified in strict mode
 > (function(x){"use strict"; x.eval = print; return eval('eval');})(this)
 Exception: TypeError: Attempted to assign to readonly property.

But, since strict mode is embedded in "classic mode", it's perfectly fine to mutate eval from outside strict mode, and strict mode has to follow suit:

 > eval = print;
 > (function(){"use strict"; return eval('eval');})()
 eval // printed

The same is true of non-strict lexical bindings for eval:

 > (function(){ var eval = print; (function(){"use strict"; return eval('eval');})();})();
 eval // printed
 > with({'eval':print}) { (function(){ "use strict"; return eval('eval');})() }
 eval // printed

An engine still has to check at run-time that eval is really eval. This crazy behavior will be with us as long as classic mode, which is to say, forever.

Strict-mode eval does have the one saving grace that it cannot introduce lexical bindings, so implementors do get a break there, but it's a poor consolation prize.

in summary

What can an engine do when it sees eval?

Not much. It can't even prove that it is actually eval unless eval is not bound lexically, there is no with, there is no intervening non-strict call to any identifier eval (regardless of whether it is eval or not), and the global object's eval property is bound to the blessed eval function, and is configured as DontDelete and ReadOnly (not the default in web browsers).

But the very fact that an engine sees a call to an identifier eval poisons optimization: because eval can introduce variables, the scope of free variables is no longer lexically apparent, in many cases.

I'll say it again: crazy!!!

Syndicated 2012-01-12 16:34:08 from wingolog

webkittens! lexical scoping is in danger!

The GTK+ WebKittens are on the loose here in Coruña. There's folks here from Red Hat, Motorola, Collabora, and of course Igalia. It's good times; beyond the obvious platitudes of "um, the web is important and stuff" it's good to be in a group that is creating the web experience of millions of users.

My part in that is very small, adding support for block-scoped let and const to JavaScriptCore.

I've made some progress, but it could be going more smoothly. I have made the parser do the right thing for const, correctly raising errors for duplicate bindings, including nested var declarations that get hoisted. The parser is fine: it maintains an environment like you would expect. But the AST assumes that all locals get hoisted to function scope, so there's no provision for e.g. two distinct local variables with the same name. So there is still some work to do on the AST, and it's a thicket of templates.

Hopefully I'll end up with a prototype by the end of the hackfest (Sunday). Sooner if I find that sword of omens, which I seem to have misplaced. Sight beyond sight!

Syndicated 2011-12-02 17:36:56 from wingolog

fscons 2011: free software, free society

Good morning, hackersphere! Time and space are moving, in the egocentric coordinate system at least, but before their trace is gone, I would like to say: FSCONS 2011 was fantastic!

FSCONS is a conference unlike any other I know. I mean, where else can you go from a talk about feminism in free software, to talk about the state of the OpenRISC chip design project, passing through a hallway track conversation on the impact of cryptocurrency on the welfare state, approached from an anarchist perspective?

Like many of you, I make software because I like to hack. But I make Free Software in particular because I value all kinds of freedom, as part of the "more beautiful world our hearts know is possible". We make the material conditions of tomorrow's social relations, and I want a world of sharing and mutual aid.

But when we reflect on what our hands are making, we tend do so in a context of how, not why. That's why I enjoyed FSCONS so much, that it created a space for joining the means of production to their ends: a cons of Free Software, Free Society.

As a GNU hacker, I'm especially honored by the appreciation that FSCONS particpants have for GNU. FSCONS has a tithe, in which a portion of the entry fees is donated to some project, and this year GNU was chosen as the recipient. It's especially humbling, given the other excellent projects that were nominated for the tithe.

So thank you very much, FSCONS organizers and participants. I had a great time!

are you bitter about it?

I gave a talk there at FSCONS, GNU Guile: Free Software Means of Production (slides, notes).

Unlike many of my other talks, this one was aimed at folks that didn't necessarily know very much about Guile. It was also different from other talks in that it emphasized Guile as a general programming environment, not as an extension language. Guile is both things, and as the general-purpose side gets a lot less publicity, I wanted to emphasize it in this talk. Hopefully the videos will be up soon.

In the last 20 minutes or so, we did a live-hack. Inspired by a tweet by mattmight, we built Bitter, a one-bit Twitter. I tried to convey what it's like to hack in Guile, with some success I think. Source code for the live-hack, such as it is, is linked to at the end of the page.

For a slightly more extended example of a web application, check out Peeple, originally presented in a talk at FOSDEM, back in February. Peeple has the advantage of being presented as a development of separate git commits. Slides of that talk, Dynamic Hacking with Guile, are also available, though they are not as developed as the ones from FSCONS.

Finally, for the real documentation, see the Guile manual.

Happy hacking, and hopefully see you at FSCONS next year!

Syndicated 2011-11-28 11:38:36 from wingolog

JavaScriptCore, the Webkit JS implementation

My readers will know that I have recently had the pleasure of looking into the V8 JavaScript implementation, from Google. I'm part of a small group in Igalia doing compiler work, and it's clear that in addition to being lots of fun, JavaScript implementations are an important part of the compiler market today.

But V8 is not the only JS implementation in town. Besides Mozilla's SpiderMonkey, which you probably know, there is another major Free Software JS implementation that you might not have even heard of, at least not by its proper name: JavaScriptCore.

jsc: js for webkit

JavaScriptCore (JSC) is the JavaScript implementation of the WebKit project.

In the beginning, JavaScriptCore was a simple tree-based interpreter, as Mozilla's SpiderMonkey was. But then in June of 2008, a few intrepid hackers at Apple wrote a compiler and bytecode interpreter for JSC, threw away the tree-based interpreter, and called the thing SquirrelFish. This was eventually marketed as "Nitro" inside Apple's products[0].

JSC's bytecode interpreter was great, and is still pretty interesting. I'll go into some more details later in this article, because its structure conditions the rest of the implementation.

But let me continue here with this historical sketch by noting that later in 2008, the WebKit folks added inline caches, a regular expression JIT, and a simple method JIT, and then called the thing SquirrelFish Extreme. Marketers called this Nitro Extreme. (Still, the proper name of the engine is JavaScriptCore; Wikipedia currently gets this one wrong.)

One thing to note here is that the JSC folks were doing great, well-factored work. It was so good that SpiderMonkey hackers at Mozilla adopted JSC's regexp JIT compiler and their native-code assembler directly.

As far as I can tell, for JSC, 2009 and 2010 were spent in "consolidation". By that I mean that JSC had a JIT and a bytecode interpreter, and they wanted to maintain them both, and so there was a lot of refactoring and tweaking to make them interoperate. This phase consolidated the SFX gains on x86 architectures, while also adding implementations for ARM and other architectures.

But with the release of V8's Crankshaft in late 2010, the JS performance bar had been lowered again (assuming it is a limbo), and so JSC folks started working on what they call their "DFG JIT" (DFG for "data-flow graph"), which aims be more like Crankshaft, basically.

It's possible to configure a JSC with all three engines: the interpreter, the simple method JIT, and the DFG JIT. In that case there is tiered compilation between the three forms: initial parsing and compilation produces bytecode, that can be optimized with the method JIT, that can be optimized by the DFG JIT. In practice, though, on most platforms the interpreter is not included, so that all code runs through the method JIT. As far as I can tell, the DFG JIT is shipping in Mac OS X Lion's Safari browser, but it is not currently enabled on any platform other than 64-bit Mac. (I am working on getting that fixed.)

a register vm

The interpreter has a number of interesting pieces, but it is important mostly for defining the format of bytecode. Bytecode is effectively the high-level intermediate representation (IR) of JSC.

To put that into perspective, in V8, the high-level intermediate representation is the JS source code itself. When V8 first sees a piece of code, it pre-parses it to raise early syntax errors. Later when it needs to analyze the source code, either for the full-codegen compiler or for Hydrogen, it re-parses it to an AST, and then works on the AST.

In contrast, in JSC, when code is first seen, it is fully parsed to an AST and then that AST is compiled to bytecode. After producing the bytecode, the source text isn't needed any more, and so it is forgotten. The interpreter interprets the bytecode directly. The simple method JIT compiles the bytecode directly. The DFG JIT has to re-parse the bytecode into an SSA-style IR before optimizing and producing native code, which is a bit more expensive but worth it for hot code.

As you can see, bytecode is the common language spoken by all of JSC's engines, so it's important to understand it.

Before really getting into things, I should make an aside about terminology here. Traditionally, at least in my limited experience, a virtual machine was considered to be a piece of software that interprets sequences of virtual instructions. This would be in contrast to a "real" machine, that interprets sequences of "machine" or "native" instructions in hardware.

But these days things are more complicated. A common statement a few years ago would be, "is JavaScript interpreted or compiled?" This question is nonsensical, because "interpreted" or "compiled" are properties of implementations, not languages. Furthermore the implementation can compile to bytecode, but then interpret that bytecode, as JSC used to do.

And in the end, if you compile all the bytecode that you see, where is the "virtual machine"? V8 hackers still call themselves "virtual machine engineers", even as there is no interpreter in the V8 sources (not counting the ARM simulator; and what of a program run under qemu?).

All in all though, it is still fair to say that JavaScriptCore's high-level intermediate language is a sequence of virtual instructions for an abstract register machine, of which the interpreter and the simple method JIT are implementations.

When I say "register machine", I mean that in contrast to a "stack machine". The difference is that in a register machine, all temporary values have names, and are stored in slots in the stack frame, whereas in a stack machine, temporary results are typically pushed on the stack, and most instructions take their operands by popping values off the stack.

(Incidentally, V8's full-codegen compiler operates on the AST in a stack-machine-like way. Accurately modelling the state of the stack when switching from full-codegen to Crankshaft has been a source of many bugs in V8.)

Let me say that for an interpreter, I am totally convinced that register machines are the way to go. I say this as a Guile co-maintainer, which has a stack VM. Here are some reasons.

First, stack machines penalize named temporaries. For example, consider the following code:

(lambda (x)
  (* (+ x 2)
     (+ x 2)))

We could do common-subexpression elimination to optimize this:

(lambda (x)
  (let ((y (+ x 2)))
    (* y y)))

But in a stack machine is this really a win? Consider the sequence of instructions in the first case:

; stack machine, unoptimized
0: local-ref 0      ; x
1: make-int8 2
2: add
3: local-ref 0      ; x
4: make-int8 2
5: add
6: mul
7: return

Contrast this to the instructions for the second case:

; stack machine, optimized
0: local-ref 0      ; push x
1: make-int8 2      ; push 2
2: add              ; pop x and 2, add, and push sum
3: local-set 1      ; pop and set y
4: local-ref 1      ; push y
5: local-ref 1      ; push y
6: mul              ; pop y and y, multiply, and push product
7: return           ; pop and return

In this case we really didn't gain anything, because the storing values to locals and loading them back to the stack take up separate instructions, and in general the time spent in a procedure is linear in the number of instructions executed in the procedure.

In a register machine, on the other hand, things are easier, and CSE is definitely a win:

0: add 1 0 0           ; add x to x and store in y
1: mul 2 1 1           ; multiply y and y and store in z
2: return 2            ; return z

In a register machine, there is no penalty to naming a value. Using a register machine reduces the push/pop noise around the instructions that do the real work.

Also, because they include the names (or rather, locations) of their operands within the instruction, register machines also take fewer instructions to do the job. This reduces dispatch cost.

In addition, with a register VM, you know the size of a call frame before going into it, so you can avoid overflow checks when pushing values in the function. (Some stack machines also have this property, like the JVM.)

But the big advantage of targeting a register machine is that you can take advantage of traditional compiler optimizations like CSE and register allocation. In this particular example, we have used three virtual registers, but in reality we only need one. The resulting code is also closer to what real machines expect, and so is easier to JIT.

On the down side, instructions for a register machine typically occupy more memory than instructions for a stack machine. This is particularly the case for JSC, in which the opcode and each of the operands takes up an entire machine word. This was done to implement "direct threading", in which the opcodes are not indexes into jump tables, but actually are the addresses of the corresponding labels. This might be an acceptable strategy for an implementation of JS that doesn't serialize bytecode out to disk, but for anything else the relocations are likely to make it a lose. In fact I'm not sure that it's a win for JSC even, and perhaps the bloat was enough of a lose that the interpreter was turned off by default.

Stack frames for the interpreter consist of a six-word frame, the arguments to the procedure, and then the locals. Calling a procedure reserves space for a stack frame and then pushes the arguments on the stack -- or rather, sets them to the last n + 6 registers in the stack frame -- then slides up the frame pointer. For some reason in JSC the stack is called the "register file", and the frame pointer is the "register window". Go figure; I suppose the names are as inscrutable as the "activation records" of the stack world.

jit: a jit, a method jit

I mention all these details about the interpreter and the stack (I mean, the register file), because they apply directly to the method JIT. The simple method JIT (which has no name) does the exact same things that the bytecode interpreter does, but it does them via emitted machine instructions instead of interpreting virtual instructions.

There's not much to say here; jitting the code has the result you would expect, reducing dispatching overhead, while at the same time allowing some context-specific compilation, like when you add a constant integer to a variable. This JIT is really quick-and-dirty though, so you don't get a lot of the visibility benefits traditionally associated with method JITs like what HotSpot's C1 or C2 currently have. Granted, the register VM bytecode does allow for some important optimizations to happen, but JSC currently doesn't do very much in the way of optimizing bytecode, as far as I can tell.

Thinking more on the subject, I suspect that for Javascript, CSE isn't even possible unless you know the types, as a valueOf() callback could have side effects.

an interlude of snarky footnotes

Hello, reader! This is a long article, and it's a bit dense. I had some snarky footnotes that I enjoyed writing, but it felt wrong to put them at the end, so I thought it better to liven things up in the middle here. The article continues in the next section.

0. In case you didn't know, compilers are approximately 37% composed of marketing, and rebranding is one of the few things you can do to a compiler, marketing-wise, hence the name train: SquirrelFish, Nitro, SFX, Nitro Extreme...[1] As in, in response to "I heard that Nitro is slow.", one hears, "Oh, man they totally fixed that in SquirrelFish Extreme. It's blazingly fast![2]"

1. I don't mean to pick on JSC folks here. V8 definitely has this too, with their "big reveals". Firefox people continue to do this for some reason (SpiderMonkey, TraceMonkey, JaegerMonkey, IonMonkey). I expect that even they have forgotten the reason at this point. In fact the JSC marketing has lately been more notable in its absence, resulting in a dearth of useful communication. At this point, in response to "Oh, man they totally are doing a great job in JavaScriptCore", you're most likely to hear, "JavaScriptCore? Never heard of it. Kids these days hack the darndest things."

2. This is the other implement in the marketer's toolbox: "blazingly fast". It means, "I know that you don't understand anything I'm saying, but I would like for you to repeat this phrase to your colleagues please." As in, "LLVM does advanced TBAA on the SSA IR, allowing CSE and LICM while propagating copies to enable SIMD loop vectorization. It is blazingly fast."

dfg: a new crankshaft for jsc?

JavaScriptCore's data flow graph (DFG) JIT is work by Gavin Barraclough and Filip Pizlo to enable speculative optimizations for JSC. For example, if you see the following code in JS:

a[i++] = 0.7*x;

then a is probably an array of floating-point numbers, and i is probably an integer. To get great performance, you want to use native array and integer operations, so you speculatively compile a version of your code that makes these assumptions. If the assumptions don't work out, then you bail out and try again with the normal method JIT.

The fact that the interpreter and simple method JIT have a clear semantic model in the form of bytecode execution makes it easy to bail out: you just reconstruct the state of the virtual registers and register window, then jump back into the code. (V8 calls this process "deoptimization"; the DFG calls it "speculation failure".)

You can go the other way as well, switching from the simple JIT to the optimized DFG JIT, using on-stack replacement. The DFG JIT does do OSR. I hear that it's needed if you want to win Kraken, which puts you in lots of tight loops that you need to be able to optimize without relying on being able to switch to optimized code only on function re-entry.

When the DFG JIT is enabled, the interpreter (if present) and the simple method JIT are augmented with profiling information, to record what types flow through the various parts of the code. If a loop is executed a lot (currently more than 1000 times), or a function is called a lot (currently about 70 times), the DFG JIT kicks in. It parses the bytecode of a function into an SSA-like representation, doing inlining and collecting type feedback along the way. This probably sounds very familiar to my readers.

The difference between JSC and Crankshaft here is that Crankshaft parses out type feedback from the inline caches directly, instead of relying on in-code instrumentation. I think Crankshaft's approach is a bit more elegant, but it is prone to lossage when GC blows the caches away, and in any case either way gets the job done.

I mentioned inlining before, but I want to make sure that you noticed it: the DFG JIT does do inlining, and does so at parse-time, like HotSpot does. The type profiling (they call it "value profiling") combined with some cheap static analysis also allows the DFG to unbox int32 and double-precision values.

One thing that the DFG JIT doesn't do, currently, is much in the way of code motion. It does do some dead-code elimination and common-subexpression elimination, and as I mentioned before, you need the DFG's value profiles in order to be able to do this correctly. But it does not, as far as I can tell, do much in the way of code motion, like loop-invariant code motion.

Also, the DFG's register allocator is not as good as Crankshaft's, yet. It is hampered in this regard by the JSC assembler that I praised earlier; while indeed a well-factored, robust piece of code, JSC's assembler has a two-address interface instead of a three-address interface. This means that instead of having methods like add(dest, op1, op2), it has methods like add(op1, op2), where the operation implicitly stores its result in its first operand. Though it does correspond to the x86 instruction set, this sort of interface is not great for systems where you have more registers (like on x86-64), and forces the compiler to shuffle registers around a lot.

The counter-based optimization triggers do require some code to run that isn't strictly necessary for the computation of the results, but this stratey does have the nice property that the DFG performance is fairly predictable, and measurable. Crankshaft, on the other hand, being triggered by a statistical profiler, has statistically variable performance.

And speaking of performance, AWFY on the mac is really where it's at for JSC right now. Since the DFG is only enabled by default on recent Mac OS 64-bit builds, you need to be sure you're benchmarking the right thing.

Looking at the results, I think we can say that JSC's performance on the V8 benchmark is really good. Also it's interesting to see JSC beat V8 on SunSpider. Of course, there are lots of quibbles to be had as to the validity of the various benchmarks, and it's also clear that V8 is the fastest right now once it has time to warm up. But I think we can say that JSC is doing good work right now, and getting better over time.

future

So that's JavaScriptCore. The team -- three people, really -- is mostly focusing on getting the DFG JIT working well right now, and I suspect they'll be on that for a few months. But we should at least get to the point where the DFG JIT is working and enabled by default on free systems within a week or two.

The one other thing that's in the works for JSC is a new generational garbage collector. This is progressing, but slowly. There are stubs in the code for card-marking write barriers, but currently there is no such GC implementation, as far as I can tell. I suspect that someone has a patch they're cooking in private; we'll see. At least JSC does have a Handle API, unlike SpiderMonkey.

conclusion

So, yes, in summary, JavaScriptCore is a fine JS implementation. Besides being a correct implementation for real-world JS -- something that is already saying quite a lot -- it also has good startup speed, is fairly robust, and is working on getting an optimizing compiler. There's work to do on it, as with all JS implementations, but it's doing fine.

Thanks for reading, if you got this far, though I understand if you skipped some parts in the middle. Comments and corrections are most welcome, from those of you that actually read all the way through, of course :). Happy hacking!

Syndicated 2011-10-28 15:51:24 from wingolog

the user in the loop

The videos from this year's GNU Hackers Meeting in Paris are up. All the videos are linked from that page. There were technical problems with a couple of them, and we didn't get BT Templeton's presentation on Emacs Lisp in Guile (using delimited dynamic bindings to implement buffer-local variables!), but all in all it's a good crop.

I gave a talk entitled "The User in the Loop" that argued for the place of extensibility in the GNU project. I also pointed out some ways in which Guile fulfills those needs, but that was not the central point to the talk. I argued that point at more length in a previous article.

Anyway, let's give that newfangled HTML5 video tag a go here. It starts with the cut-off statement, "We all know that code motion is very important to efficiency, so..."

Alternately you can download the video directly (~350MB, 52 minutes). There are notes too, a superset of the slides from the talk.

Syndicated 2011-10-19 14:32:34 from wingolog

378 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!