Older blog entries for nikodemus (starting at number 77)

Is That A Rest-List In Your Pocket?

Prelude

SBCL has for a while now been able to elide &REST list allocation when it is only used as an argument to APPLY, so

(defun foo (&rest args) (apply #'bar args))

is non-consing if BAR is. Note: I'm not saying eliding heap-allocation, I'm saying eliding list allocation completely: instead of moving the arguments into a stack or heap allocated list and then pulling them out later, the compiler generates code that directly passes them as arguments to FOO.

This doesn't make a difference to stack-allocation if all you look at is heap-consing, but it does save a noticeable amount of work at runtime, and it doesn't break tail-calls like stack allocation does.

That's how far it went, however: if you did anything else with the rest-list, the compiler gave up and allocated the full list -- on stack if you asked for that using DYNAMIC-EXTENT.

First Act

Earlier this week Nathan Froyd (an SBCL hacker, and him of Ironclad-fame) committed a change that fixed a rather embarrassing oversight: we were heap allocating the rest-lists in full calls to vararg entry points to arithmetic functions like + and *.

This is less catastrophic for most code than you might imagine, since SBCL works pretty hard to call more efficient entry points -- so those calls are virtually never seen in performance sensitive code.

Doesn't make it any less embarrassing, though. Years and years it's been like that, until Nathan noticed.

Nathan fixed the heap consing by adding DYNAMIC-EXTENT declarations to those functions involved, which not only reduced GC pressure a bit, but provided a small performance boost.

Second Act

Adding those DYNAMIC-EXTENT declarations had another side effect as well -- a couple of backtrace tests broke from to unexpected frames, due to tail-calls being foiled by the stack allocation: several tests used division by zero to trigger an error, so the arithmetic changes showed up there.

That would have been a fair tradeoff, and the backtrace tests could just have been adjusted to allow the extra frames, but we could do a bit better.

SBCL has an extra (internal, unsupported) lambda-list keyword: SB-INT:&MORE, which is a fair deal hairier to use than &REST, but allows dealing with variable arguments without any consing -- heap or stack. So those arithmetic functions got changed to use SB-INT:&MORE instead, which fixed the backtrace tests and gave another small performance boost.

Third Act

I was looking at the SB-INT:&MORE changes, and wondering if we should expose it to users as well, since it obviously is useful occasionally -- and what kind of interface cleanup that would entail.

Thinking about that I realized that I could just extend the compiler smarts for dealing with &REST instead. Under the hood, when SBCL optimizes an APPLY with a rest list as the final argument, it actually changes into using &MORE.

So, I extended that part of the compiler to deal with (for starters) a few list functions that would be sufficient for implementing the arithmetic functions using rest lists and compiler magic.

The conversion path looks roughly like this:

;;; Original source, using LENGTH as an example
(defun foo (&rest args)
  (length args))

;;; Compiler adds hidden &MORE arguments when it sees the &REST.
(lambda (&rest args &more #:context #:count)
  (length args))

;;; Source level transformation notices LENGTH is applied to a &REST
;;; argument and transform into %REST-LENGTH.
(lambda (&rest args &more #:context #:count)
  (sb-c::%rest-length args #:context #:count))

;;; During optimization another transformation sees the %REST-LENGTH,
;;; and verifies that the rest list is never modified, or used in
;;; any place that would require actually allocating it -- this being
;;; the case, it proceeds.
(lambda (&rest args &more #:context #:count)
  #:count)

;;; Since the rest list isn't used anymore, it is deleted.
(lambda (&more #:context #:count)
  #:count)

That's it, approximately. Currently this can be done for: ELT, NTH, CAR, FIRST, LENGTH, LIST-LENGTH, and VALUES-LIST -- and additionally using a rest-list as the test-form in an IF is equally efficient and doesn't force its allocation.

LENGTH, ELT, and NTH on rest-lists deserve a special mention: they're all O(1) when this optimization has been applied.

Unfortunately we don't yet have any compiler notes about this, so if you intend to take advantage of this optimization, you're best off verifying the results from assembly.

Coda

With that in place, I rewrote the vararg arithmetic functions using &REST. Amusingly they now look rather charmingly naive: the way someone who doesn't understand the cost of list traversal would write things:

(defun - (number &rest more-numbers)
  (if more-numbers
      (let ((result number))
        (dotimes (i (length more-numbers) result)
          (setf result (- result (nth i more-numbers)))))
      (- number)))

...but using bleeding edge SBCL, this compiles into rather nice code.

Finally, some pretty pictures. These are benchmark results for calling the vararg #'+ with 2, 4, or 8 arguments. F means fixnum, S a single float, and D a double float. The numbers are benchmark iterations per second, so bigger is better. Topmost chart is for the current version using SBCL's newly found rest-smarts, middle chart is for the version using DYNAMIC-EXTENT, and bottom one is for the version before all this madness started.

Benchmarks, linear scale.

Benchmarks, logarithmic scale.

If you look at the vararg+[ff], vararg+[ffff], and vararg+[ffffffff] benchmarks, you can see how the &REST list allacation and access costs almost dominate them: even with stack allocation going from 8 to 2 arguments barely doubles the speed; with the latest version each halving of the argument count doubles the speed for both the fixnums-only and the singles-floats-only benchmarks.

This was run on x86-64, so both single-floats and fixnums are immediate objects. Doubles, however, need heap allocation here -- so if you look at the double float numbers some of the allocation costs come from the numbers and intermediate results.

...but before you get too excited about these numbers, remember the reason why no-one noticed this for such a long time: in real-world performance sensitive code these entry points don't really matter that much.

Syndicated 2012-09-23 14:37:00 from Nikodemus Siivola

Neat TYPEP Trick

How do you test if an object is a cons that has the desired symbol in the car?

(typep x '(cons (eql :foo)))

Sure,

(and (consp x) (eq :foo (car x)))

is essentially just as short...

I still find cons types neat, even if they're a nightmare when it comes to type derivation, but that's a different matter. Some nightmares aren't all bad.

Syndicated 2012-05-15 17:19:49 from Nikodemus Siivola

Trolled

(set-macro-character #\{ (lambda (s c) (read-delimited-list #\} s t)) nil)
(set-macro-character #\} (get-macro-character #\)) nil)

Syndicated 2012-05-15 05:36:46 from Nikodemus Siivola

MADEIRA-PORT

This isn't Madeira proper yet, but something small and useful on it's own, I hope: MADEIRA-PORT.

Main feature is :MADEIRA-PORT ASDF component class:

(defsystem :foo
  :defsystem-depends-on (:madeira-port)
  :serial t
  :components
  ((:file "package")
   (:module "ports"
    :components
    ((:madeira-port "sbcl" :when :sbcl)
     (:madeira-port "ccl" :when :ccl)
     (:madeira-port "ansi" :unless (:or :sbcl :ccl))))
   (:file "foo")))

The :WHEN and :UNLESS options support an extended feature syntax, which allows things such as:

(:find-package :swank)
(:find-function #:exit :sb-ext)

This extended feature syntax is also accessible by calling EXTEND-FEATURE-SYNTAX at the toplevel, after which regular #+ and #- readmacros will also understand it -- but unfortunately they will also lose any implementation specific extensions in the process.

Happy Hacking!

Syndicated 2012-05-08 08:33:41 from Nikodemus Siivola

Please Don't Use SB-UNIX

SB-UNIX is an internal implementation package. If you use functionality provided by it, sooner or later your code will break, because SBCL changed its internals: It is subject to change without notice. When we stop using a function that used to live there, that function gets deleted. Things may also change names and interfaces.

CL-USER> (documentation (find-package :sb-unix) t)

"private: a wrapper layer for SBCL itself to use when talking
with an underlying Unix-y operating system.
This was a public package in CMU CL, but that was different.
CMU CL's UNIX package tried to provide a comprehensive,
stable Unix interface suitable for the end user.
This package only tries to implement what happens to be
needed by the current implementation of SBCL, and makes
no guarantees of interface stability."

Instead, use either SB-POSIX (which is the supported external API), or call the foreign functions directly. Alternatively, if you're using something from SB-UNIX that doesn't have a counterpart in SB-POSIX or elsewhere, put a feature request / bug report on Launchpad explaining what you need. Just saying "wanted: a supported equivalent of SB-UNIX:UNIX-FOO" is enough.

(The same holds more or less for all internal packages, of course, but SB-UNIX is the most common offender.)

I realize this is an imperfect world, and sometimes using an unsupported API is the best thing you can do, but please try to avoid this especially in libraries used by other people as well.

Syndicated 2012-05-01 15:39:35 from Nikodemus Siivola

Updated Common Lisp FAQ

I updated my Common Lisp FAQ. If you spot any glaring errors or omissions, please let me know.

Syndicated 2012-04-06 14:03:32 from Nikodemus Siivola

Holiday Hack: Bit Position

Logically speaking, POSITION with trivial :KEY and :TEST arguments should be much faster on bit-vectors than on simple vectors: the system should be able to pull one words worth of bits out of the vector at a single go, check if any are set (or unset), and if so locate the one we're interested in -- else going on to grab the next word.

Practically speaking, no-one who needed fast POSITION on bit-vectors seems to have cared enough to implement it, and so until yesterday (1.0.54.101) SBCL painstakingly pulled things one bit at a time from the vector, creating a lot of unnecessary memory traffic and branches.

How much of a difference does this make? I think the technical term is "quite a bit of a difference." See here for the benchmark results. First chart is from the new implementation, second from the new one. Other calls to POSITION are included for comparison: ones prefixed with generic- all go through the full generic POSITION, while the others know the type of the sequence at the call-site, and are able to sidestep a few things.

So, if you at some point considered using bit-vectors, but decided against them because POSITION wasn't up to snuff, now might be a good time to revisit that decision.

Gory details at the end of src/code/bit-bash.lisp, full story (including how the system dispatches to the specialized version) best read from git.

Also, if you're looking for an SBCL project for next year, consider the following:

  • Using a similar strategy for POSITION on base-strings: on a 64-bit system one memory read will net you 8 base-chars.
  • Using similar strategy for POSITION on all vectors with element-type width of half-word or less.
  • Improving the performance of the generic POSITION for other cases, using eg. specialized out-of-line versions.

Happy Hacking and New Year!

Syndicated 2011-12-30 09:35:55 from Nikodemus Siivola

SBCL Threading News

SBCL 1.0.54 is barely out of the door, but I'm actually going to mention something that went in the repository today, and will be in the next release:

(TL;DR: Threads on Darwin are looking pretty solid right now. Go give them a shake and let me know what falls out.)

commit 8340bf74c31b29e9552ef8f705b6e1298547c6ab
Author: Nikodemus Siivola 
Date:   Fri Nov 18 22:37:22 2011 +0200

  semaphores in the runtime
    
    Trivial refactorings:
    
    * Rename STATE_SUSPENDED STATE_STOPPED for elegance. (Spells with
      the same number of letters as STATE_RUNNING, things line up
      nicer.)
    
    * Re-express make_fixnum in terms of MAKE_FIXNUM so that we can
      use the latter to define STATE_* names in a manner acceptable to
      use in switch-statements.
    
    * Move Mach exception handling initialization to darwin_init from
      create_initial_thread so that current_mach_task gets initialized
      before the first thread struct is initialized.
    
    The Beef:
    
      Replace condition variables in the runtime with semaphores.
    
      On most platforms use sem_t, but on Darwin use semaphore_t. Hide
      the difference behind, os_sem_t, os_sem_init, os_sem_destroy,
      os_sem_post, and os_sem_wait.
    
      POSIX realtime semaphores are supposedly safe to use in signal
      handlers, unlike condition variables -- and experimentally at
      least Mach semaphores on Darwin are a lot less prone to
      problems.
    
      (Our pthread mutex usage isn't quite kosher either, but it's the
      pthread_cond_wait and pthread_cond_broadcast pair that seemed to
      be causing most of the trouble.)

(There are some other neat things lurking in HEAD in addition to this, but I'll let you discover them for yourself.)

Syndicated 2011-12-05 18:47:29 from Nikodemus Siivola

December Potpourri

Features of Common Lisp Abhishek Reddy used to have a page up on the topic, based on Robert Strandh's list. It's been down for a while now, so I rescued a copy from the Wayback Machine and put it up. So: Features of Common Lisp.

Reporting Bugs, Howto. I think it is actually a good thing that I need to say this, because it tends to be a sign of new people in the community, but if you've never read it, go now and read Simon Tatham's How to Report Bugs Effectively.

TL;DR. *sigh* Short version: provide directions to reproduce such that your idiot cousing could follow them while drunk. Don't be afraid of giving too much details. Don't speculate on causes.

Specific hints.

  • Use (lisp-implementation-version) to check the version of the Lisp you're actually running.
  • Use "uname -a" to get information about the OS and architecture you're running on.
  • When providing information, copy-paste as much as possible directly from the terminal or Emacs.

SBCL 1.0.54 due in a few days. This means we're in our monthly freeze, and testing is much appreciated. This month's release contains a lot of changes -- including plenty of threading work.

Microbench. A while ago I mentioned a microbenchmarking suite I'd been working on on-again, off-again. It's still not much to look at, and comes with zero documentation -- but curious souls can now get it from Github It should work on SBCL, CMUCL, CCL, and Lispworks. Clisp and ACL not tested yet, but a port should be fairly trivial.

What Microbench Is Good For, and Why You Should Not Trust Benchmarks at All. Look here. Pay special attention to double-sans-result+ and double-unsafe-sans-result+. When I published some of the results earlier, I was a bit surprised that they didn't perform much better then double+. Then I ran the same benchmarks on CCL and saw it running those two benchmarks 5 times faster!

With a bit of help from Paul Khuong the difference turned out to be SBCL's loading of floating-point constants, which is heavily optimized for inline-use. I have a pending patch that makes this smarter, whose effect you can at link see above.

The moral of "be sure what you're /really/ benchmarking" is an old one, but bears repeating. What makes microbenchmarks attractive to me, however -- despite their many shortcomings -- is that when something turns out slow (in comparison to another implementation, a previous version of SBCL, or another comparable benchmark operation) is tends to be easier to figure out the cause than with a macrobenchmark.

You probably also noticed that CCL seesm to do really badly at inline floating point arithmetic if my benchmarks are to be trusted. They're not. I'm 99% sure this is a case of the something specific in the way those benchmarks are implemented heavily pessimizing them for CCL.

Syndicated 2011-12-01 10:13:57 from Nikodemus Siivola

SBCL Numeric Performance Today

I've been working on and off on a new microbenchmark tool -- primarily for SBCL, but usable for other implementations as well. Last night I finally got around to teaching it how to generate pretty pictures, using Google Visualization API, and wrote a number of microbenchmarks that show the variety of numeric performance in SBCL.

Here it is: Performance of addition variants in SBCL 1.0.53.31 The higher the bar, the better the performance -- the numbers are million iterations of the benchmark per second of run-time.

Each benchmark does a roughly comparable task: adds two numbers. What varies is what the types of the numbers are, and how much the compiler knows about the situation. (In some benchmarks there may be an extra comparison or so per iteration to keep the compiler from getting and flushing out the code as effectless.) There are basically four classes of performance:

  • Superb: Modular inline integer arithmetic. This is performance essentially identical with what you'd expect from C or ASM.
  • Good: Compiler knows the types, the argument types are inline-arithmetic-friendly, the result type is not in doubt (addition of two fixnums can be a bignum), and the function doing the addition is inlined at the site where the results are unboxed and the result is used.
  • Decent: Compiler knows the types, the types are inline-arithmetic-friendly and have an immediate representation, but the function doing the addition is out of line.
  • Bad Generic arithmetic on anything else but fixnums small enough for the result to be a fixnum is just not that great.

What should be of interest to anyone optimizing floating point performance is that type-checking doesn't really cost anything measurable most of the time. All of those benchmarks do full type typechecks except for double-unsafe-sans-result+, and the gain over the safe variant is minuscule.

What matters is that you generate inline arithmetic so that your floating points don't get boxed. On x86-64 SBCL has immediate single-floats, so occastional boxing isn't quite as disastrous (compare single+ and double+), but getting rid of the boxed representations completely is a huge win -- just compare single+ to complex-double-inline+.

Postscript: I know not everyone reading this will be clear on unboxed, boxed, immediate, non-immediate, etc. My apologies. I will try to remedy the situation and write about the different representations and how and why they matter at a later date.

Post-Postscript: I will be publishing the benchmark tool once it settles down, and once I have a chance to test-drive it with something besides SBCL. Could be a while, though. If you urgently need it, get in tough and we'll arrange something.

Syndicated 2011-11-16 11:04:46 from Nikodemus Siivola

68 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!