Older blog entries for nikodemus (starting at number 71)

30 Dec 2011 »

Holiday Hack: Bit Position

Logically speaking, POSITION with trivial :KEY and :TEST arguments should be much faster on bit-vectors than on simple vectors: the system should be able to pull one words worth of bits out of the vector at a single go, check if any are set (or unset), and if so locate the one we're interested in -- else going on to grab the next word.

Practically speaking, no-one who needed fast POSITION on bit-vectors seems to have cared enough to implement it, and so until yesterday (1.0.54.101) SBCL painstakingly pulled things one bit at a time from the vector, creating a lot of unnecessary memory traffic and branches.

How much of a difference does this make? I think the technical term is "quite a bit of a difference." See here for the benchmark results. First chart is from the new implementation, second from the new one. Other calls to POSITION are included for comparison: ones prefixed with generic- all go through the full generic POSITION, while the others know the type of the sequence at the call-site, and are able to sidestep a few things.

So, if you at some point considered using bit-vectors, but decided against them because POSITION wasn't up to snuff, now might be a good time to revisit that decision.

Gory details at the end of src/code/bit-bash.lisp, full story (including how the system dispatches to the specialized version) best read from git.

Also, if you're looking for an SBCL project for next year, consider the following:

Using a similar strategy for POSITION on base-strings: on a 64-bit system one memory read will net you 8 base-chars.
Using similar strategy for POSITION on all vectors with element-type width of half-word or less.
Improving the performance of the generic POSITION for other cases, using eg. specialized out-of-line versions.

Happy Hacking and New Year!

Syndicated 2011-12-30 09:35:55 from Nikodemus Siivola

5 Dec 2011 »

SBCL Threading News

SBCL 1.0.54 is barely out of the door, but I'm actually going to mention something that went in the repository today, and will be in the next release:

(TL;DR: Threads on Darwin are looking pretty solid right now. Go give them a shake and let me know what falls out.)

commit 8340bf74c31b29e9552ef8f705b6e1298547c6ab
Author: Nikodemus Siivola 
Date:   Fri Nov 18 22:37:22 2011 +0200

  semaphores in the runtime
    
    Trivial refactorings:
    
    * Rename STATE_SUSPENDED STATE_STOPPED for elegance. (Spells with
      the same number of letters as STATE_RUNNING, things line up
      nicer.)
    
    * Re-express make_fixnum in terms of MAKE_FIXNUM so that we can
      use the latter to define STATE_* names in a manner acceptable to
      use in switch-statements.
    
    * Move Mach exception handling initialization to darwin_init from
      create_initial_thread so that current_mach_task gets initialized
      before the first thread struct is initialized.
    
    The Beef:
    
      Replace condition variables in the runtime with semaphores.
    
      On most platforms use sem_t, but on Darwin use semaphore_t. Hide
      the difference behind, os_sem_t, os_sem_init, os_sem_destroy,
      os_sem_post, and os_sem_wait.
    
      POSIX realtime semaphores are supposedly safe to use in signal
      handlers, unlike condition variables -- and experimentally at
      least Mach semaphores on Darwin are a lot less prone to
      problems.
    
      (Our pthread mutex usage isn't quite kosher either, but it's the
      pthread_cond_wait and pthread_cond_broadcast pair that seemed to
      be causing most of the trouble.)

(There are some other neat things lurking in HEAD in addition to this, but I'll let you discover them for yourself.)

Syndicated 2011-12-05 18:47:29 from Nikodemus Siivola

1 Dec 2011 »

December Potpourri

Features of Common Lisp Abhishek Reddy used to have a page up on the topic, based on Robert Strandh's list. It's been down for a while now, so I rescued a copy from the Wayback Machine and put it up. So: Features of Common Lisp.

Reporting Bugs, Howto. I think it is actually a good thing that I need to say this, because it tends to be a sign of new people in the community, but if you've never read it, go now and read Simon Tatham's How to Report Bugs Effectively.

TL;DR. *sigh* Short version: provide directions to reproduce such that your idiot cousing could follow them while drunk. Don't be afraid of giving too much details. Don't speculate on causes.

Specific hints.

Use (lisp-implementation-version) to check the version of the Lisp you're actually running.
Use "uname -a" to get information about the OS and architecture you're running on.
When providing information, copy-paste as much as possible directly from the terminal or Emacs.

SBCL 1.0.54 due in a few days. This means we're in our monthly freeze, and testing is much appreciated. This month's release contains a lot of changes -- including plenty of threading work.

Microbench. A while ago I mentioned a microbenchmarking suite I'd been working on on-again, off-again. It's still not much to look at, and comes with zero documentation -- but curious souls can now get it from Github It should work on SBCL, CMUCL, CCL, and Lispworks. Clisp and ACL not tested yet, but a port should be fairly trivial.

What Microbench Is Good For, and Why You Should Not Trust Benchmarks at All. Look here. Pay special attention to double-sans-result+ and double-unsafe-sans-result+. When I published some of the results earlier, I was a bit surprised that they didn't perform much better then double+. Then I ran the same benchmarks on CCL and saw it running those two benchmarks 5 times faster!

With a bit of help from Paul Khuong the difference turned out to be SBCL's loading of floating-point constants, which is heavily optimized for inline-use. I have a pending patch that makes this smarter, whose effect you can at link see above.

The moral of "be sure what you're /really/ benchmarking" is an old one, but bears repeating. What makes microbenchmarks attractive to me, however -- despite their many shortcomings -- is that when something turns out slow (in comparison to another implementation, a previous version of SBCL, or another comparable benchmark operation) is tends to be easier to figure out the cause than with a macrobenchmark.

You probably also noticed that CCL seesm to do really badly at inline floating point arithmetic if my benchmarks are to be trusted. They're not. I'm 99% sure this is a case of the something specific in the way those benchmarks are implemented heavily pessimizing them for CCL.

Syndicated 2011-12-01 10:13:57 from Nikodemus Siivola

16 Nov 2011 »

SBCL Numeric Performance Today

I've been working on and off on a new microbenchmark tool -- primarily for SBCL, but usable for other implementations as well. Last night I finally got around to teaching it how to generate pretty pictures, using Google Visualization API, and wrote a number of microbenchmarks that show the variety of numeric performance in SBCL.

Here it is: Performance of addition variants in SBCL 1.0.53.31 The higher the bar, the better the performance -- the numbers are million iterations of the benchmark per second of run-time.

Each benchmark does a roughly comparable task: adds two numbers. What varies is what the types of the numbers are, and how much the compiler knows about the situation. (In some benchmarks there may be an extra comparison or so per iteration to keep the compiler from getting and flushing out the code as effectless.) There are basically four classes of performance:

Superb: Modular inline integer arithmetic. This is performance essentially identical with what you'd expect from C or ASM.
Good: Compiler knows the types, the argument types are inline-arithmetic-friendly, the result type is not in doubt (addition of two fixnums can be a bignum), and the function doing the addition is inlined at the site where the results are unboxed and the result is used.
Decent: Compiler knows the types, the types are inline-arithmetic-friendly and have an immediate representation, but the function doing the addition is out of line.
Bad Generic arithmetic on anything else but fixnums small enough for the result to be a fixnum is just not that great.

What should be of interest to anyone optimizing floating point performance is that type-checking doesn't really cost anything measurable most of the time. All of those benchmarks do full type typechecks except for double-unsafe-sans-result+, and the gain over the safe variant is minuscule.

What matters is that you generate inline arithmetic so that your floating points don't get boxed. On x86-64 SBCL has immediate single-floats, so occastional boxing isn't quite as disastrous (compare single+ and double+), but getting rid of the boxed representations completely is a huge win -- just compare single+ to complex-double-inline+.

Postscript: I know not everyone reading this will be clear on unboxed, boxed, immediate, non-immediate, etc. My apologies. I will try to remedy the situation and write about the different representations and how and why they matter at a later date.

Post-Postscript: I will be publishing the benchmark tool once it settles down, and once I have a chance to test-drive it with something besides SBCL. Could be a while, though. If you urgently need it, get in tough and we'll arrange something.

Syndicated 2011-11-16 11:04:46 from Nikodemus Siivola

13 Nov 2011 »

What Does Extensible CAS Mean?

Yesterday I said SBCL now had extensible CAS. Was sind paranormale Tonbandstimmen, what is CAS, and why should you care if it's extensible? Turn off the music and I'll tell you.

CAS is short for compare-and-swap. Compare-and-swap is a fairly common atomic operation. It compares the current value of a memory location with another value, and if they are the same it replaces the value of that memory location with a specified new value.

Depending on the language and the exact design of the interface, it might just return true for success, or it might return the old value. In SBCL it does the latter, which is sometimes very convenient, but also means you need to compare the return value to the expected-old-value you specified to know if CAS succeeded.

Because it is atomic, if you have two threads doing CAS on the same memory location in parallel, only one can succeed:

(let* ((x (list nil))
       (a (join-thread (make-thread (lambda () (cas (car x) nil :a)))))
       (b (join-thread (make-thread (lambda () (cas (car x) nil :b))))))
  ;; Because CAS is atomic, we know that exactly one of the threads
  ;; will succeed -- but we can't know which beforehand.
  (cond ((not a)
         ;; A returned NIL, therefore it replaced the CAS with :A
	 ;; and therefore B must return :A and do nothing.
         (assert (and (eq :a (car x)) (eq :a b))))
        ((not b)
         ;; ...and vice versa.
         (assert (and (eq :b (car x)) (eq :b a))))
        (t
         (error "WTF? Broken CAS?"))))

If you have the least bit of threads on your mind, you can imagine how this can be quite useful.

Out of the box current bleeding edge SBCL supports CAS on a number places: car, cdr, first, rest, svref, slot-value, standard-instance-access, funcallable-standard-instance-access, symbol-value, symbol-plist, and defstruct-defined slot accessors with slot types fixnum and t. (Note: slot-value is not currently supported by CAS if slot-value-using-class or friends are involved -- that's still in the works.)

With the exception of slot-value all of those pretty much come down to a single LOCK:CMPXCGH instruction on Intel architectures.

...but what it you have a data structure -- say a queue of some sort -- and want to implement cas-queue-head which does CAS on the first element of the queue. Fine. You can do that without any CAS support from the implementation by using eg. a lock.

...but what if you want to write a macro that operates on a CAS-able place?

(defmacro my-atomic-incf (place &optional (delta 1))
  "Spins till it succeeds in atomically incrementing PLACE by
DELTA. PLACE must be CASable."
  ;; OOPS! We're multiply-evaluating PLACE.
  (with-gensyms (old new n-delta))
    `(let* ((,old ,place)
            (,n-delta ,delta)
            (,new (+ ,old ,n-delta)))
      (loop until (eq ,old
                      (setf ,old (cas ,place ,old ,new)))
            do (setf ,new (+ ,old ,n-delta)))
      ,new))

Now imagine some hapless user doing:

(loop with i = 0
      while (

Where instead of I increasing by 1 each time through the loop and
iterating across the whole vector, it could increase I by
who-knows-how-many on a single attempt skipping entries and even
running out of bounds. Ouch.

Turns out that to write a macro that operates on a CASable place
you need something analogous to get-setf-expansion, except
for CAS instead of SETF. As of yesterday, SBCL has
sb-ext:get-cas-expansion that you can use to write a macro
like my-atomic-incf correctly and safely.

(defmacro my-atomic-incf (place &optional (delta 1) &environment env)
  (multiple-value-bind (vars vals old new cas-form read-form)
      (sb-ext:get-cas-expansion place env)
    (with-gensyms (n-delta)
      `(let* (,@(mapcar 'list vars vals)
               (,old ,read-form)
               (,n-delta ,delta)
               (,new (+ ,old ,n-delta)))
          (loop until (eq ,old (setf ,old ,cas-form))
                do (setf ,new (+ ,old ,n-delta)))
          ,new))))

What's more, we've now have the notion of a generalized CASable
place, just like Common Lisp has the notion of a generalized SETFable
place.

This means that the person writing cas-queue-head can
use defcas, define-cas-expander, or even just:

(defun (sb-ext:cas queue-head) (old new queue)
  (cas-queue-head queue old new))

to make their CASable place a first-class citizens on equal footing
with the baked-in ones -- so that

(my-atomic-incf (queue-head queue))

will Just Work. (Assuming your cas-queue-head works, of
course.)

I think that's pretty nifty. I'm still looking at adding support
for (cas slot-value-using-class), which will be even niftier.
Who says there's no innovation in open source? (Maybe I'm feeling a
bit hubristic right now. I'll come down soon enough when the first
bug-reports hit the fan.)

Feel free to turn Laurie Anderson back on now.

Syndicated 2011-11-13 09:52:28 from Nikodemus Siivola

12 Nov 2011 »

Extensible CAS and Timeouts All Over The Place

More IndieGoGo funded goodies have landed in the SBCL repository:

COMPARE-AND-SWAP is now user-extensible, and also supports SLOT-VALUE STANDARD-INSTANCE-ACCESS, and FUNCALLABLE-STANDARD-INSTANCE-ACCESS out of the box.
All blocking functions in the threading API and the SB-CONCURRENCY contrib support timeouts and deadlines.

More to come...

Syndicated 2011-11-12 13:02:19 from Nikodemus Siivola

10 Nov 2011 »

SBCL Threading Work, First Merge

I just merged the first part of the IndieGoGo funded SBCL threading work. This entailed:

GRAB-MUTEX now supports timeouts on all platforms.
CONDITION-WAIT now supports timeouts.
Killing lutexes, replacing them with userspace synchronization. This main affects non-Linux platforms. Performance implications are a mixed bag, depending on what you're doing. Some things perform an order of magnitude better, some things about the same, some things somewhat worse. The stability improvements on Darwin are well worth the costs, though and I will try to address the cases where performance suffers in due course.

There's more to come---stay tuned. ...and please report any regressions to SBCL bug-tracker on Launchpad or on the mailing lists.

Syndicated 2011-11-09 22:09:23 from Nikodemus Siivola

3 Nov 2011 »

SB-TEXINFO

Docstring processors are thirteen to a dozen, and those derived from SBCL's docstrings.lisp are a legion of their own.

I've been guilty of copying SBCL's docstrings.lisp to various projects myself, hand tweaking it to suit their needs... till finally I found myself with a half a dozen divergent copies. Eugh.

So, SB-TEXINFO was born: it should contain tweaks from all my copies, merged back on top of SBCL's upstream copy.

I plan to merge it back to SBCL as a contrib once I'm happy with it, but the schedule is still open on that.

Syndicated 2011-11-03 18:13:18 from Nikodemus Siivola

1 Sep 2011 »

Crowdfunding Success, Work In Progress

My SBCL crowdfunding campaign at IndieGoGo is over, with $16,507 raised and all goals met! (There were some folks who wanted to donate either after the deadline or via a different channel who aren't included in that sum, so the actual number is closer to $17k before IndieGoGo's, PayPal's, and taxman's cuts. I expect my final share to be around $12-13k after dust settles.)

Thank You, Everyone. I can't say this often enough. Your support has been overwhelming, and is greatly appreciated.

Sorting out all the perks is probably going to take most of today and tomorrow, then it's back to work on the project.

To summarize, here are the things that are going to happen, in the approximate order I expect to commit them. (This is a superset of the list at IndieGogo, because it includes some intermediate steps which are likely to get their own commits.)

SB-THREAD:MUTEX outside futex-platforms to be built on top of userspace spinlocks. CONDITION-WAIT is also getting a userspace implementation when futexes are not available. This will fix interrupt-safety issues with locks on non-futex platforms such as Darwin and Solaris. (Done, merge pending cleanup.)
CONDITION-WAIT, WAIT-ON-SEMAPHORE, and JOIN-THREAD in SB-THREAD, and RECEIVE-MESSAGE and RECEIVE-PENDING-MESSAGES in SB-CONCURRENCY to support timeouts and deadlines on all platforms. (Work in progress.)
SB-THREAD:MUTEX to be renamed SB-THREAD:LOCK with similar renamings in the functions and macros. This is mainly motivated by the need for incompatible API-changes to lock acquisition and release. Old names will be supported for a fair while, being deprecated very gradually. Feature :SB-THREAD-LOCK can be used to detect the presence or absence of the new API.(Work in progress.)
SB-THREAD::SPINLOCK to be replaced with the new LOCK object. Old names will be supported for a while, being deprecated gradually. (Done.)
Locks outside futex-platforms are to be mostly fair by default. (Mostly done, bughunt in progress.)
Read/write locks to be added to SB-THREAD or SB-CONCURRENCY. (Not started yet.)
Sempahore notification objects to be added to SB-THREAD or SB-CONCURRENCY. (Not started yet.)
Allegro-style gates to be added to SB-THREAD or SB-CONCURRENCY. (Not started yet.)
CAS support will be extended to at least STANDARD-INSTANCE-ACCESS, and made user-extensible. (Work in progresss.)
Other threading API change/enhancement, to be specified later.
A new portability library called Madeira will be implemented. I'll blog more on it once I'm starting to work on it in earnest. (Not started yet.)
SBCL's big compiler lock will be removed, so that parallel compilation is possible. Again, I'll blog more on this when it's time. (Not started yet.)

I will update this post as work progresses.

Syndicated 2011-09-01 11:03:05 from Nikodemus Siivola

27 Aug 2011 »

IndieGoGo & The Big Compiler Lock

My SBCL crowdfunding at IndieGoGo has continued to be successful beyond my wildest expectations: it has reached $12k funding so far, which means that Madeira has been fully funded in addition to the earlier SBCL specific goals.

This lead me to a bit of a conundrum. Should I add another goal and try for more funding while there's time left in the campaign (can't adjust the duration after the fact), or leave more than good enough alone? At the end, since I lit upon a good goal that fits well with the previous ones, I opted to add it and see if I could reach all the way to the moon...

If funding reaches $16k, that will be enough to allow us to kiss the big compiler lock goodbye.

What is the big compiler lock? SBCL grabs a lock when it compiles anything. The same lock is also used for some non-compiler activities like modifying the class graph. This lock prevents parallel compilation. Even if you don't care about parallel compilation as such, there are other unfortunate consequences: CLOS dispatch (before caches settle down) can involve compiling code. Most of the time BCL is a tiny annoyance, but sometimes it can be a real headache.

If $16k is not reached, any monies over $12k will be split between general SBCL work and work leading to curtailing the effects of the big compiler lock. (Eg. making CLOS dispatch optionally work without the compiler.)

Finally, a big Thank You to all contributors.

Syndicated 2011-08-27 10:45:24 from Nikodemus Siivola

62 older entries...