Following Up On The End Of The World
Being the end of the world and all, I figure I should go into a bit more details, especially as omnifarious went as far as commenting on this life-altering situation.
He's unfortunately correct about a shared-everything concurrency model being too hard for most people, mainly because the average programmer has a lizard's brain. There's not much I can do about that, unfortunately. We might be having an issue of operating systems here, rather than languages, for that aspect. We can fake it in our Erlang and Newsqueak runtimes, but really, we can only pile so many schedulers up on each others and convince ourselves that we still make sense. That theme comes back later in this post...
omnifarious's other complaint about threads is that they introduce latency, but I think he's got it backward. Communication introduces latency. Threads let the operating system reduce the overall latency by letting other runs whenever it's possible, instead of being stuck. But if you want to avoid the latency of a specific request, then you have to avoid communication, not threads. Now, that's the thing with a shared-everything model, is that it's kind of promiscuous, and not only is it tempting to poke around in memory that you shouldn't, but sometimes you even do it by accident, when multiple threads touch things that are on the same cache line (better allocators help with that, but you have to be careful still). More points in the "too hard for most people" column.
His analogy of memcached with NUMA is also to the point. While memcached is at the cluster end of the spectrum, at the other end, there is a similar phenomenon with SMP systems that aren't all that symmetrical, multi-cores add another layer, and hyper-threading yet another. All of this should emphasize how complicated writing a scheduler that will do a good job of using this properly is, and that I'm not particularly thrilled at the idea of having to do it myself, when there's a number of rather clever people trying to do it in the kernel.
What really won me over to threading is the implicit I/O. I got screwed over by paging, so I fought back (wasn't going to let myself be pushed around like that!), summoning the evil powers of mlockall(). That's where it struck me that I was forfeiting virtual memory, at this point, and figured that there had to be some way that sucked less. To use multiple cores, I was already going to have to use threads (assuming workloads that need a higher level of integration than processes), so I was already exposed to sharing and synchronization, and as I was working things out, it got clearer that this was one of those things where the worst is getting from one thread to more than one. I was already in it, why not go all the way?
One of the things that didn't appeal to me in threads was getting preempted. It turns out that when you're not too greedy, you get rewarded! A single-threaded, event-driven program is very busy, because it always finds something interesting to do, and when it's really busy, it tends to exhaust its time slice. With a blocking I/O, thread-per-request design, most servers do not overrun their time slice before running into another blocking point. So in practice, the state machine that I tried so hard to implement in user-space works itself out, if I don't eat all the virtual memory space with huge stacks. With futexes, synchronization is really only expensive in case of contention, so that on a single-processor machine, it's actually just fine too! Seems ironic, but none of it would be useful without futexes and a good scheduler, both of which we only recently got.
There's still the case of CPU intensive work, which could introduce trashing between threads and reduced throughput. I haven't figured out the best way to do this yet, but it could be kept under control with something like a semaphore, perhaps? Have it set to the maximum number of CPU intensive tasks you want going, have them wait on it before doing work, post it when they're done (or when there's a good moment to yield)...
omnifarious is right about being careful about learning from what others have done. Clever use of shared_ptr and immutable data can be used as a form of RCU, and immutable data in general tends to make good friends with being replicated (safely) in many places.
One of the great ironies of this, in my opinion, is that Java got NIO almost just in time for it to it to be obsolete, while we were doing this in C and C++ since, well, almost forever. Sun has this trick for being right, yet do it wrong, it's amazing!