Matt Welsh's SEDA is very, very interesting. I especially like the quantitative flavor. Often, that's what makes the difference between expressing one's taste and doing real science.
Matt chose Java for his implementation. On the surface, this is a strange choice. For one, until quite recently, Java didn't even have a nonblocking I/O library. Matt had to implement his own, using JNI. The path of least resistance is to use a thread per connection, which has scaling problems as the number of connections grows large. Of course, it's possible to mitigate these scaling problems by throwing more hardware at the problem. Now why would Sun do a thing like that?
Anyway, SEDA is a fairly sophisticated runtime. Each stage is implemented with some number of threads. The number of threads is dynamically tuned (by "controllers") to optimize things like utilization and flow control. Many asynchronous designs have a much more straightforward mapping between threads and the tasklets (or "fibers", in McCusker's terminology). For example, a lot of people use select() to get one thread, one process. Alternatively, people who sell high-end hardware encourage the use of one thread per tasklet. But neither is truly satisfactory. A select()-based server will block on many important tasks, including file I/O, and is also sensitive to individual event handlers taking too long. It also won't get any gain from multiple CPU's.
SEDA solves this, but places very high demands on the runtime. In particular, you have to deal with very dynamic patterns of memory allocation. In a performance-oriented context, it's tempting to do this memory management (at least partly) by hand. But in an environment as dynamic as a typical SEDA server, it gets really tricky. Not to mention that malloc() and free() themselves start sucking when multithreaded (a microbenchmark shows roughly 3x slowdown with zero lock contention).
Similarly, reference counting would start to bite very hard too. In general, you need a mutex around each refcnt bump. No wonder Python has a global interpreter lock instead.
So the Java approach is actually very sensible. You simply require high performance multithreaded garbage collection from your lower level runtime. This is actually quite a Hard Problem; much PhD-level research has gone into it. However, by now it's more or less been solved. Since it's at the very lowest level, a large investment gets amortized over everything else at higher levels.
There's a pattern. Do sophisticated and complex stuff at the low level (here, hairy and nasty might be better adjectives) so that the higher levels become simpler.
SEDA isn't just interesting because of raw performance, either. From the looks of it, you actually do get good decoupling between the various stages. There's also some thought about debuggability, often a terribly hard problem in asynchronous systems. Perhaps it will become realistic to write an asynchronous library and be able to compose it with other components.