Played some more with the elevator to get optimal performance and good latency. Davem suggested implementing a small module that logs requests so that we can get hard data on what exactly is causing the problem and how. So I wrote a small logging driver (blklog) that accounts every buffer that is submitted for I/O, and how: new request started, had to wait for request, buffer merged, buffer not merged due to elevator latency, buffer not merged due to elevator starvation, device, sector, sectors, etc. All entries are pushed to an internal ring buffer and a small user space app collects the data and provides a nice print out of the elevator interaction at any given time. Tomorrow I will put this to good use and run benchmarks to collect data. With elevator defaults that basically kill the elevator latency, starvation, and write bomb logic David sees a 26 -> 34MB/s improvement (yes single disk, he has the toys with insane I/O).
Made other small improvements to the queueing, using spin_lock_irq() instead of _irqsave to save stack where possible (basically almost all of ll_rw_blk.c does not have to consider being called with interrupts disabled, since many paths may end up having to block for free request). Also fully removed tq_disk for waiting on locked page and buffer.