Worked some more on the block layer queue stuff. Every block device request queue now has its own request freelist protected by its own lock, instead of a global freelist for all devices. This means we no longer have to scan a table of 256 entries to find a new free request to generate I/O, but can instead just grab the first entry off the top of the queue request list. In addition, I split the tq_disk task queue so that we only fire a specific device when a buffer or page is needed instead of all of them. This should give better plugging behaviour and more smooth I/O.
Certain problem arises because of the seperate queues - we no longer have just one lock to protect the queues (io_request_lock), each queue has its own lock. This means that some low level drivers have to be audited for SMP safety (devices with several queues, such as cpqarray and DAC960). In addition, MD seems to have some problems. I need to look at this some more and do some benching. Preliminary results show that even a single CPU single disk configuration benefits from the new request freelist, as expected (gives O(1) time for new request, not O(n)).
Also worked on fixing the loop back driver, which deadlocks horribly for file backed devices. My theory now is that grab_cache_page() attempts to lock down a page (__find_lock_page()) which may require unplugging of devices (thus tq_disk is run from within tq_disk, hmmm).