Older blog entries for axboe (starting at number 4)

Analyzed heaps of blklog data resulting from dbench runs with various different elevator settings. As suspected, the max_bomb_segments setting of 4 causes a huge performance loss. I've put the results that blklog generates from a dbench 48 run with the various elevator settings up at kernel.dk/elevator along with a small util to tweak the elevator with (elvset.c) and my blklog app (blklog.c). The latter isn't really useful, since I haven't put up the blklog driver just yet...

The (8192, 131072, 128, 2) (read_latency, write_latency, max_bomb_segments, and max_read_p) elevator setting gives good bench results while still maintaining decent interactive feel. Please experiment and let me know what your findings are! The last entry of the settings above is an addition that I made, read the former three. I mainly run the tests on a SCSI disk that has tagged command queueing, so elevator sorting is basically a non-issue for me. People with IDE drives are the ones that should feel the biggest impact when playing with the elevator.

Tried various elevator settings and profiled a dbench 48 run. Depending on settings, the raw dump from a single dbench run takes between 13-16MB of disk space! Getting a detailed ASCII dumped consumed as much as 60MB of disk space. The app also prints a useful one page summary of I/O activity, which is what I've been using. I haven't had much time to investigate the logs yet, but it looks as if the write bomb logic is what is hurting performance the most. Especially because the bomb segments are set as low as 4 sectors! Much better results are achieved with a bomb segment of 64 and a new max_read_pending member to the elevator so that we don't unconditionally reduce segments for a single pending read. I will put up detailed info tomorrow along with a possible good setup for the elevator.

The request accounter does not seem to impose such a big an overhead as I had expected. I keep an in-kernel 2MB ring buffer which can hold 87381 entries. The blklog app reads 128 entries at the time and writes them out in intervals of 1MB. A dbench run consumes about 500,000-650,000 requests and the miss rate is about 0.02% which is tolerable. This gives dbench about a 3% performance hit. If blklog is not running (it enables logging when it opens /dev/blklog), the performance hit is the overhead of a function call per request - which doesn't even show up in performance testing.

Played some more with the elevator to get optimal performance and good latency. Davem suggested implementing a small module that logs requests so that we can get hard data on what exactly is causing the problem and how. So I wrote a small logging driver (blklog) that accounts every buffer that is submitted for I/O, and how: new request started, had to wait for request, buffer merged, buffer not merged due to elevator latency, buffer not merged due to elevator starvation, device, sector, sectors, etc. All entries are pushed to an internal ring buffer and a small user space app collects the data and provides a nice print out of the elevator interaction at any given time. Tomorrow I will put this to good use and run benchmarks to collect data. With elevator defaults that basically kill the elevator latency, starvation, and write bomb logic David sees a 26 -> 34MB/s improvement (yes single disk, he has the toys with insane I/O).

Made other small improvements to the queueing, using spin_lock_irq() instead of _irqsave to save stack where possible (basically almost all of ll_rw_blk.c does not have to consider being called with interrupts disabled, since many paths may end up having to block for free request). Also fully removed tq_disk for waiting on locked page and buffer.

Found an SMP race in the IDE code of the block patch (pretty stupid) and a more subtle one in the SCSI mid level. I've been doing benchmarks with the new queue stuff to get a feel for performance. A decent RAID setup with a couple of SCSI disks would do nicely here...

Worked with davem to improve the elevator in 2.3. David has a nice description of the problem on his page, but a quick recap is that the elevator will not coalesce adjacant buffers if it thinks it will hurt interactiveness. Instead a new request is grabbed and the buffer added to that. While interactiveness is a must for a desktop machine, this hurt I/O performance quite badly.

Good news is that the loop back driver works with my queuing changes. Other good news is that I'm currently seeing 14% performance increase with my queueing changes - and that is on a single disk. Multiple disk I/O should benefit even more.

Worked some more on the block layer queue stuff. Every block device request queue now has its own request freelist protected by its own lock, instead of a global freelist for all devices. This means we no longer have to scan a table of 256 entries to find a new free request to generate I/O, but can instead just grab the first entry off the top of the queue request list. In addition, I split the tq_disk task queue so that we only fire a specific device when a buffer or page is needed instead of all of them. This should give better plugging behaviour and more smooth I/O.

Certain problem arises because of the seperate queues - we no longer have just one lock to protect the queues (io_request_lock), each queue has its own lock. This means that some low level drivers have to be audited for SMP safety (devices with several queues, such as cpqarray and DAC960). In addition, MD seems to have some problems. I need to look at this some more and do some benching. Preliminary results show that even a single CPU single disk configuration benefits from the new request freelist, as expected (gives O(1) time for new request, not O(n)).

Also worked on fixing the loop back driver, which deadlocks horribly for file backed devices. My theory now is that grab_cache_page() attempts to lock down a page (__find_lock_page()) which may require unplugging of devices (thus tq_disk is run from within tq_disk, hmmm).

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!