Older blog entries for axboe (starting at number 8)

Never forget a ; it could cost you many hours of restoring data from your partitions due to corruption. Did I mention the missing ; was in the merge code? Oh well, this is the first time that I've experienced (self inflicted) corruption. I could kick myself. Seems forgetting that character was the theme of the day, eh Rik :-)

Modularized the elevator code so it is easy to write a new elevator plugin or just choose which of the available you want for a low level driver. Arjan is playing with some of his own ideas, interesting to see how they turn out.

Didn't do a whole lot more, it is saturday after all.

What do you know, edit a diary entry and the date changes!

After having studied many different types of I/O schedulers, I've come to the conclusion that simple ascending request sorting is the most optimal for most circumstances. It has decent runtime for insertions and good average seek time. Combine that with some stuff for limiting starvation and you got yourself a decent disk elevator - and behold - this is what we have in current 2.3. After having tried the BSD style elevator, I implemented one that always returns the request closest to the one the drive is servicing right now. In terms of runtime it is expensive and the gains over the simple ascending sort was just not worth it. So my work yet again degenerates into just getting good performance with tweaking elevator defaults, how boring. Well almost, I want to modularise the current elevator so that it is possible to select which one you want. For IDE drives the current one is pretty good, for SCSI it does some work that is really unneeded but does not harm performance (well, we do take a small hit because of the unneeded work, but that is neglible). For intelligent devices (highend SCSI HBA/disks, I2O) that claim to do their own elevating, we shouldn't need to do much.

At least one person is having problems caused by changing the default mode page size in ide-cd to the standard specified size. So it looks like we are reverting to old behaviour again. The only known case (to me) that fails with the smaller mode page is the ACER 50 drive, not enough to justify changing the old default.

I want to get a new workstation PC. Contemplating getting a K7, but I'm not sure I really need it. Oh well.

Implemented a BSD style elevator to see what effects that would have on I/O behaviour. The BSD elevator is different in that it keeps two lists of requests that the device must service when it gets unplugged (this is handled automatically by me, though, queue_head always points to the list that needs work). We start by filling requests onto the first list in strict sector ordering until at request comes in that lies before the last active request. Then we switch lists and start adding to the other list. This gives good I/O ordering and also imposes a limit on how long we risk waiting for a specific request to finish. Performance is as-of-yet not quite determined. Feels pretty good though and initial bencmarks show that it is.

Decided to give the nvidia XFree86-4.0 drivers a go. The kernel driver needed porting to 2.3 first, though, but that was fairly trivial. Seems to run well. Soon it is time for the Q3 test to see how well the OpenGL performs! XFree86-3.3 performance with the nvidia glx sucked big time, I truly hope the new one is much better. According to the Linux Games site it is, sweet.

Still collecting data to determine the influence on elevator settings on unplugging of devices. I now have so much data stored it's beginning to get silly :-). On the side I'm collecting data for "normal system use" -- I want to convince David that writes do not dominate as much as he thinks! I don't know, though, I want to let the logger run for another day until I look at the output. I often forget that it is running, which is good for getting unbiased results.

Fixed several sillies in ide-cd, including cleaning up the ACER50 MODE_SENSE stuff that got shoved in earlier. I think we want to go for the regulation sized capabilities page buffer, if that doesn't work I can always blame it on the drive. Sticking to ATAPI is good in that sense, otherwise it has its deficiencies. If anybody should get into trouble there's an easily changeable define in ide-cd.h.

Don't ever try to "hot plug" IDE drives, especially if you turn the power connector upside down by mistake... Lots of smoke and several weeks of packet writing work down the drain. Not to worry, I want to try putting a new board on the drive so I can get my data out. Backups? Psch, that's for wimps (not for idiots like me).

Analyzed heaps of blklog data resulting from dbench runs with various different elevator settings. As suspected, the max_bomb_segments setting of 4 causes a huge performance loss. I've put the results that blklog generates from a dbench 48 run with the various elevator settings up at kernel.dk/elevator along with a small util to tweak the elevator with (elvset.c) and my blklog app (blklog.c). The latter isn't really useful, since I haven't put up the blklog driver just yet...

The (8192, 131072, 128, 2) (read_latency, write_latency, max_bomb_segments, and max_read_p) elevator setting gives good bench results while still maintaining decent interactive feel. Please experiment and let me know what your findings are! The last entry of the settings above is an addition that I made, read the former three. I mainly run the tests on a SCSI disk that has tagged command queueing, so elevator sorting is basically a non-issue for me. People with IDE drives are the ones that should feel the biggest impact when playing with the elevator.

Tried various elevator settings and profiled a dbench 48 run. Depending on settings, the raw dump from a single dbench run takes between 13-16MB of disk space! Getting a detailed ASCII dumped consumed as much as 60MB of disk space. The app also prints a useful one page summary of I/O activity, which is what I've been using. I haven't had much time to investigate the logs yet, but it looks as if the write bomb logic is what is hurting performance the most. Especially because the bomb segments are set as low as 4 sectors! Much better results are achieved with a bomb segment of 64 and a new max_read_pending member to the elevator so that we don't unconditionally reduce segments for a single pending read. I will put up detailed info tomorrow along with a possible good setup for the elevator.

The request accounter does not seem to impose such a big an overhead as I had expected. I keep an in-kernel 2MB ring buffer which can hold 87381 entries. The blklog app reads 128 entries at the time and writes them out in intervals of 1MB. A dbench run consumes about 500,000-650,000 requests and the miss rate is about 0.02% which is tolerable. This gives dbench about a 3% performance hit. If blklog is not running (it enables logging when it opens /dev/blklog), the performance hit is the overhead of a function call per request - which doesn't even show up in performance testing.

Played some more with the elevator to get optimal performance and good latency. Davem suggested implementing a small module that logs requests so that we can get hard data on what exactly is causing the problem and how. So I wrote a small logging driver (blklog) that accounts every buffer that is submitted for I/O, and how: new request started, had to wait for request, buffer merged, buffer not merged due to elevator latency, buffer not merged due to elevator starvation, device, sector, sectors, etc. All entries are pushed to an internal ring buffer and a small user space app collects the data and provides a nice print out of the elevator interaction at any given time. Tomorrow I will put this to good use and run benchmarks to collect data. With elevator defaults that basically kill the elevator latency, starvation, and write bomb logic David sees a 26 -> 34MB/s improvement (yes single disk, he has the toys with insane I/O).

Made other small improvements to the queueing, using spin_lock_irq() instead of _irqsave to save stack where possible (basically almost all of ll_rw_blk.c does not have to consider being called with interrupts disabled, since many paths may end up having to block for free request). Also fully removed tq_disk for waiting on locked page and buffer.

Found an SMP race in the IDE code of the block patch (pretty stupid) and a more subtle one in the SCSI mid level. I've been doing benchmarks with the new queue stuff to get a feel for performance. A decent RAID setup with a couple of SCSI disks would do nicely here...

Worked with davem to improve the elevator in 2.3. David has a nice description of the problem on his page, but a quick recap is that the elevator will not coalesce adjacant buffers if it thinks it will hurt interactiveness. Instead a new request is grabbed and the buffer added to that. While interactiveness is a must for a desktop machine, this hurt I/O performance quite badly.

Good news is that the loop back driver works with my queuing changes. Other good news is that I'm currently seeing 14% performance increase with my queueing changes - and that is on a single disk. Multiple disk I/O should benefit even more.

Worked some more on the block layer queue stuff. Every block device request queue now has its own request freelist protected by its own lock, instead of a global freelist for all devices. This means we no longer have to scan a table of 256 entries to find a new free request to generate I/O, but can instead just grab the first entry off the top of the queue request list. In addition, I split the tq_disk task queue so that we only fire a specific device when a buffer or page is needed instead of all of them. This should give better plugging behaviour and more smooth I/O.

Certain problem arises because of the seperate queues - we no longer have just one lock to protect the queues (io_request_lock), each queue has its own lock. This means that some low level drivers have to be audited for SMP safety (devices with several queues, such as cpqarray and DAC960). In addition, MD seems to have some problems. I need to look at this some more and do some benching. Preliminary results show that even a single CPU single disk configuration benefits from the new request freelist, as expected (gives O(1) time for new request, not O(n)).

Also worked on fixing the loop back driver, which deadlocks horribly for file backed devices. My theory now is that grab_cache_page() attempts to lock down a page (__find_lock_page()) which may require unplugging of devices (thus tq_disk is run from within tq_disk, hmmm).

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!