Older blog entries for axboe (starting at number 11)

I believe the erratic performance davem saw is now fixed. Instead of having two request_freelist heads (one for reads and one for writes, where reads can steal from the write list if necessary) I use a single list head to hold the requests and let a counter keep track of when writes should block in get_request_wait. Writes can consume 2/3 of the queue, just like before. This approach has a couple of advantages, although it does not "seem" as clean. We save a bit of space in request_queue_t and struct request and get_request is simpler than before. I'm waiting to hear from davem claiming his free beer, before submitting this.

Removed the BROKEN_CAP_PAGE in ide-cd and let a simple id scan decide whether to include the full mode page cap size or not. Should make all drives happy, old as well as ACER50 and similar.

Started a document detailing the Linux block driver stuff.

Davem is seeing inconsistent results with my new elevator stuff, which is very strange. I haven't been home much today, I'll investigate this matter tomorrow and hopefully be able to offer an explanation of what is going on. Right now I'm puzzled.

And the weather today wasn't really as nice as I expected, compared to the weekend it was kind of cold. A good time was still had, though :-)

Finished and cleaned up the block queueing and elevator changes. Type of elevator is selectable with elevator_init(), blk_init_queue() selects ELEVATOR_DEFAULT for you which is the elevator we have now in 2.3. Only difference is max_bomb_segments is increased to 32 for much better performance. The other "elevator" implemented is called noop, since it always stores incoming at the back and always coalesces. Give it a shove, patch up and change ELEVATOR_DEFAULT to ELEVATOR_NOOP in ll_rw_blk.c. It's in Linus' inbox.

Apparently the DVD stuff is going into 2.2.16-pre2. I've got a couple of changes I need to send to Alan, mostly backports from 2.3 current. Interesting to see how this goes... It's been ages since I've gotten a bug report for 2.2 + dvd patches, so I think we are fine. In addition, 2.3 has had this stuff since 2.3.16 (or there abouts) and seems to be doing great.

Tomorrow is May 1st, which means beer and great weather! And in two weeks I'm going on vacation, life is great.

Never forget a ; it could cost you many hours of restoring data from your partitions due to corruption. Did I mention the missing ; was in the merge code? Oh well, this is the first time that I've experienced (self inflicted) corruption. I could kick myself. Seems forgetting that character was the theme of the day, eh Rik :-)

Modularized the elevator code so it is easy to write a new elevator plugin or just choose which of the available you want for a low level driver. Arjan is playing with some of his own ideas, interesting to see how they turn out.

Didn't do a whole lot more, it is saturday after all.

What do you know, edit a diary entry and the date changes!

After having studied many different types of I/O schedulers, I've come to the conclusion that simple ascending request sorting is the most optimal for most circumstances. It has decent runtime for insertions and good average seek time. Combine that with some stuff for limiting starvation and you got yourself a decent disk elevator - and behold - this is what we have in current 2.3. After having tried the BSD style elevator, I implemented one that always returns the request closest to the one the drive is servicing right now. In terms of runtime it is expensive and the gains over the simple ascending sort was just not worth it. So my work yet again degenerates into just getting good performance with tweaking elevator defaults, how boring. Well almost, I want to modularise the current elevator so that it is possible to select which one you want. For IDE drives the current one is pretty good, for SCSI it does some work that is really unneeded but does not harm performance (well, we do take a small hit because of the unneeded work, but that is neglible). For intelligent devices (highend SCSI HBA/disks, I2O) that claim to do their own elevating, we shouldn't need to do much.

At least one person is having problems caused by changing the default mode page size in ide-cd to the standard specified size. So it looks like we are reverting to old behaviour again. The only known case (to me) that fails with the smaller mode page is the ACER 50 drive, not enough to justify changing the old default.

I want to get a new workstation PC. Contemplating getting a K7, but I'm not sure I really need it. Oh well.

Implemented a BSD style elevator to see what effects that would have on I/O behaviour. The BSD elevator is different in that it keeps two lists of requests that the device must service when it gets unplugged (this is handled automatically by me, though, queue_head always points to the list that needs work). We start by filling requests onto the first list in strict sector ordering until at request comes in that lies before the last active request. Then we switch lists and start adding to the other list. This gives good I/O ordering and also imposes a limit on how long we risk waiting for a specific request to finish. Performance is as-of-yet not quite determined. Feels pretty good though and initial bencmarks show that it is.

Decided to give the nvidia XFree86-4.0 drivers a go. The kernel driver needed porting to 2.3 first, though, but that was fairly trivial. Seems to run well. Soon it is time for the Q3 test to see how well the OpenGL performs! XFree86-3.3 performance with the nvidia glx sucked big time, I truly hope the new one is much better. According to the Linux Games site it is, sweet.

Still collecting data to determine the influence on elevator settings on unplugging of devices. I now have so much data stored it's beginning to get silly :-). On the side I'm collecting data for "normal system use" -- I want to convince David that writes do not dominate as much as he thinks! I don't know, though, I want to let the logger run for another day until I look at the output. I often forget that it is running, which is good for getting unbiased results.

Fixed several sillies in ide-cd, including cleaning up the ACER50 MODE_SENSE stuff that got shoved in earlier. I think we want to go for the regulation sized capabilities page buffer, if that doesn't work I can always blame it on the drive. Sticking to ATAPI is good in that sense, otherwise it has its deficiencies. If anybody should get into trouble there's an easily changeable define in ide-cd.h.

Don't ever try to "hot plug" IDE drives, especially if you turn the power connector upside down by mistake... Lots of smoke and several weeks of packet writing work down the drain. Not to worry, I want to try putting a new board on the drive so I can get my data out. Backups? Psch, that's for wimps (not for idiots like me).

Analyzed heaps of blklog data resulting from dbench runs with various different elevator settings. As suspected, the max_bomb_segments setting of 4 causes a huge performance loss. I've put the results that blklog generates from a dbench 48 run with the various elevator settings up at kernel.dk/elevator along with a small util to tweak the elevator with (elvset.c) and my blklog app (blklog.c). The latter isn't really useful, since I haven't put up the blklog driver just yet...

The (8192, 131072, 128, 2) (read_latency, write_latency, max_bomb_segments, and max_read_p) elevator setting gives good bench results while still maintaining decent interactive feel. Please experiment and let me know what your findings are! The last entry of the settings above is an addition that I made, read the former three. I mainly run the tests on a SCSI disk that has tagged command queueing, so elevator sorting is basically a non-issue for me. People with IDE drives are the ones that should feel the biggest impact when playing with the elevator.

Tried various elevator settings and profiled a dbench 48 run. Depending on settings, the raw dump from a single dbench run takes between 13-16MB of disk space! Getting a detailed ASCII dumped consumed as much as 60MB of disk space. The app also prints a useful one page summary of I/O activity, which is what I've been using. I haven't had much time to investigate the logs yet, but it looks as if the write bomb logic is what is hurting performance the most. Especially because the bomb segments are set as low as 4 sectors! Much better results are achieved with a bomb segment of 64 and a new max_read_pending member to the elevator so that we don't unconditionally reduce segments for a single pending read. I will put up detailed info tomorrow along with a possible good setup for the elevator.

The request accounter does not seem to impose such a big an overhead as I had expected. I keep an in-kernel 2MB ring buffer which can hold 87381 entries. The blklog app reads 128 entries at the time and writes them out in intervals of 1MB. A dbench run consumes about 500,000-650,000 requests and the miss rate is about 0.02% which is tolerable. This gives dbench about a 3% performance hit. If blklog is not running (it enables logging when it opens /dev/blklog), the performance hit is the overhead of a function call per request - which doesn't even show up in performance testing.

Played some more with the elevator to get optimal performance and good latency. Davem suggested implementing a small module that logs requests so that we can get hard data on what exactly is causing the problem and how. So I wrote a small logging driver (blklog) that accounts every buffer that is submitted for I/O, and how: new request started, had to wait for request, buffer merged, buffer not merged due to elevator latency, buffer not merged due to elevator starvation, device, sector, sectors, etc. All entries are pushed to an internal ring buffer and a small user space app collects the data and provides a nice print out of the elevator interaction at any given time. Tomorrow I will put this to good use and run benchmarks to collect data. With elevator defaults that basically kill the elevator latency, starvation, and write bomb logic David sees a 26 -> 34MB/s improvement (yes single disk, he has the toys with insane I/O).

Made other small improvements to the queueing, using spin_lock_irq() instead of _irqsave to save stack where possible (basically almost all of ll_rw_blk.c does not have to consider being called with interrupts disabled, since many paths may end up having to block for free request). Also fully removed tq_disk for waiting on locked page and buffer.

2 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!