I originally posted this in response to the LWN coverage of the panel discussion at LinuxCon, but figure I should probably post it somewhere more sensible. So here goes... with a few paragraphs added at the end.
"The flash hardware itself is better placed to know about and handle failures of its cells, so that is likely to be the place where it is done, [Ted] said."
I was biting my tongue when he said that, so I didn't get up and heckle.
I think it's the wrong approach. It was all very well letting "intelligent" drives remap individual sectors underneath us so that we didn't have to worry about bad sectors or C-H-S and interleaving. But what the flash drives have to do to present a "disk" interface is much more than that; it's wrong to think that the same lessons apply here.
What the SSD does internally is a file system all of its own, commonly called a "translation layer". We then end up putting our own file system (ext4, btrfs, etc.) on top of that underlying file system.
Do you want to trust your data to a closed source file system implementation which you can't debug, can't improve and — most scarily — can't even fsck when it goes wrong, because you don't have direct access to the underlying medium?
I don't, certainly. The last two times I tried to install Linux to a SATA SSD, the disk was corrupted by the time I booted into the new system for the first time. The 'black box' model meant that there was no chance to recover — all I could do with the dead devices was throw them away, along with their entire contents.
File systems take a long time to get to maturity. And these translation layers aren't any different. We've been seeing for a long time that they are completely unreliable, although newer models are supposed to be somewhat better. But still, shipping them in a black box with no way for users to fix them or recover lost data is a bad idea.
That's just the reliability angle; there are also efficiency concerns with the filesystem-on-filesystem model. Flash is divided into "eraseblocks" of typically 128KiB or so. And getting larger as devices get larger. You can write in smaller chunks (typically 512 bytes or 2KiB, but also getting larger), but you can't just overwrite things as you desire. Each eraseblock is a bit like an Etch-A-Sketch. Once you've done your drawing, you can't just change bits of it; you have to wipe the whole block.
Our flash will fill up as we use it, and some of the data on the flash will be still relevant. Other parts will have been rendered obsolete; replaced by other data or just deleted files that aren't relevant any more. Before our flash fills up completely, we need to recover some of the space taken by obsolete data. We pick an eraseblock, write out new copies of the data which are still valid, and then we can erase the selected block and re-use it. This process is called garbage collection.
One of the biggest disadvantages of the "pretend to be disk" approach is addressed by the recent TRIM work. The problem was that the disk didn't even know that certain data blocks were obsolete and could just be discarded. So it was faithfully copying those sectors around from eraseblock to eraseblock during its garbage collection, even though the contents of those sectors were not at all relevant — according to the file system, they were free space!
Once TRIM gets deployed for real, that'll help a lot. But there are other ways in which the model is suboptimal.
The ideal case for garbage collection is that we'll find an eraseblock which contains only obsolete data, and in that case we can just erase it without having to copy anything at all. Rather than mixing volatile, short-term data in with the stable, long-term data we actually want to keep them apart, in separate eraseblocks. But in the SSD model, the underlying "disk" can't easily tell which data is which — the real OS file system code can do a much better job.
And when we're doing this garbage collection, it's an ideal time for the OS file system to optimise its storage — to defragment or do whatever else it wants (combining data extents, recompressing, data de-duplication, etc.). It can even play tricks like writing new data out in a suboptimal but fast fashion, and then only optimising it later when it gets garbage collected. But when the "disk" is doing this for us behind our back in its own internal file system, we don't get the opportunity to do so.
I don't think Ted is right that the flash hardware is in the best place to handle "failures of its cells". In the SSD model, the flash hardware doesn't do that anyway — it's done by the file system on the embedded microcontroller sitting next next to the flash.
I am certain that we can do better than that in our own file system code. All we need is a small amount of information from the flash. Telling us about ECC corrections is a first step, of course — when we had to correct a bunch of flipped bits using ECC, it's getting on for time to GC the eraseblock in question, writing out a clean copy of the data elsewhere. And there are technical reasons why we'll also want the flash to be able to say "please can you GC eraseblock #XX soon".
But I see absolutely no reason why we should put up with the "hardware" actually doing that kind of thing for us, behind our back. And badly.
Admittedly, the need to support legacy environments like DOS and to provide INT 13h "DISK BIOS" calls or at least a "block device" driver will never really go away. But that's not a problem. There are plenty of examples of translation layers done in software, where the OS really does have access to the real flash but still presents a block device driver to the OS. Linux has about 5 of them already. The corresponding "dumb" devices (like the M-Systems DiskOnChip which used to be extremely popular) are great for Linux, because we can use real file systems on them directly.
At the very least, we want the "intelligent" SSD devices to have a pass-through mode, so that we can talk directly to the underlying flash medium. That would also allow us to try to recover our data when the internal "file system" screws up, as well as allowing us to do things properly from our own OS file system code.
Now, I'm not suggesting that we already have file system code which can do things better; we don't. I wrote a file system which works on real flash, but I wrote it 8 years ago and it was designed for 16-32MiB of bitbanged NOR flash. We pushed it to work on 1GiB of NAND (and even using DMA!) for OLPC, but that is fairly much the limit of how far we'll get it to scale.
We do, however, have a lot of interesting new work such as UBI and UBIFS, which is rapidly taking the place of JFFS2 in the real world. The btrfs design also lends itself very well to working on real flash, because of the way it doesn't overwrite data in-place. I plan to have btrfs-on-flash, or at least btrfs-on-UBI, working fairly soon.
And, of course, we even have the option of using translation layers in software. That's how I tested the TRIM support when I added it to the kernel; by adding it to our existing flash translation layer implementations. Because when this stuff is done in software, we can work on it and improve it.
So I am entirely confident that we can do much better in software — and especially in open source software — than an SSD could ever do internally.
Let's not be so quick to assume that letting the 'hardware' do it for us is the right thing to do, just because it was right 20 years ago for hard drives do to something which seems vaguely similar at first glance.
Yes, we need the hardware to give us some hints about what's going on, as I mentioned above. But that's about as far as the complexity needs to go; don't listen to the people who tell you that the OS would need to know all kinds of details about the internal geometry of the device, which will be changing from month to month as technology progresses. The basic NAND flash technology hasn't changed that much in the last ten years, and existing file systems which operate on NAND haven't had to make many adjustments to keep up at all.