Flash storage; a polemic
I originally posted this in response to the LWN coverage of the
panel discussion at LinuxCon, but figure I should
probably post it somewhere more sensible. So here goes...
with a few paragraphs added at the end.
"The flash hardware itself is better placed
to know about and handle failures of its cells, so that is
likely to be the place where it is done, [Ted]
said."
I was biting my tongue when he said that, so I didn't get up
and heckle.
I think it's the wrong approach. It was all very well
letting "intelligent" drives remap individual sectors
underneath us so that we didn't have to worry about bad
sectors or C-H-S and interleaving. But what the flash drives
have to do to present a "disk" interface is much
more than that; it's wrong to think that the same lessons
apply here.
What the SSD does internally is a file system all of its
own, commonly called a "translation layer". We then end up
putting our own file system (ext4, btrfs, etc.) on top of
that underlying file system.
Do you want to trust your data to a closed source file
system implementation which you can't debug, can't improve
and — most scarily — can't even fsck
when it goes wrong, because you don't have direct access to
the underlying medium?
I don't, certainly. The last two times I tried to install
Linux to a SATA SSD, the disk was corrupted by the time I
booted into the new system for the first time. The 'black
box' model meant that there was no chance to recover —
all I could do with the dead devices was throw them away,
along with their entire contents.
File systems take a long time to get to maturity. And these
translation layers aren't any different. We've been seeing
for a long time that they are completely unreliable,
although newer models are supposed to be somewhat
better. But still, shipping them in a black box with no way
for users to fix them or recover lost data is a bad
idea.
That's just the reliability angle; there are also efficiency
concerns with the filesystem-on-filesystem model. Flash is
divided into "eraseblocks" of typically 128KiB or so. And
getting larger as devices get larger. You can write in
smaller chunks (typically 512 bytes or 2KiB, but also
getting larger), but you can't just overwrite things as you
desire. Each eraseblock is a bit like an Etch-A-Sketch. Once
you've done your drawing, you can't just change bits of it;
you have to wipe the whole block.
Our flash will fill up as we use it, and some of the data on
the flash will be still relevant. Other parts will have been
rendered obsolete; replaced by other data or just deleted
files that aren't relevant any more. Before our flash fills
up completely, we need to recover some of the space taken by
obsolete data. We pick an eraseblock, write out new copies
of the data which are still valid, and then we can
erase the selected block and re-use it. This process is called
garbage collection.
One of the biggest disadvantages of the "pretend to
be disk" approach is addressed by the recent TRIM work. The
problem was that the disk didn't even know that
certain data blocks were obsolete and could just be
discarded. So it was faithfully copying those sectors around
from eraseblock to eraseblock during its garbage collection,
even though the contents of those sectors were not at all
relevant — according to the file system, they
were free space!
Once TRIM gets deployed for real, that'll help a lot. But
there are other ways in which the model is suboptimal.
The ideal case for garbage collection is that we'll find an
eraseblock which contains only obsolete data, and
in that case we can just erase it without having to copy
anything at all. Rather than mixing volatile, short-term
data in with the stable, long-term data we actually want to
keep them apart, in separate eraseblocks. But in
the SSD model, the underlying "disk" can't easily tell which
data is which — the real OS file system code can do a
much better job.
And when we're doing this garbage collection, it's an ideal
time for the OS file system to optimise its storage —
to defragment or do whatever else it wants (combining data
extents, recompressing, data de-duplication, etc.). It can
even play tricks like writing new data out in a suboptimal
but fast fashion, and then only optimising it later
when it gets garbage collected. But when the "disk" is doing
this for us behind our back in its own internal file system,
we don't get the opportunity to do so.
I don't think Ted is right that the flash hardware is in the
best place to handle "failures of its cells". In the
SSD model, the flash hardware doesn't do that anyway —
it's done by the file system on the embedded microcontroller
sitting next next to the flash.
I am certain that we can do better than that in our
own file system code. All we need is a small amount
of information from the flash. Telling us about ECC
corrections is a first step, of course — when we had
to correct a bunch of flipped bits using ECC, it's getting
on for time to GC the eraseblock in question, writing out a
clean copy of the data elsewhere. And there are technical
reasons why we'll also want the flash to be able to say
"please can you GC eraseblock #XX soon".
But I see absolutely no reason why we should put up with the
"hardware" actually doing that kind of thing for us, behind
our back. And badly.
Admittedly, the need to support legacy environments like DOS
and to provide INT 13h "DISK BIOS" calls or at
least a "block device" driver will never really go away. But
that's not a problem. There are plenty of examples of
translation layers done in software, where the OS
really does have access to the real flash but still presents
a block device driver to the OS. Linux has about 5 of them
already. The corresponding "dumb" devices (like the
M-Systems DiskOnChip which used to be extremely popular) are
great for Linux, because we can use real file systems on
them directly.
At the very least, we want the "intelligent" SSD devices to
have a pass-through mode, so that we can talk directly to
the underlying flash medium. That would also allow
us to try to recover our data when the internal "file
system" screws up, as well as allowing us to do things
properly from our own OS file system code.
Now, I'm not suggesting that we already have file
system code which can do things better; we don't. I wrote a file system which
works on real flash, but I wrote it 8 years ago and it
was designed for 16-32MiB of bitbanged NOR flash. We pushed
it to work on 1GiB of NAND (and even using DMA!) for OLPC,
but that is fairly much the limit of how far we'll get it to
scale.
We do, however, have a lot of interesting new work such
as UBI
and UBIFS,
which is rapidly taking the place of JFFS2 in the real
world. The btrfs design also lends itself very well to
working on real flash, because of the way it doesn't
overwrite data in-place. I plan to have btrfs-on-flash, or
at least btrfs-on-UBI, working fairly soon.
And, of course, we even have the option of using
translation layers in software. That's how I tested the TRIM
support when I added it to the kernel; by adding it to our
existing flash translation layer implementations. Because
when this stuff is done in software, we can work on
it and improve it.
So I am entirely confident that we can do much better in
software — and especially in open source
software — than an SSD could ever do internally.
Let's not be so quick to assume that letting the 'hardware'
do it for us is the right thing to do, just because it was
right 20 years ago for hard drives do to something which
seems vaguely similar at first glance.
Yes, we need the hardware to give us some hints about what's
going on, as I mentioned above. But that's about as far as
the complexity needs to go; don't
listen to the people who tell you that the OS would need to
know all kinds of details about the internal geometry of the
device, which will be changing from month to month as
technology progresses. The basic NAND flash technology
hasn't changed that much in the last ten years, and
existing file
systems which operate on NAND haven't had to make many
adjustments to keep up at all.