Older blog entries for dwmw2 (starting at number 211)

Flash storage; a polemic

I originally posted this in response to the LWN coverage of the panel discussion at LinuxCon, but figure I should probably post it somewhere more sensible. So here goes... with a few paragraphs added at the end.

"The flash hardware itself is better placed to know about and handle failures of its cells, so that is likely to be the place where it is done, [Ted] said."

I was biting my tongue when he said that, so I didn't get up and heckle.

I think it's the wrong approach. It was all very well letting "intelligent" drives remap individual sectors underneath us so that we didn't have to worry about bad sectors or C-H-S and interleaving. But what the flash drives have to do to present a "disk" interface is much more than that; it's wrong to think that the same lessons apply here.

What the SSD does internally is a file system all of its own, commonly called a "translation layer". We then end up putting our own file system (ext4, btrfs, etc.) on top of that underlying file system.

Do you want to trust your data to a closed source file system implementation which you can't debug, can't improve and — most scarily — can't even fsck when it goes wrong, because you don't have direct access to the underlying medium?

I don't, certainly. The last two times I tried to install Linux to a SATA SSD, the disk was corrupted by the time I booted into the new system for the first time. The 'black box' model meant that there was no chance to recover — all I could do with the dead devices was throw them away, along with their entire contents.

File systems take a long time to get to maturity. And these translation layers aren't any different. We've been seeing for a long time that they are completely unreliable, although newer models are supposed to be somewhat better. But still, shipping them in a black box with no way for users to fix them or recover lost data is a bad idea.

That's just the reliability angle; there are also efficiency concerns with the filesystem-on-filesystem model. Flash is divided into "eraseblocks" of typically 128KiB or so. And getting larger as devices get larger. You can write in smaller chunks (typically 512 bytes or 2KiB, but also getting larger), but you can't just overwrite things as you desire. Each eraseblock is a bit like an Etch-A-Sketch. Once you've done your drawing, you can't just change bits of it; you have to wipe the whole block.

Our flash will fill up as we use it, and some of the data on the flash will be still relevant. Other parts will have been rendered obsolete; replaced by other data or just deleted files that aren't relevant any more. Before our flash fills up completely, we need to recover some of the space taken by obsolete data. We pick an eraseblock, write out new copies of the data which are still valid, and then we can erase the selected block and re-use it. This process is called garbage collection.

One of the biggest disadvantages of the "pretend to be disk" approach is addressed by the recent TRIM work. The problem was that the disk didn't even know that certain data blocks were obsolete and could just be discarded. So it was faithfully copying those sectors around from eraseblock to eraseblock during its garbage collection, even though the contents of those sectors were not at all relevant — according to the file system, they were free space!

Once TRIM gets deployed for real, that'll help a lot. But there are other ways in which the model is suboptimal.

The ideal case for garbage collection is that we'll find an eraseblock which contains only obsolete data, and in that case we can just erase it without having to copy anything at all. Rather than mixing volatile, short-term data in with the stable, long-term data we actually want to keep them apart, in separate eraseblocks. But in the SSD model, the underlying "disk" can't easily tell which data is which — the real OS file system code can do a much better job.

And when we're doing this garbage collection, it's an ideal time for the OS file system to optimise its storage — to defragment or do whatever else it wants (combining data extents, recompressing, data de-duplication, etc.). It can even play tricks like writing new data out in a suboptimal but fast fashion, and then only optimising it later when it gets garbage collected. But when the "disk" is doing this for us behind our back in its own internal file system, we don't get the opportunity to do so.

I don't think Ted is right that the flash hardware is in the best place to handle "failures of its cells". In the SSD model, the flash hardware doesn't do that anyway — it's done by the file system on the embedded microcontroller sitting next next to the flash.

I am certain that we can do better than that in our own file system code. All we need is a small amount of information from the flash. Telling us about ECC corrections is a first step, of course — when we had to correct a bunch of flipped bits using ECC, it's getting on for time to GC the eraseblock in question, writing out a clean copy of the data elsewhere. And there are technical reasons why we'll also want the flash to be able to say "please can you GC eraseblock #XX soon".

But I see absolutely no reason why we should put up with the "hardware" actually doing that kind of thing for us, behind our back. And badly.

Admittedly, the need to support legacy environments like DOS and to provide INT 13h "DISK BIOS" calls or at least a "block device" driver will never really go away. But that's not a problem. There are plenty of examples of translation layers done in software, where the OS really does have access to the real flash but still presents a block device driver to the OS. Linux has about 5 of them already. The corresponding "dumb" devices (like the M-Systems DiskOnChip which used to be extremely popular) are great for Linux, because we can use real file systems on them directly.

At the very least, we want the "intelligent" SSD devices to have a pass-through mode, so that we can talk directly to the underlying flash medium. That would also allow us to try to recover our data when the internal "file system" screws up, as well as allowing us to do things properly from our own OS file system code.

Now, I'm not suggesting that we already have file system code which can do things better; we don't. I wrote a file system which works on real flash, but I wrote it 8 years ago and it was designed for 16-32MiB of bitbanged NOR flash. We pushed it to work on 1GiB of NAND (and even using DMA!) for OLPC, but that is fairly much the limit of how far we'll get it to scale.

We do, however, have a lot of interesting new work such as UBI and UBIFS, which is rapidly taking the place of JFFS2 in the real world. The btrfs design also lends itself very well to working on real flash, because of the way it doesn't overwrite data in-place. I plan to have btrfs-on-flash, or at least btrfs-on-UBI, working fairly soon.

And, of course, we even have the option of using translation layers in software. That's how I tested the TRIM support when I added it to the kernel; by adding it to our existing flash translation layer implementations. Because when this stuff is done in software, we can work on it and improve it.

So I am entirely confident that we can do much better in software — and especially in open source software — than an SSD could ever do internally.

Let's not be so quick to assume that letting the 'hardware' do it for us is the right thing to do, just because it was right 20 years ago for hard drives do to something which seems vaguely similar at first glance.

Yes, we need the hardware to give us some hints about what's going on, as I mentioned above. But that's about as far as the complexity needs to go; don't listen to the people who tell you that the OS would need to know all kinds of details about the internal geometry of the device, which will be changing from month to month as technology progresses. The basic NAND flash technology hasn't changed that much in the last ten years, and existing file systems which operate on NAND haven't had to make many adjustments to keep up at all.

2009-08-22 01:24:51 +0000 1MefLf-0004Xf-3x H=mailhost8a.rbs.com [] F=<OnlineBanking@Information.natwest.com> rejected after DATA: Your message lacks a Date: header, which RFC5322 says it MUST have.
Dear Mr. Woodhouse,

Thank you for your call of 26th August about not being able to accept notification emails.
I have investigated the matter and can confirm that the statement notification emails are sent out with the date on. The rfc5322 is an internet protocol only and we do not have to abide by this.

Our records show that the notification emails failed delivery on the 21st August due to an invalid email address. I hope this is a satisfactory resolution to your complaint.

Christ, where do I start with this? Yes, if you're claiming to be sending Internet email then you really do have to follow RFC5322. That's the standard that defines what Internet email is.

But that seems to be a red herring — he also claims that they are including a Date: header. Unfortunately, he's wrong. He's probably looking at an email which had the Date: header added in transit by the recipient's mail server. That would be obvious to anyone with a clue, because you can compare the datestamps in the Received: headers and observe that it matches one of the later ones, not the first.

And his diagnosis of the reason for the failure seems to be complete nonsense too, given that the SMTP rejection notice contained precisely the above text: "Your message lacks a Date: header, which RFC5322 says it MUST have.".

Well done, Nat West. Bonus points for stupidity today.

Remember last year when British Telecom kept closing fault tickets without actually fixing the fault or reading what we'd told them? Well, it's official — It is BT policy to ignore all information provided in a fault ticket. They admitted it:

"CRM Teams and customers have also been advised that the only action taken on these 'Amend requests' is to complete them to allow the fault to progress. CS Ops do not actively respond to any information on these requests."

Their current game is attempting to charge me £128,000 for installing a new phone line. That's apparently the full cost of upgrading the line plant into the village, which has been desperately needed for a long time but although they costed it up years ago, they haven't got round to doing it yet. Perhaps they were just waiting for a single individual consumer to pay for it?

Hahaha. Skype might have to shut down due to licensing problems.

I hope it does. Random crap using non-standard protocols and non-free software deserves to die — and the sheep who used it deserve what they get too.

I'm accustomed to technical support being fairly incompetent and clueless, but Acer seem to have taken it to a new level. They have taken to telling direct lies and seem to be attempting to defraud their customers.

I don't think I'll ever be buying Acer hardware again.

I bought an Acer laptop a couple of months ago, through Misco. I phoned Misco and tried to get them to ship it to me without the preinstalled Windows Vista operating system. They said that it was not possible.

At that point I should have taken my business elsewhere, but this was quite a good deal — ISTR it was a return, or something like that, so it was quite cheap. So I ordered the laptop anyway, and then when it arrived I declined to accept the End User Licensing Agreement, installed Linux on it and contacted Acer for my refund as indicated.

Acer's first response was that they would be able to refund the £20.30 that Windows Vista was worth, but that they "will require a £51.99 payment to have the machine brought in to the repair centre so we may remove this for you. This will cover the courier and engineer's labour fee."

This seems to be an obvious scam to prevent customers from obtaining the refund to which they are entitled, and I didn't accept it. I wrote a letter to their head office, returning the Windows serial number sticker and giving photographic evidence that Linux had been installed on the system, wiping the old operating system. And demanding my refund within one month or court proceedings would be issued.

Acer responded to this, retracting the demand for a £51.99 payment but still claiming that the laptop had to be shipped back to them at my expense. They said that they needed to "action the following:

  1. Validate that the Operating System has been removed from the Hard Disk.
  2. Remove the Microsoft COA (Certificate Of Authenticity) label
  3. Verify your proof of purchase to ascertain that you are in the specified timeframe to refund this product.
  4. To verify if any back up recovery disks have been made and if so, recovered from you.
  5. A signed form from you, which may be given to Microsoft and which agrees to hold Acer harmelss from any claims by third parties in the event that you have produced any false information on the request."

I pointed out that it was not necessary for them to have the system shipped back to them to achieve their requirements. I offered them remote access to the system in order to verify that there was no trace of Windows left on the hard drive, and asked for a copy of the form they mentioned. I also gave them a copy of my proof of purchase, reminded them that I'd already sent back the sticker, and stated that I had made no backup copies.

At this point, they went silent and stopped responding to my email — even when I reminded them that the deadline was approaching and I was about to file the court claim for my refund. They did eventually start responding again after about two months, when I informed them that I had finally got round to filing the court case.

This did seem to get their attention, but they still claimed that they needed the system to be shipped back to them. When I spoke to an engineer on the telephone, he claimed that it wasn't sufficient merely to check that the hard drive had been wiped, and compare the serial number reported by its firmware with the one in their records. He said they had to actually take the laptop apart and read the serial number from the label on the hard drive, because I might have put a different hard drive into the laptop and flashed its firmware so that it pretended to have the same serial number as the original.

I pointed out that this was somewhat far-fetched, and if I was so inclined it would be much easier for me to just copy Windows off the original hard drive, send it back to them for validation, then put it back again afterwards. He agreed, but said that their agreement with Microsoft was that they must verify that the OS had been removed from the original hard drive — what happened after that wasn't their problem.

At this point, with the court proceedings already filed, they agreed to pay for the courier (and the court costs). Since it would only take a few days, I conceded. Before shipping it off to them, however, I took a screwdriver and carefully aligned all the screws so that I could tell if it had been opened.

Imagine my surprise when it came back and they hadn't opened the case! Despite all their protestation that they needed physical access, and that they had to open the case and physically read the serial numbers from the hard drive, when they finally got the opportunity to do so they didn't bother.

All they did was check the partitioning and serial number through software — which they could have done months ago, remotely.

As far as I can tell, it's just a huge scam to prevent customers from claiming the refund for the unlawfully-bundled software, by making it cost more to do so than they get in the refund. I certainly would have given up a long time ago if it wasn't for the principle of the thing.

Now it seems entirely clear that Acer are simply attempting to defraud their customers, though, I shall be reporting it to Trading Standards to see what they have to say about the matter.

Software makes me sad sometimes.

Every time the iwlagn driver crashes and has to be reloaded (and it does that distressingly often, since it doesn't seem to reset the device and recover when its closed-source firmware crashes), NetworkManager kills the connection and restarts completely. Not unreasonably, I suppose.

But then, all NFS mounts get automatically unmounted, which is a complete pain in the arse.

And my VPN connection is reset, and because Cisco are stupid I don't get the same VPN IP address next time I connect, even if it is still available. (I think I ought to be able to work around this from the client side, if I don't mind storing the authentication cookie on the client machine.).

Although having said that, the main reason I'd want my IP address to remain the same is so that my connection to the mail server can persist and I don't have to wait through Evolution's painfully slow startup.

Unfortunately, Evolution also responds to the network offline/online events by reporting -EAGAIN errors all the time when it auto-saves emails that you're composing, and stops being able to display mail folders — the index just comes up empty. So it needs to be killed and restarted too. (This has been in bugzilla since November last year).

Software makes me sad sometimes.

Q: My application has a command-line option to use an SSL client certificate. What is the OpenSSL function to load and use the certificate from a file?

A: Well, we make this lots of fun for you — it would be boring if there was just one function which you could pass the filename to. You have to write 230 lines of code like this instead.... First you have to check for yourself what type of file it is — is it a PKCS#12 file, is it a PEM file with a key in it, or is it a TPM key 'blob'?

No, there's no function which determines that for you — you have to do it yourself. And depending on the answer, you have to do three entirely different things to load the key.

To make things even more fun, those three file types have wildly different ways to handle their passphrase/PIN:

  • For a PEM file, you can't tell OpenSSL the passphrase in advance — if the user gave it on the command line, you have to manually override the user interface function that OpenSSL will call, and make your replacement function return the pre-set passphrase. Or if you do ask the user, you've got no way to easily tell whether the user got the passphrase wrong; if they get it wrong (and type 4 or more characters) then the 'load key' function will fail and you have to compare against a special error code, which may differ from version to version of OpenSSL because it has internal function names. Just for variety, if the user enters a wrong passphrase with fewer than 4 characters, they'll get no feedback and will just be asked again immediately.

  • For a PKCS#12 file, it's the other way round — you have to give the passphrase in advance, so you have to ask the user for it yourself. Even if the file isn't actually encrypted — because you don't know that yet.

  • For a TPM key it's a bit saner — you can either set the PIN in advance or otherwise OpenSSL will ask the user for it if necessary. But you do have to jump through various other hoops to use the TPM 'engine', instead of just pointing OpenSSL at the file and having everything handled for you.

Excuse me while I bash my head against a brick wall for a while...

And no, the answer is not "don't use OpenSSL then".

At least, not until one of the potential replacements actually starts to catch up with the features I need — support for using a TPM for certificates, and DTLS support.

WTF? Case-sensitive, but not case-preserving...

Why are people so bloody clueless about email? I received this in snail mail from my bank today:

Account Number xxxxxxxx Sort Code xx-xx-xx
Your statement
Your statement for the above account, is ready to view by logging in to online banking at www.natwest.com.

Unfortunately, we have been unable to deliver this alert to you by email. This may be because the email address we hold for you (DAVID@WOODHOU.SE) is incorrect.

That has to be almost the most clueless bug report I've ever seen. It should have included at least some of:

  • Precise date and time of the latest delivery attempt
  • Sender's email address
  • Sending server IP address
  • Which MX host was being delivered to
  • The rejection message from the MX host

If I hadn't been running my own mail server, I'd have had no way to work out what happened — no ISP is going to go trawling through their logs looking for a needle in a haystack based on virtually nothing.

Since I do run my own, I was able to log into all the MX hosts for that domain, look through the historical mail logs on each of them and I happened to find their failed message among all the lots of other people trying to fake mail from NatWest:

2009-04-21 00:38:20 +0000 1Lw40C-0002sE-3D H=mailhost7a.rbs.com [] F=<OnlineBanking@Information.natwest.com> rejected after DATA: Your message lacks a Date: header, which RFC5322 says it MUST have.

Upon calling them to tell them of their problem, I was asked "who says our mails lack a Date: header?" and "who says that they should?".

After dealing with that, I left the first-line support person with three items to pass on to Nat West's technical team:

  1. The lack of Date: header on their outbound mail
  2. The uselessness of the letter they send when they can't deliver email
  3. The fact that they are converting email addresses to upper case, when localparts may well be case-sensitive
I wonder what the odds are of any of them actually getting fixed?

Maybe I should have added "you're sending outbound mail without GPG-signing it" as a fourth item? :)

26 Mar 2009 (updated 26 Mar 2009 at 08:55 UTC) »

Today is the third birthday of GNOME bug #336076, which I filed to report a particularly idiotic regression in Evolution's IMAP code. (Update: It looks like I also posted about it on Advogato, too.)

Instead of just issuing a simple STATUS command to check the status of each folder for new mail, Evolution started to actually open the folder, fetch the headers for all new mail in it, re-fetch the flags for all mail in it.... and it does this for every folder that it's checking (which, with bug #336074 still unfixed, is all folders — not just the active folders. So in my case it was continuously re-fetching the flags for years of archived mail in folders which are marked on the server as being inactive.)

This meant that it took Evolution two HOURS to start up that first time, when connected across the Internet. Even when I ran it on a local machine which was connected to the server by Gigabit Ethernet, it still took 23 minutes to start up; downloading half a gigabyte of mail before it was usable.

I don't know what's scarier — the fact that this utterly moronic regression got into the code base in the first place (what in fuck's name were they thinking?), or the fact that GNOME 2.26 went out last week with it still not fixed, three years later.

I've actually moved my older archived mail folders off to a separate server to work around bug #336074, and I've stopped checking for new mail in folders other than the INBOX to work around bug #336076, which is a PITA but is the only way to keep Evolution even vaguely usable — and it's still extremely bad over a slow connection, such as GPRS (or connecting home from China).

It's not just at startup, either. It goes off into the weeds frequently, doing this stuff in the "background" while I'm waiting for it to fetch the mail I just clicked on. Sometimes, I end up using pine to read my email while I'm waiting for Evolution to do whatever weird crack-inspired stuff it's doing with the IMAP server and start responding again.

I think it's about time that the choice of default mail client for GNOME was re-evaluated. Evolution seems to be mostly stagnant, and the changes that are being made seem to be entirely dubious. Version 2.24 was a significant regression in many ways. Evolution is definitely letting the side down.

This kind of post inevitably leads to people mailing suggestions for an alternative MUA. Changing MUAs is a painful process, but I think after the 2.24 release I've reached the point where I'm going to have to give up on Evolution. Things I really need from the MUA are:

  • Graphical folder 'tree' showing the number of new mails in each folder (currently broken/disabled in Evolution as described above).
  • Ability to reach mail server over ssh: ssh $MAILSERVER exec imapd
  • No mangling of outgoing or incoming patches
As far as I'm aware, the latter two requirements rule out Thunderbird. I think I'm going to try Sylpheed. Last time I did that, it would SEGV at startup, which quickly put me off — but I'm sure that's fixed now, and I've heard good things about it. Next alternative if I can't get on with that is probably kmail.

Whatever I use, it would also be nice if it handled the calendar stuff that the Outlook/Exchange weenies use — preferably with the calendar on the Exchange server, but just using its own calendar (as I do in Evolution) would be fine.

(Of course, Evolution being the steaming pile of crap that it is, it fucks up the calendaring too. It has its own idea of what the timezone is, perhaps because it thinks it might be in a different timezone to the rest of the system? So for someone who travels a lot and uses the calendar infrequently, it's fairly much guaranteed that a meeting will be displayed in some arbitrary, wrong, timezone. And just for fun, it stupidly displays the meeting times without any hint about the time zone. )

202 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!