11 Jul 2014 etbe   » (Master)

Improving Computer Reliability

In a comment on my post about Taxing Inferior Products [1] Ben pointed out that most crashes are due to software bugs. Both Ben and I work on the Debian project and have had significant experience of software causing system crashes for Debian users.

But I still think that the widespread adoption of ECC RAM is a good first step towards improving the reliability of the computing infrastructure.

Currently when software developers receive bug reports they always wonder whether the bug was caused by defective hardware. So when bugs can’t be reproduced (or can’t be reproduced in a way that matches the bug report) they often get put in a list of random crash reports and no further attention is paid to them.

When a system has ECC RAM and a filesystem that uses checksums for all data and metadata we can have greater confidence that random bugs aren’t due to hardware problems. For example if a user reports a file corruption bug they can’t repeat that occurred when using the Ext3 filesystem on a typical desktop PC I’ll wonder about the reliability of storage and RAM in their system. If however the same bug report came from someone who had ECC RAM and used the ZFS filesystem then I would be more likely to consider it a software bug.

The current situation is that every part of a typical PC is unreliable. When a bug can be attributed to one of several pieces of hardware, the OS kernel and even malware (in the case of MS Windows) it’s hard to know where to start in tracking down a bug. Most users have given up and accepted that crashing periodically is just what computers do. Even experienced Linux users sometimes give up on trying to track down bugs properly because it’s sometimes very difficult to file a good bug report. For the typical computer user (who doesn’t have the power that a skilled Linux user has) it’s much worse, filing a bug report seems about as useful as praying.

One of the features of ECC RAM is that the motherboard can inform the user (either at boot time, after a NMI reboot, or through system diagnostics) of the problem so it can be fixed. A feature of filesystems such as ZFS and BTRFS is that they can inform the user of drive corruption problems, sometimes before any data is lost.

My recommendation of BTRFS in regard to system integrity does have a significant caveat, currently the system reliability decrease due to crashes outweighs the reliability increase due to checksums at this time. This isn’t all bad because at least when BTRFS crashes you know what the problem is, and BTRFS is rapidly improving in this regard. When I discuss BTRFS in posts like this one I’m considering the theoretical issues related to the design not the practical issues of software bugs. That said I’ve twice had a BTRFS filesystem seriously corrupted by a faulty DIMM on a system without ECC RAM.

Related posts:

  1. Reliability of RAID ZDNet has an insightful article by Robin Harris predicting the...
  2. Improving Blog Latency to Benefit Readers I just read an interesting post about latency and how...
  3. Banking with an Infected Computer Bruce Schneier summarised a series of articles about banking security...

Syndicated 2014-07-11 04:51:55 from etbe - Russell Coker

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!