Advogato: Blog for etbe

SCSI Failures

For a long time it was widely regarded that SCSI was the interface for all serious drives that were suitable for “Enterprise Use” or for anything else which requires reliable operation. On the other hand IDE was for cheap disks that were only suitable for home use. The SCSI vs IDE issue continues to this day but now we have SAS and SATA filling the same market niches with the main difference between the current debate and the debate a decade ago being that a SATA disk can be connected on a SAS bus.

Both SAS and SATA have a single data cable for each disk which avoids the master/slave configuration on IDE and the issue of bus device ID number (from 0-7 or 0-15) and termination on SCSI.

Termination

When a high speed electrical signal travels through a cable some portion of the signal will be reflected from any cable end point or any point of damage. To prevent the signal reflection from the end of a cable you can have a set of resistors (or some other terminating device) at the end of the cable, see the Terminator(electrical) Wikipedia page [1] for a brief overview. As an aside I think that page could do with some work, if you are an EE with a bit of spare time then improving that page would be a good thing.

SCSI was always designed to have termination while IDE never was. I presume that this was largely due to the cable length (18″ for IDE vs 1.5m to 25m for SCSI) and the number of devices (2 for IDE vs 7 or 15 for SCSI). I also presume that some of the problems that I’ve had with IDE systems have been related to signal problems that could have been avoided with a terminated bus.

My first encounter with SCSI was when working for a small business that focused on WindowsNT software development. Everyone in the office knew a reasonable amount about computers and was happy to adjust the hardware of their own workstation. A room full of people who didn’t understand termination who fiddled with SCSI buses tended to give a bad result. On the up-side I learned that a SCSI bus can work most of the time if you have a terminator in the middle of the cable and a hard drive at the end.

There have been two occasions when I’ve been at ground zero for a large deployment of servers from a company I’ll call Moon Computers. In both cases there were two particularly large and expensive servers in a cluster and one of the cluster servers had data loss from bad SCSI termination. This is particularly annoying as the terminators have different colours, all that was needed to get the servers working was to change the hardware to make them look the same. As an aside the company with no backups [2] had one of the servers with bad SCSI termination.

Heat

SCSI disks and now SAS disks tend to be designed for higher performance, this usually means greater heat dissipation. A disk that dissipates a lot of heat won’t necessarily work well in a desktop case with small and quiet fans. This can become a big problem if you have workstations running 24*7 in a hot place (such as any Australian city that’s not in Tasmania) and turn the air-conditioner off on the weekends. One of my clients lost a few disks before they determined that IDE disks are the only option for systems that are to survive Australian heat without any proper cooling.

Differences between IDE/SATA and SCSI/SAS

In 2009 I wrote about vibration and SATA performance [3]. Rumor has it that SCSI/SAS disks are designed to operate in environments where there is a lot of vibration (servers with lots of big fans and fast disks) while IDE/SATA disks are designed for desktop and laptop systems in quiet environments. One thing I’d like to do is to test performance of SATA vs SAS disks in a server that vibrates.

SCSI/SAS disks have apparently been designed for operation in a RAID array and therefore will give a faster timeout on a read error (so another disk can return the data). While IDE/SATA disks are designed for a non-RAID situation and will spend longer trying to read the data.

There are also various claims about the error rates from SCSI/SAS disks being better than those of IDE/SATA disks. But I think that in all cases the error rates are small enough not to be a problem if you use a filesystem like ZFS or BTRFS but they are also large enough to be a significant risk with modern data volumes if you have a lesser filesystem.

Data Loss from Storage Failure

In the data loss that I’ve personally observed from storage failures the loss from SCSI problems (termination and heat) is about equal to all the hardware related data loss I’ve seen on IDE disks. Given that the majority of disks I’ve been responsible for have been IDE and SATA that’s a bad sign for SCSI use in practice.

But all serious data loss that I’ve seen has involved the use of a single disk (no RAID) and inadequate backups. So a basic RAID-1 or RAID-5 installation will solve most hardware related data loss problems.

There was one occasion when heat caused two disks in a RAID-1 to give errors at the same time, but by reading from both disks I managed to get almost all the data back, RAID can save you from some extreme error conditions. That situation would have been ideal for BTRFS or ZFS to recover data.

Conclusion

SCSI and SAS are designed for servers, using them in non-server systems seems to be a bad idea. Using SATA disks in servers can have problems too, but not typically problems that involve massive data loss.

Using technology that is too complex for the people who install it seems risky. That includes allowing programmers to plug SCSI disks into their workstations and whoever it was from Moon computers or their resellers who apparently couldn’t properly terminate a SCSI bus. It seems that the biggest advantage of SAS over SCSI is that SAS is simple enough for most people to be able to correctly install it.

Making servers similar to the systems that the system administrators use at home seems like a really good idea. I think that one of the biggest benefits of using x86 systems as servers is that skills learned on home PCs can be transferred to administration of servers. Of course it would also be a good idea to have test servers that are identical to servers in production so that the sysadmin team can practice and make mistakes on systems that aren’t mission critical, but companies seem to regard that as a waste of money – apparently the risk of down-time is cheaper.

Hot-swap Storage I recently had to decommission an old Linux server and...
lifetime failures (LF) This morning at LCA Andrew Tanenbaum gave a talk about...
Planning Servers for Failure Sometimes computers fail. If you run enough computers then you...

Syndicated 2013-05-28 08:24:05 from etbe - Russell Coker

28 May 2013 etbe » (Master)

Termination

Heat

Differences between IDE/SATA and SCSI/SAS

Data Loss from Storage Failure

Conclusion