The Most Important things for running a Reliable Internet Service
One of my clients is currently investigating new hosting arrangements. It’s a bit of a complex process because there are lots of architectural issues relating to things such as the storage and backup of some terabytes of data and some serious computation on the data. Among other options we are considering cheap servers in the EX range from Hetzner  which provide 3TB of RAID-1 storage per server along with reasonable CPU power and RAM and Amazon EC2 . Hetzner and Amazon aren’t the only companies providing services that can be used to solve my client’s problems, but they both provide good value for what they provide and we have prior experience with them.
To add an extra complication my client did some web research on hosting companies and found that Hetzner wasn’t even in the list of reliable hosting companies (whichever list that was). This is in some ways not particularly surprising, Hetzner offers servers without a full management interface (you can’t see a serial console or a KVM, you merely get access to reset it) and the best value servers (the only servers to consider for many terabytes of data) have SATA disks which presumably have a lower MTBF than SAS disks.
But I don’t think that this is a real problem. Even when hardware that’s designed for the desktop is run in a server room the reliability tends to be reasonable. My experience is that a desktop PC with two hard drives in a RAID-1 array will give a level of reliability in practice that compares very well to an expensive server with ECC RAM, redundant fans, redundant PSUs, etc.
My experience is that the most critical factor for server reliability is management. A server that is designed to be reliable can give very poor uptime if poorly maintained or if there is no rapid way of discovering and fixing problems. But a system that is designed to be cheap can give quite good uptime if well maintained, if problems can be repidly discovered and fixed.
A Brief Overview of Managing Servers
There are text books about how to manage servers, so obviously I can’t cover the topic in detail in a blog post. But here are some quick points. Note that I’m not claiming that this list includes everything, please add comments about anything particularly noteworthy that you think I’ve missed.
- For a server to be well managed it needs to be kept up to date. It’s probably a good idea for management to have this on the list of things to do. A plan to check for necessary updates and apply them at fixed times (at least once a week) would be a good thing. My experience is that usually managers don’t have anything to do with this and sysadmins either apply patches or not at their own whim.
- It is really ideal for people to know how all the software works. For every piece of software that’s running it should either have come from a source that provides some degree of support (EG a Linux distribution) or be maintained by someone who knows it well. When you install custom software from people who become unavailable then it puts the reliability of the entire system at risk – if anything breaks then you won’t be able to get it fixed quickly.
- It should be possible to rapidly discover problems, having a client phone you to tell you that your web site is offline is a bad thing. Ideally you will have software like Nagios monitoring the network and reporting problems via a SMS gateway service such as ClickaTell.com. I am not sure that Nagios is the best network monitoring system or that ClickaTell is the best SMS gateway, but they have both worked well in my experience. If you think that there are better options for either of those then please write a comment.
- It should be possible to rapidly fix problems. That means that a sysadmin must be available 24*7 to respond to SMS and you must have a backup sysadmin for when the main person takes a holiday, or ideally two backup sysadmins so that if one is on holiday and another has an emergency then problems can still be fixed. Another thing to consider is that an increasing number of hotels, resorts, and cruise ships are providing net access. So you could decrease your need for backup sysadmins if you give a holiday bonus to a sysadmin who uses a hotel, resort, or cruise ship that has good net access. ;)
- If it seems likely that there may be some staff changes then it’s a really good idea to hire a potential replacement on a casual basis so that they can learn how things work. There have been a few occasions when I started a sysadmin contract after the old sysadmin ceased being on speaking terms with the company owner. This made it difficult for me to learn what’s going on.
- If your network is in any way complex (IE it’s something that needs some skill to manage) then it will probably be impossible to hire someone who has experience in all the areas of technology at a salary you are prepared to pay. So you should assume that whoever you hire will do some learning on the job. This isn’t necessarily a problem but is something that needs to be considered. If you use some unusual hardware or software and want it to run reliably then you should have a spare system for testing so that the types of mistake which are typically made in the learning process are not made on your production network.
If you have a business which depends on running servers on the Internet and you don’t do all the things in the above list then the reliability of a service like Hetzner probably isn’t going to be an issue at all.
-  http://www.hetzner.de/hosting/produktmatrix/rootserver-produktmatrix-ex
-  http://aws.amazon.com/ec2/
- Servers vs Phones Hetzner have recently updated their offerings to include servers with...
- Why Internet Access in Australia Sucks In a comment on my post about (relatively) Cheap Net...
- The National Cost of Slow Internet Access Australia has slow Internet access when compared to other first-world...