Advogato: Billion Dollar Bugs

Posted 15 Feb 2002 at 19:58 UTC by bstpierre

So you don't write code for NASA? Your code isn't running on a life support system or medical equipment? If there's a bug, QA will find it, and besides, you're behind schedule and you needed to get this out the door last week! Right?

Wrong. Your bugs, design flaws, and security holes could cost your customers billions of dollars. If the business environment ever changed so that software vendors were liable for defective products just like any other vendor, your company could be bankrupted by the lawsuits.

The point is that we shouldn't just be looking at the SNMP stack we bought. We should also look at code that we've developed in-house for bugs. "Well, duh", I can hear you saying. "That's why you have a QA group." Of course we have a QA group. We have excellent testers! But do we really expect them to write special software that attempts to exploit every possible nook and cranny in our system? Not exactly.

Testing groups are more concerned with black box testing. They don't usually get involved with examining the source code, so they aren't exactly in a position to know where the nooks and crannies are hidden. (Hmm, funny thing. I have a hankering for an English muffin right now.)

I've heard people remark, "We don't need to operate at that level. We're not writing code for the space shuttle." Implicit in this remark is the idea that NASA programmers are very methodical and cautious. Some might say they're the best in the business. Part of the reason for being careful on code for, say, the space shuttle, is the enormous cost and the potential for loss of life. According to the NASA website, the Space Shuttle Endeavour cost $1.7 billion.

If you are writing code for a widely deployed piece of software and there is a serious flaw, the costs imposed by that bug can be enormous. Case in point: Outlook Express is installed on millions of computers around the world. This software was obviously not scrutinized (neither the design nor the code would be my guess) for security flaws. But this was obviously ok at the time, because the software was a) largely being distributed for "free", and b) there are no lives at stake -- it's a mail client. Right?

Wrong. According to this article at cnet (and other sources), the "Love Bug" virus caused an estimated $8.75 billion in damages. That's more than five space shuttles! What would it have cost the programmers at Microsoft to think a little bit more carefully about their design for Outlook Express? Two or three million, tops? I guess it doesn't matter now that Microsoft has made a strong commitment to creating secure software. " SirCam", which also propagates through email, cost $1 billion. "Nimda" was a little more clever, attempting to replicate through multiple methods. One expert estimated that Nimda cost $500 million in one week!Hey, wait, I'm not finished: "Code Red" exploited multiple flaws in Microsoft's IIS, costing an estimated $2.6 billion.

That's over $12 billion in costs to users -- and those are just Microsoft's programs. I could talk about bugs in sendmail and other widely deployed open source software, but I won't. (Because you should be using qmail.)

I haven't heard any stories of anyone being killed or injured because of these viruses, but the situation could easily arise: imagine that a worm like Code Red (but with a more vicious payload) is highly active at the same time a major earthquake hits on the US West Coast. People are calling 911 to report emergencies, but the emergency dispatch centers are under attack by the worm. Or people are trying to make calls or send emails to let family members know that they are ok, but major networks are swamped because of the extra traffic introduced by the worm. Or even something as basic as hospitals' information systems getting swamped because of internal or external traffic caused by worm infections. A ten minute delay in getting an ambulance to a victim can be the difference between life and death. Even though you're just writing a web server, you have the potential to either save lives or kill people. That's scary stuff, from where I sit.

The economic costs related to viruses and virus-like software were $17.1 and $13.2 billion in 2000 and 2001, respectively. I'm not sure if those estimates take into account the tremendous amounts we spend on anti-virus (AV) software (the AV market is $2.8B). This wouldn't be necessary if systems were designed for virus resistance in the first place.

The same goes for firewall software. According to this report, firewall software is fastest growing software market, with AV software right behind. These systems shouldn't be as important as they are! Instead of putting layers of armor over all of these defective products, why not just make the products bullet-proof to begin with? Because after all, firewall software and AV software is, well, software -- and is not immune to bugs and security flaws.

If we got into the habit of creating software that was immune to viruses like Code Red and Love Bug, we'd gain from a reduction in regular bugs (the ones that make software crash without malicious intervention). Those cost money too: lost productivity, lost data, inaccurate data, lost business, etc.

I'm even going to go out on a limb: bulletproofing all software, not just the stuff that NASA uses, would cost nothing. The reason that I say this is that the techniques that are often used to produce "zero defect" software would provide offsetting gains. We'd gain from programmers' increased productivity, better schedule predictability, and decreased testing requirements. Not to mention that our systems would be far less vulnerable to viruses and malicious intruders.

The original appears here. Feedback appreciated.

Government employees and elected officials are held accountable for their actions. If they cause public harm, they will be punished.

I don't know what happens if a billion dollars are wasted due to a software bug by a NASA engineer. What punishment will he/she face?

Free software authors perform public service by giving away their software for anyone to use. And people rely on free software too. And we know the GPL and other licenses say "no warranty..."

If someone dies because of a failure in Linux running on a embedded device somewhere, and someone sues Linus Torvalds, what happens?

(Of course Linus is not elected, and he is not paid for working on Linus, but the scenario, that someone sues him because he is pissed off by some failure of Linux somewhere, is possible...)

What if someone sues Linus?

If they tried they'd get laughed out of court. But that really misses my point: taking the time to weed out the bugs up front will yield nice dividends down the road. Bugs cost more than you might expect, especially as we come to depend on software more and more as part of our daily lives.

And how much revenue generating business never would have existed if the software which you claim has caused so many losses had never existed?

It should be obvious by now that it is impossible to educate the largely ignorant programming masses about safe coding practices (how can we accept low level language code from someone who doesn't understand assembly language and computer internals? it happens all the time!). higher level languages that are much more immune to such problems are one good solution (python, java and perl instead of c/c++/c#).

also: some of us choose not to use qmail because of the author...

My local LUG just had a presentation on Extreme Programming. For those of you who have never heard of it (or didn't know what it was), Extreme Programming is a methodology for software development that among other things stresses the importantance of doing Q/A at every step of the process.

Good luck, posted 16 Feb 2002 at 02:56 UTC by apenwarr »

You've struck close to my heart :) I disagree with this article completely.

I've heard this argument many, many times in many, many forms. There are two problems with it:

1. It shows a fundamental misunderstanding of engineering concepts. (Don't feel bad; most times I've heard this argument, it's been from engineers.) Let's put it bluntly: You will NEVER, EVER EVER produce a perfect product, no matter how hard you try. Trust me on this. If you're striving for perfection, give up now. All you can do is try to make your system "better," which is what we're all trying to do.

2. Customers don't want bug-free products, and so the laws aren't going to change. Or, put another way, customers prefer the product with more features, not the one with fewer bugs. Bugs in Windows cut my productivity by, say, let's be cruel, 50%. But features of Windows make my new work possible. It's worth it. Let's think of this another way: if we offered customers two choices, Windows 95 with no bugs and Windows XP with bugs, which would they choose? XP, I'm certain of it, and not just because they're stupid. It's because the advantages (to them) outweigh the disadvantages.

By the way, the whole concept that "thinking more" during the design of Outlook would have only cost $3 mil and solved all kinds of problems is very flawed. Being a few months later with a product like Outlook could have cost Microsoft leadership in the email reader market and thus _most_ of their profits. And a "proper" design for security and an implementation to greatly reduce the number of bugs would have taken many, many months, not just a few.

ObNote: I actually hate using most Microsoft products. But I _really_ dislike people who don't understand why people buy them.

Avery

Perfection is not required to achieve large gains, at least in complex physical systems.

In post mortem engineering failure analysis it is most often found that catastrophic failure (leading to death, high property damage, etc.) is the result of multiple failures. Engineering safety margins and good design generally allow safe system shutdown with a single failure.

The point is not perfection. The postulate proposed is that a small improvement in attitude towards eliminating bad practices and delivering higher quality components will have a dramatic effect on overall system quality. Some of these reports on the shuttle accident are worth reading if interested in the subject of high quality complex systems. The shuttle is about as compex as it gets and it has plenty of software. What killed it in the end was not the inability to build perfect components (it has sufficient redundancy and engineering margin to fly successfully with a few failing components, just not all the seals or all the gyroscopes or complete structural collapse) but the unwillingness of the NASA system to allow people to take responsibility and authorize them to deliver adequate components.

The initial Ariane 5 failure is perhaps a better example: All of its components worked as originally designed but the system failed on first launch. Some software components were reused from Ariane 4 and it was not spotted later in the program that this would cause some software components used on the ground but not in flight to generate failure alarms when a flight regime change invalidated the software designer's design criteria. The alarm results in flight controller shutdown and switchover to a backup controller. The backup controller experienced the same fault and shutdown. Now out of control it had to be destroyed. The failure analysis provides detailed discussion of the sequence of decisions and design errors at the system level that lead to the overall failure. A single small improvement anywhere in this sequence, just as with the shuttle would have eliminated the failure. An imperfect design process slightly better than the one that occurred would have eliminated the catastrophe.

While software is certainly not physical, it is often present in today's complex systems. It is often the most complex subsystem or component in a system. I see no reason to expect that it is immune to trends that have been quantified in engineering analysis of complex systems. For additional examples to support the thesis a google search for NTSB reports on airliner crashes or highway accidents might be instructive. Recall that in the recent Firestone fiasco, sports utility vehicles were more susceptable to rollover on blowout. Why? High center of gravity might be viewed as a design defect for high speed freeway travel, add the 2nd defect tire blowout (instead of slow leaking flat) and then the third of poor suspension system for highspeed freeway travel and the rollover is much more likely.

I agree with you that the real problem and thus the real solution lies with the users. The sports utility vehicle example above amply illustrates your point. Two of the design defects for freeway travel result from the users insistence that they want an off road vehicle which they then drive like a sports car on the freeway. If they do not accept unnecessarily buggy software of overall low "quality" (comments, readability, logical flow and organization, etc.); the "quality" will start rising. While few get paid directly to deliver free/open source software there seem to be a lot of developers around moaning because no user/contributer/fellow developer will bother putting it on their machine and assisting with the debugging/development/maintenance. Perhaps the indirect benefits are sufficient to desire a market.

If so, the users can apply market forces. They can vote with their feet and move to projects where it is fun to take credit for successfully isolating and helping fix an easy isolated bug. Rather than tackling sloppy unreadable bug ridden development code that provides approximately the same opportunity to shine as the lottery. At much greater expense if they place any value on their time.

In general, it looks to me like the higher quality code wins. It merely takes a while for volunteer projects to catch up to commercial code developed by teams of paid professionals. As successive waves of developers/users catch on that the highly successful projects (with a few exceptions) tend to have high quality code and that lower quality projects are ready targets for displacement by killer app clones with a focus on high quality their code will get better. Until the next wave.

For fun lets call component quality a function of utility, efficiency, and correctness while defining system quality as a function of ease of installation, conflict free operation, and lack of conflicting dependencies.

Any long time community participants willing to state a crude opinion or estimate regarding whether the free/open community average quality is going up or down?

How about if we restrict the average quality estimated to the "successful" widely used projects that have hit critical mass and are in steady demand and maintenance?

Or at least will be, after I have educated him.

As a professional programmer, I always start discussing priorities with my clients. What features are they willing to pay for? What level of robustness are they willing to pay for? In general, they tend to have a fixed amount of time or money for the project, a number of essential features, and are willing to accept to spend their own time working around any bugs and misfeatures left within these parameters.

[ Of course, clients always initially expect perfect omnipotent software delivered yesterday, but I have always found it worth spending the first couple of meetings teaching them about reality. ]

It makes perfect sense. A bug free program which does not solve the clients problem is useless. A buggy program, on the other hand, may very well save the client lots of work for the (hopefully) majority of cases where it works right, which more than compensates for the manual workaround needed for the remaining cases.

As a hobbyist programmer, I don't care. I implement the features I need or find interesting, and fix bugs that become too annoying to work around. If anyone else want to use the software and have a problem with that, they can send me a patch, or become my clients and pay me to fix it.

In a broader sense, the market also agrees, especially on the desktop. The companies that have brought features to the market first have, in general, outcompeted the companies that have spend more time testing.

And no, there are no magic bullet (sorry Brad Cox), neither in methodology or tools that remove the need for a tradeoff between features and robustness, there are lots and lots of small techniques and tools that make either or both easier to accomplish, reducing the cost of development, but not removing the tradeoff. Some people think language features like garbage collection eliminates memory leaks, which is not true for any useful definition of memory leak. I have spend days tracking down memory leaks in Lisp code. However, such tools make them much rarer and easier to avoid, thus saving valuable time.

mirwin: you said many right things, but I think they support my argument more than yours.

In particular, in the case of space shuttle accidents being avoidable with "just a small improvement here or there" - of course! That can be said of nearly all bugs. But remember, we think of NASA as doing some of the best software/hardware engineering in the world. If they made a mistake, my opinion is: so will we!

Car companies also spend astonishingly more money on quality control and testing than most software companies do, and you still get disasters occasionally.

The difference between "normal" software companies and NASA is that for every $1 we spend, we could squash a lot more bugs than NASA could. That's because we have a lot more bugs. But our software also evolves faster, has more features, and meets a wider range of needs than most NASA equipment, and for the most part it still works pretty well.

Maybe our features/quality tradeoff needs tweaking in one direction or the other, but I think we're pretty close. Windows hasn't killed me yet.

I'd like to recommend that everyone read John Lakos' Large Scale C++ Software Design in which he promotes "design for testability" in software.

I think this book would be valuable reading even if you don't program in C++, although it would help to understand the language at least a little.

There is a condensed version some of the most essential material from the book in three articles by Lakos that are reprinted in More C++ Gems.

Design for Testability has been a huge success in hardware manufacturing, where it became increasingly difficult to test integrated circuit chips at their external pins because more and more functionality was hidden within the chip in complicated ways. It was only by coming up with standards for circuitry devoted specifically for test during manufacturing that we are able to enjoy the complex electronics we have today.

Software presents some tremendous advantage to the tester, but it is necessary for the tester to be a programmer. First, each individual instance of a class does not need to be tested, as is the case with hardware - if all the interfaces of a class can be thoroughly tested, then you can be sure that any new instances of the class will continue to work. So it is actually a lot cheaper to test software, if you do it at all.

Secondly, if you have already tested the classes that a new class uses, then you only need to test the new functionality provided by the new class - the way it puts its components together, or the extensions it provides to its base class, and so on.

Finally, if you make the physical design of your codebase so that there are no cyclic dependencies between different components, you can divide the project up into "levels", and test all the lower levels first, then the next level up (the first level that uses the lowest levels) and so on up.

Whereas the total number of execution paths through a large modern program is enormous, using Lakos' design and testing strategy keeps the effort required to write tests and run them quite manageable.

I can personally attest to the usefulness of the "test first" strategy that is promoted by the eXtreme Programming folks. The main usefulness of it to me is to get over the hump of getting something new started - I always have the hardest time beginning a new project. Coding up a simple test is always a straighforward way to begin work and it gets me into the groove. Towards the end of a big project I have a lot of tests I can run to make sure nothing breaks.

I would really like to see more unit testing done.

While unit testing is primarily promoted as part of an object oriented practice, there's no reason it can't be applied to other methodologies, like procedural programming in C.

Don't take this as a defense of Microsoft but ...

Companies build what paying customers ask for. Imagine the following phone conversation.

You: Hi I want to speak to the VP Marketing at Microsoft *small miracle happens, you connected to this person*

VP Marketing: Hi, how may I help.

You: I want to make you next version be more secure and have less bugs.

VP Marketing: Do you want our new feature that automatically adds the TM sigh after all Trade marked terms such as "OpenSouce".

You: No, sounds interesting but I would rather have you spend your time on less mugs more secure.

VP: Hah, I got you, I can ignore your input - you are not one of my customers. People like you run BSD. Any person who buys my products would have chosen a new feature over the less tangible things. Sorry, Dude, love to help you but paying customers have stuff they need me to do.

This is the sad part - this guy might truly wish he was spending more time working on fixing things that should be fixed. unfortunately the people who buy this stuff done care. Ask your friends who run outlook, have you upgraded to the latest service pack? Not the service pack for NT or 98 or 2000 but the service pack for office and outlook. Odds are very high they have not. If your friends have not, what do you think the users who are completely non technical are doing. They can't even be bother to download a free update. They don't care about general security. Perhaps if there system is not working, then they will download something but not until it causes problems

People who care don't buy stuff or form a very small percentage of the overall market. Perhaps that is changing. Bush's war on whatever might be changing that.

So I ask you - who caused the 12 billion dollar cost, the people who wrote the software or the people who did not care about the 12 billion when they bought it. PS I think the 12 billion estimate is completely silly but I will agree that bugs cost lots of money.

Don't take this as a defense of Microsoft but ...

Companies build what paying customers ask for. Imagine the following phone conversation.

You: Hi I want to speak to the VP Marketing at Microsoft *small miracle happens, you connected to this person*

VP Marketing: Hi, how may I help.

You: I want to make you next version be more secure and have less bugs.

VP Marketing: Do you want our new feature that automatically adds the TM sigh after all Trade marked terms such as "OpenSouce".

You: No, sounds interesting but I would rather have you spend your time on less mugs more secure.

People who care don't buy stuff or form a very small percentage of the overall market. Perhaps that is changing. Bush's war on whatever might be changing that.

Thanks for the replies.

splork: I completely agree with your point that even though these bugs may have cost billions, they have probably generated more. So there is probably still a net gain. Re qmail: I hear what you mean, I use it in spite of the author.

apenwarr: You're partly right. Customers don't want bug free products. But they don't want to be susceptible to viruses or thieves either. Of course you can't produce a perfect product. There will always be something wrong with it. But I'm of the opinion that better software can be produced at the same cost as crappy software. Having worked on both types of projects (better software and crappy software), I can tell you that the amount of effort that goes into making crappy software is greater or equal to the amount of effort that goes into better software. Of course, this is just anecdotal. Hmmm, I'll have to see if I can dig up some empirical numbers that demonstrate this. You've certainly made me think about a few different angles on this, though; and possibly some opportunities for improvement in a few areas.

If you move out of the shrink-wrap, IT, or web environment into the firmware/embedded space, I think you'll find that customers are less likely to accept bugs. How many people are going to buy a PBX/cell phone/tv/microwave that has buggy software? If the phones aren't working or your microwave goes beserk, you're talking about some serious costs for your customers!

mirwin: Very nicely stated. My point isn't *really* to achieve perfection. Just movement in that direction. Looking at software as a complex system is an excellent place to start. Multiple sources of failure are indeed the reason for many of the worst failures. (Code Red, Love Bug, etc.) And the failures aren't just confined to the application in question -- it is a function of how the application fits into the larger system of the user's pc or even the network environment.

Time is a scarce resource for most software development projects. Most of the time, the real code are mostly written at the last 10% of the coding segment. The other part is mostly spent on communication between peers though.

There is one method of defect detection that we used several years back, it was called Fagan Inspection. Not sure exactly if it is widely practiced today, but I think code walkthrough and group inspection is important in defect detection and correction.

Billion Dollar Bugs

Posted 15 Feb 2002 at 19:58 UTC by bstpierre

public service and responsibility, posted 15 Feb 2002 at 20:27 UTC by atai » (Journeyer)

re: public service and responsibility, posted 15 Feb 2002 at 22:11 UTC by bstpierre » (Journeyer)

reverse the question, posted 15 Feb 2002 at 23:49 UTC by splork » (Master)

extreme programming, posted 16 Feb 2002 at 00:45 UTC by scandal » (Master)

Good luck, posted 16 Feb 2002 at 02:56 UTC by apenwarr » (Master)

Perfecton not required for improvement., posted 16 Feb 2002 at 05:27 UTC by mirwin » (Master)

The Client is Always Right, posted 16 Feb 2002 at 11:42 UTC by abraham » (Master)

Re perfection not required, posted 16 Feb 2002 at 17:42 UTC by apenwarr » (Master)

Design for Testability, posted 17 Feb 2002 at 09:08 UTC by goingware » (Master)

Do people really want it?, posted 18 Feb 2002 at 05:05 UTC by cullenfluffyjennings » (Journeyer)

Do people really want it?, posted 18 Feb 2002 at 05:05 UTC by cullenfluffyjennings » (Journeyer)

thanks for all the feedback, posted 20 Feb 2002 at 14:10 UTC by bstpierre » (Journeyer)

Defect Minimization Rule #1, posted 24 Feb 2002 at 06:37 UTC by nymia » (Master)