Billion Dollar Bugs
Posted 15 Feb 2002 at 19:58 UTC by bstpierre
So you don't write code for NASA? Your code isn't running on a life
support system or medical equipment? If there's a bug, QA will find it,
and besides, you're behind schedule and you needed to get this out the
door last week! Right?
Wrong. Your bugs, design flaws, and security holes could cost your
customers billions of dollars. If the business environment ever changed
so that software vendors were liable for defective products just like
any other vendor, your company could be bankrupted by the lawsuits.
The point is that we shouldn't just be looking at the SNMP stack we
bought. We should also look at code that we've developed in-house for
bugs. "Well, duh", I can hear you saying. "That's why you have a QA
group." Of course we have a QA group. We have excellent testers! But do
we really expect them to write special software that attempts to
exploit every possible nook and cranny in our system? Not exactly.
Testing groups are more concerned with black box testing. They don't
usually get involved with examining the source code, so they aren't
exactly in a position to know where the nooks and crannies are hidden.
(Hmm, funny thing. I have a hankering for an English muffin right now.)
I've heard people remark, "We don't need to operate at that level.
We're not writing code for the space shuttle." Implicit in this remark
is the idea that NASA programmers are very methodical and cautious.
Some might say they're the best in the business. Part of the reason for
being careful on code for, say, the space shuttle, is the enormous cost
and the potential for loss of life. According to the NASA
website, the Space Shuttle Endeavour cost $1.7 billion.
If you are writing code for a widely deployed piece of software and
there is a serious flaw, the costs imposed by that bug can be enormous.
Case in point: Outlook Express is installed on millions of computers
around the world. This software was obviously not scrutinized (neither
the design nor the code would be my guess) for security flaws. But this
was obviously ok at the time, because the software was a) largely being
distributed for "free", and b) there are no lives at stake -- it's a
mail client. Right?
Wrong. According to this article at cnet (and other sources),
the "Love Bug" virus caused an estimated $8.75 billion in damages.
That's more than five space shuttles! What would it have cost the
programmers at Microsoft to think a little bit more carefully about
their design for Outlook Express? Two or three million, tops? I guess
it doesn't matter now that Microsoft has made a strong
commitment to creating secure
software. "
SirCam", which also propagates through email, cost $1 billion. "Nimda"
was a little more clever, attempting to replicate through multiple
methods. One expert estimated that Nimda cost $500 million in one week!Hey, wait, I'm not
finished: "Code Red" exploited multiple flaws in Microsoft's IIS,
costing an estimated $2.6 billion.
That's over $12 billion in costs to users -- and those are just
Microsoft's programs. I could talk about bugs in sendmail and other
widely deployed open source software, but I won't. (Because you should
be using qmail.)
I haven't heard any stories of anyone being killed or injured
because of these viruses, but the situation could easily arise: imagine
that a worm like Code Red (but with a more vicious payload) is
highly active at the same time a major earthquake hits on the US West
Coast. People are calling 911 to report emergencies, but the emergency
dispatch centers are under attack by the worm. Or people are trying to
make calls or send emails to let family members know that they are ok,
but major networks are swamped because of the extra traffic introduced
by the worm. Or even something as basic as hospitals' information
systems getting swamped because of internal or external traffic caused
by worm infections. A ten minute delay in getting an ambulance to a
victim can be the difference between life and death. Even though you're
just writing a web server, you have the potential to either save lives
or kill people. That's scary stuff, from where I sit.
The economic costs related to viruses and virus-like software were
$17.1 and $13.2 billion in 2000 and 2001, respectively. I'm not sure if
those estimates take into account the tremendous amounts we spend on
anti-virus (AV) software (the AV market is $2.8B). This wouldn't be necessary if systems
were designed for virus resistance in the first place.
The same goes for firewall software. According to this
report, firewall software is fastest growing software market,
with AV software right behind. These systems shouldn't be as
important as they are! Instead of putting layers of armor over all of
these defective products, why not just make the products bullet-proof
to begin with? Because after all, firewall software and AV software is,
well, software -- and is not immune to bugs and security flaws.
If we got into the habit of creating software that was immune to
viruses like Code Red and Love Bug, we'd gain from a reduction in
regular bugs (the ones that make software crash without malicious
intervention). Those cost money too: lost productivity, lost data,
inaccurate data, lost business, etc.
I'm even going to go out on a limb: bulletproofing all
software, not just the stuff that NASA uses, would cost
nothing. The reason that I say this is that
the techniques that are often used to produce "zero defect"
software would provide offsetting gains. We'd gain from programmers'
increased productivity, better schedule predictability, and decreased
testing requirements. Not to mention that our systems would be far less
vulnerable to viruses and malicious intruders.
The original appears here.
Feedback appreciated.
Government employees and elected officials are held accountable for
their actions. If they cause public harm, they will be punished.
I don't know what happens if a billion dollars are wasted due to a
software bug by a NASA engineer. What punishment will he/she face?
Free software authors perform public service by giving away their
software for anyone to use. And people rely on free software too. And
we know the GPL and other licenses say "no warranty..."
If someone dies because of a failure in Linux running on a embedded
device somewhere, and someone sues Linus Torvalds, what happens?
(Of course Linus is not elected, and he is not paid for working on
Linus, but the scenario, that someone sues him because he is pissed off
by some failure of Linux somewhere, is possible...)
What if someone sues Linus?
If they tried they'd get laughed out of court. But that really misses
my point: taking the time to weed out the bugs up front will yield
nice dividends down the road. Bugs cost more than you might expect,
especially as we come to depend on software more and more as part of
our daily lives.
And how much revenue generating business never would have existed if the
software
which you claim has caused so many losses had never existed?
It should be obvious by now that it is impossible to educate the largely
ignorant programming masses about safe coding practices (how can we
accept low level language code from someone who doesn't understand
assembly language and computer internals? it happens all the time!).
higher level languages that are much more immune to such problems are
one good solution (python, java and perl instead of c/c++/c#).
also: some of us choose not to use qmail because of the author...
My local LUG just had a presentation on Extreme Programming. For
those of you who have never heard of it (or didn't know what it was),
Extreme Programming is a methodology for software development that among
other things stresses the importantance of doing Q/A at every step of
the process.
Good luck, posted 16 Feb 2002 at 02:56 UTC by apenwarr »
(Master)
You've struck close to my heart :) I disagree with this article completely.
I've heard this argument many, many times in many, many forms. There
are two problems with it:
1. It shows a fundamental misunderstanding of engineering concepts.
(Don't feel bad; most times I've heard this argument, it's been from
engineers.) Let's put it bluntly: You will NEVER, EVER EVER produce a
perfect product, no matter how hard you try. Trust me on this. If
you're striving for perfection, give up now. All you can do is try to
make your system "better," which is what we're all trying to do.
2. Customers don't want bug-free products, and so the laws aren't going
to change. Or, put another way, customers prefer the product with more
features, not the one with fewer bugs. Bugs in Windows cut my
productivity by, say, let's be cruel, 50%. But features of Windows make
my new work possible. It's worth it. Let's think of this another way:
if we offered customers two choices, Windows 95 with no bugs and Windows
XP with bugs, which would they choose? XP, I'm certain of it, and not
just because they're stupid. It's because the advantages (to them)
outweigh the disadvantages.
By the way, the whole concept that "thinking more" during the design of
Outlook would have only cost $3 mil and solved all kinds of problems is
very flawed. Being a few months later with a product like Outlook could
have cost Microsoft leadership in the email reader market and thus
_most_ of their profits. And a "proper" design for security and an
implementation to greatly reduce the number of bugs would have taken
many, many months, not just a few.
ObNote: I actually hate using most Microsoft products. But I _really_
dislike people who don't understand why people buy them.
Avery
Perfection is not required to achieve large gains, at least in complex
physical systems.
In post mortem engineering failure analysis it is most often found that
catastrophic failure (leading
to death, high property damage, etc.) is the result of multiple
failures. Engineering safety margins
and good design generally allow safe system shutdown with a single
failure.
The point is not perfection. The postulate proposed is that a small
improvement in attitude towards
eliminating bad practices and delivering higher quality components will
have a dramatic effect on
overall system quality. Some of these reports on the shuttle accident are worth
reading if interested in the
subject of high quality complex systems. The shuttle is about as
compex as it gets and it has plenty
of software. What killed it in the end was not the inability to build
perfect components (it has
sufficient redundancy and engineering margin to fly successfully with a
few failing components, just
not all the seals or all the gyroscopes or complete structural collapse)
but the unwillingness of the
NASA system to allow people to take responsibility and authorize them to
deliver adequate
components.
The initial Ariane
5 failure is perhaps a
better example: All of its components worked as originally designed
but the system failed on first
launch. Some software components were reused from Ariane 4 and it was
not spotted later in the
program that this would cause some software components used on the
ground but not in flight to
generate failure alarms when a flight regime change invalidated the
software designer's design
criteria. The alarm results in flight controller shutdown and
switchover to a backup controller. The
backup controller experienced the same fault and shutdown. Now out
of control it had to be
destroyed. The failure analysis provides detailed discussion of the
sequence of decisions and
design errors at the system level that lead to the overall failure. A
single small improvement
anywhere in this sequence, just as with the shuttle would have
eliminated the failure. An imperfect
design process slightly better than the one that occurred would have
eliminated the catastrophe.
While software is certainly not physical, it is often present in today's
complex systems. It is often
the most complex subsystem or component in a system. I see no reason
to expect that it is immune
to trends that have been quantified in engineering analysis of complex
systems. For additional
examples to support the thesis a google search for NTSB reports on
airliner crashes or highway
accidents might be instructive. Recall that in the recent Firestone
fiasco, sports utility vehicles
were more susceptable to rollover on blowout. Why? High center of
gravity might be viewed as a
design defect for high speed freeway travel, add the 2nd defect tire
blowout (instead of slow leaking
flat) and then the third of poor suspension system for highspeed freeway
travel and the rollover is
much more likely.
I agree with you that the real problem and thus the real solution lies
with the users. The sports
utility vehicle example above amply illustrates your point. Two of
the design defects for freeway
travel result from the users insistence that they want an off road
vehicle which they then drive like a
sports car on the freeway. If they do not accept unnecessarily buggy
software of overall low
"quality" (comments, readability, logical flow and organization, etc.);
the "quality" will start rising.
While few get paid directly to deliver free/open source software there
seem to be a lot of developers
around moaning because no user/contributer/fellow developer will bother
putting it on their machine
and assisting with the debugging/development/maintenance. Perhaps the
indirect benefits are
sufficient to desire a market.
If so, the users can apply market forces. They can vote with their
feet and move to projects where
it is fun to take credit for successfully isolating and helping fix an
easy isolated bug. Rather than
tackling sloppy unreadable bug ridden development code that provides
approximately the same
opportunity to shine as the lottery. At much greater expense if they
place any value on their time.
In general, it looks to me like the higher quality code wins. It
merely takes a while for volunteer
projects to catch up to commercial code developed by teams of paid
professionals. As successive
waves of developers/users catch on that the highly successful projects
(with a few exceptions) tend
to have high quality code and that lower quality projects are ready
targets for displacement by killer
app clones with a focus on high quality their code will get better.
Until the next wave.
For fun lets call component quality a function of utility, efficiency,
and correctness while defining system quality as a function of ease of
installation, conflict free operation, and lack of conflicting
dependencies.
Any long time community participants willing to state a crude opinion or
estimate regarding whether the free/open community average quality is
going up or down?
How about if we restrict the average quality estimated to the
"successful" widely used projects that have hit critical mass and are in
steady demand and maintenance?
Or at least will be, after I have educated him.
As a professional programmer, I always start discussing priorities with
my clients. What features are they willing to pay for? What level of
robustness are they willing to pay for? In general, they tend to have
a fixed amount of time or money for the project, a number of essential
features, and are willing to accept to spend their own time working
around any bugs and misfeatures left within these parameters.
[ Of course, clients always initially expect perfect omnipotent software
delivered yesterday, but I have always found it worth spending the first
couple of meetings teaching them about reality. ]
It makes perfect sense. A bug free program which does not solve the
clients problem is useless. A buggy program, on the other hand, may very
well save the client lots of work for the (hopefully) majority of cases
where it works right, which more than compensates for the manual
workaround needed for the remaining cases.
As a hobbyist programmer, I don't care. I implement the features I
need or find interesting, and fix bugs that become too annoying to
work around. If anyone else want to use the software and have a
problem with that, they can send me a patch, or become my clients and
pay me to fix it.
In a broader sense, the market also agrees, especially on the desktop.
The companies that have brought features to the market first have, in
general, outcompeted the companies that have spend more time testing.
And no, there are no magic bullet (sorry Brad Cox), neither in
methodology or tools that remove the need for a tradeoff between features
and robustness, there are lots and lots of small techniques and tools
that make either or both easier to accomplish, reducing the cost of
development, but not removing the tradeoff. Some people think language
features like garbage collection eliminates memory leaks, which is not
true for any useful definition of memory leak. I have spend days
tracking down memory leaks in Lisp code. However, such tools make them
much rarer and easier to avoid, thus saving valuable time.
mirwin: you said many right things, but I think they
support my argument more than yours.
In particular, in the case of space shuttle accidents being avoidable
with "just a small improvement here or there" - of course! That can be
said of nearly all bugs. But remember, we think of NASA as doing some
of the best software/hardware engineering in the world. If they made a
mistake, my opinion is: so will we!
Car companies also spend astonishingly more money on quality control and
testing than most software companies do, and you still get disasters
occasionally.
The difference between "normal" software companies and NASA is that for
every $1 we spend, we could squash a lot more bugs than NASA could.
That's because we have a lot more bugs. But our software also evolves
faster, has more features, and meets a wider range of needs than most
NASA equipment, and for the most part it still works pretty well.
Maybe our features/quality tradeoff needs tweaking in one direction or
the other, but I think we're pretty close. Windows hasn't killed me yet.
I'd like to recommend that everyone read John Lakos'
Large Scale C++ Software Design in which he promotes "design for
testability" in software.
I think this book would be valuable reading even if you don't program in
C++, although it would help to understand the language at least a little.
There is a condensed version some of the most essential material from
the book in three articles by Lakos that are reprinted in More
C++ Gems.
Design for Testability has been a huge success in hardware
manufacturing, where it became increasingly difficult to test integrated
circuit chips at their external pins because more and more functionality
was hidden within the chip in complicated ways. It was only by coming
up with standards for circuitry devoted specifically for test during
manufacturing that we are able to enjoy the complex electronics we have
today.
Software presents some tremendous advantage to the tester, but it is
necessary for the tester to be a programmer. First, each individual
instance of a class does not need to be tested, as is the case with
hardware - if all the interfaces of a class can be thoroughly tested,
then you can be sure that any new instances of the class will continue
to work. So it is actually a lot cheaper to test software, if you do it
at all.
Secondly, if you have already tested the classes that a new class uses,
then you only need to test the new functionality provided by the new
class - the way it puts its components together, or the extensions it
provides to its base class, and so on.
Finally, if you make the physical design of your codebase so that there
are no cyclic dependencies between different components, you can divide
the project up into "levels", and test all the lower levels first, then
the next level up (the first level that uses the lowest levels) and so
on up.
Whereas the total number of execution paths through a large modern
program is enormous, using Lakos' design and testing strategy keeps the
effort required to write tests and run them quite manageable.
I can personally attest to the usefulness of the "test first" strategy
that is promoted by the eXtreme Programming folks. The main usefulness
of it to me is to get over the hump of getting something new started - I
always have the hardest time beginning a new project. Coding up a
simple test is always a straighforward way to begin work and it gets me
into the groove. Towards the end of a big project I have a lot of tests
I can run to make sure nothing breaks.
I would really like to see more unit testing done.
While unit testing is primarily promoted as part of an object oriented
practice, there's no reason it can't be applied to other methodologies,
like procedural programming in C.
Don't take this as a defense of Microsoft but ...
Companies build what paying customers ask for. Imagine the following
phone conversation.
You: Hi I want to speak to the VP Marketing at Microsoft
*small miracle happens, you connected to this person*
VP Marketing: Hi, how may I help.
You: I want to make you next version be more secure and have less bugs.
VP Marketing: Do you want our new feature that automatically adds the
TM sigh after all Trade marked terms such as "OpenSouce".
You: No, sounds interesting but I would rather have you spend your time
on less mugs more secure.
VP: Hah, I got you, I can ignore your input - you are not one of my
customers. People like you run BSD. Any person who buys my products
would have chosen a new feature over the less tangible things. Sorry,
Dude, love to help you but paying customers have stuff they need me to
do.
This is the sad part - this guy might truly wish he was spending more
time working on fixing things that should be fixed. unfortunately the
people who buy this stuff done care. Ask your friends who run outlook,
have you upgraded to the latest service pack? Not the service pack for
NT or 98 or 2000 but the service pack for office and outlook. Odds are
very high they have not. If your friends have not, what do you think
the users who are completely non technical are doing. They can't even
be bother to download a free update. They don't care about general
security. Perhaps if there system is not working, then they will
download something but not until it causes problems
People who care don't buy stuff or form a very small percentage of the
overall market. Perhaps that is changing. Bush's war on whatever might
be changing that.
So I ask you - who caused the 12 billion dollar cost, the people who
wrote the software or the people who did not care about the 12 billion
when they bought it. PS I think the 12 billion estimate is completely
silly but I will agree that bugs cost lots of money.
Don't take this as a defense of Microsoft but ...
Companies build what paying customers ask for. Imagine the following
phone conversation.
You: Hi I want to speak to the VP Marketing at Microsoft
*small miracle happens, you connected to this person*
VP Marketing: Hi, how may I help.
You: I want to make you next version be more secure and have less bugs.
VP Marketing: Do you want our new feature that automatically adds the
TM sigh after all Trade marked terms such as "OpenSouce".
You: No, sounds interesting but I would rather have you spend your time
on less mugs more secure.
VP: Hah, I got you, I can ignore your input - you are not one of my
customers. People like you run BSD. Any person who buys my products
would have chosen a new feature over the less tangible things. Sorry,
Dude, love to help you but paying customers have stuff they need me to
do.
This is the sad part - this guy might truly wish he was spending more
time working on fixing things that should be fixed. unfortunately the
people who buy this stuff done care. Ask your friends who run outlook,
have you upgraded to the latest service pack? Not the service pack for
NT or 98 or 2000 but the service pack for office and outlook. Odds are
very high they have not. If your friends have not, what do you think
the users who are completely non technical are doing. They can't even
be bother to download a free update. They don't care about general
security. Perhaps if there system is not working, then they will
download something but not until it causes problems
People who care don't buy stuff or form a very small percentage of the
overall market. Perhaps that is changing. Bush's war on whatever might
be changing that.
So I ask you - who caused the 12 billion dollar cost, the people who
wrote the software or the people who did not care about the 12 billion
when they bought it. PS I think the 12 billion estimate is completely
silly but I will agree that bugs cost lots of money.
Thanks for the replies.
splork: I completely agree with your point that even
though these bugs may have cost billions, they have probably generated
more. So there is probably still a net gain. Re qmail: I hear what you
mean, I use it in spite of the author.
apenwarr: You're partly right. Customers don't want
bug free products. But they don't want to be susceptible to viruses or
thieves either. Of course you can't produce a perfect product. There
will always be something wrong with it. But I'm of the opinion that
better software can be produced at the same cost as crappy software.
Having worked on both types of projects (better software and crappy
software), I can tell you that the amount of effort that goes into
making crappy software is greater or equal to the amount of effort that
goes into better software. Of course, this is just anecdotal. Hmmm,
I'll have to see if I can dig up some empirical numbers that
demonstrate this. You've certainly made me think about a few different
angles on this, though; and possibly some opportunities for improvement
in a few areas.
If you move out of the shrink-wrap, IT, or web environment into the
firmware/embedded space, I think you'll find that customers are less
likely to accept bugs. How many people are going to buy a PBX/cell
phone/tv/microwave that has buggy software? If the phones aren't
working or your microwave goes beserk, you're talking about some
serious costs for your customers!
mirwin: Very nicely stated. My point isn't *really* to
achieve perfection. Just movement in that direction. Looking at
software as a complex system is an excellent place to start. Multiple
sources of failure are indeed the reason for many of the worst
failures. (Code Red, Love Bug, etc.) And the failures aren't just
confined to the application in question -- it is a function of how the
application fits into the larger system of the user's pc or even the
network environment.
Time is a scarce resource for most software development projects. Most
of the time, the real code are mostly written at the last 10% of the
coding segment. The other part is mostly spent on communication between
peers though.
There is one method of defect detection that we used several years
back, it was called Fagan Inspection. Not sure
exactly if it is widely practiced today, but I think code walkthrough
and group inspection is important in defect detection and correction.