18 Dec 2001 (updated 18 Dec 2001 at 06:44 UTC)
»
I thought I'd use this first diary entry for a rant,
since I lost about a week of
productive research work due to a bug in the C++ standard
library that ships with Red Hat. (Why, might you ask, do I
have time to write a diary entry after wasting all that
time? I don't.)
The basic_string implementation uses a
reference-counting internal representation class, called
rep. Rather than use a lock, the implementor
decided to use atomic operations to implement the reference
counting--- a strategy which I heartily approve.
Unfortunately, the code to increase the reference
count
looks like
this:
charT* grab () { if (selfish) return clone (); ++ref;
return data (); }
No problem, right? ++ compiles
down to a single instruction, so the code works fine under
multithreading.
But not on a multiprocessor system. When you pull
out
your microscope and think of the CPU as a load/store
machine, an increment is a load, an add, and a store--- and
the other CPU can jump in at just the wrong place. The
correct solution, pointed out here,
is to use the LOCK prefix on the add instruction, like this:
charT* grab () { if (selfish) return clone ();
asm ("lock; addl %0, (%1)"
: : "a" (1), "d" (&ref)
: "memory");
return data (); }
When was this patch posted to gcc-patches and gcc-bug?
July 2000. As of RedHat 7.1 (libstdc++-2.96-85), this bug
still exists. (GCC 3 has a rewritten string class which
does the right thing, thankfully.)
The patch did me absolutely no good. All I had to
start
with were wierd memory corruption errors that seemed to
usually hit basic_string's nilRep member.
I only knew what to search for in bug reports once I had
(laboriously) traced the problem down to a race condition in
the reference counting--- and at that point the answer was
staring me in the face.
Frankly, I feel as if RedHat and the GCC maintainers
let
me down; they had a fix available for a year, but somehow it
never made anybody's to-do list--- and as a result, all my
projects have been pushed back a week.