I thought I'd use this first diary entry for a rant, since I lost about a week of productive research work due to a bug in the C++ standard library that ships with Red Hat. (Why, might you ask, do I have time to write a diary entry after wasting all that time? I don't.)
The basic_string implementation uses a reference-counting internal representation class, called rep. Rather than use a lock, the implementor decided to use atomic operations to implement the reference counting--- a strategy which I heartily approve.
Unfortunately, the code to increase the reference count looks like this:
charT* grab () { if (selfish) return clone (); ++ref; return data (); }No problem, right? ++ compiles down to a single instruction, so the code works fine under multithreading.
But not on a multiprocessor system. When you pull out your microscope and think of the CPU as a load/store machine, an increment is a load, an add, and a store--- and the other CPU can jump in at just the wrong place. The correct solution, pointed out here, is to use the LOCK prefix on the add instruction, like this:
charT* grab () { if (selfish) return clone (); asm ("lock; addl %0, (%1)" : : "a" (1), "d" (&ref) : "memory"); return data (); }
When was this patch posted to gcc-patches and gcc-bug? July 2000. As of RedHat 7.1 (libstdc++-2.96-85), this bug still exists. (GCC 3 has a rewritten string class which does the right thing, thankfully.)
The patch did me absolutely no good. All I had to start with were wierd memory corruption errors that seemed to usually hit basic_string's nilRep member. I only knew what to search for in bug reports once I had (laboriously) traced the problem down to a race condition in the reference counting--- and at that point the answer was staring me in the face.
Frankly, I feel as if RedHat and the GCC maintainers let me down; they had a fix available for a year, but somehow it never made anybody's to-do list--- and as a result, all my projects have been pushed back a week.