Sometimes I find myself questioning my tenacious pursuit of bugs. Usually this comes while I'm in the middle of trying to understand some really tricky failure. As I work on a system with a code base that's been around for over a decade, bugs these days tend to be the "tail of the dragon" kind which can take days or weeks to make sense of.
Most recently, I worked on a crash produced by a multi-threaded server during a test that ran for a little over 3 days before failing. It happened on a platform that probably doesn't get much use outside my testing environment (Sparc Linux). The crash was inside the memory allocator (in this case the Boehm garbage collector).
Any crash inside a memory allocator suggests some sort of memory corruption. Of course I had no idea whether the application-level code was responsible, but it seemed prudent to pursue it. After a day of digging around in gdb, I had understood the data structure where the memory allocator had gotten the pointer it was trying to de-reference. It was a linked list which, luckily, also had an array pointing to the individual members. (The list is sparse, so the "next" pointer of array element N might point to an element greater than N+1.)
The interesting thing at this point was that several of the elements out of this linked list had been completely zeroed. It was easy to tell this because of the overall data structure: the next/previous links of the zeroed elements were wrong, plus each one also contained its own array index as a field and those were zero as well.
I found that the zeroed elements were all pretty near each other in memory, which suggested that maybe the same thing had zeroed all of them. I wanted to find out just how much memory had been zeroed. Unfortunately, gdb seems to lack any sort of memory searching capabilities. I tried writing a little loop in gdb's command language to do the searching for me, but boy was it slow. (Oh, for those bygone days of Waldemar Horwat's TMON debugger!)
So I opened up my core file in hexedit and started looking around in it. This required using objdump(1) to get information from the core file on mappings from file offsets to virtual addresses and doing a little conversion. Not too much trouble, but not something I usually have to resort to with core files. Of course then I found that hexedit's search function only allowed me to search for a particular hex string, not to search for a non-zero byte. But with a few lines of C code I had a program to do that searching for me. (This is about the point where I started questioning the sanity of continuing. When the debugger isn't good enough, and looking at a core file in a hex editor isn't good enough, and you start writing your own code to read through core files... well maybe that's a good thing, but it gave me pause.)
So now, a good 2-3 days into my debugging, I find that a whopping 15.2M of memory had been zeroed. Luckily for me, it started and ended within the same memory region in the core file. It would have been much harder to follow if it had spanned multiple regions. I definitely felt like I was getting somewhere. I went back to hexedit and searched for a pointer to the beginning of this zeroed region, figuring that there was a good chance there would be one somewhere. Sure enough, I found two copies of the pointer I was looking for.
Then the bad news: the pointers were in the stack of another thread. Since this crash happened on a machine running Debian sarge, which for some reason still hasn't made the jump to NPTL, my core file only contained the registers for the thread which actually crashed. I had the contents of the other stack, but no way to get gdb to display the call stack or the contents of the locals and arguments. The gdb manual claims that you can display an arbitrary stack frame by passing one or more addresses to the "frame" command, but I couldn't get this to work in either gdb 6.3 or gdb 5.2. But I was not to be deterred! There was clearly some code zeroing memory it shouldn't have been, and I was going to find it.
So it was time to figure out how to decode Sparc stack frames. I Googled around for some info on Sparc stack frames, but I didn't find much. Instead I read some of the gdb source to see what it was doing. Eventually I managed to figure out enough about the stack frame layout to make some sense of it. (The output from gdb 6.3's "info frame" on the thread it would display was actually quite helpful with this.)
Now I needed to reconstruct the stack from the other thread. Luckily, I knew that the first few stack frames of each thread would be the same thread start functions because the Boehm garbage collector wraps thread creation so that it can find the stack of each thread. This made it easy to find the first few frames. Once I had those, it was back to hexedit to search for frame pointers and work my way down into the nested call frames. At each one I extracted the saved PC and used gdb's "info line" to map that back to a source line. Pretty soon I had at least a partial picture of what was going on in this thread.
It turned out that it was in a bucketed hash table template class, allocating and initializing some new buckets. Now I was finally getting somewhere. I still needed some way to get the arguments and local variables out of these stack frames. Since gdb refused to do it for me, I needed to figure out what offset from the frame pointer the variables I was interested in were stored at. To do that I ran some other code using the same class under the debugger and stopped in the same member functions. This made it easy to determine the offsets I needed. Then I went back to the other gdb session, and examined memory at these offsets from the frame pointers I had found and noted down.
At last, after something like 4+ days of debugging, I was getting close to some answers. It turned out that the hash table code was trying to allocate an array of 0x40000000 buckets. Each bucket was a pointer, which was 4 bytes. Multiply the number of elements by the element size, and you get a number that doesn't fit in a 32-bit size_t. So it actually allocated a heap block of... zero bytes. And then went to initialize all those pointers to NULL... which meant it was trying to zero the entire address space. Ooops.
This value (0x40000000) was, in fact, the clamped maximum number of allowed buckets. So that's obviously too large. This is one of those things you run into when you take software originally developed on 64-bit systems and move it to a 32-bit platform. Also, I'm a bit disturbed by the fact that allocating a heap block of zero bytes gives you back a non-NULL pointer. However, it seems that "malloc(0)" and "new void * " in C++ both return a valid pointer.
Unfortunately, this is about as far back as I could take my debugging, which is somewhat unsatisfying as I really like to find root causes. In this case it seems as though somehow a floating-point division came up with the value 2210398208 for 1 divided by 0.5, at least as far as I can tell.
If the maximum hash table buckets were clamped to a smaller value and allocating a block of zero bytes returned NULL this all would have crashed much sooner, and in a way that would have made it easier to look at what else was going on. So I guess I'm going to have to be satisfied with those two buglets.
At some prodding from a couple people familiar with the code, I went back and dug up a few more details from the call frame that was initializing the hash table that caused all the trouble. (As long as I've already gone the extra mile, what's another short sprint, right?) Unfortunately, the data structure it was supposed to be copying was right in the middle of the memory region that got zeroed, which again leaves me stuck without enough information. However, the data structure corresponds to some on-disk data. Furthermore, the code path in question is only taken when the data is unmodified from what was paged in from disk. I was able to find enough information on the stack to find the corresponding file. That file confirms that the sizeHint parameter should indeed have been 1 in this case, which unfortunately brings me back to the mysterious result of 1/0.5=2210398208.