To blizzard: If you're going to improve gmon/mcount, please teach it that if there's an existing gmon.out in the working directory, then it should augment that file instead of clobbering it. That way, if you want to profile a program that runs for a short time, you could just run it a few thousand times in a shell loop. Right now you have to do that, plus rename the reports so they all get saved, and then crunch them together at the end. This takes much longer than it has to, and throws your results off because disk cache is wasted on the huge gmon.out files which all have to stay around until the end.
To make this change safely, you should probably save the identity of the executable in gmon.out, and start over if it changes. (This should be done anyway.)
I'd also like to see better kernelside support for profiling. setitimer(2) has a lot of overhead, and the ticks don't come nearly often enough. SVR4 has a profil(2) system call that pushes the histogram updates into the kernel, which gets rid of the overhead but doesn't help with the granularity. Also, I don't think it can handle gaps in the region to be profiled, so your program has to be statically linked.
I'd rather not add system calls. Instead, I envision a pseudo-device which you map several different times, specifying the window of the address space to profile. It can use the high-resolution timer in the RTC to get ticks more often than the normal timer interrupt. Updates happen in the driver, so no more 30% of execution time spent in __mcount_internal.
GCC/i386 has a stupid bug where it clobbers %edx on every function entry, when compiling with profiling. This breaks -mregparm. Okay, that doesn't affect very many people - it still needs to get fixed.