There are times when you stare at disassembly, wondering just how your 'optimization' has managed to cause a tenfold degradation in performance. "This version can't possibly be slower: look, the old version has several full function calls per loop, has to rotate words to get the right bit out, and more besides..."
It turns out that I was completely right with this assertion. The moral of the story is not to test your optimization against something that the compiler is smart enough to constant-fold, for fear of discouragement.