In response to a problem at work where the new version of a program was running much slower than the released one, I looked at the performance of openssl doing AES on recent Intel x86_64. We had carefully ensured that SSL built using the hand-optimized x86_64 assembly version of the code. Imagine our surprise at finding that the C version runs twice as fast. I had tears from laughing.
How could this be? I interpreted it to mean that somebody at Intel made sure the compiler for x86 -> internal RISC (implemented in silicon) recognizes the instruction sequence generated by Gcc for AES, and pastes in its place a hand-coded sequence of internal operations. This means, in turn, that any new cipher that doesn't much resemble AES will be unfairly handicapped by running much more slowly on current Intel chips. It might mean that if Gcc were to improve its optimization to the point that the chip doesn't recognize the sequence it generates as AES, performance might drop by half or more. I wonder which chip version first got this optimization.
[Update: Or, it means that L1 cache alignment came out unlucky for the assembly code, in that particular build. Or something.]