I was reading the documentation for the different x86 CPUs regarding my SIMD work, and I have started comparing the latencies. I was shocked just how bad the Pentium 4 is.
The Pentium 4 has really horrible latencies, in particular where it really matters: at the SIMD instructions. For example, movq has a latency of 6 on the Pentium 4 (VIA C3: 1, Pentium 3: 1, Athlon: 2). This means moving one MMX register to another, no memory stalls or AGIs included!
Now movq is pretty important because the x86 architecture uses source+destination addressing in the instructions, not source1+source2+destionation as most RISC CPUs, so you need to copy data around all the time to get something done. This means that you have to find five other instructions to schedule between the movq and the next instruction to keep the pipeline filled. That is next to impossible in typical applications. There are not enough registers to unroll a typical loop four times.
No wonder the Pentium 4 performs so badly per clock cycle in comparison to, well, everything else on the market. I'll stick with my Athlon, thank you. Until I can buy a Hammer, that is ;-)
BTW: Since I mentioned it here, I put the 3dnow vorbis decoder diff on my home page, and 46 people downloaded it so far. Whee! Now I need to find the time to do it again in SSE, that should speed things up even more on my Athlon XP.
@nymia: you describe the code slave, not the coder. You describe someone who does it because he needs the money, not because he likes to do it. Don't just copy other people's code; in most cases it turns out those others didn't know what they were doing, too. Read their code and understand why they did it like that. Then you can copy from yourself. Only copy from others after you have completely understood what and why they were doing.