SIMD hacking is starting to be a lot of fun. That and trying to find clever ways to avoid branches. I spent the last few days hacking vorbis to make it faster on my slow C3. Vorbis is completely float based, and the C3 has 3dnow, so this was a good opportunity to learn 3dnow. So far I converted the q&p loop in vorbis_lsp_to_curve, the overlap/add (only large/large) and copy sections of vorbis_synthesis_blockin, vorbis_apply_window mdct_butterfly_generic and the 2-channel same-endianness segment of ov_read and netted a 10% speed-up on my Athlon. The C3 has about the same speed-up, which is interesting since the C3's FPU is running and only half the CPU clock speed, so 3dnow should be a bigger gain. Maybe vorbis is limited by the RAM bandwidth and cache misses and not FPU?
I have to say that the AMD documentation is much better than the Intel documentation, especially about SIMD. I'll look for some SSE docs on their web page, because the Intel SSE docs are even worse than their other docs. The Intel web site is less forthcoming and their documentation sucks in comparison. At least all the necessary documents can be downloaded for free and without registration (and as PDF and not winword) ;-)
Anyway, during my SIMD hacking I found that I need a good assembly level debugger, a good profiler (I'm using gcov right now, but it won't tell me where the time is spent, only what code is executed often -- close but no cigar; gcov is too unprecise, hrprof seems to get the timings wrong) and a good stall simulator. If someone knows a free tool where I can specify a given target CPU and run my code and it will then tell me which assembly instructions caused which stalls and for what reason, it would be most helpful. Something like this should be relatively easy to do for the bochs people. Or for valgrind, once someone adds MMX, SSE and 3dnow support to their JIT engine.
By the way: a big hooray for valgrind! What a great piece of software! If you develop software, use valgrind!
I wonder how much crypto bulk cipher performance could be gained using MMX and SSE. SSE is next on my list. Hacking code is great, every day you have the opportunity to learn new exciting things! ;-)
If you are new to SIMD and want to see how really skillful people do it, have a look at the ffmpeg sources. I find the altivec stuff (in libavcodec/ppc) particularly impressive.