The first mass-market computer with a "letter quality" display (which I define as 192 dpi or higher) was released today. Not surprisingly, it's a handheld. Too bad it ships with such a lousy OS.
AltiVec and SSE2
One of the projects I've been working on lately is SIMD-optimized versions of my inkjet error diffusion code. It'll be released soon, so all you guys can take a look, but in the meantime here are my impressions.
AltiVec and SSE2 are very, very similar. Both operate on 128 bit words, which can be segmented into four singles or 32-bit ints, eight shorts, or 16 chars (plus a couple more, depending on the chip). In both cases, the raw computational bandwidth available is absolutely stunning - well-tuned code will run at two instructions per clock cycle. That's something like 24 GFLOPS theoretical peak for a 3 GHz P4. Wow.
Not surprisingly, it's tricky to get that peak performance. A lot depends on how parallel the problem is. A lot of 2D graphics code is embarassingly parallel, which makes it pretty easy to get good SIMD performance. Not so error diffusion algorithms, which have some very tight serialization constraints. I get around this by running four planes at a time. This is a very reasonable approach for printers such as my Epson Photo 2200 which use 7 inks (so only 1/8 of the compute bandwidth is wasted), but does make things a bit trickier.
You really feel the constraint of pipeline latencies. On the P4, you can't use the result of a load until 6 cycles after the load instruction (and that's assuming it hits in L1 cache). Considering that ideally you're issuing two instructions per cycle, that means you can't use the results of a load until twelve instructions down. That's a lot, especially when you've only got 8 registers. On the flip side, the bandwidth is incredible. You can issue one load operation per cycle, for an awe-inspiring peak L1 bandwidth of 48 GB/s.
Bottom line, running 4 planes in AltiVec is about the same speed as running one plane in the old C code. SSE2 is about 1.5 the speed. So, at least for this problem, the potential for speedup is greater on the P4 architecture than G4. I haven't analyzed the code closely, but I suspect that the main culprit is the conditional branches in the C version of the code, which have all been replaced by bitwise logical operations in the vectorized version. Mispredicted conditional branches are performance death on the P4.
As I've said, I found the two SIMD approaches more alike than different. Here are some differences I noted, though:
- AltiVec feels cleaner, richer, and better designed, even though
SSE2 has more instructions. I'm sure a big part of the problem is that
it took several generations for SSE2 to evolve - I consider MMX
seriously underpowered, and SSE (Pentium III) lacks packed integer
operations, critically important for image processing.
- AltiVec has 32 registers, SSE2 8.
- The tools for AltiVec are more mature. I really appreciated having
the C compiler schedule the instructions for me (using "intrinsics").
Intrinsics are available for SSE2 in bleeding-edge versions of GCC,
but don't ship with RH9.
- It's easier to understand AltiVec performance; it's better
documented, and tools like CHUD
really help (I used amber and simg4).
- All that said, with currently available chips, the raw bandwidth
of the P4 outstrips the G4/G5. While the G4's implementation of
AltiVec is excellent, the clock speed is pitifully slow by today's
standards. The G5 runs with faster clock, but takes a step backward in
how much gets done per cycle (in many more cases, only one instruction
can be issued per clock).
I think both architectures are becoming reasonably stable, but it's still easy to find computers that don't support SIMD well, especially laptops and the cool-running Via chips. My desktop is a dual Athlon, which is sadly SSE-only. I also hear that AMD64's implementation of SSE2 is lackluster. So, the performance win still depends a lot on the particulars of the system you're using. I suspect that'll improve with time, as newer models phase in.
SIMD and graphics hardware represent two significantly different approaches for 2D graphics optimization, with different strengths and weaknesses. I feel that SIMD is ultimately the biggest win for printing applications, largely because it easily accommodates sophisticated color transformations and so on. Even so, its raw bandwidth will lead to great performance on interactive display applications. I'd be happy to stake out the SIMD-optimized rendering territory, while largely leaving optimization through offloading to the video card to Xrender/Cairo. In any case, it looks like some fun times are ahead!