Programming trivia: 4x4 integer matrix transpose in SSE2
The Intel SSE2 intrinsics has a macro _MM_TRANSPOSE4_PS which performs a matrix transposition on a 4x4 array represented by elements in 4 SSE registers. However, it doesn't work with integer registers because Intel intrinsics make a distinction between integer and floating point SSE registers. Theoretically one could cast and use the floating point operations, but it seems quite plausible that this will not round trip properly; for instance if one of your integer values happens to have the same value as a 32-bit IEEE denormal.
However it is easy to do with the punpckldq, punpckhdq, punpcklqdq, and punpckhqdq instructions; code and diagrams ahoy.