AMD 250 Manuale Utente

230

Optimizing with SIMD Instructions

Chapter 9

25112

Rev. 3.06

September 2005

Software Optimization Guide for AMD64 Processors

9.17

Optimized 4

× 4 Matrix Multiplication on 4 × 1

Column Vector Routines

Optimization

Transpose the rotation matrix to eliminate the need to accumulate floating-point values in an XMM
register.

Application

This optimization applies to:

•

32-bit software

•

64-bit software

Rationale

The multiplication of a 4

× 4 matrix with a 4 × 1 vector is commonly used in 3-D graphics for

geometric transformation (translating, scaling, rotating, and applying perspective to 3-D points
represented in homogeneous coordinates). Efficiency in single-precision matrix multiplication can be
enhanced by use of SIMD instructions to increase throughput, but there are other general
optimizations that can be implemented to further increase performance. The first optimization is the
transposition of the rotation matrix such that the column n of the matrix becomes the row n and the
row m becomes the column m. This optimization does not benefit 3DNow! technology code (3DNow!
technology has extended instructions that preclude the need for this optimization), but does benefit
SSE code. There are no SSE or SSE2 instructions that accumulate the floats and doubles in a single
XMM register; for this reason, the matrix must be transposed. If the rotation matrix is not transposed,
then the dot-product of a row of the matrix with a column vector necessitates the accumulation of the
four floating-point values in an XMM register. The multiplication upon the column vector is
illustrated here:

|r00 r01 r02 r03| |r00 r10 r20 r30| |v0| |v'0|

tr(R) x v = tr |r10 r11 r12 r13| x v = |r01 r11 r21 r31| x |v1| = |v'1|

|r20 r21 r22 r23| |r02 r12 r22 r32| |v2| |v'2|

|r30 r31 r32 r33| |r03 r13 r23 r33| |v3| |v'3|

Step 0 Step 1 Step 2 Step 3

|v'0| |r00 x v0| |r01 x v1| + |r02 x v2| + |r03 x v3|

|v'1| = |r10 x v0| + |r11 x v1| + |r12 x v2| + |r13 x v3|

|v'2| |r20 x v0| |r21 x v1| + |r22 x v2| + |r23 x v3|

|v'3| |r30 x v0| |r31 x v1| + |r32 x v2| + |r33 x v3|

In each step above, the elements of the rotation matrix can be loaded into an XMM register with the
MOVAPS instruction, assuming the rotation matrix begins at a 16-byte-aligned memory location.
Transposition of the rotation matrix eliminates the need to accumulate the floating-point values in an