AMD 250 Manuale Utente

Pagina di 384
230
Optimizing with SIMD Instructions
Chapter 9
25112
Rev. 3.06
September 2005
Software Optimization Guide for AMD64 Processors
9.17
Optimized 4
× 4 Matrix Multiplication on 4 × 
Column Vector Routines
Optimization
Transpose the rotation matrix to eliminate the need to accumulate floating-point values in an XMM 
register.
Application
This optimization applies to:
32-bit software
64-bit software
Rationale
The multiplication of a 4
× 4 matrix with a 4 × 1 vector is commonly used in 3-D graphics for 
geometric transformation (translating, scaling, rotating, and applying perspective to 3-D points 
represented in homogeneous coordinates). Efficiency in single-precision matrix multiplication can be 
enhanced by use of SIMD instructions to increase throughput, but there are other general 
optimizations that can be implemented to further increase performance. The first optimization is the 
transposition of the rotation matrix such that the column n of the matrix becomes the row n and the 
row m becomes the column m. This optimization does not benefit 3DNow! technology code (3DNow! 
technology has extended instructions that preclude the need for this optimization), but does benefit 
SSE code. There are no SSE or SSE2 instructions that accumulate the floats and doubles in a single 
XMM register; for this reason, the matrix must be transposed. If the rotation matrix is not transposed, 
then the dot-product of a row of the matrix with a column vector necessitates the accumulation of the 
four floating-point values in an XMM register. The multiplication upon the column vector is 
illustrated here:
               |r00 r01 r02 r03|       |r00 r10 r20 r30|   |v0|   |v'0|
tr(R) x v = tr |r10 r11 r12 r13| x v = |r01 r11 r21 r31| x |v1| = |v'1|
               |r20 r21 r22 r23|       |r02 r12 r22 r32|   |v2|   |v'2|
               |r30 r31 r32 r33|       |r03 r13 r23 r33|   |v3|   |v'3|
         Step 0       Step 1       Step 2       Step 3
|v'0|   |r00 x v0|   |r01 x v1| + |r02 x v2| + |r03 x v3|
|v'1| = |r10 x v0| + |r11 x v1| + |r12 x v2| + |r13 x v3|
|v'2|   |r20 x v0|   |r21 x v1| + |r22 x v2| + |r23 x v3|
|v'3|   |r30 x v0|   |r31 x v1| + |r32 x v2| + |r33 x v3|
In each step above, the elements of the rotation matrix can be loaded into an XMM register with the 
MOVAPS instruction, assuming the rotation matrix begins at a 16-byte-aligned memory location. 
Transposition of the rotation matrix eliminates the need to accumulate the floating-point values in an