AMD Typewriter x86 사용자 설명서

다운로드
페이지 256
18
Completely Unroll Small Loops
AMD Athlon™ Processor x86 Code Optimization 
22007E/0—November 1999
Completely Unroll Small Loops
Take advantage of the AMD Athlon processor’s large, 64-Kbyte
instruction cache and completely unroll small loops. Unrolling
loops can be beneficial to performance, especially if the loop
body is small which makes the loop overhead significant. Many
compilers are not aggressive at unrolling loops. For loops that
have a small fixed loop count and a small loop body, completely
unrolling the loops at the source level is recommended.
Example 1 (Avoid):  
// 3D-transform: multiply vector V by 4x4 transform matrix M
for (i=0; i<4; i++) {
   r[i] = 0;
   for (j=0; j<4; j++) {
      r[i] += M[j][i]*V[j];
   }
}
Example 2 (Preferred):  
// 3D-transform: multiply vector V by 4x4 transform matrix M
r[0] = M[0][0]*V[0] + M[1][0]*V[1] + M[2][0]*V[2] +
M[3][0]*V[3];
r[1] =  M[0][1]*V[0] + M[1][1]*V[1] + M[2][1]*V[2] +
M[3][1]*V[3];
r[2] =  M[0][2]*V[0] + M[1][2]*V[1] + M[2][2]*V[2] +
M[3][2]*V[3];
r[3] =  M[0][3]*V[0] + M[1][3]*V[1] + M[2][3]*V[2] +
M[3][3]*v[3];
Avoid Unnecessary Store-to-Load Dependencies
A store-to-load dependency exists when data is stored to
m e m o ry,   o n ly   t o   b e   re a d   b a ck   s h o r t ly   t h e re a f t e r.   S e e
“Store-to-Load Forwarding Restrictions” on page 51 for more
details. The AMD Athlon processor contains hardware to
accelerate such store-to-load dependencies, allowing the load to
obtain the store data before it has been written to memory.
However, it is still faster to avoid such dependencies altogether
and keep the data in an internal register. 
Avoiding store-to-load dependencies is especially important if
they are part of a long dependency chains, as might occur in a
recurrence computation. If the dependency occurs while
operating on arrays, many compilers are unable to optimize the