AMD Typewriter x86 사용자 설명서

Completely Unroll Small Loops

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Completely Unroll Small Loops

Take advantage of the AMD Athlon processor’s large, 64-Kbyte
instruction cache and completely unroll small loops. Unrolling
loops can be beneficial to performance, especially if the loop
body is small which makes the loop overhead significant. Many
compilers are not aggressive at unrolling loops. For loops that
have a small fixed loop count and a small loop body, completely
unrolling the loops at the source level is recommended.

Example 1 (Avoid):

// 3D-transform: multiply vector V by 4x4 transform matrix M
for (i=0; i<4; i++) {
r[i] = 0;
for (j=0; j<4; j++) {
r[i] += M[j][i]*V[j];
}
}

Example 2 (Preferred):

// 3D-transform: multiply vector V by 4x4 transform matrix M
r[0] = M[0][0]*V[0] + M[1][0]*V[1] + M[2][0]*V[2] +

M[3][0]*V[3];

r[1] = M[0][1]*V[0] + M[1][1]*V[1] + M[2][1]*V[2] +

M[3][1]*V[3];

r[2] = M[0][2]*V[0] + M[1][2]*V[1] + M[2][2]*V[2] +

M[3][2]*V[3];

r[3] = M[0][3]*V[0] + M[1][3]*V[1] + M[2][3]*V[2] +

M[3][3]*v[3];

Avoid Unnecessary Store-to-Load Dependencies

A store-to-load dependency exists when data is stored to
m e m o ry, o n ly t o b e re a d b a ck s h o r t ly t h e re a f t e r. S e e
“Store-to-Load Forwarding Restrictions” on page 51 for more
details. The AMD Athlon processor contains hardware to
accelerate such store-to-load dependencies, allowing the load to
obtain the store data before it has been written to memory.
However, it is still faster to avoid such dependencies altogether
and keep the data in an internal register.

Avoiding store-to-load dependencies is especially important if
they are part of a long dependency chains, as might occur in a
recurrence computation. If the dependency occurs while
operating on arrays, many compilers are unable to optimize the