AMD 250 User Manual

Page of 384
Chapter 9
Optimizing with SIMD Instructions
201
Software Optimization Guide for AMD64 Processors
25112
Rev. 3.06
September 2005
“dotted” with a column of B
T
. Once this is done, the rows of matrix A are “dotted” with the next 
column of B
T
, and the process is repeated through all the columns of B
T
.
From a performance standpoint, there are several caveats to recognize, as follows:
Once all the rows of A have been multiplied with the first column of B, all the rows of A are in the 
cache, and subsequent accesses to them do not cause cache misses.
The rows of B
T
 are brought into the cache by “dotting” the first four rows of A with each row of 
B
T
 in the 
Ctr_row_num
 for-loop.
The elements of C
T
 are not initially in the cache, and every time a new set of four rows of A are 
“dotted” with a new row of B
T
, the processor has to wait for C
T
 to arrive in the cache before the 
results can be written.
You can address the last two caveats by prefetching to improve performance. However, to efficiently 
exploit prefetching, you must structure the code to issue the prefetch instructions such that:
Enough time is provided for memory requests sent out through prefetch requests to bring data into 
the processor’s cache before the data is needed.
The loops containing the prefetch instructions are ordered to issue sufficient prefetch instructions 
to fetch all the pertinent data.
The matrix order of 32 is not a coincidence. A double-precision number consists of 8 bytes. Prefetch 
instructions bring memory into the processor in chunks called cache lines consisting of 64 bytes (or 
eight double-precision numbers). We need to issue four prefetch instructions to prefetch a row of B
T
Consequently, when multiplying all 32 rows of A with a particular column of B, we want to arrange 
the for-loop that cycles through the rows of A such that it is repeated four times. To achieve this, we 
need to dot eight rows of A with a row of B
T
 every time we pass through the 
Ctr_row_num
 for-loop. 
Additionally, “dotting” eight rows of A upon a row of B
T
 produces eight doubles of C
T
 (that is, a full 
cache line).
Assume it takes 60 ns to retrieve data from memory; then we must ensure that at least this much time 
elapses between issuing the prefetch instruction and the processor loading that data into its registers. 
The dot-product of eight rows of A with a row of B
T
 consists of 512 floating-point operations (dotting 
a single row of A with a row of B
T
 consists of 32 additions and 32 multiplications). The 
AMD Athlon, AMD Athlon 64, and AMD Opteron processors are capable of performing a maximum 
of two floating point operations per clock cycle; therefore, it takes the processor no less than 
256 clock cycles to process each 
Ctr_row_num
 for-loop.
Choosing a matrix order of 32 is convenient for these reasons:
All three matrices AB
T
, and C
T
 can fit into the processor’s 64-Kbyte L1 data cache.
On a 2-GHz processor running at full floating-point utilization, 128 ns elapse during the 
256 clock cycles, considerably more than the 60 ns to retrieve the data from memory.