AMD 250 User Manual

Chapter 9

Optimizing with SIMD Instructions

201

Software Optimization Guide for AMD64 Processors

25112

Rev. 3.06

September 2005

“dotted” with a column of B

. Once this is done, the rows of matrix A are “dotted” with the next

column of B

, and the process is repeated through all the columns of B

From a performance standpoint, there are several caveats to recognize, as follows:

•

Once all the rows of A have been multiplied with the first column of B, all the rows of A are in the
cache, and subsequent accesses to them do not cause cache misses.

•

The rows of B

are brought into the cache by “dotting” the first four rows of A with each row of

in the

Ctr_row_num

for-loop.

•

The elements of C

are not initially in the cache, and every time a new set of four rows of A are

“dotted” with a new row of B

, the processor has to wait for C

to arrive in the cache before the

results can be written.

You can address the last two caveats by prefetching to improve performance. However, to efficiently
exploit prefetching, you must structure the code to issue the prefetch instructions such that:

•

Enough time is provided for memory requests sent out through prefetch requests to bring data into
the processor’s cache before the data is needed.

•

The loops containing the prefetch instructions are ordered to issue sufficient prefetch instructions
to fetch all the pertinent data.

The matrix order of 32 is not a coincidence. A double-precision number consists of 8 bytes. Prefetch
instructions bring memory into the processor in chunks called cache lines consisting of 64 bytes (or
eight double-precision numbers). We need to issue four prefetch instructions to prefetch a row of B

Consequently, when multiplying all 32 rows of A with a particular column of B, we want to arrange
the for-loop that cycles through the rows of A such that it is repeated four times. To achieve this, we
need to dot eight rows of A with a row of B

every time we pass through the

Ctr_row_num

for-loop.

Additionally, “dotting” eight rows of A upon a row of B

produces eight doubles of C

(that is, a full

cache line).

Assume it takes 60 ns to retrieve data from memory; then we must ensure that at least this much time
elapses between issuing the prefetch instruction and the processor loading that data into its registers.
The dot-product of eight rows of A with a row of B

consists of 512 floating-point operations (dotting

a single row of A with a row of B

consists of 32 additions and 32 multiplications). The

AMD Athlon, AMD Athlon 64, and AMD Opteron processors are capable of performing a maximum
of two floating point operations per clock cycle; therefore, it takes the processor no less than
256 clock cycles to process each

Ctr_row_num

for-loop.

Choosing a matrix order of 32 is convenient for these reasons:

•

All three matrices A, B

, and C

can fit into the processor’s 64-Kbyte L1 data cache.

•

On a 2-GHz processor running at full floating-point utilization, 128 ns elapse during the
256 clock cycles, considerably more than the 60 ns to retrieve the data from memory.