Справочник Пользователя для AMD 250
200
Optimizing with SIMD Instructions
Chapter 9
25112
Rev. 3.06
September 2005
Software Optimization Guide for AMD64 Processors
9.5
Structuring Code with Prefetch Instructions to
Hide Memory Latency
Hide Memory Latency
Optimization
When utilizing prefetch instructions, attend to:
•
The time allotted (latency) for data to reach the processor between issuing a prefetch instruction
and using the data.
and using the data.
•
Structuring the code to best take advantage of prefetching.
Application
This optimization applies to:
•
32-bit software
•
64-bit software
Rationale
Prefetch instructions bring the cache line containing a specified memory location into the processor
cache. (For more information on prefetch instructions, see “Prefetch Instructions” on page 104.)
Prefetching hides the main memory load latency, which is typically many orders of magnitude larger
than a processor clock cycle.
cache. (For more information on prefetch instructions, see “Prefetch Instructions” on page 104.)
Prefetching hides the main memory load latency, which is typically many orders of magnitude larger
than a processor clock cycle.
There are two types of loops:
The example provided below illustrates the importance of the above considerations in an example that
multiplies a double-precision 32
multiplies a double-precision 32
× 32 matrix A with another 32 × 32 transposed double-precision
matrix, B
T
; the result is returned in another 32
× 32 transposed double-precision matrix, C
T
. (The
transposition of B and C is performed to efficiently access their elements because matrices in the C
programming language are stored in row-major format. Doing the transposition in advance reduces
the problem of matrix multiplication to one of computing several dot-products—one for each element
of the results matrix, C
programming language are stored in row-major format. Doing the transposition in advance reduces
the problem of matrix multiplication to one of computing several dot-products—one for each element
of the results matrix, C
T
. This “dotting” operation is implemented as the sum of pair-wise products of
the elements of two equal-length vectors.) For this example, assume the processor clock speed is
2 GHz, and the memory latency is 60 ns. In this example, the rows of matrix A are repeatedly
2 GHz, and the memory latency is 60 ns. In this example, the rows of matrix A are repeatedly
Loop type
Description
Memory-limited
Data can be processed and requested faster than it can be fetched from memory.
Processor-limited
Data can be requested and brought into the processor before it is needed because
considerable processing occurs during each unrolled loop iteration.
considerable processing occurs during each unrolled loop iteration.