Справочник Пользователя для AMD 250

Скачать
Страница из 384
200
Optimizing with SIMD Instructions
Chapter 9
25112
Rev. 3.06
September 2005
Software Optimization Guide for AMD64 Processors
9.5
Structuring Code with Prefetch Instructions to 
Hide Memory Latency
Optimization
When utilizing prefetch instructions, attend to:
The time allotted (latency) for data to reach the processor between issuing a prefetch instruction 
and using the data.
Structuring the code to best take advantage of prefetching.
Application
This optimization applies to:
32-bit software
64-bit software
Rationale
Prefetch instructions bring the cache line containing a specified memory location into the processor 
cache. (For more information on prefetch instructions, see “Prefetch Instructions” on page 104.) 
Prefetching hides the main memory load latency, which is typically many orders of magnitude larger 
than a processor clock cycle.
There are two types of loops:
The example provided below illustrates the importance of the above considerations in an example that 
multiplies a double-precision 32
× 32 matrix A with another 32 × 32 transposed double-precision 
matrix, B
T
; the result is returned in another 32
× 32 transposed double-precision matrix, C
T
. (The 
transposition of B and C is performed to efficiently access their elements because matrices in the C 
programming language are stored in row-major format. Doing the transposition in advance reduces 
the problem of matrix multiplication to one of computing several dot-products—one for each element 
of the results matrix, C
T
. This “dotting” operation is implemented as the sum of pair-wise products of 
the elements of two equal-length vectors.) For this example, assume the processor clock speed is 
2 GHz, and the memory latency is 60 ns. In this example, the rows of matrix A are repeatedly 
Loop type
Description
Memory-limited
Data can be processed and requested faster than it can be fetched from memory.
Processor-limited
Data can be requested and brought into the processor before it is needed because 
considerable processing occurs during each unrolled loop iteration.