Справочник Пользователя для AMD 250

200

Optimizing with SIMD Instructions

Chapter 9

25112

Rev. 3.06

September 2005

Software Optimization Guide for AMD64 Processors

9.5

Structuring Code with Prefetch Instructions to
Hide Memory Latency

Optimization

When utilizing prefetch instructions, attend to:

•

The time allotted (latency) for data to reach the processor between issuing a prefetch instruction
and using the data.

•

Structuring the code to best take advantage of prefetching.

Application

This optimization applies to:

•

32-bit software

•

64-bit software

Rationale

Prefetch instructions bring the cache line containing a specified memory location into the processor
cache. (For more information on prefetch instructions, see “Prefetch Instructions” on page 104.)
Prefetching hides the main memory load latency, which is typically many orders of magnitude larger
than a processor clock cycle.

There are two types of loops:

The example provided below illustrates the importance of the above considerations in an example that
multiplies a double-precision 32

× 32 matrix A with another 32 × 32 transposed double-precision

matrix, B

; the result is returned in another 32

× 32 transposed double-precision matrix, C

. (The

transposition of B and C is performed to efficiently access their elements because matrices in the C
programming language are stored in row-major format. Doing the transposition in advance reduces
the problem of matrix multiplication to one of computing several dot-products—one for each element
of the results matrix, C

. This “dotting” operation is implemented as the sum of pair-wise products of

the elements of two equal-length vectors.) For this example, assume the processor clock speed is
2 GHz, and the memory latency is 60 ns. In this example, the rows of matrix A are repeatedly

Loop type

Description

Memory-limited

Data can be processed and requested faster than it can be fetched from memory.

Processor-limited

Data can be requested and brought into the processor before it is needed because
considerable processing occurs during each unrolled loop iteration.