AMD 250 Manuale Utente

Pagina di 384
196
Optimizing with SIMD Instructions
Chapter 9
25112
Rev. 3.06
September 2005
Software Optimization Guide for AMD64 Processors
9.2
Improving Scalar SSE and SSE2 Floating-Point 
Performance with MOVLPD and MOVLPS When 
Loading Data from Memory
Optimization
Use the MOVLPS and MOVLPD instructions to move scalar floating-point data into the XMM 
registers prior to addition, multiplication, or other scalar instructions.
Application
This optimization applies to:
32-bit software
64-bit software
Rationale—Single Precision
The MOVSS instruction is used to move scalar single-precision floating-point data into the XMM 
registers prior to addition (ADDSS) and multiplication (MULSS) or other scalar instructions. In 
addition to loading a 32-bit floating-point value into the XMM register, the MOVSS instruction clears 
the upper 96 bits of the register. Clearing part of the XMM register is an inefficiency that you can 
bypass by using the MOVLPS instruction. MOVLPS loads two floating-point values from memory 
without clearing the upper 64 bits of the XMM register.
The latency of the MOVSS instruction is 3 cycles, whereas the latency of the MOVLPS instruction is 
2 cycles. The AMD Athlon™ 64 and AMD Opteron™ processors can perform two 64-bit loads per 
clock cycle. Two 64-bit MOVLPS loads can be issued in the same cycle, assuming the data is 8-byte 
aligned. Likewise, two MOVSS loads can be performed per cycle, but—unlike MOVLPS—additional 
operations that interfere with the MULSS and ADDSS instructions must be issued to clear the 
register. Using MOVLPS rather than MOVSS to load single-precision scalar data from memory on 
processor-limited floating-point-intensive code can result in significant performance increases.
Consider the following caveats when using the MOVLPS instruction:
When accessing 4-byte-aligned addresses that are not 8-byte aligned, MOVLPS loads take an 
additional cycle.
Since MOVLPS loads two floating-point values instead of one, accessing the last floating-point 
value in a single-precision array attempts to load 4 bytes of additional memory directly after the 
end of the array, which may cause an access violation. To avoid an access violation, use MOVSS 
to access the last value in a single-precision array or store a dummy floating-point value at the end 
of the array.