Справочник Пользователя для AMD 250

Скачать
Страница из 384
Chapter 9
Optimizing with SIMD Instructions
215
Software Optimization Guide for AMD64 Processors
25112
Rev. 3.06
September 2005
9.12
Use XOR Operations to Negate Operands of SSE, 
SSE2, and 3DNow!™ Instructions
Optimization
For AMD Athlon, AMD Athlon 64, and AMD Opteron processors, use instructions that perform 
XOR operations (PXOR, XORPS, and XORPD) instead of multiplication instructions to change the 
sign bit of operands of SSE , SSE2, and 3DNow! instructions.
Application
This optimization applies to:
32-bit software
64-bit software
Rationale
On the AMD Athlon 64 and AMD Opteron processors, using XOR-type instructions allows for more 
parallelism, as these instructions can execute in either the FADD or FMUL pipe of the floating-point 
unit.
Single Precision
For single-precision, you can use either 3DNow! or SSE SIMD XOR operations. The latency of 
multiplying by –1.0 in 3DNow! is 4 cycles, while the latency of using the PXOR instruction is only 
2 cycles. Similarly, the latency of the MULPS instruction is 5 cycles, while the latency of the XORPS 
instruction is 3 cycles. The following code example illustrates how to toggle the sign bit of a number 
using 3DNow! instructions:
signmask DQ 8000000080000000h
pxor mm0, [signmask]   ; Toggle sign bits of both floats.
This example does the same thing using SSE instructions:
signmask DQ 8000000080000000h,8000000080000000h
xorps xmm0, [signmask]   ; Toggle sign bits of all four floats.
Double Precision
To perform double-precision arithmetic, you can use the XORPD instruction—similar to the single-
precision example—to flip the sign of packed double-precision floating-point operands. The XORPD 
instruction takes 3 cycles to execute, whereas the MULPD instruction requires 5 cycles.
signmask DQ 8000000000000000h,8000000000000000h
xorpd xmm0, [signmask]   ; Toggle sign bit of both doubles.