Справочник Пользователя для AMD 250

Chapter 9

Optimizing with SIMD Instructions

215

Software Optimization Guide for AMD64 Processors

25112

Rev. 3.06

September 2005

9.12

Use XOR Operations to Negate Operands of SSE,
SSE2, and 3DNow!™ Instructions

Optimization

For AMD Athlon, AMD Athlon 64, and AMD Opteron processors, use instructions that perform
XOR operations (PXOR, XORPS, and XORPD) instead of multiplication instructions to change the
sign bit of operands of SSE , SSE2, and 3DNow! instructions.

Application

This optimization applies to:

•

32-bit software

•

64-bit software

Rationale

On the AMD Athlon 64 and AMD Opteron processors, using XOR-type instructions allows for more
parallelism, as these instructions can execute in either the FADD or FMUL pipe of the floating-point
unit.

Single Precision

For single-precision, you can use either 3DNow! or SSE SIMD XOR operations. The latency of
multiplying by –1.0 in 3DNow! is 4 cycles, while the latency of using the PXOR instruction is only
2 cycles. Similarly, the latency of the MULPS instruction is 5 cycles, while the latency of the XORPS
instruction is 3 cycles. The following code example illustrates how to toggle the sign bit of a number
using 3DNow! instructions:

signmask DQ 8000000080000000h

pxor mm0, [signmask] ; Toggle sign bits of both floats.

This example does the same thing using SSE instructions:

signmask DQ 8000000080000000h,8000000080000000h

xorps xmm0, [signmask] ; Toggle sign bits of all four floats.

Double Precision

To perform double-precision arithmetic, you can use the XORPD instruction—similar to the single-
precision example—to flip the sign of packed double-precision floating-point operands. The XORPD
instruction takes 3 cycles to execute, whereas the MULPD instruction requires 5 cycles.

signmask DQ 8000000000000000h,8000000000000000h

xorpd xmm0, [signmask] ; Toggle sign bit of both doubles.