Справочник Пользователя для AMD 250

Chapter 10

x87 Floating-Point Optimizations

245

Software Optimization Guide for AMD64 Processors

25112

Rev. 3.06

September 2005

10.4

Using the FXCH Instruction Rather Than FST/FLD
Pairs

Optimization

Increase parallelism by breaking up dependency chains or by evaluating multiple dependency chains
simultaneously by explicitly switching execution between them.

Application

This optimization applies to:

•

32-bit software

•

64-bit software

Rationale

Although the AMD Athlon 64 and AMD Opteron processor’s floating-point unit has a deep
scheduler, which in most cases can extract sufficient parallelism from existing code, long dependency
chains can stall the scheduler while issue slots are still available. The maximum dependency chain
length that the scheduler can absorb is about six four-cycle instructions.

To switch execution between dependency chains, use of the FXCH instruction is recommended
because it has an apparent latency of zero cycles and generates only one micro-op. The floating-point
unit of the AMD Athlon 64 and AMD Opteron processors contains special hardware to handle up to
three FXCH instructions per cycle. Using FXCH is preferred over the use of FST/FLD pairs, even if
the FST/FLD pair works on a register. An FST/FLD pair adds two cycles of latency and consists of
two macro-ops.