Справочник Пользователя для AMD 250

Скачать
Страница из 384
Chapter 10
x87 Floating-Point Optimizations
245
Software Optimization Guide for AMD64 Processors
25112
Rev. 3.06
September 2005
10.4
Using the FXCH Instruction Rather Than FST/FLD 
Pairs
Optimization
Increase parallelism by breaking up dependency chains or by evaluating multiple dependency chains 
simultaneously by explicitly switching execution between them.
Application
This optimization applies to:
32-bit software
64-bit software
Rationale
Although the AMD Athlon 64 and AMD Opteron processor’s floating-point unit has a deep 
scheduler, which in most cases can extract sufficient parallelism from existing code, long dependency 
chains can stall the scheduler while issue slots are still available. The maximum dependency chain 
length that the scheduler can absorb is about six four-cycle instructions.
To switch execution between dependency chains, use of the FXCH instruction is recommended 
because it has an apparent latency of zero cycles and generates only one micro-op. The floating-point 
unit of the AMD Athlon 64 and AMD Opteron processors contains special hardware to handle up to 
three FXCH instructions per cycle. Using FXCH is preferred over the use of FST/FLD pairs, even if 
the FST/FLD pair works on a register. An FST/FLD pair adds two cycles of latency and consists of 
two macro-ops.