AMD 250 Manuale Utente

Pagina di 384
Chapter 2
C and C++ Source-Level Optimizations
35
Software Optimization Guide for AMD64 Processors
25112
Rev. 3.06
September 2005
2.16
Explicit Parallelism in Code
Optimization
Where possible, break long dependency chains into several independent dependency chains that can 
then be executed in parallel, exploiting the execution units in each pipeline. 
Application
This optimization applies to:
32-bit software
64-bit software
Rationale and Examples
This is especially important to break long dependency chains into smaller executing units in floating-
point code, whether it is mapped to x87, SSE, or SSE2 instructions, because of the longer latency of 
floating-point operations. Because most languages (including ANSI C) guarantee that floating-point 
expressions are not reordered, compilers cannot usually perform such optimizations unless they offer 
a switch to allow noncompliant reordering of floating-point expressions according to algebraic rules.
Reordered code that is algebraically identical to the original code does not necessarily produce 
identical computational results due to the lack of associativity of floating-point operations. There are 
well-known numerical considerations in applying these optimizations (consult a book on numerical 
analysis). In some cases, these optimizations may lead to unexpected results. In the vast majority of 
cases, the final result differs only in the least-significant bits.
Listing 10. Avoid
double a[100], sum;
int i;
sum = 0.0f;
for (i = 0; i < 100; i++) {
   sum += a[i];
}