AMD 250 Manuale Utente

Chapter 2

C and C++ Source-Level Optimizations

Software Optimization Guide for AMD64 Processors

25112

Rev. 3.06

September 2005

2.16

Explicit Parallelism in Code

Optimization

Where possible, break long dependency chains into several independent dependency chains that can
then be executed in parallel, exploiting the execution units in each pipeline.

Application

This optimization applies to:

•

32-bit software

•

64-bit software

Rationale and Examples

This is especially important to break long dependency chains into smaller executing units in floating-
point code, whether it is mapped to x87, SSE, or SSE2 instructions, because of the longer latency of
floating-point operations. Because most languages (including ANSI C) guarantee that floating-point
expressions are not reordered, compilers cannot usually perform such optimizations unless they offer
a switch to allow noncompliant reordering of floating-point expressions according to algebraic rules.

Reordered code that is algebraically identical to the original code does not necessarily produce
identical computational results due to the lack of associativity of floating-point operations. There are
well-known numerical considerations in applying these optimizations (consult a book on numerical
analysis). In some cases, these optimizations may lead to unexpected results. In the vast majority of
cases, the final result differs only in the least-significant bits.

Listing 10. Avoid

double a[100], sum;

int i;

sum = 0.0f;

for (i = 0; i < 100; i++) {

sum += a[i];

}