AMD x86 User Manual

Unrolling Loops

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Without Loop Unrolling:

MOV ECX, MAX_LENGTH
MOV EAX, OFFSET A
MOV EBX, OFFSET B

$add_loop:
FLD

QWORD PTR [EAX]

FADD

QWORD PTR [EBX]

FSTP

QWORD PTR [EAX]

ADD

EAX, 8

ADD

EBX, 8

DEC

ECX

JNZ

$add_loop

The loop consists of seven instructions. The AMD Athlon
processor can decode/retire three instructions per cycle, so it
cannot execute faster than three iterations in seven cycles, or
3/7 floating-point adds per cycle. However, the pipelined
floating-point adder allows one add every cycle. In the following
code, the loop is partially unrolled by a factor of two, which
creates potential endcases that must be handled outside the
loop:

With Partial Loop Unrolling:

MOV

ECX, MAX_LENGTH

MOV

EAX, offset A

MOV

EBX, offset B

SHR

ECX, 1

JNC

$add_loop

FLD

QWORD PTR [EAX]

FADD

QWORD PTR [EBX]

FSTP

QWORD PTR [EAX]

ADD

EAX, 8

ADD

EBX, 8

$add_loop:
FLD

QWORD PTR[EAX]

FADD

QWORD PTR[EBX]

FSTP

QWORD PTR[EAX]

FLD

QWORD PTR[EAX+8]

FADD

QWORD PTR[EBX+8]

FSTP

QWORD PTR[EAX+8]

ADD

EAX, 16

ADD

EBX, 16

DEC

ECX

JNZ

$add_loop

Now the loop consists of 10 instructions. Based on the
decode/retire bandwidth of three OPs per cycle, this loop goes