AMD x86 User Manual
![AMD](https://files.manualsbrain.com/attachments/812c0ad06c29fa7e95b1abe2111a1edaa280e91d/common/fit/150/50/1bf990368109a76ea6bd2d8d055ff8d6014a81121adff709966c75a2bfec/brand_logo.png)
Unrolling Loops
69
22007E/0—November 1999
AMD Athlon™ Processor x86 Code Optimization
Without Loop Unrolling:
MOV ECX, MAX_LENGTH
MOV EAX, OFFSET A
MOV EBX, OFFSET B
MOV EAX, OFFSET A
MOV EBX, OFFSET B
$add_loop:
FLD
FLD
QWORD PTR [EAX]
FADD
QWORD PTR [EBX]
FSTP
QWORD PTR [EAX]
ADD
EAX, 8
ADD
EBX, 8
DEC
ECX
JNZ
$add_loop
The loop consists of seven instructions. The AMD Athlon
processor can decode/retire three instructions per cycle, so it
cannot execute faster than three iterations in seven cycles, or
3/7 floating-point adds per cycle. However, the pipelined
floating-point adder allows one add every cycle. In the following
code, the loop is partially unrolled by a factor of two, which
creates potential endcases that must be handled outside the
loop:
processor can decode/retire three instructions per cycle, so it
cannot execute faster than three iterations in seven cycles, or
3/7 floating-point adds per cycle. However, the pipelined
floating-point adder allows one add every cycle. In the following
code, the loop is partially unrolled by a factor of two, which
creates potential endcases that must be handled outside the
loop:
With Partial Loop Unrolling:
MOV
ECX, MAX_LENGTH
MOV
EAX, offset A
MOV
EBX, offset B
SHR
ECX, 1
JNC
$add_loop
FLD
QWORD PTR [EAX]
FADD
QWORD PTR [EBX]
FSTP
QWORD PTR [EAX]
ADD
EAX, 8
ADD
EBX, 8
$add_loop:
FLD
QWORD PTR[EAX]
FADD
QWORD PTR[EBX]
FSTP
QWORD PTR[EAX]
FLD
QWORD PTR[EAX+8]
FADD
QWORD PTR[EBX+8]
FSTP
QWORD PTR[EAX+8]
ADD
EAX, 16
ADD
EBX, 16
DEC
ECX
JNZ
$add_loop
Now the loop consists of 10 instructions. Based on the
decode/retire bandwidth of three OPs per cycle, this loop goes
decode/retire bandwidth of three OPs per cycle, this loop goes