AMD x86 ユーザーズマニュアル

ページ / 256
Unrolling Loops
69
22007E/0—November 1999
AMD Athlon™ Processor x86 Code Optimization 
Without Loop Unrolling:  
MOV ECX, MAX_LENGTH
MOV EAX, OFFSET A
MOV EBX, OFFSET B
$add_loop:
FLD
QWORD PTR [EAX]
FADD
QWORD PTR [EBX]
FSTP
QWORD PTR [EAX]
ADD
EAX, 8
ADD
EBX, 8
DEC
ECX
JNZ
$add_loop
The loop consists of seven instructions. The AMD Athlon
processor can decode/retire three instructions per cycle, so it
cannot execute faster than three iterations in seven cycles, or
3/7 floating-point adds per cycle. However, the pipelined
floating-point adder allows one add every cycle. In the following
code, the loop is partially unrolled by a factor of two, which
creates potential endcases that must be handled outside the
loop:
With Partial Loop Unrolling:  
MOV
ECX, MAX_LENGTH
MOV
EAX, offset A
MOV
EBX, offset B
SHR
ECX, 1
JNC
$add_loop
FLD
QWORD PTR [EAX]
FADD
QWORD PTR [EBX]
FSTP
QWORD PTR [EAX]
ADD
EAX, 8
ADD
EBX, 8
   
$add_loop:
FLD
QWORD PTR[EAX]
FADD
QWORD PTR[EBX]
FSTP
QWORD PTR[EAX]
FLD
QWORD PTR[EAX+8]
FADD
QWORD PTR[EBX+8]
FSTP
QWORD PTR[EAX+8]
ADD
EAX, 16
ADD
EBX, 16
DEC
ECX
JNZ
$add_loop   
Now  the loop consists  of  10 instructions.  Based on the
decode/retire bandwidth of three OPs per cycle, this loop goes