AMD 250 User Manual

Page of 384
164
Integer Optimizations
Chapter 8
25112
Rev. 3.06
September 2005
Software Optimization Guide for AMD64 Processors
8.2
Alternative Code for Multiplying by a Constant
Optimization
Devise instruction sequences with lower latency to accomplish multiplication by certain constant 
multipliers.
Rationale
A 32-bit integer multiplied by a constant has a latency of 3 cycles; a 64-bit integer multiplied by a 
constant has a latency of 4 cycles. For certain constant multipliers, instruction sequences can be 
devised that accomplish the multiplication with lower latency. Because the AMD Athlon 64 and 
AMD Opteron processors contain only one integer multiplier but three integer execution units, the 
replacement code can provide better throughput as well.
Most replacement sequences require the use of an additional temporary register, thus increasing 
register pressure. If register pressure in a piece of code that performs integer multiplication with a 
constant is already high, it could be better for the overall performance of that code to use the IMUL 
instruction instead of the replacement code. Similarly, replacement sequences with low latency but 
containing many instructions may negatively influence decode bandwidth as compared to the IMUL 
instruction. In general, replacement sequences containing more than four instructions are not 
recommended.
The following code samples are designed for the original source to receive the final result. Other 
sequences are possible if the result is in a different register. Sequences that do not require a temporary 
register are favored over ones requiring a temporary register, even if the latency is higher. Arithmetic-
logic-unit operations are preferred over shifts to keep code size small. Similarly, both arithmetic-
logic-unit operations and shifts are favored over the LEA instruction.
There are improvements in the AMD Athlon 64 and AMD Opteron processors’ multiplier over that of 
previous x86 processors. For this reason, when doing 32-bit multiplication, only use the alternative 
sequence if the alternative sequence has a latency that is less than or equal to 2 cycles. For 64-bit 
multiplication, only use the alternative sequence if the alternative sequence has a latency that is less 
than or equal to 3 cycles.
Examples
by 2:   add 
reg1, reg1            ; 1 cycle
by 3:   lea 
reg1, [reg1+reg1*2]   ; 2 cycles
by 4:   shl 
reg1, 2               ; 1 cycle
by 5:   lea 
reg1, [reg1+reg1*4]   ; 2 cycles