Справочник Пользователя для AMD 250
Appendix C
Instruction Latencies
317
Software Optimization Guide for AMD64 Processors
25112
Rev. 3.06
September 2005
C.7
SSE Instructions
Table 18.
SSE Instructions
Syntax
Encoding
Decode
type
type
FPU pipe(s)
Latency Note
Prefix
byte
byte
First
byte
byte
2nd
byte
byte
ModRM byte
ADDPS xmmreg1,
xmmreg2
xmmreg2
0Fh
58h
11-xxx-xxx
Double
FADD
5
ADDPS xmmreg,
mem128
mem128
0Fh
58h
mm-xxx-xxx
Double
FADD
7
ADDSS xmmreg1,
xmmreg2
xmmreg2
F3h
0Fh
58h
11-xxx-xxx
DirectPath
FADD
4
ADDSS xmmreg,
mem128
mem128
F3h
0Fh
58h
mm-xxx-xxx
DirectPath
FADD
6
ANDNPS xmmreg1,
xmmreg2
xmmreg2
0Fh
55h
11-xxx-xxx
Double
FMUL
3 1
ANDNPS xmmreg,
mem128
mem128
0Fh
55h
mm-xxx-xxx
Double
FMUL
5
ANDPS xmmreg1,
xmmreg2
xmmreg2
0Fh
54h
11-xxx-xxx
Double
FMUL
3 1
ANDPS xmmreg,
mem128
mem128
0Fh
54h
mm-xxx-xxx
Double
FMUL
5
CMPPS xmmreg1,
xmmreg2, imm8
xmmreg2, imm8
0Fh
C2h
11-xxx-xxx
Double
FADD
3 1
CMPPS xmmreg,
mem128, imm8
mem128, imm8
0Fh
C2h
mm-xxx-xxx
Double
FADD
5
CMPSS xmmreg1,
xmmreg2, imm8
xmmreg2, imm8
F3h
0Fh
C2h
11-xxx-xxx
DirectPath
FADD
2
CMPSS xmmreg,
mem32, imm8
mem32, imm8
F3h
0Fh
C2h
mm-xxx-xxx
DirectPath
FADD
4
COMISS xmmreg1,
xmmreg2
xmmreg2
0Fh
2Fh
11-xxx-xxx
VectorPath
4
Notes:
1. The low half of the result is available one cycle earlier than listed.
2. The second latency value indicates when the low half of the result becomes available.
3. The high half of the result is available one cycle earlier than listed.
4. The latency listed is the absolute minimum, while average latencies may be higher and are a function of internal
2. The second latency value indicates when the low half of the result becomes available.
3. The high half of the result is available one cycle earlier than listed.
4. The latency listed is the absolute minimum, while average latencies may be higher and are a function of internal
pipeline conditions.
5. For the PREFETCHNTA/T0/T1/T2 instructions, the mem8 value refers to an address in the 64-byte line to be
prefetched.
6. The 8-clock latency is only visible to younger stores that need to do an external write. The 2-clock latency is
visible to the other stores and instructions.
7. This is the execution latency for the instruction. The time to complete the external write depends on the memory
speed and the hardware implementation.