AMD Typewriter x86 사용자 설명서

다운로드
페이지 256
112
3DNow!™ and MMX™ Intra-Operand Swapping
AMD Athlon™ Processor x86 Code Optimization 
22007E/0—November 1999
Example:  
PXOR      MM2, MM2       ;   0 | 0
MOVD      MM0, [ab]      ; 0 0 | b a
MOVD      MM1, [cd]      ; 0 0 | d c
PUNPCKLWD MM0, MM2       ; 0 b | 0 a
PUNCPKLWD MM1, MM2       ; 0 d | 0 c
PMADDWD   MM0, MM1       ; b*d | a*c
3DNow!™ and MMX™ Intra-Operand Swapping
AMD Athlon™ 
Specific Code
If the swapping of MMX register halves is necessary, use the
PSWAPD instruction, which is a new AMD Athlon 3DNow! DSP
e x t e n s i o n .   U s e   o f   t h i s   i n s t r u c t i o n   s h o u l d   o n ly   b e   fo r
AMD Athlon specific code. “PSWAPD MMreg1, MMreg2”
performs the following operation:
mmreg1[63:32] = mmreg2[31:0])
mmreg1[31:0] = mmreg2[63:32])
See the AMD Extensions to the 3DNow! and MMX Instruction Set
Manual,
 order #22466 for more usage information.
Blended Code
Otherwise, for blended code, which needs to run well on
AMD-K6 and AMD Athlon family processors, the following code
is recommended:
Example 1 (Preferred, faster):  
;MM1 = SWAP (MM0), MM0 destroyed
MOVQ
MM1, MM0
;make a copy
PUNPCKLDQ 
MM0, MM0
;duplicate lower half
PUNPCKHDQ
MM1, MM0
;combine lower halves
Example 2 (Preferred, fast):  
;MM1 = SWAP (MM0), MM0 preserved
MOVQ
MM1, MM0
;make a copy
PUNPCKHDQ 
MM1, MM1
;duplicate upper half
PUNPCKLDQ
MM1, MM0
;combine upper halves
Both examples accomplish the swapping, but the first example
should be used if the original contents of the register do not
need to be preserved. The first example is faster due to the fact
that the MOVQ and PUNPCKLDQ instructions can execute in
parallel. The instructions in the second example are dependent
on one another and take longer to execute.