AMD 250 Manuale Utente

Pagina di 384
218
Optimizing with SIMD Instructions
Chapter 9
25112
Rev. 3.06
September 2005
Software Optimization Guide for AMD64 Processors
9.15
Accumulating Single-Precision Floating-Point 
Numbers Using SSE, SSE2, and 3DNow!™ 
Instructions
Optimization
In 32-bit software, use the 3DNow! PFACC instruction to perform complex-number multiplication, 
4
× 4 matrix multiplication, and dot products. For 64-bit software, careful selection of SSE 
instructions based on how the data is organized can also lead to more efficient code, as shown in the 
second example.
Application
This optimization applies to:
32-bit software
64-bit software
Rationale
Though SSE, SSE2, and 3DNow! instrucitons are similar in the sense that they all have vectorized 
multiplication and addition, 3DNow! technology supports certain special instructions. One of these is 
the PFACC instruction. There are many instances where PFACC is useful, such as complex-number 
multiplication, 4
× 4 matrix multiplication, and dot products.
Examples
The following example accumulates two floats in two MMX registers:
;accumulate_3dnow(float *a_and_b, float *c_and_d, float *aplusb_cplusd);
;
; TO ASSEMBLE INTO *.obj DO THE FOLLOWING:
;       ml.exe -coff -c accumulate_3dnow.asm
;
.586
.K3D
.XMM
_TEXT   SEGMENT
PUBLIC _accumulate_3dnow
_accumulate_3dnow PROC NEAR
;==============================================================================
; INSTRUCTIONS BELOW SAVE THE REGISTER STATE WITH WHICH THIS ROUTINE WAS ENTERED
; REGISTERS (EAX, ECX, EDX ARE CONSIDERED VOLATILE AND ASSUMED TO BE CHANGED)
;  WHILE THE REGISTERS BELOW MUST BE PRESERVED IF THE USER IS CHANGING THEM