Справочник Пользователя для AMD 250

Скачать
Страница из 384
120
Cache and Memory Optimizations
Chapter 5
25112
Rev. 3.06
September 2005
Software Optimization Guide for AMD64 Processors
5.13
Memory Copy
Optimization
For a very fast general purpose memory copy routine, call the libc memcpy() function included 
with the Microsoft or gcc tools.  This function features optimizations for all block sizes and 
alignments.
Application
This optimization applies to:
32-bit software
64-bit software
Rationale
The memcpy() routines included with recent compilers from Microsoft and gcc feature optimizations 
for all block sizes and alignments for AMD Athlon 64 and AMD Opteron processors.
Copying Small Data Structures
Use inline assembly code to copy a small data structure in cache. Use an unrolled series of MOV 
instructions. Alternate loads and stores in sequences such as load/store/load/store routines, or use 
load/load/store/store sequences for even better performance. Align the destination (and source) if 
possible.
Example 1
The following 64-bit example copies 18 bytes of data:
; rsi = source
; rdi = destination
    mov     r8, [rsi]      ; 8 bytes of source
    mov     r9, [rsi+8]    ; next 8 bytes of source
    mov     [rdi], r8      ; write 8 bytes
    mov     [rdi+8], r9    ; write next 8
    mov     r8w, [rsi+16]  ; read two bytes "r8 word"
    mov     [rdi+16], r8w  ; write the last 2 bytes
Example 2
The following example illustrates how to copy blocks of 32 bytes and larger, in cache. This code 
performs best when the source and destination addresses are 8-byte aligned. Align the destination