Manuali utente per AMD x86

AMD

Manuale Utente (English)

Manuale Utente

Sommario

Contents
3
Revision History
15
Introduction
17
- About this Document
  17
- AMDAthlon™ Processor Family
  19
- AMDAthlon™ Processor Microarchitecture Summary
  20
Top Optimizations
23
- Group I — Essential Optimizations
  23
- Group II — Secondary Optimizations
  23
- Optimization Star
  24
- Group I Optimizations — Essential Optimizations
  24
- Group II Optimizations—Secondary Optimizations
  25
C Source Level Optimizations
29
- Ensure FloatingPoint Variables and Expressions are of Type Float
  29
- Use 32Bit Data Types for Integer Code
  29
- Consider the Sign of Integer Operands
  30
  - Example 1 (Avoid):
    30
  - Example (Preferred):
    30
  - Example (Avoid):
    31
  - Example (Preferred):
    31
  - Use unsigned types for:
    31
  - Use signed types for:
    31
- Use Array Style Instead of Pointer Style Code
  31
  - Example 1 (Avoid):
    32
  - Example 2 (Preferred):
    33
- Completely Unroll Small Loops
  34
  - Example 1 (Avoid):
    34
  - Example 2 (Preferred):
    34
- Avoid Unnecessary Store-to-Load Dependencies
  34
  - Example 1 (Avoid):
    35
  - Example 2 (Preferred):
    35
- Consider Expression Order in Compound Branch Conditions
  36
- Switch Statement Usage
  37
  - Optimize Switch Statements
    37
    - Example 1 (Avoid):
      37
    - Example 2 (Preferred):
      37
- Use Prototypes for All Functions
  37
- Use Const Type Qualifier
  38
- Generic Loop Hoisting
  38
  - Example 1:
    38
  - Generalization for Multiple Constant Control Code
    39
    - Example 2:
      39
- Declare Local Functions as Static
  40
- Dynamic Memory Allocation Consideration
  41
  - Example:
    41
- Introduce Explicit Parallelism into Code
  41
  - Example 1 (Avoid):
    42
  - Example 2 (Preferred):
    42
- Explicitly Extract Common Subexpressions
  42
  - Example 1
    43
    - Avoid:
      43
    - Preferred:
      43
  - Example 2
    43
    - Avoid:
      43
    - Preferred:
      43
- C Language Structure Component Considerations
  43
  - Sort by Base Type Size
    43
  - Pad by Multiple of Largest Base Type Size
    44
    - Original ordering (Avoid):
      44
    - New ordering, with padding (Preferred):
      44
- Sort Local Variables According to Base Type Size
  44
  - Original ordering (Avoid):
    45
  - Improved ordering (Preferred):
    45
- Accelerating Floating-Point Divides and Square Roots
  45
  - Example:
    46
- Avoid Unnecessary Integer Division
  47
  - Example 1 (Avoid):
    47
  - Example 2 (Preferred):
    47
- Copy Frequently De-referenced Pointer Arguments to Local Variables
  47
  - Example 1 (Avoid):
    48
  - Example 2 (Preferred):
    48
Instruction Decoding Optimizations
49
- Overview
  49
- Select DirectPath Over VectorPath Instructions
  50
- LoadExecute Instruction Usage
  50
- Align Branch Targets in Program Hot Spots
  52
- Use Short Instruction Lengths
  52
  - Example 1 (Avoid):
    52
  - Example 2 (Preferred):
    53
- Avoid Partial Register Reads and Writes
  53
- Replace Certain SHLD Instructions with Alternative Code
  54
  - Example 1
    54
    - (Avoid):
      54
    - (Preferred):
      54
  - Example 2
    54
    - (Avoid):
      54
    - (Preferred):
      54
  - Example 3
    54
    - (Avoid):
      54
    - (Preferred):
      54
- Use 8Bit SignExtended Immediates
  54
- Use 8Bit SignExtended Displacements
  55
- Code Padding Using Neutral Code Fillers
  55
  - Recommendations for the AMDAthlon™ Processor
    56
  - Recommendations for AMDK6® Family and AMDAthlon™ Processor Blended Code
    57
Cache and Memory Optimizations
61
- Memory Size and Alignment Issues
  61
  - Avoid Memory Size Mismatches
    61
    - Example 1 (Avoid):
      61
    - Example 2 (Avoid):
      61
  - Align Data Where Possible
    62
- Use the 3DNow!™ PREFETCH and PREFETCHW Instructions
  62
- Take Advantage of Write Combining
  66
- Avoid Placing Code and Data in the Same 64Byte Cache Line
  66
- StoretoLoad Forwarding Restrictions
  67
  - StoretoLoad Forwarding Pitfalls—True Dependencies
    67
  - Summary of StoretoLoad Forwarding Pitfalls to Avoid
    70
- Stack Alignment Considerations
  70
  - Extend to 32 Bits Before Pushing onto Stack
    70
    - Example (Preferred):
      71
- Align TBYTE Variables on Quadword Aligned Addresses
  71
- C Language Structure Component Considerations
  71
  - Example:
    72
- Sort Variables According to Base Type Size
  72
  - Example:
    72
Branch Optimizations
73
- Avoid Branches Dependent on Random Data
  73
  - AMDAthlon™ Processor Specific Code
    74
    - Example 1 — Signed integer ABS function (X = labs(X)):
      74
    - Example 2 — Unsigned integer min function (z = x < y ? x : y):
      74
  - Blended AMDK6® and AMDAthlon™ Processor Code
    74
- Always Pair CALL and RETURN
  75
- Replace Branches with Computation in 3DNow!™ Code
  76
  - Muxing Constructs
    76
    - Example 1 (Avoid):
      76
    - Example 2 (Preferred):
      77
  - Sample Code Translated into 3DNow!™ Code
    77
    - Example 1:
      77
      - C code:
        77
      - 3DNow! code:
        77
    - Example 2:
      78
      - C code:
        78
      - 3DNow! code:
        78
    - Example 3:
      78
      - C code:
        78
      - 3DNow! code:
        78
    - Example 4:
      79
      - C code:
        79
      - 3DNow! code:
        79
    - Example 5:
      80
      - C code:
        80
      - 3DNow! code:
        80
- Avoid the Loop Instruction
  81
  - Example 1 (Avoid):
    81
  - Example 2 (Preferred):
    81
- Avoid Far Control Transfer Instructions
  81
- Avoid Recursive Functions
  82
  - Example 1 (Avoid):
    82
  - Example 2 (Preferred):
    82
Scheduling Optimizations
83
- Schedule Instructions According to their Latency
  83
- Unrolling Loops
  83
  - Complete Loop Unrolling
    83
  - Partial Loop Unrolling
    84
- Use Function Inlining
  87
- Avoid Address Generation Interlocks
  88
  - Example 1 (Avoid):
    89
  - Example 2 (Preferred):
    89
- Use MOVZX and MOVSX
  89
  - Example 1 (Avoid):
    89
  - Example 2 (Preferred):
    89
- Minimize Pointer Arithmetic in Loops
  89
  - Example 1 (Avoid):
    90
  - Example 2 (Preferred):
    90
  - Example 3 (Preferred):
    91
- Push Memory Data Carefully
  91
  - Example 1 (Avoid):
    91
  - Example 2 (Preferred):
    91
Integer Optimizations
93
- Replace Divides with Multiplies
  93
- Use Alternative Code When Multiplying by a Constant
  97
- Use MMX™ Instructions for IntegerOnly Work
  99
- Repeated String Instruction Usage
  100
  - Latency of Repeated String Instructions
    100
  - Guidelines for Repeated String Instructions
    100
- Use XOR Instruction to Clear Integer Registers
  102
  - Example 1 (Acceptable):
    102
  - Example 2 (Preferred):
    102
- Efficient 64Bit Integer Arithmetic
  102
  - Example 1 (Addition):
    102
  - Example 2 (Subtraction):
    102
  - Example 3 (Negation):
    102
  - Example 4 (Left shift):
    103
  - Example 5 (Right shift):
    103
  - Example 6 (Multiplication):
    103
  - Example 7 (Division):
    104
  - Example 8 (Remainder):
    105
- Efficient Implementation of Population Count Function
  107
  - Step 1
    107
  - Step 2
    107
  - Step 3
    108
  - Step 4
    108
    - Example:
      108
- Derivation of Multiplier Used for Integer Division by Constants
  109
  - Unsigned Derivation for Algorithm, Multiplier, and Shift Factor
    109
  - Signed Derivation for Algorithm, Multiplier, and Shift Factor
    111
FloatingPoint Optimizations
113
- Ensure All FPU Data is Aligned
  113
- Use Multiplies Rather than Divides
  113
- Use FFREEP Macro to Pop One Register from the FPU Stack
  114
- FloatingPoint Compare Instructions
  114
- Use the FXCH Instruction Rather than FST/FLD Pairs
  115
- Avoid Using ExtendedPrecision Data
  115
- Minimize FloatingPointtoInteger Conversions
  116
  - Example 1 (Fast):
    116
  - Example 2 (Potentially faster)
    117
  - Example 3 (Potentially faster):
    118
  - Example 4 (Fastest):
    118
- Floating-Point Subexpression Elimination
  119
  - Example 1 (Avoid):
    119
  - Example 2 (Preferred):
    119
- Check Argument Range of Trigonometric Instructions Efficiently
  119
  - Example 1 (Avoid):
    120
  - Example 2 (Preferred):
    120
- Take Advantage of the FSINCOS Instruction
  121
  - Example 1 (Avoid):
    121
  - Example 2 (Preferred):
    121
3DNow!™ and MMX™ Optimizations
123
- Use 3DNow!™ Instructions
  123
- Use FEMMS Instruction
  123
- Use 3DNow!™ Instructions for Fast Division
  124
  - Optimized 14Bit Precision Divide
    124
    - Example:
      124
  - Optimized Full 24Bit Precision Divide
    124
    - Example:
      124
  - Pipelined Pair of 24Bit Precision Divides
    125
    - Example:
      125
  - NewtonRaphson Reciprocal
    125
- Use 3DNow!™ Instructions for Fast Square Root and Reciprocal Square Root
  126
  - Optimized 15Bit Precision Square Root
    126
    - Example:
      126
  - Optimized 24Bit Precision Square Root
    126
    - Example:
      126
  - NewtonRaphson Reciprocal Square Root
    127
- Use MMX™ PMADDWD Instruction to Perform Two 32-Bit Multiplies in Parallel
  127
  - Example:
    128
- 3DNow!™ and MMX™ IntraOperand Swapping
  128
  - AMDAthlon™ Specific Code
    128
  - Blended Code
    128
    - Example 1 (Preferred, faster):
      128
    - Example 2 (Preferred, fast):
      128
- Fast Conversion of Signed Words to Floating-Point
  129
  - Example 1 (AMD Athlon specific code using 3DNow! DSP extension):
    129
  - Example 2 (AMDK6 Family and AMD Athlon processor blended code):
    129
- Use MMX™ PXOR to Negate 3DNow!™ Data
  129
- Use MMX™ PCMP Instead of 3DNow!™ PFCMP
  130
  - Both Numbers Positive
    130
  - One Negative, One Positive
    130
  - Both Numbers Negative
    130
- Use MMX™ Instructions for Block Copies and Block Fills
  131
  - AMDK6® and AMDAthlon™ Processor Blended Code
    131
    - Example 1:
      131
  - AMDAthlon™ Processor Specific Code
    133
    - Example 2:
      133
- Use MMX™ PXOR to Clear All Bits in an MMX™ Register
  134
- Use MMX™ PCMPEQD to Set All Bits in an MMX™ Register
  135
- Use MMX™ PAND to Find Absolute Value in 3DNow!™ Code
  135
- Optimized Matrix Multiplication
  135
- Efficient 3D-Clipping Code Computation Using 3DNow!™ Instructions
  138
- Use 3DNow!™ PAVGUSB for MPEG2 Motion Compensation
  139
  - Example 1 (Avoid):
    140
  - Example 2 (Preferred):
    141
- Stream of Packed Unsigned Bytes
  141
  - Example:
    141
- Complex Number Arithmetic
  142
  - Example:
    142
  - Example:
    142
General x86 Optimization Guidelines
143
- Short Forms
  143
  - Example 1 (Avoid):
    143
  - Example 2 (Preferred):
    143
- Dependencies
  144
- Register Operands
  144
- Stack Allocation
  144
Appendix A
145
AMDAthlon™Processor Microarchitecture
145
- Introduction
  145
- AMDAthlon™ Processor Microarchitecture
  146
  - Superscalar Processor
    146
  - Instruction Cache
    147
  - Predecode
    148
  - Branch Prediction
    148
  - Early Decoding
    149
    - DirectPath Decoder
      149
    - VectorPath Decoder
      149
  - Instruction Control Unit
    150
  - Data Cache
    150
  - Integer Scheduler
    151
  - Integer Execution Unit
    151
  - FloatingPoint Scheduler
    152
  - FloatingPoint Execution Unit
    153
  - LoadStore Unit (LSU)
    154
  - L2 Cache Controller
    155
  - Write Combining
    155
  - AMDAthlon™ System Bus
    155
Appendix B
157
Pipeline and Execution Unit Resources Overview
157
- Fetch and Decode Pipeline Stages
  157
  - Cycle 1–FETCH
    159
  - Cycle 2–SCAN
    159
  - Cycle 3 (DirectPath)– ALIGN1
    159
  - Cycle 3 (VectorPath)– MECTL
    159
  - Cycle 4 (DirectPath)– ALIGN2
    159
  - Cycle 4 (VectorPath)– MEROM
    159
  - Cycle 5 (DirectPath)– EDEC
    159
  - Cycle 5 (VectorPath)– MEDEC/MESEQ
    159
  - Cycle 6– IDEC/Rename
    159
- Integer Pipeline Stages
  160
  - Cycle 7–SCHED
    161
  - Cycle 8–EXEC
    161
  - Cycle 9–ADDGEN
    161
  - Cycle 10–DCACC
    161
  - Cycle 11–RESP
    161
- FloatingPoint Pipeline Stages
  162
  - Cycle 7–STKREN
    163
  - Cycle 8–REGREN
    163
  - Cycle 9–SCHEDW
    163
  - Cycle 10–SCHED
    163
  - Cycle 11–FREG
    163
  - Cycle 12–15– FloatingPoint Execution (FEXEC1–4)
    163
- Execution Unit Resources
  164
  - Terminology
    164
    - Operands
      164
    - Results
      164
    - Examples
      164
  - Integer Pipeline Operations
    165
  - FloatingPoint Pipeline Operations
    166
  - Load/Store Pipeline Operations
    167
  - Code Sample Analysis
    168
Appendix C
171
Implementation of Write Combining
171
- Introduction
  171
- WriteCombining Definitions and Abbreviations
  172
- What is Write Combining?
  172
- Programming Details
  172
- Write-Combining Operations
  173
  - Sending WriteBuffer Data to the System
    175
Appendix D
177
Performance-Monitoring Counters
177
- Overview
  177
- Performance Counter Usage
  177
- Event and TimeStamp Monitoring Software
  184
- Monitoring Counter Overflow
  185
Appendix E
187
Programming the MTRR and PAT
187
- Introduction
  187
- Memory Type Range Register (MTRR) Mechanism
  187
  - Memory Types
    190
  - MTRR Capability Register Format
    190
    - MTRR Default Type Register Format
      191
  - MTRR Overlapping
    192
- Page Attribute Table (PAT)
  193
  - MSR Access
    193
  - Accessing the PAT
    194
  - MTRRs and PAT
    194
  - MTRR Fixed-Range Register Format
    198
  - Variable-Range MTRRs
    199
  - Variable-Range MTRR Register Format
    199
  - MTRR MSR Format
    201
Appendix F
203
Instruction Dispatch and Execution Resources
203
Appendix G
235
DirectPath versus VectorPath Instructions
235
- Select DirectPath Over VectorPath Instructions
  235
- DirectPath Instructions
  235
- VectorPath Instructions
  247
- Index
  253

Dimensioni:

2,99 MB
Pagine:

256
Language:

English

Apri il manuale