Manuale UtenteSommarioContents3Revision History15Introduction17About this Document17Chapter 1: Introduction18Chapter 2: Top Optimizations18Chapter 3: C Source Level Optimizations18Chapter 4: Instruction Decoding Optimizations18Chapter 5: Cache and Memory Optimizations18Chapter 6: Branch Optimizations18Chapter 7: Scheduling Optimizations18Chapter 8: Integer Optimizations18Chapter 9: Floating-Point Optimizations18Chapter 10: 3DNow!™ and MMX™ Optimizations18Chapter 11: General x86 Optimizations Guidelines18Appendix A: AMDAthlon Processor Microarchitecture18Appendix B: Pipeline and Execution Unit Resources Overview19Appendix C: Implementation of Write Combining19Appendix D: Performance Monitoring Counters19Appendix E: Programming the MTRR and PAT19Appendix F: Instruction Dispatch and Execution Resources19Appendix G: DirectPath versus VectorPath Instructions19AMDAthlon™ Processor Family19AMDAthlon™ Processor Microarchitecture Summary20Top Optimizations23Group I — Essential Optimizations23Group II — Secondary Optimizations23Optimization Star24Group I Optimizations — Essential Optimizations24Memory Size and Alignment Issues24Avoid Memory Size Mismatches24Align Data Where Possible24Use the 3DNow!™ PREFETCH and PREFETCHW Instructions24Select DirectPath Over VectorPath Instructions25Group II Optimizations—Secondary Optimizations25LoadExecute Instruction Usage25Use LoadExecute Instructions25Avoid LoadExecute FloatingPoint Instructions with Integer Operands26Take Advantage of Write Combining26Use 3DNow!™ Instructions26Avoid Branches Dependent on Random Data26Avoid Placing Code and Data in the Same 64Byte Cache Line27C Source Level Optimizations29Ensure FloatingPoint Variables and Expressions are of Type Float29Use 32Bit Data Types for Integer Code29Consider the Sign of Integer Operands30Example 1 (Avoid):30Example (Preferred):30Example (Avoid):31Example (Preferred):31Use unsigned types for:31Use signed types for:31Use Array Style Instead of Pointer Style Code31Example 1 (Avoid):32Example 2 (Preferred):33Completely Unroll Small Loops34Example 1 (Avoid):34Example 2 (Preferred):34Avoid Unnecessary Store-to-Load Dependencies34Example 1 (Avoid):35Example 2 (Preferred):35Consider Expression Order in Compound Branch Conditions36Switch Statement Usage37Optimize Switch Statements37Example 1 (Avoid):37Example 2 (Preferred):37Use Prototypes for All Functions37Use Const Type Qualifier38Generic Loop Hoisting38Example 1:38Generalization for Multiple Constant Control Code39Example 2:39Declare Local Functions as Static40Dynamic Memory Allocation Consideration41Example:41Introduce Explicit Parallelism into Code41Example 1 (Avoid):42Example 2 (Preferred):42Explicitly Extract Common Subexpressions42Example 143Avoid:43Preferred:43Example 243Avoid:43Preferred:43C Language Structure Component Considerations43Sort by Base Type Size43Pad by Multiple of Largest Base Type Size44Original ordering (Avoid):44New ordering, with padding (Preferred):44Sort Local Variables According to Base Type Size44Original ordering (Avoid):45Improved ordering (Preferred):45Accelerating Floating-Point Divides and Square Roots45Example:46Avoid Unnecessary Integer Division47Example 1 (Avoid):47Example 2 (Preferred):47Copy Frequently De-referenced Pointer Arguments to Local Variables47Example 1 (Avoid):48Example 2 (Preferred):48Instruction Decoding Optimizations49Overview49Select DirectPath Over VectorPath Instructions50LoadExecute Instruction Usage50Use LoadExecute Integer Instructions50Use LoadExecute FloatingPoint Instructions with FloatingPoint Operands51Example 1 (Avoid):51Example 2 (Preferred):51Avoid LoadExecute FloatingPoint Instructions with Integer Operands51Example 1 (Avoid):52Example 2 (Preferred):52Align Branch Targets in Program Hot Spots52Use Short Instruction Lengths52Example 1 (Avoid):52Example 2 (Preferred):53Avoid Partial Register Reads and Writes53Replace Certain SHLD Instructions with Alternative Code54Example 154(Avoid):54(Preferred):54Example 254(Avoid):54(Preferred):54Example 354(Avoid):54(Preferred):54Use 8Bit SignExtended Immediates54Use 8Bit SignExtended Displacements55Code Padding Using Neutral Code Fillers55Recommendations for the AMDAthlon™ Processor56Recommendations for AMDK6® Family and AMDAthlon™ Processor Blended Code57Cache and Memory Optimizations61Memory Size and Alignment Issues61Avoid Memory Size Mismatches61Example 1 (Avoid):61Example 2 (Avoid):61Align Data Where Possible62Use the 3DNow!™ PREFETCH and PREFETCHW Instructions62PREFETCH/W versus PREFETCHNTA/T0/T1 /T263PREFETCHW Usage63Multiple Prefetches63Example (Multiple Prefetches):63Determining Prefetch Distance65Prefetch at Least 64 Bytes Away from Surrounding Stores65Take Advantage of Write Combining66Avoid Placing Code and Data in the Same 64Byte Cache Line66StoretoLoad Forwarding Restrictions67StoretoLoad Forwarding Pitfalls—True Dependencies67NarrowtoWide StoreBuffer Data Forwarding Restriction68Example 1 (Avoid):68Example 2 (Avoid):68WidetoNarrow StoreBuffer Data Forwarding Restriction68Example 3 (Avoid):68Example 4 (Avoid):68Example 5 (Preferred):69Misaligned StoreBuffer Data Forwarding Restriction69Example 6 (Avoid):69HighByte StoreBuffer Data Forwarding Restriction69Example 7 (Avoid):69One Supported Store- to-Load Forwarding Case70Example 8 (Allowed):70Summary of StoretoLoad Forwarding Pitfalls to Avoid70Stack Alignment Considerations70Extend to 32 Bits Before Pushing onto Stack70Example (Preferred):71Align TBYTE Variables on Quadword Aligned Addresses71C Language Structure Component Considerations71Example:72Sort Variables According to Base Type Size72Example:72Branch Optimizations73Avoid Branches Dependent on Random Data73AMDAthlon™ Processor Specific Code74Example 1 — Signed integer ABS function (X = labs(X)):74Example 2 — Unsigned integer min function (z = x < y ? x : y):74Blended AMDK6® and AMDAthlon™ Processor Code74Example 3 — Signed integer ABS function (X = labs(X)):74Example 4 — Unsigned integer min function (z = x < y ? x : y):74Example 5 — Hexadecimal to ASCII conversion (y=x < 10 ? x + 0x30: x + 0x41):74Example 6 — Increment Ring Buffer Offset:75Example 7 — Integer Signum Function:75Always Pair CALL and RETURN75Replace Branches with Computation in 3DNow!™ Code76Muxing Constructs76Example 1 (Avoid):76Example 2 (Preferred):77Sample Code Translated into 3DNow!™ Code77Example 1:77C code:773DNow! code:77Example 2:78C code:783DNow! code:78Example 3:78C code:783DNow! code:78Example 4:79C code:793DNow! code:79Example 5:80C code:803DNow! code:80Avoid the Loop Instruction81Example 1 (Avoid):81Example 2 (Preferred):81Avoid Far Control Transfer Instructions81Avoid Recursive Functions82Example 1 (Avoid):82Example 2 (Preferred):82Scheduling Optimizations83Schedule Instructions According to their Latency83Unrolling Loops83Complete Loop Unrolling83Partial Loop Unrolling84Without Loop Unrolling:85With Partial Loop Unrolling:85Deriving Loop Control For Partially Unrolled Loops86Example 1 (rolled loop):86Example 2 (partially unrolled loop):86Use Function Inlining87Overview87Always Inline Functions if Called from One Site88Always Inline Functions with Fewer than 25 Machine Instructions88Avoid Address Generation Interlocks88Example 1 (Avoid):89Example 2 (Preferred):89Use MOVZX and MOVSX89Example 1 (Avoid):89Example 2 (Preferred):89Minimize Pointer Arithmetic in Loops89Example 1 (Avoid):90Example 2 (Preferred):90Example 3 (Preferred):91Push Memory Data Carefully91Example 1 (Avoid):91Example 2 (Preferred):91Integer Optimizations93Replace Divides with Multiplies93Multiplication by Reciprocal (Division) Utility93Signed Division Utility94Unsigned Division Utility94Unsigned Division by Multiplication of Constant94Algorithm: Divisors 1 <= d < 231, Odd d94Derivation of a, m, s94Algorithm: Divisors 231 <= d < 23294Example 1:95Simpler Code for Restricted Dividend95Signed Division by Multiplication of Constant95Algorithm: Divisors 2 <= d < 23195Derivation for a, m, s96Signed Division By 296Signed Division By 2n96Signed Division By –296Signed Division By –(2n)96Remainder of Signed Integer 2 or –296Remainder of Signed Integer 2n or –(2n)97Use Alternative Code When Multiplying by a Constant97Use MMX™ Instructions for IntegerOnly Work99Repeated String Instruction Usage100Latency of Repeated String Instructions100Guidelines for Repeated String Instructions100Use the Largest Possible Operand Size100Ensure DF=0 (UP)101Align Source and Destination with Operand Size101Inline REP String with Low Counts101Use Loop for REP String with Low Variable Counts101Using MOVQ and MOVNTQ for Block Copy/Fill101Use XOR Instruction to Clear Integer Registers102Example 1 (Acceptable):102Example 2 (Preferred):102Efficient 64Bit Integer Arithmetic102Example 1 (Addition):102Example 2 (Subtraction):102Example 3 (Negation):102Example 4 (Left shift):103Example 5 (Right shift):103Example 6 (Multiplication):103Example 7 (Division):104Example 8 (Remainder):105Efficient Implementation of Population Count Function107Step 1107Step 2107Step 3108Step 4108Example:108Derivation of Multiplier Used for Integer Division by Constants109Unsigned Derivation for Algorithm, Multiplier, and Shift Factor109Signed Derivation for Algorithm, Multiplier, and Shift Factor111FloatingPoint Optimizations113Ensure All FPU Data is Aligned113Use Multiplies Rather than Divides113Use FFREEP Macro to Pop One Register from the FPU Stack114FloatingPoint Compare Instructions114Use the FXCH Instruction Rather than FST/FLD Pairs115Avoid Using ExtendedPrecision Data115Minimize FloatingPointtoInteger Conversions116Example 1 (Fast):116Example 2 (Potentially faster)117Example 3 (Potentially faster):118Example 4 (Fastest):118Floating-Point Subexpression Elimination119Example 1 (Avoid):119Example 2 (Preferred):119Check Argument Range of Trigonometric Instructions Efficiently119Example 1 (Avoid):120Example 2 (Preferred):120Take Advantage of the FSINCOS Instruction121Example 1 (Avoid):121Example 2 (Preferred):1213DNow!™ and MMX™ Optimizations123Use 3DNow!™ Instructions123Use FEMMS Instruction123Use 3DNow!™ Instructions for Fast Division124Optimized 14Bit Precision Divide124Example:124Optimized Full 24Bit Precision Divide124Example:124Pipelined Pair of 24Bit Precision Divides125Example:125NewtonRaphson Reciprocal125Use 3DNow!™ Instructions for Fast Square Root and Reciprocal Square Root126Optimized 15Bit Precision Square Root126Example:126Optimized 24Bit Precision Square Root126Example:126NewtonRaphson Reciprocal Square Root127Use MMX™ PMADDWD Instruction to Perform Two 32-Bit Multiplies in Parallel127Example:1283DNow!™ and MMX™ IntraOperand Swapping128AMDAthlon™ Specific Code128Blended Code128Example 1 (Preferred, faster):128Example 2 (Preferred, fast):128Fast Conversion of Signed Words to Floating-Point129Example 1 (AMD Athlon specific code using 3DNow! DSP extension):129Example 2 (AMDK6 Family and AMD Athlon processor blended code):129Use MMX™ PXOR to Negate 3DNow!™ Data129Use MMX™ PCMP Instead of 3DNow!™ PFCMP130Both Numbers Positive130One Negative, One Positive130Both Numbers Negative130Use MMX™ Instructions for Block Copies and Block Fills131AMDK6® and AMDAthlon™ Processor Blended Code131Example 1:131AMDAthlon™ Processor Specific Code133Example 2:133Use MMX™ PXOR to Clear All Bits in an MMX™ Register134Use MMX™ PCMPEQD to Set All Bits in an MMX™ Register135Use MMX™ PAND to Find Absolute Value in 3DNow!™ Code135Optimized Matrix Multiplication135Efficient 3D-Clipping Code Computation Using 3DNow!™ Instructions138Use 3DNow!™ PAVGUSB for MPEG2 Motion Compensation139Example 1 (Avoid):140Example 2 (Preferred):141Stream of Packed Unsigned Bytes141Example:141Complex Number Arithmetic142Example:142Example:142General x86 Optimization Guidelines143Short Forms143Example 1 (Avoid):143Example 2 (Preferred):143Dependencies144Register Operands144Stack Allocation144Appendix A145AMDAthlon™Processor Microarchitecture145Introduction145AMDAthlon™ Processor Microarchitecture146Superscalar Processor146Instruction Cache147Predecode148Branch Prediction148Early Decoding149DirectPath Decoder149VectorPath Decoder149Instruction Control Unit150Data Cache150Integer Scheduler151Integer Execution Unit151FloatingPoint Scheduler152FloatingPoint Execution Unit153LoadStore Unit (LSU)154L2 Cache Controller155Write Combining155AMDAthlon™ System Bus155Appendix B157Pipeline and Execution Unit Resources Overview157Fetch and Decode Pipeline Stages157Cycle 1–FETCH159Cycle 2–SCAN159Cycle 3 (DirectPath)– ALIGN1159Cycle 3 (VectorPath)– MECTL159Cycle 4 (DirectPath)– ALIGN2159Cycle 4 (VectorPath)– MEROM159Cycle 5 (DirectPath)– EDEC159Cycle 5 (VectorPath)– MEDEC/MESEQ159Cycle 6– IDEC/Rename159Integer Pipeline Stages160Cycle 7–SCHED161Cycle 8–EXEC161Cycle 9–ADDGEN161Cycle 10–DCACC161Cycle 11–RESP161FloatingPoint Pipeline Stages162Cycle 7–STKREN163Cycle 8–REGREN163Cycle 9–SCHEDW163Cycle 10–SCHED163Cycle 11–FREG163Cycle 12–15– FloatingPoint Execution (FEXEC1–4)163Execution Unit Resources164Terminology164Operands164Results164Examples164Integer Pipeline Operations165FloatingPoint Pipeline Operations166Load/Store Pipeline Operations167Code Sample Analysis168Appendix C171Implementation of Write Combining171Introduction171WriteCombining Definitions and Abbreviations172What is Write Combining?172Programming Details172Write-Combining Operations173Sending WriteBuffer Data to the System175Appendix D177Performance-Monitoring Counters177Overview177Performance Counter Usage177PerfEvtSel[3:0] MSRs (MSR Addresses C001_0000h–C001_0003h)178Event Select Field (Bits 0—7)178Unit Mask Field (Bits 8—15)179USR (User Mode) Flag (Bit 16)179OS (Operating System Mode) Flag (Bit 17)179E (Edge Detect) Flag (Bit 18)179PC (Pin Control) Flag (Bit 19)179INT (APIC Interrupt Enable) Flag (Bit 20)179EN (Enable Counter) Flag (Bit 22)179INV (Invert) Flag (Bit 23)179Counter Mask Field (Bits 31–24)179PerfCtr[3:0] MSRs (MSR Addresses C001_0004h–C001_0007h)183Starting and Stopping the PerformanceMonitoring Counters184Event and TimeStamp Monitoring Software184Monitoring Counter Overflow185Appendix E187Programming the MTRR and PAT187Introduction187Memory Type Range Register (MTRR) Mechanism187Memory Types190MTRR Capability Register Format190MTRR Default Type Register Format191MTRR Overlapping192Page Attribute Table (PAT)193MSR Access193Accessing the PAT194MTRRs and PAT194MTRR Fixed-Range Register Format198Variable-Range MTRRs199Variable-Range MTRR Register Format199MTRR MSR Format201Appendix F203Instruction Dispatch and Execution Resources203Appendix G235DirectPath versus VectorPath Instructions235Select DirectPath Over VectorPath Instructions235DirectPath Instructions235VectorPath Instructions247Index253Dimensioni: 2,99 MBPagine: 256Language: EnglishApri il manuale