AMD athlon 64 User Manual

Page of 48
Chapter 1
Introduction
9
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™
ccNUMA Multiprocessor Systems
40555
Rev. 3.00
June 2006
Chapter 1
Introduction
The AMD Athlon™ 64 and AMD Opteron™ family of single-core and dual-core multiprocessor 
systems are based on the cache coherent Non-Uniform Memory Access (ccNUMA) architecture. In 
this architecture, each processor has access to its own low-latency, local memory (through the 
processor’s on-die local memory controller), as well as to higher latency remote memory through the 
on-die memory controllers of the other processors in the multiprocessor environment. At the same 
time, the ccNUMA architecture is designed to maintain the cache coherence of the entire shared 
memory space. The high-performance coherent HyperTransport™ technology interconnects between 
processors in the multiprocessor system permit remote memory access and cache coherence.
In traditional symmetric multiprocessing (SMP) systems, the various processors share a single 
memory controller. This single memory connection can become a performance bottleneck when all 
processors access memory at once. At the same time, the SMP architecture does not scale well into 
larger systems with a greater number of processors. The AMD ccNUMA architecture is designed to 
overcome these inherent SMP performance bottlenecks. It is a mature architecture that is designed to 
extract greater performance potential from multiprocessor systems. 
As developers deploy more demanding workloads on these multiprocessor systems, common 
performance questions arise: Where should threads or processes be scheduled (thread or process 
placement)? Where should memory be allocated (memory placement)? The underlying operating 
system (OS), tuned for AMD Athlon 64 and AMD Opteron multiprocessor ccNUMA systems, makes 
these performance decisions transparent and easy.
Advanced developers, however, should be aware of the more advanced tools and techniques available 
for performance tuning. In addition to recommending mechanisms provided by the OS for explicit 
thread (or process) and memory placement, this application note explores advanced techniques such 
as node interleaving of memory to boost performance. This document also delves into the 
characterization of an AMD ccNUMA multiprocessor system, providing advanced developers with 
an understanding of the fundamentals necessary to enhance the performance of synthetic and real 
applications and to develop advanced tools.
In general, applications can be memory latency sensitive or memory bandwidth sensitive; both classes 
are important for performance tuning. In a multiprocessor system, in addition to memory latency and 
memory bandwidth, other factors influence performance:
the latency of remote memory access (hop latency)
the latency of maintaining cache coherence (probe latency)
the bandwidth of the HyperTransport interconnect links
the lengths of various buffer queues in the system
The empirical analysis presented in this document is based upon data provided by running a multi-
threaded synthetic test. While this test is neither a pure memory latency test nor a pure memory