AMD athlon 64 User Manual
20
Analysis and Recommendations
Chapter 3
40555
Rev. 3.00
June 2006
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™
ccNUMA Multiprocessor Systems
ccNUMA Multiprocessor Systems
3.1.2
Multiple Threads-Shared Data
When scheduling multiple threads that share data on an idle system, it is preferable to schedule the
threads on both cores of an idle node first, then on both cores of the the next idle node, and so on. In
other words, schedule using core major order first followed by node major order.
threads on both cores of an idle node first, then on both cores of the the next idle node, and so on. In
other words, schedule using core major order first followed by node major order.
For example, when scheduling threads that share data on a dual-core Quartet system, AMD
recommends using the following order:
recommends using the following order:
•
Core 0 and core 1 on node 0 in any order
•
Core 0 and core 1 on node 1 in any order
•
Core 0 and core 1 on node 2 in any order
•
Core 0 and core 1 on node 3 in any order
3.1.3
Scheduling on a Non-Idle System
2
].
In general, most developers will achieve good performance by relying on the ccNUMA-aware OS to
make the right scheduling decisions on idle and non-idle systems. For additional details on ccNUMA
scheduler support in various operating systems, refer to Section A.6 on page 43.
make the right scheduling decisions on idle and non-idle systems. For additional details on ccNUMA
scheduler support in various operating systems, refer to Section A.6 on page 43.
In addition to the scheduler, several NUMA-aware OSs provide tools and application programming
interfaces (APIs) that allow the developer to explicitly set thread placement to a certain core or node.
Using these tools or APIs overrides the scheduler and hands over control for thread placement to the
developer, who should use the previously mentioned techniques to assure reasonable scheduling.
interfaces (APIs) that allow the developer to explicitly set thread placement to a certain core or node.
Using these tools or APIs overrides the scheduler and hands over control for thread placement to the
developer, who should use the previously mentioned techniques to assure reasonable scheduling.
For additional details on the tools and API libraries supported in various OSs, refer to Section A.7 on
page 44.
page 44.
3.2
Data Locality Considerations
It is best to keep data local to the node from which it is being accessed. Accessing data remotely is
slower than accessing data locally. The further the hop distance to the data, the greater the cost of
accessing remote memory. For most memory-latency sensitive applications, keeping data local is the
single most important recommendation to consider.
slower than accessing data locally. The further the hop distance to the data, the greater the cost of
accessing remote memory. For most memory-latency sensitive applications, keeping data local is the
single most important recommendation to consider.
As explained in Section 2.1 on page page 13, if a thread is running and accessing data on the same
node, it is considered as a local access. If a thread is running on one node but accessing data resident
on a different node, it is considered as a remote access. If the node where the thread is running and the
node where the data is resident are directly connected to each other, it is considered as a 1 hop access
node, it is considered as a local access. If a thread is running on one node but accessing data resident
on a different node, it is considered as a remote access. If the node where the thread is running and the
node where the data is resident are directly connected to each other, it is considered as a 1 hop access