AMD athlon 64 User Manual

Page of 48
Appendix A
41
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™
ccNUMA Multiprocessor Systems
40555
Rev. 3.00
June 2006
4.4 GB/s necessary. The two coherent HyperTransport links are loaded at 3.5 GB/s each. Thus the 
utilization of each of the two coherent HyperTransport links that connect node 0 and node 1 equals 
87% (3.5÷4).
A.2.3
What Role Do Buffers Play in the Throughput Observed?
Node 0 queues up packets in HyperTransport buffers and sends them on the outgoing link only if 
node 1 can accommodate them. Likewise node 1 queues up packets in HyperTransport buffers and 
sends them on the outgoing link only if node 0 can accept them. 
When the HyperTransport buffers are saturated, they can prevent the coherent HyperTransport links 
from reaching their full throughput capacity of 4GB/s and, thus, full 100% utilization. 
Also, saturating the HyperTransport buffers in the XBar has a domino effect on the other buffers in 
the system. Remember, the SRI is connected to the XBar, which is connected to the coherent 
HyperTransport links. 
When packets are stalled in the XBar buffer queue to be sent over the coherent HyperTransport links, 
a chain effect can cause packets stall in the SRI buffer queue to be sent to the XBar.
AMD makes several event profiling tools available under NDA to monitor the HyperTransport 
bandwidth and buffer queue usage patterns. 
The buffer lengths are BIOS configurable within some hardware-specific limits that are specified in 
the appropriate BIOS Kernel and Developers Guide for the processor under consideration. Following 
AMD recommendations, the BIOS allocates these buffers on a link-by-link basis to optimize for the 
most common workloads.
A.2.4
What Resources Are Used When Write-Only Threads Do Not Fire 
at Each Other (No Crossfire) on an Idle System?
Now consider the case in which the writer threads do not fire at each other: i.e., the first thread runs on 
node 0 and writes to memory on node 1 and second thread runs on node 1 and writes to memory on 
node 3.
In this case, the bidirectional link from node 0 to node 1 is in under substantial use (60% utilization in 
each direction). In addition, the bidirectional link from node 1 to node 3 is also under substantial use 
(54% utilization in each direction). 
As the load is now spread over two bidirectional links instead of 1, the performance is better than in 
the crossfire case.