AMD athlon 64 User Manual

Page of 48
26
Analysis and Recommendations
Chapter 3
40555
Rev. 3.00
June 2006
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 
ccNUMA Multiprocessor Systems
Threads firing at each other (crossfire)
The first thread runs on node 0 and writes to memory on node 1 (1 hop). The second thread runs 
on node 1 and writes to memory on node 0 (1 hop).
In each case, the two threads are run on core 0 of whichever code they are running on. The system is 
left idle except for the two threads. As shown in Figure 6 on page 26, the crossfire 1 hop-1 hop case is 
the worst performer.
Figure 6.
Crossfire 1 Hop-1 Hop Case vs No Crossfire 1 Hop-1 Hop Case on an Idle 
System
When the write-only threads fire at each other (crossfire), the bidirectional HyperTransport link 
between node 0 and node 1 is saturated and loaded at 3.5 GB/s in each direction. The theoretical 
maximum bandwidth of the HyperTransport link is 4 GB/s in each direction. Thus, the utilization of 
the bidirectional HyperTransport link is 87% (3.5 ÷ 4) in each direction on that HyperTransport link.
On the other hand, when the write-only threads do not fire at each other (no crossfire), the utilization 
of the bidirectional link from node 0 to node 1 is at 60% in each direction. In addition, the utilization 
of the bidirectional link from node 1 to node 3 is at 54% in each direction. Since the load is now 
spread over two bidirectional HyperTransport links instead of one, the performance is better. 
The saturation of these coherent HyperTransport links is responsible for the poor performance for the 
crossfire case compared to the no crossfire case. For detailed analysis, refer to Section A.2 on 
page 40.
In this synthetic test, read-only threads do not result in poor performance. Throughput of such threads 
is not high enough to exhaust the HyperTransport link resources. When both threads are read-only, 
the crossfire case is equivalent in performance to the no crossfire case.
It is also useful to study whether this observation holds on a system that is not idle. The following 
analysis explores the behavior of the two foreground threads under a variable background load.
 
Total Time for both threads (write-write)
113%
130%
149%
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
0.0.w.0  1.0.w.1  (0 Hops)  (0 Hops)
0.0.w.1  1.0.w.3  (1 Hops)  (1 Hops)
0.0.w.1  1.0.w.0  (1 Hops)  (1 Hops)
0 Hop
0 Hop
1 Hop
1 Hop
NO 
Xfire
1 Hop
1 Hop
Xfire