AMD athlon 64 Manuale Utente

Analysis and Recommendations

Chapter 3

40555

Rev. 3.00

June 2006

Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™
ccNUMA Multiprocessor Systems

This analogy clearly communicates the performance effects of queuing time versus latency. In a
computer server, with many concurrent outstanding memory requests, we would gladly incur some
additional latency (walking) to spread memory transactions (check-out processes) across multiple
memory controllers (check-out lanes) because this greatly improves performance by reducing the
queuing time.

However, if the number of customers at the remote queue increases to 20 or more, then the customer
would much rather wait for the local queue directly in front of him.

The following example was extracted by mining the results of the synthetic test case.

There are four cases illustrated in Figure 10. In each case there are two threads running on
node 0 (core 0 and core 1 respectively). The system is left idle except for the two threads.

•

Both threads access memory on node 0.

•

First thread accesses memory on node 0. The second thread accesses memory on node 1, which is
one hop away.

•

First thread accesses memory on node 0. The second thread accesses memory on node 2, which is
one hop away.

•

First thread accesses memory on node 0. The second thread accesses memory on node 3, which is
two hops away.

As shown in Figure 10, synthetic tests indicate that when both threads are read-only, the 0 hop-0 hop
case is faster than the 0 hop-1 hop and 0 hop-2 hop cases.

Figure 10.

Both Read-Only Threads Running on Node 0 (Different Cores) on an Idle

System

Total Time for both threads (read-read)

102%

108%

107%

118%

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

0.0.r.0 0.1.r.0 (0 Hops) (0 Hops)
0.0.r.0 0.1.r.1 (0 Hops) (1 Hops)

0.0.r.0 0.1.r.2 (0 Hops) (1 Hops)
0.0.r.0 0.1.r.3 (0 Hops) (2 Hops)

0 Hop
0 Hop

0 Hop
1 Hop

0 Hop
2 Hop