Escali 4.4 User Manual

Page of 81
Section: 5.4 Collective operations
Scali MPI Connect Release 4.4 Users Guide 
50
       4 pair4
       5 pipe0
       6 pipe1
       7 safe
def    8 smp
By looping through these alternatives the performance of IS varies:
algorithm 0: Mop/s total = 95.60
algorithm 1: Mop/s total = 78.37
algorithm 2: Mop/s total = 34.44
algorithm 3: Mop/s total = 61.77
algorithm 4: Mop/s total = 41.00
algorithm 5: Mop/s total = 49.14
algorithm 6: Mop/s total = 85.17
algorithm 7: Mop/s total = 60.22
algorithm 8: Mop/s total = 48.61
For this particular combination of Alltoallv-algorithm and application (IS) the performance 
varies significantly, with algorithm 0 close to doubling the performance over the default.
5.4.1 Finding the best algorithm
Consider the image processing example from Chapter 4 which containes four collective 
operations. All of these can be tuned with respect to algorithm according to the following 
pattern:
user% for a in <range>; do 
\>; SCAMPI_<MPI-function>_ALGORITHM=$a \
\>;mpimon <application> -- <nodes> ><application>.out.$a; \
\>; done
For example, trying out the alternative algorithms for MPI_Reduce with two processes can be 
done as follows (assuming Bourne Again Shell [bash]:
user% for a in 0 1 2 3 4 5 6 7 8; do
\>; SCAMPI_REDUCE_ALGORITHM=$a
\>; mpimon ./kollektive-8 ./uf256-8.pgm -- r1 r2;
\>; done
Given that the application then reports the timing of the relevant parts of the code a best choice 
can be made. Note however that with multiple collective operations working in the same 
program there may be interference between the algorithms. Also, the performance of the 
implementations is interconnect dependent.