Escali 4.4 User Manual

Page of 81
Section: 5.4 Collective operations
Scali MPI Connect Release 4.4 Users Guide 
49
5.3.2 Memory consumption increase after warm-up
Remember that group operations (MPI_Comm_{create, dup, split}) may involve creating 
new communication buffers. If this is a problem, decreasing chunck_size may help.
5.4 Collective operations
A collective communication is a communication operation in which a group of processes works 
together to distribute or gather together a set of one or more values. Scali MPI Connect uses 
a number of different approaches to implement collective operations. Through environment 
variables the user can control which algorithm the application uses.
Consider the Integer Sort (IS) benchmark in NPB (NAS Parallel Benchmarks). When running on 
ten processes on 5 nodes over Gigabit Ethernet (mpimon -net smp,tcp bin/is.A.16.scampi  
-- r1 2 r2 2 r3 2 r4 2 r5 2) the resulting performance is:
Mop/s total     = 34.05
Mop/s/process   = 2.13
Extracting the MPI profile of the run can be done as follows:
user% export SCAMPI_TRACE="-f arg;timing" 
user% mpimon bin/is.A.16.scampi  -- $ALL2  > trace.out
And running the output through scanalyze yields the following:
MPI Call 
<128
128-1k
1-8k
8-32k
32-256k 256k-1M  >1M
MPI_Send    
0.00  0.00 
    0.00 
  0.00
0.00 
     0.00 
   0.00
MPI_Irecv   
0.00  0.00 
    0.00 
 0.00
0.00 
     0.00 
   0.00
MPI_Wait    
0.69 
0.00     0.00  0.00 0.00 
     0.00 
   0.00
MPI_Alltoall    
0.14
0.00   0.00 
   0.00 
   0.00
0.00
0.00
MPI_Alltoallv   
11.20
0.00   0.00 
   0.00 
   0.00 
    0.00 
   0.00
MPI_Reduce    
1.04
0.00   0.00 
   0.00 
   0.00 
     0.00 
   0.00
MPI_Allreduce   
0.00
0.00    15.63
0.00
0.00 
     0.00 
   0.00
MPI_Comm_size  
0.00 
0.00     0.00 
   0.00 
   0.00 
     0.00 
   0.00
MPI_Comm_rank 0.00
0.00 
    0.00 
   0.00 
   0.00 
     0.00 
   0.00
MPI_Keyval_free
0.00
0.00     0.00 
   0.00 
   0.00 
     0.00 
   0.00
The MPI_Alltoallv uses a high fraction of the total execution time. The communication time is 
the sum of all used algorithms and the total timing may depend on more than one type of 
communication. If one type or a few operations dominate the time consumption, they are 
promising candidates for tuning/optimization. 
Note: Please note that the run time selectable algorithms and their values may vary on 
different Scali MPI Connect release versions. For information on which algorithms that are 
selectable at run time and their valid values, set the environment variable 
SCAMPI_ALGORITHM and run an example application:
# SCAMPI_ALGORITHM=1 mpimon /opt/scali/examples/bin/hello -- localhost
This will produce a listing of the different implementations of particular collective MPI calls.
For each collective operation a listing consisting of a number and a short description of the 
algoritmn is produced, e.g., for MPI_Alltoallv() the following:
SCAMPI_ALLTOALLV_ALGORITHM alternatives
       0 pair0
       1 pair1
       2 pair2
       3 pair3