Escali Escali, LLC Network Router 4.4 Manual Do Utilizador

Scali MPI Connect Release 4.4 Users Guide

Chapter 4

Profiling with Scali MPI Connect

The Scali MPI communication library has a number of built-in timing and trace facilities. These
features are built into the run time version of the library, so no extra recompiling or linking of
libraries is needed. All MPI calls can be timed and/or traced. A number of different environment
variables control this functionality. In addition an implied barrier call can be automatically
inserted before all collective MPI calls. All of this can give detailed insights into application
performance.

The trace and timing facilities are initiated by environment variables that either can be set and
exported or set at the command line just before running mpimon.

There are different tools available that can be useful to detect and analyze the cause of
performance bottlenecks:
• Built-in proprietary trace and profiling tools provided with SMC
• Commercial tools that collect information during run and postprocesses and presents

results afterwards such as Vampir from Pallas GmbH. See http://www.pallas.de for
more information.

The main difference between these tools is that the SMC built-in tools can be used with an
existing binary while the other tools require reloading with extra libraries.

The powerful run time facilities Scali MPI Connect trace and Scali MPI Connect timing can be
used to monitor and keep track of MPI calls and their characteristics. The various trace and
timing options can yield many different views of an application's usage of MPI. Common to most
of these logs are the massive amount of data which can sometimes be overwhelming,
especially when run with many processes and using both trace and timing concurrently.

The second part shows the timing of these different MPI calls. The timing is a sum of the timing
for all MPI calls for all MPI processes and since there are many MPI processes the timing can
look unrealistically high. However, it reflects the total time spent in all MPI calls. For situations
in which benchmarking focuses primarily on timing rather than tracing MPI calls, the timing
functionality is more appropriate. The trace functionality introduces some overhead and the
total wall clock run time of the application goes up. The timing functionality is relatively light
and can be used to time the application for performance benchmarking.

4.1 Example

To illustrate the potential of tracing and timing with Scali MPI Connect consider the code
fragment below (full source reproduced in A-2).

int main( int argc, char** argv )
{
MPI_Init( &argc, &argv );
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
MPI_Comm_size( MPI_COMM_WORLD, &size );
/* read image from file */

   /* broadcast to all nodes */
MPI_Bcast( &my_count, 1, MPI_INT, 0, MPI_COMM_WORLD );
   /* scatter the image */
MPI_Scatter( pixels, my_count, MPI_UNSIGNED_CHAR, recvbuf,
              my_count, MPI_UNSIGNED_CHAR, 0, MPI_COMM_WORLD );
/* sum the squares of the pixels in the sub-image */