Escali 4.4 User Manual

Page of 81
Section: 5.2 How to optimize MPI performance
Scali MPI Connect Release 4.4 Users Guide 
48
5.2 How to optimize MPI performance
There is no universal recipe for getting good performance out of a message passing program. 
Here are some do’s and don’t’s for SMC. 
5.2.1 Performance analysis
Learn about the performance behaviour of your particular MPI applications on a Scali System 
by using a performance analysis tool. 
5.2.2 Using processor-power to poll
To maximize performance, ScaMPI is using poll when waiting for communication to terminate, 
instead of using interrupts. Polling means that the CPU is performing busy-wait (looping) when 
waiting for data over the interconnect. All exotic interconnects require polling.
Some applications create treads which may end up having more active threads than you have 
CPUs. This will have huge impact on MPI performance. In threaded application with irregular 
communication patterns you probably have other threads that could make use of the 
processor. To increase performance in this case, Scali has provided a “backoff” feature in 
ScaMPI. The backoff feature will still poll when waiting for data, but will start to enter sleep 
states on intervals when no data is coming. The algorithm is as follows: ScaMPI polls for a short 
time (idle time), then stops for a periode, and polls again.
The sleep periode starts a parameter controlled minimum and is doubled every time until it 
reaches the maximum value. The following environment variables set the parameters:
SCAMPI_BACKOFF_ENABLE  (turns the mechanism on)
SCAMPI_BACKOFF_IDLE=n  (defines idle-period as n ms [Default = 20 ms])
SCAMPI_BACKOFF_MIN=n  (defines minimum backoff-time in ms [Default = 10 ms])
SCAMPI_BACKOFF_MAX=n  (defines maximum backoff-time in ms [Default = 100 ms])
5.2.3 Reorder network traffic to avoid conflicts
Many-to-one communication may introduce bottlenecks. Zero-byte messages are low-cost. In 
a many-to-one communication, performance may improve if the receiver sends ready-to-
receive tokens (in the shape of a zero-byte message) to the MPI-process wanting to send data.
5.3 Benchmarking
Benchmarking is that part of performance evaluation that deals with the measurement and 
analysis of computer performance using various kinds of test programs. Benchmark figures 
should always be handled with special care when making comparisons with similar results.
5.3.1 How to get expected performance
• Caching the application program on the nodes.
For benchmarks with short execution time, total execution time may be reduced when 
running the process repetitively. For large configurations, copying the application to the 
local file system on each node will reduce startup latency and improve disk I/O 
bandwidth.
• The first iteration is (very) slow.
This may happen because the MPI-processes in an application are not started 
simultaneously. Inserting an MPI_Barrier() before the timing loop will eliminate this.