Q-Logic IB6054601-00 D Manuel D’Utilisation

Page de 122
C – Troubleshooting
InfiniPath MPI Troubleshooting
C-22
IB6054601-00 D
Q
If this file is not present or the node has not been rebooted after the 
infinipath
 
RPM has been installed, a failure message similar to this will be generated:
mpirun -m ~/tmp/sm -np 2 -mpi_latency 1000 1000000
 node-00:1.ipath_update_tid_err: failed: Cannot allocate memory 
 mpi_latency:
 /fs2/scratch/infinipath-build-2.0/mpi-2.0/mpich/psm/src 
mq_ips.c:691:
mq_ipath_sendcts: Assertion ‘rc == 0’ failed. MPIRUN: Node program 
unexpectedly quit. Exiting.
You can check the 
ulimit -l
 on all the nodes by running 
ipath_checkout
. A 
warning will be given if 
ulimit -l 
is less that 4096. 
There are two possible solutions to this. If InfiniPath is not installed on the node 
where you start the job, set this value in the following way (as root). 
ulimit -l 65536 
Or, if you have installed InfiniPath on the node, reboot it to insure that 
/etc/initscript
 is run.
C.8.12
Error Messages Generated by mpirun
In the sections below, types of mpirun error messages are described. They fall into 
these categories:
Messages from the InfiniPath Library
MPI messages
Messages relating to the InfiniPath driver and InfiniBand links
Messages generated by mpirun follow a general format:
program_name: message
function_name: message
Messages may also have different prefixes, such and ipath_ or psm_, which will 
indicate in which part of the software the errors are occurring.
C.8.12.1
Messages from the InfiniPath Library
These messages may appear in the 
mpirun
 output. 
The first set are error messages, which indicate internal problems and should be 
reported to Support.
Trying to cancel invalid timer (EOC)
sender rank rank is out of range (notification)
sender rank rank is out of range (ack)
Reached TIMER_TYPE_EOC while processing timers