Q-Logic IB6054601-00 D Manuel D’Utilisation

Page de 122
C – Troubleshooting
InfiniPath MPI Troubleshooting
C-24
IB6054601-00 D
Q
The following message indicates that a node program may not be processing 
incoming packets, perhaps due to a very high system load:
eager array full after overflow, flushing (head h, tail t)
The following indicates an invalid InfiniPath link protocol version:
InfiniPath version ERROR: Expected version v, found w (memkey h)
The following error messages should rarely occur and indicate internal software 
problems:
ExpSend opcode h tid=j, rhf_error k: str
Asked to set timeout w/delay l, gives time in past (t2 < t1)
Error in sending packet: str
Fatal error in sending packet, exiting: str
Fatal error in sending packet: str
Here the 
str
 can give additional clues to the reason for the failure.
The following probably indicates a node failure or malfunctioning link in the fabric:
Couldn’t connect to NODENAME, rank RANK#. Time elapsed HH:MM:SS. 
Still trying
NODENAME is the node (host) name, RANK# is the MPI rank, and HH:MM:SS are 
the hours, minutes, and seconds since we started trying to connect.
If you get messages similar to the following, it may mean that you are trying to 
receive to an invalid (unallocated) memory address, perhaps due to a logic error in 
the program, usually related to malloc/free:
ipath_update_tid_err: Failed TID update for rendevous, allocation 
problem
kernel: infinipath: get_user_pages (0x41 pages starting at 
0x2aaaaeb50000
kernel: infinipath: Failed to lock addr 0002aaaaeb50000, 65 pages: 
errno 12
TID is short for Token ID, and is part of the InfiniPath hardware. This error indicates 
a failure of the program, not the hardware or driver.
C.8.12.2
MPI Messages
Some MPI error messages are issued from the parts of the code inherited from the 
MPICH implementation. See the MPICH documentation for descriptions of these. 
This section presents the error messages specific to the InfiniPath MPI 
implementation.