IBM SG24-5131-00 User Manual

Page of 240
Cluster Troubleshooting 
145
hang. After a certain amount of time, by default 360 seconds, the cluster 
manager will issue a config_too_long message into the /tmp/hacmp.out
 
file. 
The message issued looks like this:
The cluster has been in reconfiguration too long;Something may be wrong.
In most cases, this is because an event script has failed. You can find out 
more by analyzing the /tmp/hacmp.out
 
file.The error messages in the 
/var/adm/cluster.log
 
file may also be helpful. You can then fix the problem 
identified in the log file and execute the 
clruncmd
 command on the command 
line, or by using the 
SMIT Cluster Recovery Aids 
screen. The 
clruncmd
 
command signals the Cluster Manager to resume cluster processing.
Note, however, that sometimes scripts simply take too long, so the message 
showing up isn’t always an error, but sometimes a warning. If the message is 
issued, that doesn’t necessarily mean that the script failed or never finished. 
A script running for more than 360 seconds can still be working on something 
and eventually get the job done. Therefore, it is essential to look at the 
/tmp/hacmp.out file to find out what is actually happening.
7.3  Deadman Switch
The term “deadman switch” describes the AIX kernel extension that causes a 
system panic and dump under certain cluster conditions if it is not reset. The 
deadman switch halts a node when it enters a hung state that extends 
beyond a certain time limit. This enables another node in the cluster to 
acquire the hung node’s resources in an orderly fashion, avoiding possible 
contention problems. 
If this is happening, and it isn’t obvious why the cluster manager was kept 
from resetting this timer counter, for example because some application ran 
at a higher priority as the 
clstrmgr
 process, customizations related to 
performance problems should be performed in the following order:
1. Tune the system using I/O pacing.
2. Increase the 
syncd
 frequency.
3. If needed, increase the amount of memory available for the 
communications subsystem.
4. Change the Failure Detection Rate.
Each of these options is described in the following sections.