IBM SG24-5131-00 User Manual

Cluster Troubleshooting

145

hang. After a certain amount of time, by default 360 seconds, the cluster
manager will issue a config_too_long message into the /tmp/hacmp.out

file.

The message issued looks like this:

The cluster has been in reconfiguration too long;Something may be wrong.

In most cases, this is because an event script has failed. You can find out
more by analyzing the /tmp/hacmp.out

file.The error messages in the

/var/adm/cluster.log

file may also be helpful. You can then fix the problem

identified in the log file and execute the

clruncmd

command on the command

line, or by using the

SMIT Cluster Recovery Aids

screen. The

clruncmd

command signals the Cluster Manager to resume cluster processing.

Note, however, that sometimes scripts simply take too long, so the message
showing up isn’t always an error, but sometimes a warning. If the message is
issued, that doesn’t necessarily mean that the script failed or never finished.
A script running for more than 360 seconds can still be working on something
and eventually get the job done. Therefore, it is essential to look at the
/tmp/hacmp.out file to find out what is actually happening.

7.3 Deadman Switch

The term “deadman switch” describes the AIX kernel extension that causes a
system panic and dump under certain cluster conditions if it is not reset. The
deadman switch halts a node when it enters a hung state that extends
beyond a certain time limit. This enables another node in the cluster to
acquire the hung node’s resources in an orderly fashion, avoiding possible
contention problems.

If this is happening, and it isn’t obvious why the cluster manager was kept
from resetting this timer counter, for example because some application ran
at a higher priority as the

clstrmgr

process, customizations related to

performance problems should be performed in the following order:

1. Tune the system using I/O pacing.

2. Increase the

syncd

frequency.

3. If needed, increase the amount of memory available for the

communications subsystem.

4. Change the Failure Detection Rate.

Each of these options is described in the following sections.