Cisco Cisco ASR 5000 故障排查指南

determine which aaamgr(s) is/are failing. Starting in some versions of StarOS 17 and v18+, this
behavior has been changed so that the corresponding aaamgr instance number having
connectivity issues (as reported in SNMP traps) is reported in the logs with the particular id (Cisco
CDETS CSCum84773), though still only the first occurrence (across mutliple aaamgrs) of this
happening is reported.

The management aaamgr is the maximum sessmgr instance number + 1, and so on an ASR
5500 it is 385 for Data Processing Card (DPC) or 1153 (for DPC 2).

As a sidenote, the management aaamgr is responsible for handling operator/administrator logins
as well as handling change of authorization requests initiated from RADIUS servers themselves.

Continuing, the "show radius accounting (or authentication) servers detail" command will indicate
the timestamps of the state changes to Down that corresponds to the traps/logs (reminder: Not
Responding defined earlier is only a single aaamgr getting a timeout, whereas Down is a single
aaamgr getting enough consecutive timeouts per configuration to trigger Down)

If there is only one server configured, then it is not marked down, as that would be critical for
successful call setup.

Worth mentioning is that there is another parameter that can be configured on the detect-dead-
server config line called “response-timeout”. When specified, a server is marked down only when
the consecutive failures and response-timeout conditions are both met. The response-timeout
specifies a period of time when NO responses are received to ALL the requests sent to a
particular server. (Note that this timer would be continually reset as responses are received.) This
condition would be expected when either a server or the network connection is completely down,
vs. partially compromised/degraded.

The use case for this would be a scenario where a burst in traffic causes the consecutive failures
to trigger, but marking a server down immediately as a result is not desired. Rather, the server is
only be marked down after a specific period of time passes where no responses are received,
effectively representing true server un-reachability.

This method just discussed of controlling radius state machine changes is dependent on looking at
all aaamgr processes and finding one that triggers the condition of failed retries. This method is
subject to some degree to some randomness of failures, and so may not be the ideal algorithm to
detecting failures. But it is especially good at finding aaamgr(s) that are broken while all others are
working fine.

Keepalive approach

Another method of detecting radius server reachability is using dummy keepalive test messages.
This involves the constant sending of fake radius messages instead of monitoring live traffic.
Another advantage of this method is that it is always active, vs. with the consecutive failures in a
aaamgr approach, where there could be periods where no radius traffic is sent, and so there is no
way to know if a problem exists during those times, resulting in delayed detection when attempts
do start occurring. Also when a server is marked down, these keepalives continue to be sent so
that the server can be marked up as soon as possible. The disadvantage to this approach is that it
misses issues that are tied to specific aaamgr instances that may be experiencing issues because
it uses the management aaaamgr instance for the test messages.