Extreme 3804 Supplementary Manual

Page of 112
 
64
Advanced System Diagnostics and Troubleshooting Guide
Diagnostics
BlackDiamond System with Two MSMs
During the scanning period, the module is taken offline. Expect a minimum offline time of 90 seconds. 
Up to eight correctable single-bit errors are corrected, with minimal loss to the total memory buffers.
In extremely rare cases, non-correctable errors are detected by memory scanning. In these circumstances, 
the condition is noted, but no corrective action is possible. When operating in the manual mode of 
memory scanning, the module is returned to online service after all possible corrective actions have 
been taken.
During the memory scan, the CPU utilization is high and mostly dedicated to executing the 
diagnostics—as is normal for running any diagnostic on the modules. During this time, other network 
activities where this system is expected to be a timely participant could be adversely affected, for 
example, in networks making use of STP and OSPF.
The alarm-level option of the global system health check facility does not attempt to diagnose a 
suspected module; instead, it simply logs a message at a specified level.
The auto-recovery option does attempt to diagnose and recover a failed module a configured number of 
times. You should plan carefully before you use this command option. If you enable the system health 
check facility on the switch and configure the auto-recovery option to use the offline auto-recovery 
action, once a module failure is suspected, the system removes the module from service and performs 
extended diagnostics. If the number of auto-recovery attempts exceeds the configured threshold, the 
system removes the module from service. The module is permanently marked “down,” is left in a 
non-operational state, and cannot be used in a system running ExtremeWare 6.2.2 or later. A log 
message indicating this will be posted to the system log. 
NOTE
Keep in mind that the behavior described above is configurable by the user, and that you can enable the 
system health check facility on the switch and configure the auto-recovery option to use the online 
auto-recovery action, which will keep a suspect module online regardless of the number of errors 
detected.
Example log messages for modules taken offline:
01/31/2002 01:16.40 <CRIT:SYST> Sys-health-check [ACTION] (PBUS checksum)
(CARD_HWFAIL_PBUS_CHKSUM_EDP_ERROR) slot 3
01/31/2002 01:16.40 <INFO:SYST> Card in slot 1 is off line
01/31/2002 01:16.40 <INFO:SYST> card.c 2035: Set card 1 to Non-operational 
01/31/2002 01:16.40 <INFO:SYST> Card in slot 2 is off line
01/31/2002 01:16.44 <INFO:SYST> card.c 2035: Set card 2 to Non-operational 
01/31/2002 01:16.44 <INFO:SYST> Card in slot 3 is off line
01/31/2002 01:16.46 <INFO:SYST> card.c 2035: Set card 3 to Non-operational 
01/31/2002 01:16.46 <INFO:SYST> Card in slot 4 is off line
01/31/2002 01:16.46 <INFO:SYST> card.c 2035: Set card 4 to Non-operational