Extreme 3804 Manual Suplementar

Página de 112
 
Error Messages for Fabric Checksums
Advanced System Diagnostics and Troubleshooting Guide
31
Permanent Failures
The most detrimental set of conditions that result in packet error events are those that result in 
permanent
 errors. These types of errors arise from some failure within the switch fabric that causes data 
to be corrupted in a systematic fashion. These permanent hardware defects might, or might not, affect 
normal switch operation. They cannot be resolved by user intervention and will not resolve themselves. 
You must replace hardware to resolve permanent errors.
Responding to Failures
Because fabric checksum validation can detect and report both transient and systematic failures, some 
human intelligence must be applied to differentiate between those transient and systematic failures.
As an example, the following messages are associated with an MSM64i health-check packet problem. 
They indicate that the system is running system-health-check to check the internal connectivity.
<CRIT:SYST> CPU health-check packet missing type 0 on slot 5
<CRIT:SYST> CPU health-check packet problem on card 5
<INFO:SYST> card.C 1937: card5 (type 20) is reset due to autorecovery config reset
counter is 1
If these messages occur only once or twice, no action is necessary. (Transient problem.)
If these messages recur continuously, remove and re-insert the module in its slot. (If the problem goes 
away, this was a systematic, soft-state failure.)
If removing and re-inserting the module does not fix the problem, run extended diagnostics on the 
switch, because the messages might point to a systematic, permanent failure.
Hardware replacement is indicated when systematic errors cannot be resolved by normal 
troubleshooting methods. That is, one must first demonstrate that an error is both systematic and 
permanent before repairing or replacing a component.
Error Messages for Fabric Checksums
Versions of ExtremeWare prior to Version 6.2.2b56 simply logged the fact that a checksum occurred on a 
slot without providing much detail as to the type of packet or the reason that the checksum message 
was logged. The following messages are examples of the earlier message format.
01/31/2002 01:30.58 <CRIT:KERN> ERROR: Checksum Error on slot 3
01/31/2002 01:30.58 <CRIT:KERN> ERROR: Checksum Error on Slot 3
01/31/2002 01:30.58 <CRIT:KERN> ERROR: Checksum Error on slot 3
ExtremeWare Release 6.2.2b56 and higher provide more detailed information about the origins of the 
checksum message by expanding the message to include descriptions of the type of message and the 
condition detected. For example, if the system-health-check subsystem detects a panic or a condition 
requiring action on the part of the system health check subsystem or the administrator, you can expect 
to see a message similar to this:
04/16/2003 13:17.23 <CRIT:SYST> Sys-health-check [EXT] checksum error
on slot 5 prev=0 cur=6