Cisco Cisco UCS B200 M2 Blade Server Guía Para Resolver Problemas
UCS Memory Error Management
UCS Enhanced Memory Error Management
Page 7
UCS Memory Error Management
Error Reporting Mechanisms
When correctable memory errors occur, they are counted in registers which can be accessed by the Cisco
Integrated Management Controller (CIMC). The CIMC tracks these errors and uses an IPMI sensor for each DIMM
to enforce a failure threshold. When new correctable errors are detected, the CIMC will generate an entry in the
System Event Log (SEL) indicating which DIMM encountered the error.
Integrated Management Controller (CIMC). The CIMC tracks these errors and uses an IPMI sensor for each DIMM
to enforce a failure threshold. When new correctable errors are detected, the CIMC will generate an entry in the
System Event Log (SEL) indicating which DIMM encountered the error.
If a DIMM exceeds the pre-defined correctable ECC threshold, a subsequent entry will be made in the SEL
indicating the threshold was exceeded. Once a DIMM crosses the threshold, it is marked as "Degraded" and will
remain in this state until memory errors are manually reset, or the DIMM is replaced. Even though the DIMM is
marked as “Degraded”, the overall DIMM status is “Operable” and the DIMM will continue to operate as normal
with no impact on performance (see Figure 1). The “Degraded” state is to serve as a notification that further
investigation should be done to see why the DIMM crossed the threshold.
indicating the threshold was exceeded. Once a DIMM crosses the threshold, it is marked as "Degraded" and will
remain in this state until memory errors are manually reset, or the DIMM is replaced. Even though the DIMM is
marked as “Degraded”, the overall DIMM status is “Operable” and the DIMM will continue to operate as normal
with no impact on performance (see Figure 1). The “Degraded” state is to serve as a notification that further
investigation should be done to see why the DIMM crossed the threshold.
As a result of these SEL events, UCSM will trigger a series of faults pertaining to the individual DIMM state, the
Server Health LED Status, and the Overall Server Operability. For more details on different ways to view memory
error statistics, please refer to Appendix A.
Server Health LED Status, and the Overall Server Operability. For more details on different ways to view memory
error statistics, please refer to Appendix A.
Figure 1: For correctable ECC errors, DIMM overall status is “Operable” and continues to function with no
impact on performance
impact on performance