Cisco Cisco UCS B22 M3 Blade Server Fehlerbehebungsanleitung

Page 5 of 9

Handling of Memory Errors

As explained in the previous section, increased error rates can be attributed to multiple trends in the server
industry. Cisco UCS servers handle memory errors in a way that does not compromise the reliability, availability,
and serviceability (RAS) of the server. Memory errors are handled through three main technologies: server ECC
capabilities, scrub protocols, and field memory-error threshold policies.

Cisco UCS Server ECC Capabilities

All Cisco UCS servers use memory modules with ECC codes applied across 64-bit (8-byte) data words protected
by 8 check bits to form a 72-bit code word. Such single error correcting and double error detecting (SECDED) ECC
codes could correct any single-bit error and detect any double-bit error. In addition, Cisco UCS servers built from
Intel Xeon EP-class processors employ ECC codes that not only correct any single-bit error, but also correct errors
confined to a single x4 DRAM chip and detect errors in up to two devices. This capability is known as single-device
data correction (SDDC).

Additionally, when a system is operating in lockstep mode, which spreads the ECC code words across a pair of
memory channels, SDDC is extended to correct errors in any x8 bit DRAM chip (or adjacent pair of x4 DRAM
chips). To provide even greater reliability and availability, Cisco UCS servers built from the Intel Xeon EX-class
processors can correct errors in any (not necessarily adjacent) pair of x4 devices and can detect errors in up to
three devices. This capability is known as double-device data correction (DDDC).

Scrub Protocol

In all normal memory read accesses, the memory controller checks for and corrects single-bit errors. However,
sometimes the data in the entire memory array may not be accessible for reasons related to data locality. Thus,
scrub patrol protocols provide additional correction capabilities that are needed beyond the usual SECDED ECC
codes. The scrub patrol routine reads the entire memory array and corrects any single-bit errors. This patrol routine
occurs periodically at a predetermined interval, usually once every 24 hours.

Field Memory-Error Threshold Policies

In addition to ECC capabilities and scrub protocol, Cisco UCS servers employ field memory-error threshold policies
that flag certain memory modules as candidates for replacement after the module reaches a certain memory-error
threshold. If any memory module generates an uncorrectable error, that module is flagged as degraded. A similar
logic is used to flag modules that generate correctable errors also as degraded.

A problem with current threshold policies is that both correctable and uncorrectable errors are flagged as degraded
even though advanced ECC capabilities and scrub protocols effectively address all correctable errors. Flagging a
module with only correctable errors as degraded is premature, because data shows that in many cases correctable
errors are transient and will resolve themselves with little user intervention. In addition, when a module is flagged
as degraded from correctable errors, it is replaced, causing unnecessary disruption to server uptime.