IBM 520Q User Manual

Page of 110
80
 
IBM System p5 520 and 520Q Technical Overview and Introduction
The operating system cannot program or access the temperature threshold using the SP. 
EPOW events can, for example, trigger the following actions:
򐂰
Temperature monitoring, which increases the fan’s rotation speed when ambient 
temperature is above a preset operating range.
򐂰
Temperature monitoring warns the system administrator of potential environment-related 
problems. It also performs an orderly system shutdown when the operating temperature 
exceeds a critical level.
򐂰
Voltage monitoring provides warning and an orderly system shutdown when the voltage is 
out of the operational specification.
3.1.4  Self-healing
For a system to be self-healing, it must be able to recover from a failing component by first 
detecting and isolating the failed component, taking it offline, fixing or isolating it, and 
reintroducing the fixed or replacement component into service without any application 
disruption. Examples include:
򐂰
Bit steering 
to redundant memory in the event of a failed memory module to keep the 
server operational
򐂰
Bit-scattering
, thus allowing for error correction and continued operation in the presence 
of a complete chip failure (Chipkill™ recovery)
򐂰
Single bit error correction using ECC without reaching error thresholds for main, L2, and 
L3 cache memory
򐂰
L3 cache line deletes extended from 2 to 10 for additional self-healing
򐂰
ECC extended to inter-chip connections on fabric and processor bus
򐂰
Memory scrubbing 
to help prevent soft-error memory faults
Memory reliability, fault tolerance, and integrity
The p5-520 and p5-520Q use Error Checking and Correcting (ECC) circuitry for system 
memory to correct single-bit and to detect double-bit memory failures. Detection of double-bit 
memory failures helps maintain data integrity. Furthermore, the memory chips are organized 
such that the failure of any specific memory module only affects a single bit within a four-bit 
ECC word (
bit-scattering
), thus allowing for error correction and continued operation in the 
presence of a complete chip failure (
Chipkill recovery
). The memory DIMMs also use 
memory scrubbing
 and thresholding to determine when spare memory modules within each 
bank of memory should be used to replace memory modules that have exceeded their 
threshold of error count (
dynamic bit-steering
). Memory scrubbing is the process of reading 
the contents of the memory during idle time and checking and correcting any single-bit errors 
that have accumulated by passing the data through the ECC logic. This function is a 
hardware function on the memory controller and does not influence normal system memory 
performance.