HP (Hewlett-Packard) t2808-90006 Manuale Utente

Pagina di 104
Disaster Tolerance and Recovery in a Serviceguard Cluster
Managing a Disaster Tolerant Environment
Chapter 1
49
Even if recovery is automated, you may choose to, or need to recover 
from some types of disasters with manual recovery. A rolling 
disaster
, which is a disaster that happens before the cluster has 
recovered from a previous disaster, is an example of when you may 
want to manually switch over. If the data link failed, as it was 
coming up and resynchronizing data, and the data center failed, you 
would want human intervention to make judgment calls on which 
site had the most current and consistent data before failing over.
Who manages the nodes in the cluster and how are they trained?
Putting a disaster tolerant architecture in place without planning for 
the people aspects is a waste of money. Training and documentation 
are more complex because the cluster is in multiple data centers. 
Each data center often has its own operations staff with their own 
processes and ways of working. These operations people will now be 
required to communicate with each other and coordinate 
maintenance and failover rehearsals, as well as working together to 
recover from an actual disaster. If the remote nodes are placed in a 
“lights-out” data center, the operations staff may want to put 
additional processes or monitoring software in place to maintain the 
nodes in the remote location.
Rehearsals of failover scenarios are important to keep prepared. A 
written plan should outline rehearsal of what to do in cases of 
disaster with a minimum recommended rehearsal schedule of once 
every 6 months, ideally once every 3 months.
How is the cluster maintained?
Planned downtime and maintenance, such as backups or upgrades, 
must be more carefully thought out because they may leave the 
cluster vulnerable to another failure. For example, nodes need to be 
brought down for maintenance in pairs: one node at each site, so that 
quorum calculations do not prevent automated recovery if a disaster 
occurs during planned maintenance.
Rapid detection of failures and rapid repair of hardware is essential 
so that the cluster is not vulnerable to additional failures. 
Testing is more complex and requires personnel in each of the data 
centers. Site failure testing should be added to the current cluster 
testing plans.