Cisco Cisco Process Orchestrator 3.0 User Guide

Page of 242
 
5-18
Cisco Process Orchestrator User Guide
OL-30196-01
Chapter 5      Managing High Availability and Resiliency
  Handling Restarts and Failures
Handling Events During Restarts and Failures
Process Orchestrator adapters resume sending events after the Process Orchestrator server restarts. 
While the adapter is still responsible for the implementation in this area, existing Process Orchestrator 
adapter implementations attempt to be consistent with this guaranteed delivery design. For example, 
when the SAP adapter reads CCMS alerts or the Remedy adapter polls for Incident state, the adapter 
stores the last-read record so that it can resume reading records starting with the next entry when the 
Process Orchestrator server resumes. For non guaranteed-delivery, network-initiated event technologies 
such as SNMP traps, the Process Orchestrator server cannot know about events that occurred while it 
was down. If required, many of these technologies can use highly-available intermediaries to persist the 
transient events. For example, there are tools to listen for SNMP traps and convert them to persistent 
stores such as log files or Windows events. 
There are two types of event systems in the Process Orchestrator server: 
  •
Event-based triggers. Because of the adapter implementations (see 
), event-based triggers are not lost across server restarts or 
failures. Like state management in other areas of the product, trigger submissions are transactional. 
When Process Orchestrator adapters send a trigger to the Process Orchestrator server, processes 
depending on that trigger are initiated as a part of the submission. As with any other processes, these 
process instances are persisted to the database so that after a restart, the triggered processes are 
running. With the exception of transient, non-persisted events, such as SNMP traps or performance 
thresholds, there should be no time gap where events should be lost such that triggers fail to launch.
  •
Correlation. The Correlation feature allows a server to tie together a related series of events. This is 
achieved through a caching mechanism to ensure that the event data is available to the Correlate 
activities in processes when needed. This cache of events is not retained across server restarts, which 
can cause issues with some Correlate activities. Correlate activities can be used to make a process 
wait for a certain condition, or branch based on how many events with particular properties have 
been received in a particular time frame. Because the cache is not persisted across server restarts, 
events received before the restart will not be matched by Correlate activities. These effects can affect 
accuracy of a process “decision” made based on the data from a Correlate activity.
In practice this has been found to not be an issue for the following reasons:
  –
First, correlation is a fairly advanced feature in the product and is very rarely used in a majority 
of scenarios. 
  –
Second, best practices also dictate that a correlation time frame is configured to be fairly small, 
which will minimize both the likelihood and impact of this compound condition. The time 
required to restart a server will take up some or all of this time frame if the correlation time 
frame coincides with a server restart. 
  –
Third, these correlations are typically used in diagnostic situations, and often the events repeat 
if the problem recurs. Where a problem is still occurring, Process Orchestrator will typically 
pick up the condition the next time the diagnostic process launches. 
  –
Finally, Process Orchestrator automation packs tend to set these processes as non-persistent 
anyway so they are not restarted. If the Process Orchestrator server is down, it is usually best to 
get a fresh view of the health of the component being monitored rather than depend on the status 
before the Process Orchestrator server restart. In many cases, the failure is resolved and it just 
creates noise to record the old diagnosis.