When System Recovery Is Needed
Caution
|
The methods explained in this topic may fail if you use a cluster profile consisting of only 3 hybrid VM nodes (and no worker
nodes). The failure happens due to the lack of VM resiliency caused by the absence of worker nodes.
|
At some time during normal operations of your Cisco Crosswork cluster, you may find that you need to recover the entire system.
This can be the result of one or more malfunctioning nodes, one or more malfunctioning services or applications, or a disaster
that destroys the hosts for the entire cluster.
A functional cluster requires a minimum of three hybrid nodes. These hybrid nodes share the processing and traffic loads imposed
by the core Cisco Crosswork management, orchestration, and infrastructure services. The hybrid nodes are highly available
and able to redistribute processing loads among themselves, and to worker nodes, automatically.
The cluster can tolerate one hybrid node reboot (whether graceful or ungraceful). During the hybrid node reboot, the system
is still functional, but degraded from an availability point of view. The system can tolerate any number of failed worker
nodes, but again, system availability is degraded until the worker nodes are restored.
Cisco Crosswork generates alarms when nodes, applications, or services are malfunctioning. If you are experiencing system
faults, examine the alarm and check the health of the individual node, application, or service identified in the alarm. You
can use the features described in the Check Cluster Health section to drill down on the source of the problem and, if it turns out to be a service fault, restart the problem service.
If you see alarms indicating that one hybrid node has failed, or that one hybrid node and one or more worker nodes have failed,
start by attempting to reboot or replace (erase and then readd) the failed nodes. If you are still having trouble after that,
consider performing a clean system reboot.
The loss of two or more hybrid nodes is a double fault. Even if you replace or reboot the failed hybrid nodes, there is no
guarantee that the system will recover correctly. There may also be cases where the entire system has degraded to a bad state.
For such states, you can deploy a new cluster, and then recover the entire system using a recent backup taken from the old
cluster.
Important
|
-
VM shutdown is not supported on a 3 VM cluster that is running the Crosswork Network Controller solution. If a VM fails, the
remaining two VMs cannot support all the pods being migrated from the failed VM. You must deploy additional worker nodes to
enable the VM shutdown.
-
Reboot of one of the VMs is supported in a 3 VM cluster. In case of a reboot, the VM restore can take from 5 minutes (if the
orch pod is not running in the rebooted VM) up to 25 minutes (if the orch pod is running in the rebooted VM).
|
The following two sections describe the steps to follow in each case.