vSphere provides several solutions to ensure a high level of availability, during both planned and unplanned downtime scenarios. vSphere depends on the following technologies to make sure that virtual machines running in the environment remain available:
Together with vSAN, vSphere HA produces a resilient, highly available solution for VxRail virtual machine workloads. vSphere HA protects virtual machines by restarting them in the event of a host failure. (See the figure below.) It leverages the ESXi cluster configuration to ensure rapid recovery from outages, providing cost-effective high availability for applications running in virtual machines. When a host joins a cluster, its resources become part of the cluster resources. The cluster manages the resources of all hosts within it. In a vSphere environment, ESXi clusters are responsible for vSphere HA, DRS, and the vSAN technology that provides VxRail software-defined storage capabilities. See the figure below.
Figure 38. vSphere HA
vSphere HA provides several points of protection for applications:
It circumvents any server failure by restarting the virtual machines on other hosts within the cluster.
It continuously monitors virtual machines and resets any detected VM failures.
It protects against datastore accessibility failures and provides automated recovery for affected virtual machines. With Virtual Machine Component Protection (VMCP), the affected VMs are restarted on other hosts that still have access to the datastores.
It protects virtual machines against network isolation by restarting them if their host becomes isolated on the management or VMware vSAN network. This protection is provided even if the network has become partitioned.
Once vSphere HA is configured, all workloads are protected. No actions are required to protect new virtual machines and no special software needs to exist within the application or virtual machine.
Included in the failover capabilities in vSphere HA is a service called the Fault Domain Manager (FDM) that runs on the member hosts (shown in the figure below). After the FDM agents have started, the cluster hosts become part a fault domain, and a host can exist in only one fault domain at a time. Hosts cannot participate in a fault domain if they are in maintenance mode, standby mode, or disconnected from vCenter Server.
Figure 39. Fault Domain Manager
FDM uses a master-slave operational model (see the figure above). An automatically designated master host manages the fault domain, and the remaining hosts are slaves. FDM agents on slave hosts communicate with the FDM service on the master host using a secure TCP connection. In VxRail clusters, vSphere HA is enabled only after the vSAN cluster has been configured. Once vSphere HA has started, vCenter Server contacts the master host agent and sends it a list of cluster-member hosts along with the cluster configuration. That information is saved to local storage on the master host and then pushed out to the slave hosts in the cluster. If additional hosts are added to the cluster during normal operation, the master agent sends an update to all hosts in the cluster.
The master host provides an interface to vCenter Server for querying and reporting on the state of the fault domain and virtual-machine availability. vCenter Server governs the vSphere HA agent, identifying the virtual machines to protect and maintaining a VM-to-host compatibility list.
The agent learns of state changes through hostd, and vCenter Server learns of them through vpxa. The master host monitors the health of the slaves and takes responsibility for virtual machines that had been running on a failed slave host. Meanwhile, the slave host monitors the health of its local virtual machines and sends state changes to the master host. A slave host also monitors the health of the master host.
vSphere HA is configured, managed, and monitored through vCenter Server. Cluster configuration data is maintained by the vCenter Server vpxd process. If vxpd reports any cluster configuration changes to the master agent, the master advertises a new copy of the cluster configuration information and then each slave fetches the updated copy and writes the new information to local storage. Each datastore includes a list of protected virtual machines. The list is updated after vCenter Server notices any user-initiated power-on (protected) or power-off (unprotected) operation.