vSAN policy attributes establish parameters to protect against node failures, but they may not be the most effective or efficient way to build tolerance for events like rack failures. This section reviews the availability features for vSAN clusters on the VxRail system. It starts out by looking at the availability implications on small VxRail deployments with fewer than four nodes.
vSAN and VxRail systems use fault domains as a way of configuring tolerance for rack and site failures. By default, a node is considered a fault domain. vSAN will spread components across fault domains, therefore, by default vSAN will spread components across nodes. Consider, for example, a cluster with four (4) four-node VxRail systems, each VxRail system placed in a different rack. By explicitly defining each four-node system as separate fault domains, vSAN will spread redundancy components across the different racks.
In terms of implementation, any host that is not part of another fault domain is considered its own single-host fault domain. VxRail requires at least three fault domains, and each has at least one host. Fault domain definitions recognize the physical hardware constructs that represent the domain itself. Once the domain is enabled, vSAN applies the active virtual machine storage policy to the entire domain, instead of just to the individual hosts. The number of fault domains in a cluster is calculated based on the FTT attribute: (Number of fault domains) = 2 * (Failures to tolerate) + 1. Administrators can manage fault domains from the vSphere web client (as shown in the figure below.)
Figure 51. Managing fault domains
The fault domain mechanism detects when the configuration is vulnerable. Consider a cluster that contains four server racks, each with two nodes. If the FTT is set to 1, and fault domains are not enabled, vSAN might store both replicas of an object with hosts in the same rack. In that case, applications are exposed to a potential rack-level failure. With fault domains enabled, vSAN ensures that each protection component (replicas and witnesses) is placed in a separate fault domain, making sure that the nodes cannot fail together.
The figure below illustrates a four-rack setup, each with two ESXi nodes (a subset of the available hosts in a VxRail system). There are four defined fault domains:
Figure 52. Fault domains for a four-rack VxRail configuration
This configuration guarantees that the replicas of an object are stored in hosts of different rack enclosures, ensuring availability and data protection in case of a rack-level failure.
When deploying a cluster that just meets the minimum requirements, it is important to understand the high availability implications. Choosing a 3-node minimum configuration for RAID-1 protection or a 4-node minimum configuration for RAID-5 protection means that a cluster will not be able self-heal by rebuilding data on another host if one host fails. When a host is in maintenance mode such as a node upgrade, the data is exposed to a potential failure or inaccessibility if an additional failure occurs.