The following sections describe the steps to take in the event of different failure types.
Handling a node failure on either site in a stretched cluster environment is no different than managing one in a traditional or standalone Azure Stack HCI cluster. A complete node failure would result in operating system or HBA corruption or complete hardware failure on the node. In either case, restoring system functionality is the priority.
The high level steps to do this are:
- Replace the hardware as needed.
- Re-install the operating system on the operating system drives (if needed).
- Join the system to the domain.
- Ensure you assign the new node IPs specific to the site where the node is hosted.
- Add the node to the existing stretched cluster.
- Based on the IP subnets used or the Cluster Fault Domains added, the cluster adds the drives to the correct pool.
- Wait for the storage jobs to complete.
- During this process the workloads on the affected site would still be running and there should be no interruption of replication.
A site failure in a stretched cluster topology requires rebuilding all of the nodes of the affected site. If the failure happens at the primary site, the following scenarios occur:
- All volumes hosted on the affected site and associated VMs become inaccessible.
- After a brief period, the volumes move to the secondary site.
- The VMs restart on the secondary site.
- Depending on whether synchronous or asynchronous replication is being used, you either have zero data loss or data loss within the limits of the defined RPO:
- For the replica volumes configured with synchronous replication, the VMs are crash consistent. Application recovery depends on the available backup/recovery of the application.
- For the replica volumes configured with asynchronous replication, the VMs are not crash consistent. The default RPO is 30 seconds. It can be configured using PowerShell or Windows Admin Center. Application recovery still depends on the available backup/recovery of the application.
Follow these steps to recover the nodes on the failed site:
- Remove the failed nodes from the cluster and remove the computer names from the Active Directory.
- Remove SRPartnership and SRGroups using PowerShell cmdlets. Replication can also be disabled from the Failover Cluster Manager.
- Bring up all the nodes on the affected site. The node names and IPs used should be the same as those used before the crash.
- Join the nodes to the domain.
- Add all the nodes to the existing stretched cluster at the same time.
- All drives in the new site will be added to a new pool.
- Re-create and enable replication for replica volumes and associated log volumes using Failover Cluster Manager or PowerShell cmdlets.