The real benefit of stretch clustering is automatic failover after a site failure. In test scenario 2, we simulated an outage in Site 1 by simultaneously powering down both ax740xds1N1 and ax740xds1N2 using their respective iDRAC consoles. As the VMs were brought back online in Site 2, we observed how the process was orchestrated through Windows Admin Center and Failover Cluster Manager.
The following figures show the health of the cluster, VMs, and volumes, respectively, before power was removed from the Site 1 nodes.
Figure 14. Healthy cluster reported by Windows Admin Center
Figure 15. Healthy VMs in Site 1 as reported by Failover Cluster Manager
Figure 16. Healthy volumes in Site 1 as reported by Failover Cluster Manager
In Figure 15, OME-1 is the first VM that is listed. This VM’s virtual disk was placed on the OM volume, as indicated in Table 3. This VM was running Dell OpenManage Enterprise, which is a simple-to-use, one-to-many systems management console that provides comprehensive life cycle management for Dell PowerEdge servers and other devices. This test scenario, as well as test scenario 3, focused on the behavior of this VM during the failover conditions because the VM was running a real-world web-based application.
The following figure shows the fully functioning OpenManage Enterprise application before the simulated site failure:
Figure 17. Functioning OpenManage Enterprise web application server
Similar to the unplanned node failure, the nodes appeared as Isolated in Failover Cluster Manager a couple of minutes after the power was removed from the Site 1 nodes, as shown in the following figure:
Figure 18. Bangalore cluster nodes after site outage
The following figures record the VM status and volume status, respectively, in Failover Cluster Manager at approximately the same time.
Figure 19. Unmonitored VMs after site outage
As shown in the following figure:
Figure 20. Offline volumes after site outage
It took the affected volumes 15 to 20 minutes to move and fully come back online in Site 2.
The following figure shows the active volumes coming online in the new site. The replica volumes remained offline until Site 1 was restored to full health. Once Site 1 was back online, synchronous replication began again from the source volumes in Site 2 to their destination replica partners in Site 1.
Figure 21. Volumes moved and fully brought back online in Site 2
A few minutes after the volumes were moved and brought online in Site 2, all the VMFleet and the OME-1 VMs were restarted and reachable. These VMs were randomly distributed across ax740xds2N1 and ax740xds2N2 when they were moved to Site 2, as shown in the following figure.
Figure 22. VMs restarted and running on Site 2
The following figure shows the OpenManage Enterprise web application accessible by its hostname and operational after restarting on Site 2—approximately 23 minutes after the failure of Site 1 was initiated. Other applications tested outside the scope of this document took between 20 and 25 minutes to become fully operational.
Note: As a best practice, thoroughly test any application being considered for hosting on a stretched cluster to observe its behavior during an unplanned site failure.
Figure 23. OpenManage Enterprise functional after restarting on Site 2
Because the VM networks are separate L3 subnets in each site, the OME-1 VM required a new IP address after migrating to Site 2. When moving a server from one L3 subnet to another, you can approach the changing of an IP address in many ways. For the OpenManage Enterprise web application to be accessible by its hostname after it was restarted on Site 2, we used the following approach in the lab:
The OME-1 application was reachable by its hostname within 3 minutes of being restarted in Site 2.
Note: SDN is not currently supported for multisite clusters.