Home > Storage > PowerMax and VMAX > Storage Admin > Implementing Dell SRDF SRA with VMware SRM > Recovery with SRDF/Star
The following sections discuss SRM recovery with 3-site SRDF/Star.
Note: SRDF/Metro does not support SRDF/Star.
Detailed step-by-step configuration of SRDF/Star is beyond the scope of this book but there are some important considerations to note before configuring it for use with SRM.
The protected SRM server and vCenter server must be located at the appropriate SRDF/Star sites to support failover to the desired site. In other words, the protected VMware compute environment must always be at the workload site and the recovery VMware compute environment must be at the SRDF/Star synchronous site in order to failover to the synchronous site or at the asynchronous site in order to support failover to the asynchronous site. The location of the compute environment in an SRDF/Star environment dictates where the SRDF SRA will failover for a given configuration. Once a topology has been configured, failover may only occur between the original workload site and the configured target site, be it the synchronous site or the asynchronous site. Whichever site does not host one of the two SRM-protected compute environments is deemed a bunker site by the SRDF SRA and therefore will not be managed by the SRDF SRA.
Note: The SRDF SRA advanced option FailoverToAsyncSite must be set to NO to allow recovery to the Synchronous site or YES to allow recovery to the Asynchronous site. This option must be set identically on both the recovery and protected SRM servers.
The recovery option “Planned Migration” assures a graceful migration of virtual machines from a local vCenter to a remote vCenter. Any errors that the recovery plan encounters will immediately fail the operation and require the user to remediate these errors and restart the migration process. Therefore, a Planned Migration assumes the following things (among other details):
Before executing a recovery plan failover, it is highly recommended to test the recovery plan first (preferably multiple times) using the “Test” feature offered by SRM. Information on configuring and running a recovery test is discussed in detail in Chapter 4.
The first step is to ensure that SRDF/Star is in a proper state. In order to run a successful recovery the SRDF/Star state must be “Protected”. If SRDF/Star is in an alternative state, it must either be changed to a proper state using PowerMax management applications or a Disaster Recovery operation may need to be run instead which will ignore an invalid SRDF/Star state.
Generally, a good indicator of valid SRDF/Star pair status is shown in the “Status” column in a given array pair. If the “Status” column shows a blue directional arrow with “Forward”, the RDF Pair State is valid for Planned Migration. This is slightly different than with 2-site SRDF however, as 3-site SRDF has additional limitations on what state the devices can be in to allow failover which are not reflected within “Status” column. Not only do the device pairs need to be replicating, but they need to be consistent. In SRDF/Star terminology, this means that both the Asynchronous site and the Synchronous site must be in the “Protected” state and SRDF/Star must be in the overall “Protected” state. An example of a valid configuration is seen in Figure 132.
The SRA requires that SRDF/Star is “Protected” so if either site is not in the “Protected” state, SRDF/Star cannot be in the proper state and Planned Migration will fail. The SRDF SRA log will report errors such as those below, after a failed recovery plan due to invalid SRDF/Star state:
[01/16 13:18:00 16212 0569
SraStarGroup::IsValidStarState] STAR Group state is Unprotected.
[01/16 13:18:00 16212 0985
SraStarGroup::ValidateInputDevices] [WARNING]: The STAR state is not valid. Exiting
[01/16 13:18:00 16212 3018
PrepareFailoverCommand::RunOnGroup] [ERROR]: Input device validation succeeded but one/many of the adapter's conditions is not met. Exiting with failure
Errors of a failed recovery plan due to an invalid SRDF/Star state will take the form of “Failed to demote source consistency group ‘xxxx’” as shown in Figure 133.
If the “Status” column shows a different message, such as green arrows with “Failover in Progress”, either manual intervention is required or the Disaster Recovery option needs to be selected. Planned migration will never be allowed in this state.
While the “Status” column is, in general, a good indicator of RDF pair states, it is inadequate to cover the many diverse possibilities of RDF pair states. Therefore, it is advisable to use Solutions Enabler to determine the exact status. A valid SRDF/Star group for Planned Migration is when the 1st Target (synchronous site), the 2nd Target (the asynchronous site) and SRDF/Star state itself all report as “Protected”.
At this point, a planned migration can be initiated by selecting the appropriate recovery plan and the selecting the “Recovery” link as can be seen in Figure 134.
Once the Recovery link has been selected, a short confirmation wizard appears asking to confirm the initiation of the recovery operation and in which mode the recovery plan should be run. This screen is shown in Figure 135.
As soon as the wizard completes the recovery operation will commence. During the steps “Prepare Protected Site VMs for Migration” and “Change Recovery Site Storage To Writable”, as shown in Figure 136, the SRDF SRA performs the necessary Star operations on the devices in the protection group to failover.
The following two tables describe the steps for recovery with Concurrent SRDF/Star. The tables have seven columns:
For Concurrent SRDF/Star recovery to the Synchronous site, the SRA performs the steps listed in Table 20. For Concurrent SRDF/Star recovery to the Asynchronous site, the SRA performs the steps listed in Table 21.
SRDF SRA step # | Issuing SRDF SRA and Solutions Enabler | Step detail | Resulting SRDF/Star state | Site A after step | Site B after step | Site C after step |
1 | Protected site | [OPTIONAL] Create protected site gold copies | Protected | Workload | Protected | Protected |
2 | Protected site | [OPTIONAL] Remove R1 devices from storage group | Protected | Workload | Protected | Protected |
3 | Protected site | Halt Star | Unprotected | Workload | Halted | Halted |
4 | Recovery site | [OPTIONAL] Create recovery site gold copies | Unprotected | Workload | Halted | Halted |
5 | Recovery site | Switch Workload site. The original Workload site is now the Sync site and vice versa. | Unprotected | Disconnected | Workload | Disconnected |
6 | Recovery site | Connect Sync target site | Unprotected | Connected | Workload | Disconnected |
7 | Recovery site | Connect Async target site | Unprotected | Connected | Workload | Connected |
8 | Recovery site | Protect Sync target site | Unprotected | Protected | Workload | Connected |
9 | Recovery site | Protect Async target site | Unprotected | Protected | Workload | Protected |
10 | Recovery site | Enable Star | Protected | Protected | Workload | Protected |
11 | Recovery site | [OPTIONAL] Add R2 devices to storage group | Protected | Protected | Workload | Protected |
SRDF SRA step # | Issuing SRDF SRA and Solutions Enabler | Step detail | Resulting SRDF/Star state | Site A after step | Site B after step | Site C after step |
1 | Protected site | [OPTIONAL] Create protected site gold copies | Protected | Workload | Protected | Protected |
2 | Protected site | [OPTIONAL] Remove R1 devices from storage group | Protected | Workload | Protected | Protected |
3 | Protected site | Halt Star | Unprotected | Workload | Halted | Halted |
4 | Recovery site | [OPTIONAL] Create recovery site gold copies | Unprotected | Workload | Halted | Halted |
5 | Recovery site | Switch Workload site. The original Workload site is now the Async site and vice versa. | Unprotected | Disconnected | Disconnected | Workload |
6 | Recovery site | Connect Async target site | Unprotected | Disconnected | Connected | Workload |
7 | Recovery site | Connect Sync target site | Unprotected | Connected | Connected | Workload |
8 | Recovery site | Protect Async target site | Unprotected | Connected | Protected | Workload |
9 | Recovery site | [OPTIONAL] Add R2 devices to storage group | Unprotected | Connected | Protected | Workload |
Readers may note an important difference between the Asynchronous recovery and the Synchronous recovery, that being the end state of Star is not fully protected in an Asynchronous recovery unlike Synchronous recovery. This is due to the fact that when the workload site is located in the tertiary site, which is at an asynchronous distance from the other two sites, it cannot replicate in a synchronous fashion to the Synchronous target site. Without a Synchronous replication site, Star cannot be fully protected. Consequently the SRDF SRA can only “Connect” the Synchronous site which leaves it in Adaptive Copy Disk mode.
Unlike 2-site SRDF or Non-Star 3-site, the SRA will always reverse the direction of replication and enable consistency during the failover process. For Non-Star SRDF solutions, this replication reversal is reserved for the “Reprotect” operation. Due to the requirements inherent in SRDF/Star these steps are included in the failover process.
Once the SRDF/Star switch workload process has completed, the Synchronous target site devices are now write enabled so the devices will be mounted and the virtual machines will be registered and powered-on. The SRDF/Star switch workload process will swap the RDF personalities of the devices and change the site definitions within the SRDF/Star configuration. With Concurrent SRDF/Star, the R1 devices will become R2 devices. Similarly, the R2 devices (on the Synchronous site OR the Asynchronous site depending on the configuration) are now R1 devices.
In addition to the R2/21 devices being mounted on the recovery-side ESXi hosts, the R1 volumes will be unmounted and detached from the protection-side ESXi hosts.
When the VMFS volumes on the R2 devices are mounted, the ESXi kernel must resignature the VMFS first because it is seen as a copy due to its invalid signature. The reason for the invalid signature, and therefore the subsequent resignaturing, is due to the fact that the R1 and R2 devices have different world-wide names (WWNs) but an identical VMFS volume. The VMFS volume was (most likely) originally created on the R1 device and the signature of the VMFS is based, in part, on the WWN of the underlying device. Since the WWN changes between the R1 and the R2 and the signature is copied over, the ESXi kernel will identify a WWN/VMFS signature mismatch and resignature.
The VMware kernel automatically renames VMFS volumes that have been resignatured by adding a “SNAP-XXXXXX” prefix to the original name to denote that it is a copied file system. VMware Site Recovery Manager provides an advanced setting (disabled by default), storageProvider.fixRecoveredDatastoreNames, that will cause this suffix to be automatically removed during the recovery plan. Check this option on the recovery site to enable this automatic prefix removal behavior.
The option “Disaster Recovery” should be selected for recovery when there are issues with the infrastructure that will prevent a graceful recovery of virtual machines from a local vCenter to a remote vCenter. Unlike the “Planned Migration” option, most errors that the recovery plan encounters will be ignored by SRM. The only errors that will prevent a recovery in disaster recovery mode are failures in the recovery site infrastructure. Anything from minor errors to complete failure of the protected site infrastructure will not prevent recovery.
If possible, the “Planned Migration” is preferable as it will more likely allow for a clean subsequent reprotection and/or failback. Therefore, if errors are encountered an earnest attempt to remediate them should be made. If these errors cannot be fixed (due to equipment failure or if time is of the essence and the virtual environment must be recovered as quickly as possible) the “Disaster Recovery” option should be selected.
This section will discuss disaster recovery failover in two parts: