Home > Storage > PowerMax and VMAX > Storage Admin > Implementing Dell SRDF SRA with VMware SRM > Test failover workflow in VMware SRM
Test failover is a two-part process:
This section will first cover the replica modes then the general workflow of the test failover operation, regardless of replica mode. The subsequent sections will deal specifically with the configuration and execution of the different test failover modes.
In general, there are three ways to run a test failover with the SRDF SRA:
The first of these is the preferred method as it requires the least input from the user. Once the parameter is set, the SRA will create the necessary snapshots, linked devices, and present those devices to the recovery site. Upon cleanup the user can decide whether to delete the snapshot devices or retain them for future tests.
The second option requires that the user pre-creates the devices that the SRA will link to the snapshots. The XML file must be modified to show the relationship between the R2 devices and the pre-created devices. Customers who are unable to create new devices for the SRM test (e.g., cache limitations) or who already have available devices, might select this option.
The final option is to use the R2s directly, without a copy. This has the benefit of not requiring additional devices for the test. The SRA halts replication between the protection and recovery sites for the duration of the test and then during cleanup discards any changes made to the R2 and resynchronizes the R1 to the R2. This option is not a recommended option, however, because if the protection site is lost during the test, the recovery site cannot receive any changes from the R1 that occurred during the test. Furthermore, any changes made to the R2 during the test are now part of the recovered environment. Therefore, if this option is selected, Dell recommends taking a targetless SnapVX copy of the R2 before the testing commences. This will provide a viable backup if the R1 is lost and changes have already been made to the R2 during testing. Details on how this might be scripted can be found in the following section Targetless Gold Copies.
Before a test failover can be executed the following requirements must be met:
“Failed to create snapshots of replica devices. Failed to create snapshot of replica consistency group test_cg. SRA command 'testFailoverStart' failed for consistency group 'test_cg'. The devices don't have consistency enabled. Please check and enable consistency on the devices.”
Refer to the section, Consistency groups for more information on consistency.
Note: In Cascaded SRDF/Star or Cascaded SRDF/Non-Star environments that have been recovered to the Asynchronous target site, test failover, with or without TimeFinder is not supported. Only full recovery back to the original workload site is supported with the SRDF SRA.
Once all required configurations are complete and the user is ready to perform a test failover operation the user must select the desired Recovery Plan and click on the Test link as shown in Figure 52.
Once clicked, a short series of windows pop-up to confirm the test failover operation. There is a check-box offered in the first window called “Replicate Recent changes to recovery site”. This option, which enables or disables the “SyncOnce” operation, is present for all storage vendors. Though it may be relevant for other vendors and their respective replication technologies, it is ignored by the SRDF SRA, checked or unchecked. SRDF and consistency technology (as well as additional SRA requirements) make sure that an up-to-date and consistent copy is available on the R2 side. Steps 3 and 4 complete the wizard as seen in Figure 53.
At a high level, a recovery plan test involves the following:
Once the user has confirmed the test failover operation can proceed, the Recovery Plan will be initiated in test mode. A completed test recovery can be seen in Figure 54. The test environment will remain operational until a Cleanup operation has been executed.
Once a recovery plan has been tested the test environment can be discarded and reset through the use of the “Cleanup” operation offered by SRM. The Cleanup operation automatically reverts all changes incurred by the recovery plan test and allows for subsequent failover operations.
Before resetting the environment after a test failover, ensure that the recovery plan worked as desired. Verify the success of any custom scripts, application functionality, networking, etc. Once all facets of the test have been verified by the involved stakeholders a Cleanup operation can be executed.
Note: After a test failover has been executed, an actual failover or another test failover cannot be run until a Cleanup operation has occurred. It is advisable to run a Cleanup operation as soon as the test environment is no longer needed to allow for any subsequent operations to be run without delay.
As can be seen in Figure 55, a Cleanup can only be run against a recovery plan if the recovery plan status is in “Test Complete”. Furthermore, even if a test failover was not entirely successful, a Cleanup operation will still need to be executed before another test failover can be attempted. The “Test Complete” status will be assigned to the Recovery Plan regardless of the level of success reached by the test failover.
The Cleanup process is initiated, in a similar fashion to the test failover process, by clicking on the Cleanup link (broom) after selecting the appropriate Recovery Plan. This can be seen in Figure 56.
The Cleanup link launches a similar set of windows that the original test operation brought up to confirm the reset activities that it will execute. The first attempt at running this Cleanup after a particular failover offers no configurable parameters and simply displays details for confirmation. This set of screens are shown in Figure 57.
Figure 58 shows the steps taken by the Cleanup process itself.
Figure 58. Cleanup operation steps in SRM
Due to a variety of reasons, the first attempt at a Cleanup operation may fail. Typical causes include:
Note: Errors reported in the SRM interface can often be very generic. Refer to the SRDF SRA log on the recovery site if the error indicates a failure is related to storage operations.
In cases such as these, the first Cleanup operation, which does not permit the use of force, will fail. This is due to the fact that on the first run the Cleanup operation does not tolerate any failures with any step of the Cleanup process. Therefore, if the Cleanup process encounters an error, it will immediately fail as in Figure 59
Once the Cleanup process has failed for the first time the ability to force the Cleanup becomes available. The Cleanup confirmation wizard, when run subsequent to a failure, will now offer a check-box to force the Cleanup as seen in Figure 60. This will alter the behavior of the Cleanup process to ride through any error encountered. Any operation it can complete successfully will be completed and, unlike before, any operation that encounters an error will be skipped.
In general, it is not advisable to resort to the force Cleanup unless an actual failover operation needs to be run immediately and the time to troubleshoot any issues encountered in the Cleanup cannot be afforded. Otherwise, before using the force option, attempt to resolve any issues first and then retry a non-forced Cleanup again. If a force Cleanup is used in haste, it may require additional manual intervention afterwards because the SRA and SRM may not be able to recover and ready themselves for another test failover or failover without user intervention.
When a force Cleanup is run, as seen in Figure 61, users should refer to the logs to identify the exact errors encountered. If necessary, resolve these issues and attempt to execute another test failover as soon as possible to verify the environment is functioning correctly. A very common failure when running Cleanup with the SRDF SRA is the inability to remove snapshot devices. Frequently this is because they were never created due to issues with the XML configuration files. In such cases a Cleanup with force is perfectly acceptable.