Reprotection with SRDF/Star

Thank you for your feedback!

After a recovery plan has run, there are often cases where the environment must continue to be protected against failure to ensure its resilience and to meet objectives for disaster recovery. Site Recovery Manager offers “reprotection” which is an extension to recovery plan management that enables the environment at the recovery site to establish replication and protection of the environment back to the original protected site. This behavior allows users to recover the environment quickly and easily back to the original site if necessary.
It is important to note that a completely unassisted reprotection by VMware SRM may not always be possible depending on the circumstances and results of the preceding recovery operation. Recovery plans executed in “Planned Migration” mode are the likeliest candidates for a subsequent successful automated reprotection by SRM. Exceptions to this occur if certain failures or changes have occurred between the time the recovery plan was run and the reprotection operation was initiated. Those situations may cause the reprotection to fail. Similarly, if a recovery plan was executed in “Disaster Recovery” mode any persisting failures may cause a partial or complete failure of a reprotection of a recovery plan.
These different situations are discussed in the subsequent sections. Note that these sections cover both Concurrent SRDF/Star and Cascaded SRDF/Star configurations.
Note: If a reprotect operation fails, a discover devices operation, performed at the recovery site, may be necessary to resolve. The use of device/composite (consistency) groups (prerequisite) is essential in mitigating any occurrence.
Reprotect after Planned Migration
The scenario that will most likely lead to a successful reprotection is a reprotection after a planned migration. In the case of a planned migration there are no failures in either the storage or compute environment that preceded the recovery operation. Therefore reversing recovery plans/protections groups as well as swapping and establishing replication in the reverse direction is possible.
Reprotect is only available after a recovery operation has occurred, which is indicated by the recovery plan being in the “Recovery Complete” state. A reprotect can be executed by selecting the appropriate recovery plan and selecting the “REPROTECT” link as shown in Figure 154.
Figure 154. Executing a reprotect operation in SRM
The reprotect operation does the following things for Concurrent or Cascaded SRDF/Star environments:
- Reverses protection groups. The protection groups are deleted on the original protection SRM server and are recreated on the original recovery SRM server. The inventory mappings are configured (assuming the user has pre-configured them in SRM on the recovery site) and the necessary “shadow” or “placeholder” VMs are created and registered on the newly designated recovery SRM server.
- Reverses recovery plan. The failed-over recovery plan is deleted on the original recovery SRM server and recreated with the newly reversed protection group.
Figure 155 shows the steps involved in a reprotect operation.
Figure 155. Reprotect operation steps
In the case of Concurrent or Cascaded SRDF/Star environments, the RDF swap and the re-establish are not executed in the reprotect operation by default. Due to the nature and requirements of SRDF/Star, these steps must be included in the failover operation. Therefore, the SRA detects Concurrent or Cascaded SRDF/Star configurations and these reprotect steps become non-operation events for the SRA. If for some reason the Concurrent SRDF/Star environment is not in a fully-protected mode, the SRDF SRA will connect, protect and enable Star as needed. If the Cascaded SRDF/Star environment has one or more legs in a disconnected state, the SRDF SRA will attempt to connect, protect and enable Star.
Reprotect after a temporary failure
The previous section describes the best possible scenario for a smooth reprotection because it follows a planned migration where no errors are encountered. For recovery plans failed over in disaster recovery mode, this may not be the case.
Disaster recovery mode allows for failures ranging from the very small to a full site failure of the protection datacenter. If these failures are temporary and recoverable a fully-successful reprotection may be possible once those failures have been rectified. In this case, a reprotection will behave similar to the scenario described in the previous section. If a reprotection is run before the failures are corrected or certain failures cannot be fully recovered, an incomplete reprotection operation will occur. This section describes this scenario.
For reprotect to be available, the following steps must first occur:
- A recovery must be executed with all steps finishing successfully. If there were any errors during the recovery, the user needs to resolve the issues that caused the errors and then rerun the recovery.
- The original site should be available and SRM servers at both sites should be in a connected state. If the original site cannot be restored (for example, if a physical catastrophe destroys the original site) automated reprotection cannot be run and manual recreation will be required if and when the original protected site is rebuilt.
If the protected site SRM server was disconnected during failover and is reconnected later, SRM will want to retry certain failover operations before allowing reprotect. This typically occurs if the recovery plan was not able to connect to the protected side vCenter server and power down the virtual machines due to network connectivity issues. If network connectivity is restored after the recovery plan was failed over, SRM will detect this situation and require the recovery plan to be re-run in order to power those VMs down. Figure 156 shows the message SRM displays in the recovery plan when this situation occurs.
Figure 156. Recovery plan message after SRM network partition
A reprotection operation will fail if it encounters any errors the first time it runs. If this is the case, the reprotect must be run a second time but with the “Force cleanup” option selected [47] as in Figure 157.
Figure 157. Forcing a reprotect operation
Once the force option is selected, any errors will be acknowledged and reported but ignored. This will allow the reprotect operation to continue even if the operation has experienced errors. It will attempt all of the typical steps and complete whichever ones are possible.
Therefore, in certain situations, the SRDF replication may not be properly reversed even though the recovery plan and protection group(s) were. If the “Configure storage to reverse direction” step fails, manual user intervention may be required to complete the process. The SRDF SRA will attempt to connect, protect and enable Star during the reprotection operation.
If any of these steps fail during reprotection, a device discovery operation can be executed when paths are back online and one or both of the sites are in the “disconnected” state the SRDF SRA will connect, protect and enable discovered those Star environments.
Reprotect after a failover due to unrecoverable failure
In extreme circumstances, the storage and/or the compute environment may be rendered completely unrecoverable due to a disaster. In this scenario, reprotect will not be possible. Therefore the process of “reprotecting” the original recovery site is no different than the original setup of the protection groups and recovery plans from scratch. The existing protection group and recovery plan would need to be deleted and re-created. Refer to Chapter 2 for instructions on that process.

Your Browser is Out of Date

Reprotection with SRDF/Star

Reprotection with SRDF/Star

Reprotect after Planned Migration

Reprotect after a temporary failure

Reprotect after a failover due to unrecoverable failure