PowerOne and SRDF (Part 2)
Thu, 25 Jun 2020 17:55:46 -0000|
Read Time: 0 minutes
PowerOne with SRDF use cases and associated topologies
In my previous blog post (PowerOne and SRDF Part I) we introduced the business context and technologies involved in a PowerOne with SRDF scenario. In this second blog post we describe two data center use cases and the associated topologies:
- Two sites with SRDF Synchronous (SRDF/S) or Asynchronous (SRDF/A) data replication with Remote Restart (disaster restart protection for virtual infrastructure)
- Two sites in a stretch cluster configuration with synchronous data mirroring (SRDF/Metro). This stretched cluster use case relies on the existence of a third site to perform the witness role.
Protected clusters with SRDF/S and SRDF/A
In this topology, vSphere clusters built using traditional configuration techniques on PowerOne systems are recovered at a defined secondary site on a per-cluster basis. The recovery process can take up to several minutes, depending on the number of servers involved.
The functional requirements for this approach are:
- Deploy two PowerOne Systems with PowerMax, one at the primary site and one at the recovery site. These systems must be licensed for SRDF and VMware SRM.
- Create, modify or delete a cluster at the primary site. Add volumes to the replication set. Select the mode of protection based on RPO with sync for zero data loss (or) async for data loss (from a few seconds of data loss to minutes, depending on the RPO).
- Add stretched application VLAN(s), typically called overlay networks, such as VMware VXLAN using NSX, so that IP addressing will work at the primary or the recovery site. Any mechanism to re-IP and modify DNS entries as needed in the secondary site may also be valid.
- Create, modify or delete remote array connections, called SRDF Groups. Links must be scalable so you can add links to increase throughput.
- Invoke failover/failback to prove all technologies and operational processes are functioning as expected.
- In order to be able to attach the replicated storage volumes for failover and failback, create storage-less clusters and add them to the vCenter instance at the secondary site.
- For non-disruptive failover testing, SRM provides an orchestrated recovery validation mechanism called Bubble. Bubble provides a recovery area using SnapVX clones of R2 volumes without impacting the replication process. Isolate the Bubble recovery to specific VLANs or subnets (so that it does not overlap with production or recovery networks).
This approach describes a typical Disaster Recovery scenario. Through the integration of PowerOne, SRDF/S/A, and VMware SRM, we can create an automated DR architecture that, in case of a site failure, will failover the production CRGs to the secondary disaster site. VMware SRM automates this multi-step recovery of virtual machines. For more details about pre-requisites, supported devices, and configurations, see Implementing Dell EMC SRDF SRA with VMware SRM.
Stretched clusters with SRDF/Metro
In this topology, PowerOne clusters are always on. If one site goes down, VMs are restarted on surviving servers. Application architecture determines recovery time. For example, this means that monolithic applications that require all VMs to be restarted will have a wait time, whereas distributed or cloud-native applications will continue to run without interruption at lower capacity levels until the failed VMs restart.
Figure 1: PowerOne with SRDF Metro architecture
The functional requirements for this approach are:
- Deploy two PowerOne Systems, with PowerMax licensed for SRDF/Metro, within metro distance. Create a low-latency communication channel between them to avoid disk write delays.
- Create, modify, or delete remote array connections, called SRDF Groups, which require redundant ports and replication adapters using Ethernet or Fibre Channel protocols. Links must be scalable so you can add links to increase throughput.
- Create, modify, or delete VMware Metro Storage Cluster(s) using SRDF device pairing. Split servers 50/50 across both systems so that storage is bidirectionally mirrored across both sites.
- Add stretched application VLAN(s), such as VMware VXLAN using NSX, so that IP addressing works regardless of which half of the cluster runs the application VM.
- (Optional) Implement any fine-grained workload migration controls to address restarting the workloads if application-specific needs or dependencies arise.
- Create, delete, or modify SRDF pairs. Suspend and deactivate SRDF/Metro failure recovery controls.
- Implement either a bias or witness mechanism to prevent data inconsistencies with multi-access at both sides. The witness is an external arbitrator reachable by both sites. The witness can be a Virtual Witness (vWitness) or a physical array acting as a witness.
In the third and final part of this blog series, we will explore the network architecture, best practices, and recommendations for the different PowerOne with SRDF scenarios we have presented.
Related Blog Posts
PowerOne and SRDF (Part 3)
Mon, 20 Jul 2020 20:12:56 -0000|
Read Time: 0 minutes
In this third and final blog post on PowerOne and SRDF we will focus on the SRDF/Metro use case. In this post, we explain the network requirements to provide Layer 2 and Layer 3 services and go into some depth with some best practices and recommendations for setting up a PowerOne system in an SRDF/Metro scenario.
First, we need to define how we will stretch the network for a Metro scenario. As part of the project design, we must determine how to provide Layer 2 and Layer 3 networking for vSphere, NSX Management, and vMotion networks.
The SRDF/Metro use case consists of three sites: two workload sites (local and remote) and a witness site. (Remember that a witness serves to prevent data inconsistencies between local and remote sites. A witness can be virtual or physical.)
Here are the essential Dell EMC Best Practices for setting up SRDF/Metro on both workload sites and on a witness site:
- Where possible, use dedicated ports on each PowerMax for connectivity to the dedicated replication network. The network must meet the latency requirements for SRDF/Metro and VMware Metro Storage Clusters.
- Use non-uniform host access to simplify SAN design and provide predictable I/O latency for workloads.
- Implement vSphere and NSX-T Management as required to meet operational needs and constraints through a dedicated management cluster or through a cluster that is shared with workloads.
- Implement vCenter HA with a witness site or with restart recovery, depending on network architecture and operational needs.
- Follow Dell Technologies Best Practices (NSX-T Data Center Administration Guide) to set up NSX-T L2 VPN across both sites.
- Deploy or migrate workload VMs with NSX-T L2 VPN as the VM Network, for example, for application access.
- Configure workload VMs to leverage a VMware NSX-T L2 VPN or stretched VXLAN implementation in Dell Smart Fabric Services to retain identical IP addressing on both sites.
The proposed architecture options and their implications appear in the following examples.
Figure 1: Layer 2 networking architecture
In this example, a Layer 2 stretched network is implemented using either of the following:
- A BGP eVPN—The network is extended by running a VXLAN tunnel over the top of the Layer 3 handoff from a PowerOne system
- Layer 2 trunking—Dedicated 802.1Q trunk ports can be configured on the Dell S5232-F switches used as leaf devices, mapping incoming 802.1Q tags into the proper VXLAN virtual-networks
- If the customer already has a VLAN Layer 2 adjacent between sites, this effectively extends it to the other datacenter, where the other PowerOne system uses the same method for mapping an incoming tag to the virtual network (VXLAN).
Stretching the management VLAN adds complexity to the physical network but has the advantage of retaining identical IP addressing for management components across sites.
Figure 2: Layer 3 networking architecture
In this example, you can include Layer 3 networking to eliminate the need to stretch a VLAN at the physical network level. This example shows how vCenter HA can be used to distribute vSphere Management across sites.
This approach requires that all NSX-T management components are configured with DNS entries that have a short TTL. It also requires a re-IP operation and a DNS update after restarting on Site 2. All workload VMs are restarted automatically in this scenario, through vCenter HA, and are immediately fully functional on the NSX-T L2 VPN.
Management activities, such as changing the NSX-T configuration, become available once the NSX-T Management VMs are restarted.
Figure 3: Distributed networking architecture
Another option is to use the third site for witness duties. By centralizing the management components at the third site, they become isolated from the workload sites. In this case, in addition to the PowerMax witness, we have all the vCenter and NSX-T network management elements in this third site, so we will not need to re-start those services if a workload site fails.
Protecting the management in the third site is not covered in this example.
All of these vMSC architectures, (as described in Best Practices for Using Dell EMC SRDF/Metro in a VMware vSphere Metro Storage Cluster) form a highly available, business-continuous scenario. In this kind of scenario, both primary and secondary sites are perceived as one by a PowerOne vSphere host and from a CRG provisioning perspective. Stretching the sites at the storage and network levels enables seamless vMotion and Storage vMotion operations between the two sites.
PowerOne continues to deliver the now-traditional Converged Infrastructure values with extensive automation for site local needs, while simultaneously allowing for seamless integration with outcomes that fall outside of autonomous operations. All operations related to these stretched CRGs (such as provisioning, expansion, and lifecycle management) can be tailored to extended use cases, such as SRDF/Metro, through traditional configuration approaches widely used in the industry today.
Best Practices and Recommendations
To determine the correct PowerOne technology configuration for the required outcome, we first need to determine the organization’s continuity requirements for a given cluster, in terms of the following:
- Recovery Time Objective (RTO)—How long does it take to get the cluster working again after site failure?
- Recovery Point Objective (RPO)—How much data is the customer prepared to lose?
Designing the architecture means:
- Determining RTO and RPO requirements and site distance constraints
- Working with an organization to understand their continuity requirements and DR needs
- Designing and estimating the price of the appropriate PowerOne configurations to meet those needs
It is typical for an organization to classify its recovery needs on a per-application basis, resulting in collections of applications that have common availability requirements. This maps well to the PowerOne approach in which the cluster is the primary unit of consumption. At the cluster level, one cluster could have an RPO and RTO that is different from that of another cluster, allowing a direct mapping of the recovery needs of applications to the clusters in which they run.
But investment in operational continuity is the starting point. Having confidence that your plan will work is the critical point. Confidence comes from regular testing that is minimally disruptive and performed in a controlled manner.
PowerOne makes possible these various outcomes by making the right set of components and other resources available. This means that the initial sizing work must include understanding the organization’s continuity requirements and how they will be implemented. This will help ensure that the components and configurations needed to fulfill the requirements are incorporated in the system definition and that the correct technologies and capacities are available at implementation.
Operational Continuity: Solutions
To investigate further how an organization’s continuity solution can be provided, let’s examine the best-practice recovery configurations for PowerOne, and how they can be extended to achieve the required outcomes.
We will need to address questions about how we deal with physical connectivity at the storage and network layers, such as how we configure the logical behavior of our vSphere environments to minimize RPO. In order to use the traditional configuration techniques for non-autonomous extended use cases such as SRDF, the PowerOne Controller is invoked to reserve and allow seamless hand-off of components. We rely on well-proven tools such as VMware Site Recovery Manager to automate failover and failback operations, and to configure the production network on the vSphere remote site.
PowerOne with SRDF/Metro provides an organization’s continuity solution in which we can define RPO and RTO as zero or near-zero when VMware vSphere Metro Storage Cluster technology and architecture are also implemented.
There are a number of considerations to take into account when designing a PowerOne system with SRDF/Metro architecture, specifically:
- Compute, storage, and network configuration
- How we design our vSphere environment in terms of vCenter configuration (HA and Platform Service Controller architectures)
- NSX-T management design guidelines
- A few considerations about the SRDF witness
- Potential benefits of including a third site in the architecture design
- Architectural considerations on site design and intersite replication
There is a market demand to provide a way to deploy critical applications in a highly available manner. As an architecture option for building that solution, combining the core Converged Infrastructure and site-local autonomous outcomes of PowerOne with traditional configuration capabilities for SRDF (and VMware SRM), delivers cumulative industry-leading value from each of those components in one solution, all fully supported by Dell Technologies.
For additional in-depth information, please read the supporting white paper: Protecting Business-Critical Workloads with Dell EMC SRDF and PowerOne.
PowerOne and SRDF (Part I)
Tue, 23 Jun 2020 18:03:50 -0000|
Read Time: 0 minutes
Planning for disaster recovery is essential for IT organizations who are designing their environments to support business-critical applications. Each new application or instance must be deployed with enough resiliency to overcome common hazards such as floods, fire, power failures, and human error.
To design a resilient IT infrastructure, we must first consider the datacenter site, or sites. If there is a single datacenter site, simply duplicating the IT infrastructure will not provide the desired resiliency if the failure event impacts the entire site. If we deploy our business applications across more than one site, we would need mechanisms to replicate the information across sites. In the event of a site loss, even with information replicated, we would still need to introduce processes or tools that would help during the subsequent failover and failback operations.
All industries and geographies share a need for a resilient IT architecture, one that is manifested in IT architecture proposals that consider factors such as:
- Site Distance – Depending on the distance between sites, the required technologies will vary, and the recovery scenarios will be different. Relatively short distances (under 100 km with Round Trip Time under 10 ms) allow the use of more powerful tools in order to minimize the following two factors (RTO and RPO).
- Recovery Time Objective (RTO) – Every business or application may allow for a different length of time during which to recover when a failure occurs. Some will only support a few seconds of application downtime or no downtime at all, while others may be able to withstand minutes or even hours. This factor greatly influences the architectural requirements.
- Recovery Point Objective (RPO) – Another key factor when defining a solution is the amount of data a business can afford to lose in the event of an application outage or site loss. In some cases, a business might be able to withstand recovery of its applications to a data state that existed minutes or even hours before the failure; in other cases, the business could not withstand the loss of a single transaction.
In this context, we propose a solution to address this business need with a highly effective and function-rich architecture, featuring Dell EMC PowerOne with Dell EMC Symmetrix Remote Data Facility (SRDF), and VMware Site Recovery Manager.
This blog post is part one of a three-part series. In this first installment, we will expose the business context and technologies involved. In part two we will deal with the main use cases that this blog addresses. In the third and last blog post, we will share some technology recommendations and best practices.
PowerOne and SRDF technology overview
Dell EMC PowerOne combines compute, storage, and networking in a fully engineered and highly automated converged infrastructure that provides autonomous operations, all-in-one simplicity, and flexible consumption options. With PowerOne, IT organizations can start moving from traditional operations to modern cloud outcomes.
Based on vSphere clusters, PowerOne delivers business outcomes. During daily tasks, such as provisioning workloads, the customer is never required to specify low-level details about IP stack configuration parameters, storage array configuration object names, and so on. Instead, the customer is asked only to identify the capacity required to support the target workload. All other information required to deliver the desired outcome is derived from system standards and best practices.
Dell EMC Symmetrix Remote Data Facility (SRDF) solutions provide near real-time copies of application data from a production storage array to one or more remote storage arrays. The main use cases are:
- Disaster recovery
- High availability
- Data migration
In a traditional SRDF device pair relationship, the secondary device (“R2”), is read-only, and writes are disabled. Only the primary device (“R1”) is enabled for read and write activity. With SRDF/Metro, the R2 is also write-enabled and accessible by the host or application. The R2 takes on the personality of the R1, including the World Wide Name (WWN). A host would see both the R1 and R2 as the same device.
When SRDF/Metro is used in conjunction with VMware vSphere across various hosts in two sites, a VMware vSphere Metro Storage Cluster (vMSC) is formed. A VMware vMSC infrastructure is a stretched cluster -- an architecture that extends local network and storage configuration across remote sites, enabling on-demand and nonintrusive workload mobility.
VMware Site Recovery Manager (SRM) is another technology that can play a key role in simplifying operations in multi-site architectures. VMware SRM provides workflow and business continuity, and disaster restart process management for VMware vSphere workloads. For the SRDF/Metro use case, because we can build a vMSC, SRM is not required because the multi-site deployment is perceived by vSphere workloads as a single stretched site. However, SRM is a mainstream technology for SRDF/S/A for handling failover and failback operations. In a second use case documented in the white paper Protecting Business-Critical Workloads with Dell EMC SRDF and PowerOne, VMware SRM leverages SRDF replication to protect PowerOne Cluster Resource Groups (CRGs).
The integration of VMware SRM with SRDF automates storage-based disaster restart operations on PowerOne systems. In the white paper, we focus on the availability and disaster recovery scenarios made possible by PowerOne.
Figure 1: PowerOne with SRDF basic architecture
In my next blog post, we will explore two use cases and their associated technologies:
- Two data center sites with SRDF Synchronous or Asynchronous (SRDF/S or SRDF/A)
- Two sites in a stretch cluster configuration with synchronous data mirroring (SRDF/Metro).