Home > Workload Solutions > SAP > Guides > Dell Validated Design for High Availability for SAP with SUSE Pacemaker Clusters > Solution introduction
Your infrastructure, no matter where it is located, must be in a highly available state to support your business and customers. In production environments, SAP business systems need infrastructure that supports a high level of availability to ensure that there are no missed transactions or opportunities. Ensuring this availability is the responsibility of the IT department.
Planning and configuring a highly available SAP landscape can be a challenge. In an ideal solution, no single points of failure exist so that the system is never affected by a loss of a single component or even a data center. The required level of protection is determined by the loss of productivity when the SAP system is unavailable, compared to the investments that are needed to prevent this loss of productivity.
Dell understands the importance of delivering successful outcomes for our customers. This document describes best practices for attaining high availability (HA) for local hardware failures and data center failures, including:
By using the resources that are already in the data center, customers can access an additional layer of availability on top of what SAP offers for SAP HANA system replication. SAP HANA systems using HANA System Replication (HSR) can experience a “Split Brain” condition in the event of a node failure on the primary SAP HANA host before a failover is fully initiated. Managing this process through cluster software can automate system replication in the event of a failure. By selecting an active/active Read-Enabled SAP HANA configuration, you avoid having to "cold start" your SAP HANA standby database. This lowers time lost to tables reloading into memory, which might cause application delays.
The Dell SUSE Pacemaker cluster solution addresses this issue by using a fencing mechanism. If a standby node suddenly cannot communicate with the primary node, it will terminate the affected node to ensure that both nodes are not active at the same time. The failover to the surviving node will continue, with workloads taking over the role as the new primary node: