HA Functionality

Thank you for your feedback!

The general architecture of the NSS-HA family remains the same, including the high availability functionality. There are many distinct types of failures and faults that impact the functionality of a highly available NFS solution. The following table lists the potential failures that are tolerated in this solution, where the NFS service is running on the active server of the HA failover cluster, while the passive server is ready and in stand-by mode, in case it needs to fence or replace the active server, to continue providing the NFS service to clients.

Table 2. Mechanisms to handle failures

Failure type	Mechanism to handle failure
Single local operating system disk failure on a server	Operating system installed on a RAID1 virtual device (two disks). If the PowerVault ME40484 HDDs are smaller than 12TB, another SSD can be configured as a global hot spare. The PERC controller handles any SSD failures and reports them to the OS. A single OS disk failure is unlikely to make the server non-operational, especially if a hot spare is enabled.
Single local Swap Space disk failure on a server	The Swap Space on a local SSD based RAID 0 is idle during normal operation, and it is only required by xfs_repair after significant failures on the XFS file system. Also, a smaller (4 GiB) swap space is normally created as part of the default RHEL installation.
Single server failure	Event monitored by the cluster service. In case of a failure, the NFS service and other services required by NFS failover to passive server.
Power supply or power bus failure	Each server has dual redundant PSUs, and each PSU must be connected to a separate power bus. The server can continue operational with a single PSU
Fence device failure	The cluster is configured with two fencing devices. The iDRAC 9 Enterprise on each server is used as primary fence device. Two switched PDUs AP7921B are secondary fence devices.
SAS cable or SAS port failure	Two SAS HBA cards are installed on each NFS server and each card has a SAS cable to a different SAS controller in the shared PowerVault ME4084. The operating system multi-path service manages all SAS paths available, keeping only one active per LUN. A single SAS card or cable failure will not impact data availability, but performance may be reduced depending on I/O load.
Single ME4084 SAS controller failure	If a single PowerVault ME4084 SAS controller fails, the remaining controller takes over the I/O transactions (leveraging shared cache among controllers), ownership of disk groups, volumes, SAS connections, etc., and instructs the multipath services on the operating systems for both servers to adjust the SAS paths to make active only those connected to itself (using ALUA). Performance may be degraded depending on the I/O load.
Dual SAS cable or card failure	Event is monitored by the cluster service. If all SAS data paths to the shared storage are lost, the active server is fenced and all services under cluster control fail over to the passive server.
InfiniBand or 10GbE link failure	Event is monitored by the cluster service. The active server is fenced and all services under the cluster control failover to passive server.
Private ethernet switch failure	While a single point of failure, this is not a vital resource for the cluster, unless a fencing event needs to take place. The NFS service continues running on the active server. If there is an additional component failure before the ethernet switch comes back online, the service is stopped and manual intervention from a system administrator is required.
Heartbeat network interface failure	Event is monitored by the cluster service. The active server is fenced and all services under the cluster control fail over to passive server.

HA functionality is verified by mounting the solution file system by means of NFSv4 on the clients. The following failures are simulated on the HA cluster based on the failures and faults listed in the table above.

Server failure (induce a kernel panic on active server)
Heartbeat link failure (removal of active server’s ethernet link to private network switch)
Public datalink failure (removal of IB HDR100 or 10 GbE datalink on active server)
Private switch failure (removal of power to private switch)
Primary fence device failure (removal of iDRAC link on active server)
Single SAS link failure (removal of a single SAS link from the active server)
Multiple SAS link failures (removal of both SAS links from the active server)
Single PowerVault ME4084 SAS controller failure — could be simulated by the removal of one SAS controller on the PowerVault ME4084. However, since that HA testing is done as part of the PowerVault ME4084 development for each firmware release, this scenario was not tested on this solution.

The rest of this document describes the testbed and provides I/O performance information related to expected use cases for the Dell Technologies Validated Design for HPC NFS Storage using several standard benchmarks. To contrast the performance difference between the current Design for NFS Storage and the last release, the corresponding performance numbers of the last release called “NSS7.4-HA” are also presented.

Your Browser is Out of Date

HA Functionality

HA Functionality