Home > Workload Solutions > High Performance Computing > White Papers > Enhance Availability of Storage Services with an NFS Storage System > HA Functionality
The general architecture of the NSS-HA family remains the same, including the high availability functionality. There are many distinct types of failures and faults that impact the functionality of a highly available NFS solution. The following table lists the potential failures that are tolerated in this solution, where the NFS service is running on the active server of the HA failover cluster, while the passive server is ready and in stand-by mode, in case it needs to fence or replace the active server, to continue providing the NFS service to clients.
Failure type |
Mechanism to handle failure |
Single local operating system disk failure on a server |
Operating system installed on a RAID1 virtual device (two disks). If the PowerVault ME40484 HDDs are smaller than 12TB, another SSD can be configured as a global hot spare. The PERC controller handles any SSD failures and reports them to the OS. A single OS disk failure is unlikely to make the server non-operational, especially if a hot spare is enabled. |
Single local Swap Space disk failure on a server |
The Swap Space on a local SSD based RAID 0 is idle during normal operation, and it is only required by xfs_repair after significant failures on the XFS file system. Also, a smaller (4 GiB) swap space is normally created as part of the default RHEL installation. |
Single server failure |
Event monitored by the cluster service. In case of a failure, the NFS service and other services required by NFS failover to passive server. |
Power supply or power bus failure |
Each server has dual redundant PSUs, and each PSU must be connected to a separate power bus. The server can continue operational with a single PSU |
Fence device failure |
The cluster is configured with two fencing devices. |
SAS cable or SAS port failure |
Two SAS HBA cards are installed on each NFS server and each card has a SAS cable to a different SAS controller in the shared PowerVault ME4084. The operating system multi-path service manages all SAS paths available, keeping only one active per LUN. A single SAS card or cable failure will not impact data availability, but performance may be reduced depending on I/O load. |
Single ME4084 SAS controller failure |
If a single PowerVault ME4084 SAS controller fails, the remaining controller takes over the I/O transactions (leveraging shared cache among controllers), ownership of disk groups, volumes, SAS connections, etc., and instructs the multipath services on the operating systems for both servers to adjust the SAS paths to make active only those connected to itself (using ALUA). Performance may be degraded depending on the I/O load. |
Dual SAS cable or card failure |
Event is monitored by the cluster service. If all SAS data paths to the shared storage are lost, the active server is fenced and all services under cluster control fail over to the passive server. |
InfiniBand or 10GbE link failure |
Event is monitored by the cluster service. The active server is fenced and all services under the cluster control failover to passive server. |
Private ethernet switch failure |
While a single point of failure, this is not a vital resource for the cluster, unless a fencing event needs to take place. The NFS service continues running on the active server. If there is an additional component failure before the ethernet switch comes back online, the service is stopped and manual intervention from a system administrator is required. |
Heartbeat network interface failure |
Event is monitored by the cluster service. The active server is fenced and all services under the cluster control fail over to passive server. |
HA functionality is verified by mounting the solution file system by means of NFSv4 on the clients. The following failures are simulated on the HA cluster based on the failures and faults listed in the table above.
The rest of this document describes the testbed and provides I/O performance information related to expected use cases for the Dell Technologies Validated Design for HPC NFS Storage using several standard benchmarks. To contrast the performance difference between the current Design for NFS Storage and the last release, the corresponding performance numbers of the last release called “NSS7.4-HA” are also presented.