Home > Workload Solutions > High Performance Computing > White Papers > HPC High-Performance Storage Solution for BeeGFS > Mechanisms to handle failure
There are many distinct types of failures and faults that can impact the functionality of a highly available BeeGFS storage solution. Table 2 lists the potential failures that are tolerated in the solution.
Failure type | Mechanism to handle failure |
Single local operating system disk failure on a server | Operating system installed on a RAID1 virtual device (two disks). |
HDD failures in the ME arrays | The storage targets are in RAID 6 disk group of 10 drives (8+2). Four dynamic global spares have been configured. Automatic RAID rebuild takes place if a disk fails. |
Power supply or power bus failure | Each server has dual redundant PSUs, and each PSU must be connected to a separate power bus. The server can continue to be operational with a single PSU. |
SAS cable / SAS port failure | Two SAS HBA cards installed on each MDS, four on each SS. Redundant connections are made to all arrays across both MDS/SS. A single SAS card or cable failure will not impact data availability, but performance may be reduced depending on I/O load. |
InfiniBand link/Adapter failure. | Two physical interfaces configured in an active-backup logical bonded interface. When the active link fails, the backup link takes over. When the active adapter fails, the passive takes its place |
Single ME5 SAS controller failure (SS or MDS) | If a single ME5 SAS controller fails, the remaining ME5 controller takes over the I/O transactions. Performance may be degraded depending on the I/O load. |
Single server failure | Event monitored by pcs services. |
Private Ethernet switch failure (yet to be tested) | A single point of failure, but it is not a vital resource for the cluster. If there is an additional component failure before the ethernet switch comes back online, the service is stopped and manual intervention from a system administrator is required. |