Mechanisms to handle failure | HPC High-Performance Storage Solution for BeeGFS | Dell Technologies Info Hub

Your Browser is Out of Date

Nytro.ai uses technology that works best in other browsers.
For a full experience use one of the browsers below

Mechanisms to handle failure

Mechanisms to handle failure

Thank you for your feedback!

There are many distinct types of failures and faults that can impact the functionality of a highly available BeeGFS storage solution. Table 2 lists the potential failures that are tolerated in the solution.

Table 2. High availability—mechanisms to handle failure

Failure type	Mechanism to handle failure
Single local operating system disk failure on a server	Operating system installed on a RAID1 virtual device (two disks).
HDD failures in the ME arrays	The storage targets are in RAID 6 disk group of 10 drives (8+2). Four dynamic global spares have been configured. Automatic RAID rebuild takes place if a disk fails.
Power supply or power bus failure	Each server has dual redundant PSUs, and each PSU must be connected to a separate power bus. The server can continue to be operational with a single PSU.
SAS cable / SAS port failure	Two SAS HBA cards installed on each MDS, four on each SS. Redundant connections are made to all arrays across both MDS/SS. A single SAS card or cable failure will not impact data availability, but performance may be reduced depending on I/O load.
InfiniBand link/Adapter failure.	Two physical interfaces configured in an active-backup logical bonded interface. When the active link fails, the backup link takes over. When the active adapter fails, the passive takes its place
Single ME5 SAS controller failure (SS or MDS)	If a single ME5 SAS controller fails, the remaining ME5 controller takes over the I/O transactions. Performance may be degraded depending on the I/O load.
Single server failure	Event monitored by pcs services. In case of a failure, the services failover to the other server.
Private Ethernet switch failure (yet to be tested)	A single point of failure, but it is not a vital resource for the cluster. If there is an additional component failure before the ethernet switch comes back online, the service is stopped and manual intervention from a system administrator is required.