As part of a Dell Validated design, we ask how resilient the design is. In this case, we focused on the challenge of designing a resilient system for the depot since this is a non-data center environment with limited IT support. There is also fault-tolerant resiliency built into the EPIC iO device that is mounted on the bus, but this was not the focus of our testing.
The bus depot design uses a rugged XR4000 system with two servers, a witness node, and runs vSAN. This design was chosen to ensure a resilient, low-touch system is available in the depot. To test this resilience, we did a hard shutdown (via iDRAC) of the server running the depot workload. This test triggered a HA failover using the capabilities of VMware. The workload should be reinstantiated on the second server in the cluster without manual intervention.
Test Specification
The following steps were performed to test HA:
- Setup the system to stream data from bus to depot.
- Identify which server the depot cluster that the workload is running on.
- Force a shutdown of the server that runs the workload.
- Monitor the second server to see how long the recovery takes.
Test results
The VMware view during the HA process looks as follows:
In the graphic above, one server in the vSAN cluster is not responding as shown with a red "dot".This server was the one on which we forced a hard failure. All workloads switched over to run on the cluster's second server. Since vSAN is being used, no data is lost at the depot either. It took approximately 1 minute for the second depot server to be back up and running after the failure with no human intervention.
During the failover period, the offloading buses see that the depot is temporarily unavailable, so they pause the offload and wait until the destination is back online. This use case is similar to when the bus moves away from the depot. Once a depot connection resumes, the offload continues as before. See below for an example of the 5G traffic during a HA event:
This graph is performance data taken from the Cradlepoint while offloading video over 5G. The spike in the middle occurs when the HA event was triggered. The EPIC iO UIG accelerates the rate of data transfer to catch up from the minute when the depot receiver was unavailable. This short spike is expected. The data transfer rate quickly returns to the previous level and continues at a steady state as normal.
Findings
- Having HA capabilities from VMware make it possible to build a resilient system at the depot.
- An entire server can fail unexpectedly or go down for maintenance without stalling the bus offload procedure.
- The local cache on the bus can bridge the time taken (1 min) for the depot VM to reinstantiate on the working server.