For this benchmark, we used two comparators to assess the effectiveness of our platform. For the training component of the analysis, we compared our results to NVIDIA’s published DGX-1 training results (NVIDIA MLPerf Training v0.6 Results, published July 2019: https://mlperf.org/training-results-0-6).
For the inference analysis, we compared our results to the results of the PowerEdge R740xd with four T4 GPUs (Dell EMC MLPerf Inference v0.5 Results, published November 2019: https://mlperf.org/inference-results/). A notable difference between our tests and the tests that we compared them with is that, in our tests, we added to the stack another layer of software—OpenShift Container Platform.
In addition, while we compare raw numbers for these results, we also present our results on a per-GPU basis, thus demonstrating the overall system performance based on the number of GPUs. For the training comparison, the PowerEdge C4140 platform has four V100 GPUs compared to eight V100 GPUs in the NVIDIA DGX-1 (both with NVLink). For the inference comparison, we are comparing PowerEdge R740xd with one T4 GPU to PowerEdge R740xd with four T4 GPUs.
Our MLPerf Training and Inference results show that running containerized models on Red Hat OpenShift Container Platform software and Dell EMC hardware are similar to, and sometimes faster than, the NVIDIA published results on a per-GPU basis. We strive to match the NVIDIA published training results and Dell inference results, normalizing the results to GPU count.
Our results demonstrate that AI and machine learning training and inference workloads can be efficiently and cost-effectively performed by combining the Red Hat OpenShift Container Platform and Dell EMC PowerEdge severs with NVIDIA GPUs. Our results also demonstrate that incorporating the OpenShift Container Platform into the stack adds little overhead. We also demonstrate that is it feasible to run different workload types on the OpenShift Container Platform and achieve efficient performance by incorporating different GPU model types in the same cluster. OpenShift Container Platform and GPU Operator enable scheduling of tasks to the hardware resources in the cluster best suited for that job. We performed neural network training on the PE-C 4140 server equipped with NVIDIA V100, enabling fast training times. We ran inference on R740xd equipped with NVIDIA T4, which is optimized for low latency inference and power-efficient execution of inference tasks in the data center.