We used the Dell EMC Ready Stack reference architecture for OpenShift 4.3 to perform the testing described in this paper. For details about the reference architecture, see the design guide and deployment guide on the Dell Technologies Info Hub. To further enhance performance for AI and machine learning workloads, we included NVIDIA GPUs and Mellanox ConnectX-5 network cards in the server configurations for our testing.
Red Hat OpenShift Container Platform provides an enterprise-grade platform for deploying and hosting applications and runtimes using container technology. Container workloads on OpenShift are orchestrated with Kubernetes. GPU-accelerated workloads are enabled using a combination of the Node Feature Discovery operator and the NVIDIA GPU Operator to ensure that the containers are scheduled on nodes that can support the accelerator requirement. OpenShift also provides integrated monitoring capability with Prometheus and Grafana, and an easy-to-use graphical interface.
The NVIDIA GPU Cloud (NGC) containers facilitate development of workloads that use NVIDIA GPU acceleration by including packages that are useful for AI and machine learning applications. NGC containers are optimized for many machine learning and deep learning frameworks, such as TensorFlow and PyTorch. The NGC Catalog includes many AI models and examples to help you get started with various use cases, including image recognition, natural language processing, and hosting models for inference.
We validated the reference architecture for this solution in the Dell HPC & AI Innovation Lab in Austin, Texas, by running the machine learning workloads MLPerf Training v0.6 and MLPerf Inference v0.5 on the OpenShift Container Platform with Dell EMC PowerEdge servers. We demonstrate that our reference architecture delivers excellent performance, with results that are competitive with NVIDIA MLPerf Training and Inference results on a per-GPU basis. The published MLPerf inference times that we used for comparison had ECC (memory error correction) turned off. We did not turn off ECC when we ran the benchmarks; however, previous tests at Dell suggest that turning off ECC would improve performance by 3.41 to 14.29 percent, depending on the model.