MLPerf is a benchmark suite that is used to evaluate training and inference performance of on-premises and cloud platforms. MLPerf is intended as an independent, objective performance yardstick for software frameworks, hardware platforms, and cloud platforms for machine learning. A consortium of AI community researchers and developers from more than 30 organizations developed and continue to evolve these benchmarks. The goal of MLPerf is to give developers a way to evaluate hardware architectures and the wide range of advancing machine learning frameworks.
The MLPerf consortium published its results for Training v0.6 and Inference v0.5 in July 2019 and November 2019, respectively. The MLPerf Training benchmarking suite measures the time it takes to train machine learning models to a target level of accuracy. MLPerf Inference benchmarks measure how quickly a trained neural network can perform inference tasks on new data.
The MLPerf Training and Inference benchmarks that we ran on OpenShift in the Dell lab perform human language translation and object detection. These models are deep neural networks trained in an iterative learning process by passing training data through the models to adjust model weights and “learn” to translate language or detect objects in images.
Model training is typically the most computationally intensive step of the machine learning workflow. Because it can take days or weeks to train a model, reducing training time enables data scientists to innovate faster. After a model is discovered and trained, it is deployed into a production environment where it takes in new data and performs inference, answering questions. In a production machine learning cycle, the inference phase is where AI and machine learning go to work—identifying medical conditions from CAT scans, detecting obstacles and road signs for self-driving cars, helping robots identify objects, translating human language phrases between source and target languages, and so on.
The MLPerf inference benchmark is intended as an objective way to measure inference performance in both the data center and the edge. Each benchmark has four measurement scenarios: server, offline, single-stream, and multi-stream. The following figure depicts the basic structure of the MLPerf Inference benchmark and the order that data flows through the system:
Figure 8. Four MLPerf inference scenarios (source: Nvidia.com)
Server and offline scenarios are most relevant for data center use cases, while single-stream and multi-stream scenarios evaluate the workloads of edge devices. In the server scenario, input arrives randomly (using a Poisson distribution), as it would in an online translation service. In the offline scenario, all the data is immediately available, representing batch processing applications such as the task of identifying people and locations in photo albums. To test our reference architecture, we ran the MLPerf Inference server and offline scenarios, which are intended for data center evaluation.