MLCommons™ is an open engineering consortium of deep learning experts who focus on improving the machine learning process pertaining to system acceleration, ease of access, and standardization of deep learning tooling. It was founded by experts from different areas such as large-scale companies, startups, academics, and research institutes. MLCommons hosts MLPerf benchmarks with the intent to produce effective benchmarks. These benchmarks aim to correlate to real world uses cases that customers face on a frequent basis in their data center settings.
The MLPerf Training benchmark aims to be a fair representation of the workload that not only requires high throughput and high compute, but also reaches the expected target convergence metric. It measures time to convergence. MLPerf does not measure system throughput as the primary metric because throughput can be easily manipulated by increasing batch size. Convergence cannot be easily manipulated like throughput.
MLPerf training includes two divisions: Open and Closed.
The Open division enables new model architectures, model update mechanisms (optimizers), and other research that expects to reach target quality of service (accuracy).
Our submission to MLPerf training is for the Closed division, in which submissions ensure comparisons can be made like-for-like among other Closed division submitters. The Closed division enforces the submissions to follow the same dataset preprocessing, model, training method, and quality target as the reference implementation. For instance, hyperparameters must be the same as the reference implementation. Hyperparameters include the optimizers used and values such as regularization norms and weight decays. The fp64, fp32, tf32, fp16, fp8, bfloat16, int8, uint8, int4, and uint4 numerical formats are preapproved for use. Additional formats require explicit approval.
The following table lists the available benchmarks for submission and their corresponding expected quality target. For a submission to be valid, the accuracy for that specific model must converge to a specified quality target.
Table 1. Available benchmarks in MLPerf Training v3.0
Area | Problem | Model | Target |
Vision | Image classification | ResNet-50 v1.5 | 75.90% classification |
Image segmentation (medical) | U-Net3D | 0.908 Mean DICE score | |
Object detection (light weight) | SSD (RetinaNet) | 34.0% mAP | |
Object detection (heavy weight) | Mask R-CNN | 0.377 Box min AP and 0.339 Mask min AP | |
Language | Speech recognition | RNN-T | 0.058 Word Error Rate |
NLP | BERT | 0.720 Mask-LM accuracy | |
Large Language Model | GPT3 | 2.69 log perplexity | |
Commerce | Recommendation | DLRMv2 (DCNv2) | 0.80275 AUC |
The following table shows the minimum number of expected runs for a valid benchmark in the Closed division:
Table 2. Minimum number of expected runs
Area | Problem | Minimum number of runs |
Vision | Image classification | 5 |
Image segmentation (medical) | 40 | |
Object detection (light weight) | 5 | |
Object detection (heavy weight) | 5 | |
Language | NLP | 10 |
Speech recognition | 10 | |
Large language model | 3 | |
Commerce | Recommendation | 10 |
The MLPerf training suite uses a compliance checker to ensure that submissions are made to the benchmark fairly. This check occurs through a package and RCP checker. The package checker ensures that the expected criteria are met, as defined by the Closed division rules.
The RCP checker ensures that the convergence of the submitted benchmark does not deviate from the convergence of the reference. Its purpose is to avoid cases where submission convergence is faster than the reference. Reference implementation convergence sets a lower bound on epoch convergence that a valid submission must not surpass.