
Interpreting TPCx-AI Benchmark Results
Wed, 01 Feb 2023 14:29:11 -0000
|Read Time: 0 minutes
TPCx-AI Benchmark
Overview
TPCx-AI Benchmark abstracts the diversity of operations in a retail data center scenario. Selecting a retail business model assists the reader relate intuitively to the components of the benchmark, without tracking that industry segment tightly. Such tracking would minimize the relevance of the benchmark. The TPCx-AI benchmark can be used to characterize any industry that must transform operational and external data into business intelligence.
This paper introduces the TPCx-AI benchmark and uses a published TPCx-AI result to describe how the primary metrics are determined and how they should be read.
Benchmark model
TPCx-AI data science pipeline
The TPCx-AI benchmark imitates the activity of retail businesses and data centers with:
- Customer information
- Department stores
- Sales
- Financial data
- Product catalog and reviews
- Emails
- Data center logs
- Facial images
- Audio conversations
It models the challenges of end-to-end artificial intelligence systems and pipelines where the power of machine learning and deep learning is used to:
- Detect anomalies (fraud and failures)
- Drive AI-based logistics optimizations to reduce costs through real-time forecasts (classification, clustering, forecasting, and prediction)
- Use deep learning AI techniques for customer service management and personalized marketing (facial recognition and speech recognition)
It consists of ten different use cases that help any retail business data center address and manage any business analysis environment.
The TPCx-AI kit uses a Parallel Data Generator Framework (PDGF) to generate the test dataset. To mimic the datasets of different company sizes the user can specify scale factor (SF), a configuration parameter. It sets the target input dataset size in GB. For example, SF=100 equals 100 GB. Once generated, all the data is processed for subsequent stages of postprocessing within the data science pipeline.
Use cases
The TPCx-AI Benchmark models the following use cases:
Figure 1: TPCx-AI benchmark use case pipeline flow
Table 1: TPCx-AI benchmark use cases
ID | Use case | M/DL | Area | Algorithm |
UC01 | Customer Segmentation | ML | Analytics | K-Means |
UC02 | Conversation Transcription | DL | NLP | Recurrent Neural Network |
UC03 | Sales Forecasting | ML | Analytics | ARIMA |
UC04 | Spam Detection | ML | Analytics | Naïve Bayes |
UC05 | Price Prediction | DL | NLP | RNN |
UC06 | Hardware Failure Detection | ML | Analytics | Support Vector Machines |
UC07 | Product Rating | ML | Recommendation | Alternating Least Squares |
UC08 | Trip Type Classification | ML | Analytics | XGBoost |
UC09 | Facial Recognition | DL | Analytics | Logistic Regression |
UC10 | Fraud Detection | ML | Analytics | Logistic Regression |
Benchmark run
The TPCx-AI Benchmark run consists of seven separate tests run sequentially. The tests are listed below:
- Data Generation using PDGF
- Load Test – Loads data into persistent storage (HDFS or other file systems)
- Power Training Test – Generates and trains models
- Power Serving Test I – Uses the trained model in Training Phase to conduct the serving phase (Inference) for each use case
- Power Serving Test II – There are two serving tests that run sequentially. The test with the greater geometric mean (geomean) of serving times is used in the overall score.
- Scoring Test – Model validation stage. Accuracy of the model is determined using defined accuracy metrics and criteria
- Throughput Test – Runs two or more concurrent serving streams
The elapsed time for each test is reported.
Note: There are seven benchmark phases that span an end-to-end data science pipeline as shown in Figure 1. For a compliant performance run, the data generation phase is run but not scored and consists of the subsequent six separate tests, load test through throughput test, run sequentially.
Primary metrics
For every result, the TPC requires the publication of three primary metrics:
- Performance
- Price-Performance
- Availability Date
Performance metric
It is possible that not all scenarios in TPCx-AI will be applicable to all users. To account for this situation, while defining the performance metric for TPCx-AI, no single scenario dominates the performance metric. The primary performance metric is the throughput expressed in terms of AI use cases per minute (AIUCpm) @ SF is defined in the figure below.
Figure 2: Definition of the TPCx-AI benchmark metric
Where:
TLD = Load time
TPTT = Geomean of training times
TPST1 = Geomean of Serving times
TPST2 = Geomean of serving times
TPST = Max (TPST1, TPST2)
TTT = Total elapsed time/ (#streams * number of use cases)
N = Number of use cases
Note: The elapsed time for the scoring test is not considered for the calculation of the performance metric. Instead, the results of the scoring test are used to determine whether the Performance test was successful.
The scoring test result for each user case should meet or better the reference result set provided in the kit as shown in the figure below.
Figure 3: Benchmark run accuracy metrics
Calculating the Performance metric
To illustrate how the performance metric is calculated, let us consider the results published for SF=10 at:
https://www.tpc.org/tpcx-ai/results/tpcxai_result_detail5.asp?id=122110802
A portion of the TPCx-AI result highlights, showing the elapsed time for the six sequential tests constituting the benchmark run is shown in the figure below.
Figure 4: Elapsed time for the benchmark test phases
The result highlights only provide the training times and the serving times. To calculate the final performance metric, we need to use the geometric mean of the training times and serving times. To arrive at the geomean of the training times and the testing times, the time taken for each use case is needed. That time is provided in the Full Disclosure Report (FDR) that is part of the benchmark results. The link to the FDR of the SF=10 results that we are considering are at:
https://www.tpc.org/results/fdr/tpcxai/dell~tpcxai~10~dell_poweredge_r7615~fdr~2022-11-09~v01.pdf
The use case times and accuracy table from the FDR are shown in the figure below.
Figure 5: Use case times and accuracy
Note: The accuracy metrics are defined in Table 7a of the TPCx-AI User Guide.
Using the data in Figure 4 and Figure 5:
TLD = Load time =2.306 seconds
TPTT = Geomean of training time =316.799337
(119.995*2104.383*113.122*89.595*974.454*424.76*26.14*4928.427*29.112*253.63)1/10
TPST1 = Geomean of Serving times =19.751 seconds
(10.025*8.949*4.405*12.05*4.489*144.016*4.254*396.486*75.706*22.987)1/10
TPST2 = Geomean of serving times = 19.893 seconds
(10.043*8.92*4.39*12.288*4.622*148.551*4.275*396.099*75.508*22.881)1/0
TPST = Max (TPST1, TPST2)= 19.893 seconds
TTT = Total elapsed time/ (#streams * # of use cases) =2748.071/ (100*10)= 2.748 seconds
N = Number of use cases =10
Note: The geometric mean is arrived at by multiplying the time taken for each of the use cases and finding the 10th root of the product.
Plugging the values in the formula for calculating the AIUCpm@SF given in Figure 2, we get:
AIUCpm@SF= 10*10*60/ (2.306*316.799*19.893*2.748)1/4
= 6000/ (39935.591)1/4
= 6000/14.1365=424.433
The actual AIUCpm@SF10=425.31
Calculating the Price-Performance metric
The Price-Performance metric is defined in the figure below.
Figure 6: Price-Performance metric definition
Where:
- P = is the price of the hardware and software components in the System Under Test (SUT)
- AIUCpm@SF is the reported primary performance metric
Note: A one-year pricing model must be used to calculate the price and the price-performance result of the TPCx-AI Benchmark.
AIUCpm@SF10 = 425.31
Price of the configuration =$ 48412
$/AIUCpm@SF10 = 113.83 USD per AIUCpm@SF10
Availability date
All components used in this result will be orderable and available for shipping by February 22, 2023.
Performance results
Dell has published six world record-setting results based on the TPCx-AI Benchmark standard of the TPC. Links to the publications are provided below.
SF1000
Dell PowerEdge R650/Intel Xeon Gold (Ice Lake) 6348/CDP 7.1.7—11 nodes
https://www.tpc.org/tpcx-ai/results/tpcxai_result_detail5.asp?id=122120101
SF300
Dell PowerEdge R6625/AMD EPYC Genoa 9354/CDP 7.1.7—four nodes
https://www.tpc.org/tpcx-ai/results/tpcxai_result_detail5.asp?id=122110805
SF100
Dell PowerEdge R6625/AMD EPYC Genoa 9354/CDP 7.1.7—four nodes
https://www.tpc.org/tpcx-ai/results/tpcxai_result_detail5.asp?id=122110804
SF30
Dell PowerEdge R6625/AMD EPYC Genoa 9174F/Anaconda3—one node
https://www.tpc.org/tpcx-ai/results/tpcxai_result_detail5.asp?id=122110803
SF10
Dell PowerEdge R7615/AMD EPYC Genoa 9374F/Anaconda3—one node
https://www.tpc.org/tpcx-ai/results/tpcxai_result_detail5.asp?id=122110802
SF3
Dell PowerEdge R7615/AMD EPYC Genoa 9374F/Anaconda3—one node
https://www.tpc.org/tpcx-ai/results/tpcxai_result_detail5.asp?id=122110801
With these results, Dell Technologies holds the following world records on the TPCx-AI Benchmark Standard:
- #1 Performance and Price-Performance on SF1000
- #1 Performance and Price-Performance on SF300
- #1 Performance and Price-Performance on SF100
- #1 Performance and Price-Performance on SF30
- #1 Performance on SF10
- #1 Performance Price-Performance on SF3
Conclusion
Summary
This blog describes the TPCx-AI benchmark and how the performance result of the TPCx-AI Benchmark can be interpreted. It also describes how Dell Technologies maintains leadership in the TPCx-AI landscape.
Related Blog Posts

Dell Servers Excel in MLPerf™ Training v2.1
Wed, 16 Nov 2022 10:07:33 -0000
|Read Time: 0 minutes
Dell Technologies has completed the successful submission of MLPerf Training, which marks the seventh round of submission to MLCommons™. This blog provides an overview and highlights the performance of the Dell PowerEdge R750xa, XE8545, and DSS8440 servers that were used for the submission.
What’s new in MLPerf Training v2.1?
This round of submission does not include new benchmarks or changes in the existing benchmarks. A change is introduced in the submission compliance checker.
This round adds one-sided normalization to the checker to reduce variance in the number of steps to converge. This change means that if a result converges faster than the RCP mean within a certain range, the checker normalizes the results to the RCP mean. This normalization was not available in earlier rounds of submission.
What’s new in MLPerf Training v2.1 with Dell submissions?
For Dell submission for MLPerf Training v2.1, we included:
- Improved performance with BERT and Mask R-CNN models
- Minigo submission results on Dell PowerEdge R750xa server with A100 PCIe GPUs
Overall Dell Submissions
Figure 1. Overall submissions for all Dell PowerEdge servers in MLPerf Training v2.1
Figure 1 shows our submission in which the workloads span across image classification, lightweight and heavy object detection, speech recognition, natural language processing, recommender systems, medical image segmentation, and reinforcement learning. There were different NVIDIA GPUs including the A100, with PCIe and SXM4 form factors having 40 GB and 80 GB VRAM and A30.
The Minigo on the PowerEdge R750xa server is a first-time submission, and it takes around 516 minutes to run to target quality. That submission has 4x A100 PCIe 80 GB GPUs.
Our results have increased in count from 41 to 45. This increased number of submissions helps customers see the performance of the systems using different PowerEdge servers, GPUs, and CPUs. With more results, customers can expect to see the influence of using different hardware settings that can play a vital role in time to convergence.
We have several procured winning titles that demonstrate the higher performance of our systems in relation to other submitters, starting with the highest number of results across all the submitters. Some other titles include the top position in the time to converge for BERT, ResNet, and Mask R-CNN with our PowerEdge XE8545 server powered by NVIDIA A100-40GB GPUs.
Improvement in Performance for BERT and Mask R-CNN
Figure 2. Performance gains from MLPerf v2.0 to MLPerf v2.1 running BERT
Figure 2 shows the improvements seen with the PowerEdge R750xa and PowerEdge XE8545 servers with A100 GPUs from MLPerf training v2.0 to MLPerf training v2.1 running BERT language model workload. The PowerEdge XE8545 server with A100-80GB has the fastest time to convergence and the highest improvement at 13.1 percent, whereas the PowerEdge XE8545 server with A100-40GB has 7.74 percent followed by the PowerEdge R750xa server with A100-PCIe at 5.35 percent.
Figure 3. Performance gains from MLPerf v2.0 to MLPerf v2.1 running Mask R-CNN
Figure 3 shows the improvements seen with the PowerEdge XE8545 server with A100 GPUs. There is a 3.31 percent improvement in time to convergence with MLPerf v2.1.
For both BERT and Mask R-CNN, the improvements are software-based. These results show that software-only improvements can reduce convergence time. Customers can benefit from similar improvements without any changes in their hardware environment.
The following sections compare the performance differences between SXM and PCIe form factor GPUs.
Performance Difference Between PCIe and SXM4 Form Factor with A100 GPUs
Figure 4. SXM4 form factor compared to PCIe for the BERT
Figure 5. SXM4 form factor compared to PCIe for Resnet50 v1.5
Figure 6. SXM4 form factor compared to PCIe for the RNN-T
Table 1:
System | BERT | Resnet50 | RNN-T |
R750xax4A100-PCIe-80GB | 48.95 | 61.27 | 66.19 |
XE8545x4A100-SXM-80GB | 32.79 | 54.23 | 55.08 |
Percentage difference | 39.54% | 12.19% | 18.32% |
Figures 4, 5, and 6 and Table 1 show that SXM form factor is faster than the PCIe form factor for BERT, Resnet50 v1.5, and RNN-T workloads.
The SXM form factor typically consumes more power and is faster than PCIe. For the above workloads, the minimum percentage improvement in convergence that customers can expect is in double digits, ranging from approximately 12 percent to 40 percent, depending on the workload.
Multinode Results Comparison
Multinode performance assessment is more important than ever. With the advent of large models and different parallelism techniques, customers have an ever-increasing need to find results faster. Therefore, we have submitted several multinode results to assess scaling performance.
Figure 7. BERT multinode results with PowerEdge R750xa and XE8545 servers
Figure 7 indicates multinode results from three different systems with the following configurations:
- R750xa with 4 A100-PCIe-80GB GPUs
- XE8545 with 4 A100-SXM-40GB GPUs
- XE8545 with 4 A100-SXM-80GB GPUs
Every node of the above system has four GPUs each. When the graph shows eight GPUs, it means that the performance results are derived from two nodes. Similarly, for 16 GPUs the results are derived from four nodes, and so on.
Figure 8. Resnet50 multinode results with R750xa and XE8545 servers
Figure 9. Mask R-CNN multinode results with R750xa and XE8545 servers
As shown in Figures 7, 8, and 9, the multinode scaling results of the BERT, Resnet50, and Mask R-CNN are linear or nearly linear scaled. This shows that Dell servers offer outstanding performance with single-node and multinode scaling.
Conclusion
The findings described in this blog show that:
- Dell servers can run all types of workloads in the MLPerf Training submission.
- Software-only enhancements reduce time to solution for our customers, as shown in our MLPerf Training v2.1 submission, and customers can expect to see improvements in their environments.
- Dell PowerEdge XE8545 and PowerEdge R750xa servers with NVIDIA A100 with PCIe and SXM4 form factors are both great selections for all deep learning models.
- PCIe-based PowerEdge R750xa servers can deliver reinforcement learning workloads in addition to other classes of workloads, such as image classification, lightweight and heavy object detection, speech recognition, natural language processing, and medical image segmentation.
- The single-node results of our submission indicate that Dell servers deliver outstanding performance and that multinode run scales well and helps to reduce time to solution across a distinct set of workload types, making Dell servers apt for single-node and multinode deep learning training workloads.
- The single-node results of our submission indicate that Dell servers deliver outstanding performance and that multinode results show a well-scaled performance that helps to reduce time to solution across a distinct set of workload types. This makes Dell servers apt for both small training workloads on single nodes and large deep learning training workloads on multinodes.
Appendix
System Under Test
MLPerf system configurations for PowerEdge XE8545 systems
Operating system | CPU | Memory | GPU | GPU form factor | GPU count | Networking | Software stack |
XE8545x4A100-SXM-40GB 2xXE8545x4A100-SXM-40GB 4xXE8545x4A100-SXM-40GB 8xXE8545x4A100-SXM-40GB 16xXE8545x4A100-SXM-40GB 32xXE8545x4A100-SXM-40GB | 2x ConnectX-6 IB HDR 200Gb/Sec
|
| |||||
Red Hat Enterprise Linux | AMD EPYC 7713 | 1 TB | NVIDIA A100-SXM-40GB | SXM4 | 4, 8, 16, 32, 64, 128 |
| CUDA 11.6 Driver 510.47.03 cuBLAS 11.9.2.110 cuDNN 8.4.0.27 TensorRT 8.0.3 DALI 1.5.0 NCCL 2.12.10 Open MPI 4.1.1rc1 MOFED 5.4-1.0.3.0 |
XE8545x4A100-SXM-80GB |
|
| |||||
Ubuntu 20.04.4 | AMD EPYC 7763 | 1 TB | NVIDIA A100-SXM-80GB | SXM4 | 4 |
| CUDA 11.6 Driver 510.47.03 cuBLAS 11.9.2.110 cuDNN 8.4.0.27 TensorRT 8.0.3 DALI 1.5.0 NCCL 2.12.10 Open MPI 4.1.1rc1 MOFED 5.4-1.0.3.0 |
2xXE8545x4A100-SXM-80GB 4xXE8545x4A100-SXM-80GB |
|
| |||||
Red Hat Enterprise Linux | AMD EPYC 7713 | 1 TB | NVIDIA A100-SXM-80GB | SXM4 | 4, 8 |
| CUDA 11.6 Driver 510.47.03 cuBLAS 11.9.2.110 cuDNN 8.4.0.27 TensorRT 8.0.3 DALI 1.5.0 NCCL 2.12.10 Open MPI 4.1.1rc1 MOFED 5.4-1.0.3.0 |
MLPerf system configurations for Dell PowerEdge R750xa servers
| 2xR750xa_A100 | 8xR750xa_A100 |
MLPerf System ID | 2xR750xax4A100-PCIE-80GB | 8xR750xax4A100-PCIE-80GB |
Operating system | CentOS 8.2.2004 | |
CPU | Intel Xeon Gold 6338 | |
Memory | 512 GB | |
GPU | NVIDIA A100-PCIE-80GB | |
GPU form factor | PCIe | |
GPU count | 4,32 | |
Networking | 1x ConnectX-5 IB EDR 100Gb/Sec | |
Software stack | CUDA 11.6 Driver 470.42.01 cuBLAS 11.9.2.110 cuDNN 8.4.0.27 TensorRT 8.0.3 DALI 1.5.0 NCCL 2.12.10 Open MPI 4.1.1rc1 MOFED 5.4-1.0.3.0 | CUDA 11.6 Driver 470.42.01 cuBLAS 11.9.2.110 cuDNN 8.4.0.27 TensorRT 8.0.3 DALI 1.5.0 NCCL 2.12.10 Open MPI 4.1.1rc1 MOFED 5.4-1.0.3.0 |
MLPerf system configurations Dell DSS 8440 servers
| DSS 8440 |
MLPerf System ID | DSS8440x8A30-NVBRIDGE |
Operating system | CentOS 8.2.2004 |
CPU | Intel Xeon Gold 6248R |
Memory | 768 GB |
GPU | NVIDIA A30 |
GPU form factor | PCIe |
GPU count | 8 |
Networking | 1x ConnectX-5 IB EDR 100Gb/Sec |
Software stack | CUDA 11.6 Driver 510.47.03 cuBLAS 11.9.2.110 cuDNN 8.4.0.27 TensorRT 8.0.3 DALI 1.5.0 NCCL 2.12.10 Open MPI 4.1.1rc1 MOFED 5.4-1.0.3.0 |

Performance of the Dell PowerEdge R750xa Server for MLPerf™ Inference v2.0
Thu, 21 Apr 2022 18:20:33 -0000
|Read Time: 0 minutes
Abstract
Dell Technologies recently submitted results to the MLPerf Inference v2.0 benchmark suite. The results provide information about the performance of Dell servers. This blog takes a closer look at the Dell PowerEdge R750xa server and its performance for MLPerf Inference v1.1 and v2.0.
We compare the v1.1 results with the v2.0 results. We show the performance difference between the software stack versions. We also use the PowerEdge R750xa server to demonstrate that the v1.1 results from all systems can be referenced for planning an ML workload on systems that are not available for MLPerf Inference v2.0.
PowerEdge R750xa server
Built with state-of-the-art components, the PowerEdge R750xa server is ideal for artificial intelligence (AI), machine learning (ML), and deep learning (DL) workloads. The PowerEdge R750xa server is the GPU-optimized version of the PowerEdge R750 server. It supports accelerators as 4 x 300 W DW or 6 x 75 W SW. The GPUs are placed in the front of the PowerEdge R750xa server allowing for better airflow management. It has up to eight available PCIe Gen4 slots and supports up to eight NVMe SSDs.
The following figures show the PowerEdge R750xa server (source):
Figure 1: Front view of the PowerEdge R750xa server
Figure 2: Rear view of the PowerEdge R750xa server
Figure 3: Top view of the PowerEdge R750xa server
Configuration comparison
The following table describes the software stack configurations from the two rounds of submission for the closed data center division:
Table 1: MLPerf Inference v1.1 and v2.0 software stacks
NVIDIA component | v1.1 software stack | v2.0 software stack |
TensorRT | 8.0.2 | 8.4.0 |
CUDA | 11.3 | 11.6 |
cuDNN | 8.2.1 | 8.3.2 |
GPU driver | 470.42.01 | 510.39.01 |
DALI | 0.30.0 | 0.31.0 |
Triton | 21.07 | 22.01 |
Although the software has been updated across the two rounds of submission, performance is consistent, if not better, for the v2.0 submission. For MLPerf Inference v2.0, Triton performance results can be extrapolated from MLPerf Inference v1.1 except for the 3D U-Net benchmark, which is due to a v2.0 dataset change.
The following table describes the System Under Test (SUT) configurations from MLPerf Inference v1.1 and v2.0 of data center inference submissions:
Table 2: MLPerf Inference v1.1 and v2.0 system configuration of the PowerEdge R750xa server
Component | v1.1 system configuration | v2.0 system configuration |
Platform | R750xa 4x A100-PCIE-80GB, TensorRT | R750xa 4xA100 TensorRT |
MLPerf system ID | R750xa_A100-PCIE-80GBx4_TRT | R750xa_A100_PCIE_80GBx4_TRT |
Operating system | CentOS 8.2 | |
CPU | Intel Xeon Gold 6338 CPU @ 2.00 GHz | |
Memory | 1 TB | |
GPU | NVIDIA A100-PCIE-80GB | |
GPU form factor | PCIe | |
GPU count | 4 | |
Software stack | TensorRT 8.0.2 | TensorRT 8.4.0 CUDA 11.6 cuDNN 8.3.2 Driver 510.39.01 DALI 0.31.0 |
In the v1.1 round of submission, Dell Technologies submitted four different configurations on the PowerEdge R750xa server. Although the GPU count of four was maintained, Dell Technologies submitted the 40 GB and the 80 GB versions of the NVIDIA A100 GPU. Additionally, Dell Technologies submitted Multi-Instance GPU (MIG) numbers using 28 instances of the one compute instance of the 10gb memory profile on the 80 GB A100 GPU. Furthermore, Dell Technologies submitted power numbers (MaxQ is a performance and power submission) for the 40 GB version of the A100 GPU and submitted with the Triton server on the 80 GB version of the A100 GPU. A discussion about the v1.1 submission by Dell Technologies can be found in this blog.
Performance comparison of the PowerEdge R70xa server for MLPerf Inference v2.0 and v1.1
ResNet 50
ReNet50 is a 50-layer deep convolution neural network that is made up of 48 convolution layers along with a single max pool and average pool layer. This model is used for computer vision applications including image classification, object detection, and object classification. For the ResNet 50 benchmark, the performance numbers from the v2.0 submission match and outperform in the server and offline scenarios respectively when compared to the v1.1 round of submission. As shown in the following figure, the v2.0 submission results are within 0.02 percent in the server scenario and outperform the previous round by 1 percent in the offline scenario:
Figure 4: MLPerf Inference v2.0 compared to v1.1 ResNet 50 per card results on the PowerEdge R750xa server
BERT
Bidirectional Encoder Representation from Transformers (BERT) is a state-of-the-art language representational model for Natural Language Processing applications. This benchmark performs the SQuAD question answering task. The BERT benchmark consists of default and high accuracy modes for the offline and server scenarios. In the v2.0 round of submission, the PowerEdge R750xa server matched and slightly outperformed its performance from the previous round. In the default BERT server and offline scenarios, the extracted performance is within 0.06 and 2.33 percent respectively. In the high accuracy BERT server and offline scenarios, the extracted performance is within 0.14 and 1.25 percent respectively.
Figure 5: MLPerf Inference v2.0 compared to v1.1 BERT per card results on the PowerEdge R750xa server
SSD-ResNet 34
The SSD-ResNet 34 model falls under the computer vision category. This benchmark performs object detection. For the SSD-ResNet 34 benchmark, the results produced in the v2.0 round of submission are within 0.14 percent for the server scenario and show a 1 percent improvement in the offline scenario.
Figure 6: MLPerf Inference v2.0 compared to v1.1 SSD-ResNet 34 per card results on the PowerEdge R750xa server
DLRM
Deep Learning Recommendation Model (DLRM) is an effective benchmark for understanding workload requirements for building recommender systems. This model uses collaborative filtering and predicative analysis-based approaches to process large amounts of data. The DLRM benchmark consists of default and high accuracy modes, both containing the server and offline scenarios. For the server scenario in both the default and high accuracy modes, the v2.0 submissions results are within 0.003 percent. For the offline scenario across both modes, the PowerEdge R750xa server showed a 2.62 percent performance gain.
Figure 7: MLPerf Inference v2.0 compared to v1.1 DLRM per card results on the PowerEdge R750xa server
RNNT
The Recurrent Neural Network Transducers (RNNT) model falls under the speech recognition category. This benchmark accepts raw audio samples and produces the corresponding character transcription. For the RNNT benchmark, the PowerEdge R750xa server maintained similar performance behavior within 0.04 percent in the server mode and showing 1.46 percent performance gains in the offline mode.
Figure 8: MLPerf Inference v2.0 compared to v1.1 RNNT per card results on the PowerEdge R750xa server
3D U-Net
The 3D U-Net performance numbers have changed in terms of scale and are not comparable in a bar graph because of a change to the dataset. The new dataset for this model is the Kitts 2019 Kidney Tumor Segmentation set. However, the PowerEdge R750xa server yielded Number One results among the PCIe form factor systems that were submitted. This model falls under the computer vision category, but it specifically deals with medical image data.
Results summary
Figure 1 through Figure 8 show the consistent performance of the PowerEdge R750xa server across both rounds of submission.
The following figure shows that in the offline scenarios for the benchmarks there is a small but noticeable performance improvement:
Figure 9: Performance improvement in percentage of the PowerEdge R750xa server across MLPerf Inference v2.0 and v1.1
The small percentage delta in the server scenarios can be a result of noise and are consistent with the previous round of submission.
Conclusion
This blog confirms the consistent performance of the Dell PowerEdge R750xa server across the MLPerf Inference v1.1 and MLPerf Inference v2.0 submissions. Because an identical system from round v1.1 performed at a consistent level for MLPerf Inference v2.0, we see that the software stack upgrades had minimal impact on performance. Therefore, the optimal results from the v1.1 round of submission can be used for making informed decisions about server performance for a specific ML workload. Because Dell Technologies submitted a diverse set of configurations in the v1.1 round of submission, customers can take advantage of many results.