Dell EMC Servers Excel in MLPerf™ Training v1.0 Benchmarks
Thu, 08 Jul 2021 15:28:25 -0000
|Read Time: 0 minutes
Dell Technologies has submitted MLPerf training v1.0 results. This blog provides an explanation of what is new with MLPerf training v1.0 and a high-level overview of our submissions. Results indicate that Dell EMC DSS8440 and PowerEdge XE8545 servers offer promising performance for Deep Learning training workloads across different areas.
MLCommons™ is a community that contains a consortium of experts in the Machine Learning/Deep Learning industry from different fields within AI technology. It consists of experts from industry, academia, startups, and individual researchers. MLPerf™ Training is the community-led test suite focusing on deep learning training. This test suite aims to measure how fast a system can train deep learning models across eight different problem types:
- Image classification
- Medical image segmentation
- Light-weight object detection
- Heavy-weight object detection
- Speech recognition
- Natural language processing
- Recommendation
- Reinforcement learning
These benchmarks provide a consistent and reproducible way to measure accuracy and convergence on individual accelerators, systems, and cloud setups. As of June 2021, MLPerf™ Training released the latest v1.0 results in the fourth round of submissions of MLPerf Training. The following changes are new with v1.0:
- Addition of two benchmarks:
- RNN-T—RNN-T is a speech recognition model. Speech recognition accepts raw audio samples and produces a corresponding text transcription. It uses the Libri-speech dataset, which is derived from audiobooks. An example of the use of speech recognition is Google Voice Search.
- 3D-UNet—3D-Unet is a model for 3D medical image segmentation. It accepts 3D images that contain tumors; the model divides (or segments) the tumor from the other parts in the image. It uses the KiTs19 dataset. An example of the use of 3D medical image segmentation is for the identification of kidney tumors.
- Introduction of a uniform and more mature process for evaluation and submission:
- Reference Convergence Points (RCP) checker to ensure hyperparameters are assessed consistently and uniformly across different submissions.
- Other checkers such as compliance checker, system desc checker, and package checker to check the accuracy of the submission.
- Result summarizer to provide a submission summary.
- Retirement of two language translation benchmarks from v0.7:
- GNMT
- Transformer
BERT serves as a replacement for language model tasks.
The following figure demonstrates the numbers from the Deep Learning v1.0 benchmarks submitted by Dell Technologies:
Figure 1: MLPerf v1.0 results from Dell Technologies
Contributions from Dell Technologies
Our submissions focused on Dell EMC DSS 8440 and Dell EMC PowerEdge XE8545 servers. The DSS 8440 server is an Intel-based, PCIe Gen3 4U server that supports up to 10 double-wide PCIe GPUs, focused on Machine Learning/Deep Learning applications such as training. The 4U PowerEdge XE8545 server supports the latest 3rd Gen AMD EPYC processors, PCIe Gen4, and the latest NVIDIA A100 Tensor Core GPUs for cutting edge machine learning workloads. Both of these system configurations are NVIDIA-Certified, which means they have been validated for best performance and optimal scalability. The submission from Dell Technologies also included multinode training entries to showcase scale-out performance.
Multinode training is important. Training is compute intensive, therefore, more compute nodes are used while training models. Because extra compute nodes help to reduce the turnaround time, it is critical to showcase multiple nodes’ performance. Dell Technologies and NVIDIA are the only submitters that submitted multiple nodes on GPUs. The submissions from NVIDIA run on Docker with a customized Slurm environment to optimize performance; we submitted multinode submissions with Singularity on our DSS 8440 servers as well as Docker and Slurm submissions on PowerEdge XE8545 servers. Singularity is a secure containerization solution primarily used in traditional HPC GPU clusters. Setup scripts with singularity help traditional HPC customers run MLPerf™ Training on their cluster without the need to fully restructure their existing cluster setup.
The PowerEdge XE8545 server provides the best performing submission with an air-cooled solution for NVIDIA A100-SXM-80GB 500W GPUs. Typically, 500W GPUs of most vendors' systems are cooled with liquid, due to the challenges presented by the high TDP. However, Dell Technologies invested engineering and design time to solve the thermal challenge and allows customers to avoid the need for costly changes to a standard data center setup.
The DSS 8440 server submissions to MLPerf™ Training v1.0 using the latest generation NVIDIA A100 40 GB-PCIe GPUs show a 2.1 to 2.4 times increase from equivalent MLPerf™ Training v0.7 submissions using NVIDIA V100S PCIe GPUs. Dell Technologies is committed to bringing the latest performance advancements to customers as quickly as possible.
Out of 12 different organizations, Dell Technologies and NVIDIA are the only two organizations that submitted results for all eight models in the MLPerf™ training v1.0 benchmarking suite.
Next steps
As a next step, we will publish more technical blogs to provide deep dives into DSS 8440 server and PowerEdge XE8545 server results.
Related Blog Posts
Multinode Performance of Dell PowerEdge Servers with MLPerfTM Training v1.1
Mon, 07 Mar 2022 19:51:12 -0000
|Read Time: 0 minutes
The Dell MLPerf v1.1 submission included multinode results. This blog showcases performance across multiple nodes on Dell PowerEdge R750xa and XE8545 servers and demonstrates that the multinode scaling performance was excellent.
The compute requirement for deep learning training is growing at a rapid pace. It is imperative to train models across multiple nodes to attain a faster time-to-solution. Therefore, it is critical to showcase the scaling performance across multiple nodes. To demonstrate to customers the performance that they can expect across multiple nodes, our v1.1 submission includes multinode results. The following figures show multinode results for PowerEdge R750xa and XE8545 systems.
Figure 1: One-, two-, four-, and eight-node results with PowerEdge R750xa Resnet50 MLPerf v1.1 scaling performance
Figure 1 shows the performance of the PowerEdge R750xa server with Resnet50 training. These numbers scale from one node to eight nodes, from four NVIDIA A100-PCIE-80GB GPUs to 32 NVIDIA A100-PCIE-80GB GPUs. We can see that the scaling is almost linear across nodes. MLPerf training requires passing Reference Convergence Points (RCP) for compliance. These RCPs were inhibitors to show linear scaling for the 8x scaling case. The near linear scaling makes a PowerEdge R750xa node an excellent choice for multinode training setup.
The workload was distributed by using singularity on PowerEdge R750xa servers. Singularity is a secure containerization solution that is primarily used in traditional HPC GPU clusters. Our submission includes setup scripts with singularity that help traditional HPC customers run workloads without the need to fully restructure their existing cluster setup. The submission also includes Slurm Docker-based scripts.
Figure 2: Multinode submission results for PowerEdge XE8545 server with BERT, MaskRCNN, Resnet50, SSD, and RNNT
Figure 2 shows the submitted performance of the PowerEdge XE8545 server with BERT, MaskRCNN, Resnet50, SSD, and RNNT training. These numbers scale from one node to two nodes, from four NVIDIA A100-SXM-80GB GPUs to eight NVIDIA A100-SXM-80GB GPUs. All GPUs operate at 500W TDP for maximum performance. They were distributed using Slurm and Docker on PowerEdge XE8545 servers. The performance is nearly linear.
Note: The RNN-T single node results submitted for the PowerEdge XE8545x4A100-SXM-80GB system used a different set of hyperparameters than for two nodes. After the submission, we ran the RNN-T benchmark again on the PowerEdge XE8545x4A100-SXM-80GB system with the same hyperparameters and found that the new time to converge is approximately 77.37 minutes. Because we only had the resources to update the results for the 2xXE8545x4A100-SXM-80GB system before the submission deadline, the MLCommons results show 105.6 minutes for a single node XE8545x4100-SXM-80GB system.
The following figure shows the adjusted representation of performance for the PowerEdge XE8545x4A100-SXM-80GB system. RNN-T provides an unverified score of 77.31 minutes[1]:
Figure 3: Revised multinode results with PowerEdge XE8545 BERT, MaskRCNN, Resnet50, SSD, and RNNT
Figure 3 shows the linear scaling abilities of the PowerEdge XE8545 server across different workloads such as BERT, MaskRCNN, ResNet, SSD, and RNNT. This linear scaling ability makes the PowerEdge XE8545 server an excellent choice to run large-scale multinode workloads.
Note: This rnnt.zip file includes log files for 10 runs that show that the averaged performance is 77.31 minutes.
Conclusion
- It is critical to measure deep learning performance across multiple nodes to assess the scalability component of training as deep learning workloads are growing rapidly.
- Our MLPerf training v1.1 submission includes multinode results that are linear and perform extremely well.
- Scaling numbers for the PowerEdge XE8545 and PowerEdge R750xa server make them excellent platform choices for enabling large scale deep learning training workloads across different areas and tasks.
[1] MLPerf v1.1 Training RNN-T; Result not verified by the MLCommonsTM Association. The MLPerf name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See http://www.mlcommons.org for more information.
Dell PowerEdge Servers Unleash Another Round of Excellent Results with MLPerf™ v4.0 Inference
Wed, 27 Mar 2024 15:12:53 -0000
|Read Time: 0 minutes
Today marks the unveiling of MLPerf v4.0 Inference results, which have emerged as an industry benchmark for AI systems. These benchmarks are responsible for assessing the system-level performance consisting of state-of-the-art hardware and software stacks. The benchmarking suite contains image classification, object detection, natural language processing, speech recognition, recommenders, medical image segmentation, LLM 6B and LLM 70B question answering, and text to image benchmarks that aim to replicate different deployment scenarios such as the data center and edge.
Dell Technologies is a founding member of MLCommons™ and has been actively making submissions since the inception of the Inference and Training benchmarks. See our MLPerf™ Inference v2.1 with NVIDIA GPU-Based Benchmarks on Dell PowerEdge Servers white paper that introduces the MLCommons Inference benchmark.
Our performance results are outstanding, serving as a clear indicator of our resolve to deliver outstanding system performance. These improvements enable higher system performance when it is most needed, for example, for demanding generative AI (GenAI) workloads.
What is new with Inference 4.0?
Inference 4.0 and Dell’s submission include the following:
- Newly introduced Llama 2 question answering and text to image stable diffusion benchmarks, and submission across different Dell PowerEdge XE platforms.
- Improved GPT-J (225 percent improvement) and DLRM-DCNv2 (100 percent improvement) performance. Improved throughput performance of the GPTJ and DLRM-DCNv2 workload means faster natural language processing tasks like summarization and faster relevant recommendations that allow a boost to revenue respectively.
- First-time submission of server results with the recently released PowerEdge R7615 and PowerEdge XR8620t servers with NVIDIA accelerators.
- Besides accelerator-based results, Intel-based CPU-only results.
- Results for PowerEdge servers with Qualcomm accelerators.
- Power results showing high performance/watt scores for the submissions.
- Virtualized results on Dell servers with Broadcom.
Overview of results
Dell Technologies delivered 187 data center, 28 data center power, 42 edge, and 24 edge power results. Some of the more impressive results were generated by our:
- Dell PowerEdge XE9680, XE9640, XE8640, and servers with NVIDIA H100 Tensor Core GPUs
- Dell PowerEdge R7515, R750xa, and R760xa servers with NVIDIA L40S and A100 Tensor Core GPUs
- Dell PowerEdge XR7620 and XR8620t servers with NVIDIA L4 Tensor Core GPUs
- Dell PowerEdge R760 server with Intel Emerald Rapids CPUs
- Dell PowerEdge R760 with Qualcomm QAIC100 Ultra accelerators
NVIDIA-based results include the following GPUs:
- Eight-way NVIDIA H100 GPU (SXM)
- Four-way NVIDIA H100 GPU (SXM)
- Four-way NVIDIA A100 GPU (PCIe)
- Four-way NVIDIA L40S GPU (PCIe)
- NVIDIA L4 GPU
These accelerators were benchmarked on different servers such as PowerEdge XE9680, XE8640, XE9640, R760xa, XR7620, and XR8620t servers across data center and edge suites.
Dell contributed to about 1/4th of the closed data center and edge submissions. The large number of result choices offers end users an opportunity to make data-driven purchase decisions and set performance and data center design expectations.
Interesting Dell data points
The most interesting data points include:
- Performance results across different benchmarks are excellent and show that Dell servers meet the increasing need to serve different workload types.
- Among 20 submitters, Dell Technologies was one of the few companies that covered all benchmarks in the closed division for data center suites.
- The PowerEdge XE8640 and PowerEdge XE9640 servers compared to other four-way systems procured winning titles across all the benchmarks including the newly launched stable diffusion and Llama 2 benchmark.
- The PowerEdge XE9680 server compared to other eight-way systems procured several winning titles for benchmarks such as ResNet Server, 3D-Unet, BERT-99, and BERT-99.9 Server.
- The PowerEdge XE9680 server delivers the highest performance/watt compared to other submitters with 8-way NVIDIA H100 GPUs for ResNet Server, GPTJ Server, and Llama 2 Offline
- The Dell XR8620t server for edge benchmarks with NVIDIA L4 GPUs outperformed other submissions.
- The PowerEdge R750xa server with NVIDIA A100 PCIe GPUs outperformed other submissions on the ResNet, RetinaNet, 3D-Unet, RNN-T, BERT 99.9, and BERT 99 benchmarks.
- The PowerEdge R760xa server with NVIDIA L40S GPUs outperformed other submissions on the ResNet Server, RetinaNet Server, RetinaNet Offline, 3D-UNet 99, RNN-T, BERT-99, BERT-99.9, DLRM-v2-99, DLRM-v2-99.9, GPTJ-99, GPTJ-99.9, Stable Diffusion XL Server, and Stable Diffusion XL Offline benchmarks.
Highlights
The following figure shows the different Offline and Server performance scenarios in the data center suite. These results provide an overview; follow-up blogs will provide more details about the results.
The following figure shows that these servers delivered excellent performance for all models in the benchmark such as ResNet, RetinaNet, 3D-UNet, RNN-T, BERT, DLRM-v2, GPT-J, Stable Diffusion XL, and Llama 2. Note that different benchmarks operate on varied scales. They have all been showcased in an exponentially scaled y-axis in the following figure:
Figure 1: System throughput for submitted systems for the data center suite.
The following figure shows single-stream and multistream scenario results for the edge for ResNet, RetinaNet, 3D-Unet, RNN-T, BERT 99, GPTJ, and Stable Diffusion XL benchmarks. The lower the latency, the better the results and for Offline scenario, higher the better.
Figure 2: Edge results with PowerEdge XR7620 and XR8620t servers overview
Conclusion
The preceding results were officially submitted to MLCommons. They are MLPerf-compliant results for the Inference v4.0 benchmark across various benchmarks and suites for all the tasks in the benchmark such as image classification, object detection, natural language processing, speech recognition, recommenders, medical image segmentation, LLM 6B and LLM 70B question answering, and text to image. These results prove that Dell PowerEdge XE9680, XE8640, XE9640, and R760xa servers are capable of delivering high performance for inference workloads. Dell Technologies secured several #1 titles that make Dell PowerEdge servers an excellent choice for data center and edge inference deployments. End users can benefit from the plethora of submissions that help make server performance and sizing decisions, which ultimately deliver enterprises’ AI transformation and shows Dell’s commitment to deliver higher performance.
MLCommons Results
https://mlcommons.org/en/inference-datacenter-40/
https://mlcommons.org/en/inference-edge-40/
The preceding graphs are MLCommons results for MLPerf IDs from 4.0-0025 to 4.0-0035 on the closed datacenter, 4.0-0036 to 4.0-0038 on the closed edge, 4.0-0033 in the closed datacenter power, and 4.0-0037 in closed edge power.