Dell PowerEdge Servers demonstrate excellent performance in MLPerf™ Training 3.0 benchmark
Tue, 27 Jun 2023 19:49:45 -0000
|Read Time: 0 minutes
MLPerf Training v3.0 has just been released and Dell Technologies results are shining brighter than ever. Our submission includes benchmarking test results from new generation servers that were recently launched, such as Dell PowerEdge XE9680, XE8640, and R760xa servers, and our previous generation of servers, such as the PowerEdge XE8545 and R750xa servers. Our submission included various use cases in the MLPerf training benchmark such as image classification, medical image segmentation, lightweight and heavy weight object detection, speech recognition, NLP, and recommendation. We encourage you to read our previous whitepaper about MLPerf Training v2.0, which introduces the MLPerf training benchmark. These benchmarks serve as a reference for the kind of performance customers can expect.
Dell Technologies also announced Project Helix, which introduced a solution that customers can use to run their generative AI workloads.
What’s new with MLPerf Training 3.0 with Dell submissions?
New features for this submission include:
- Significantly improved performance gains.
- Results that include NVIDIA H100 Tensor Core GPUs. Our results included submission to the newly introduced DLRMv2 benchmark, which has multihot encodings.
- First-time training submission using new generation Dell PowerEdge servers.
- First and only multinode results using Cornelis Omnipath interconnect fabric.
- More multinode results using different interconnect fabrics.
Overview of results
Dell Technologies submitted a total of 91 results, the highest number of results compared to other submitters, which constitute over one-third of all the closed division results. These results were submitted using 27 different systems. The most outstanding results were from Dell PowerEdge XE9680, XE8640, and R760xa servers with the new NVIDIA H100 PCIe and NVIDIA H100 SXM form factor-based accelerators. The results included multinodes. Other accelerators included NVIDIA A100 PCIE and SXM form factors.
Interesting data points include the following:
- Among other servers with four GPUs having the NVIDIA H100 PCIe accelerator, the Dell PowerEdge R760xa server has the lowest time to converge in MaskRCNN, ResNet, and UNet-3D benchmarks. Similarly, for the four NVIDIA H100 SXM accelerators, the PowerEdge XE8640 server had the lowest time to converge with BERT, DLRMv2, ResNet, and UNet-3D benchmarks.
- The Dell PowerEdge R760xa server features PCIe Gen 5, which allows for faster multi-GPU training. Our submissions included PowerEdge R750xa and R760xa servers with the same accelerators to show the performance gains customers can expect.
- MLPerf Training 3.0 was the first time that Dell Technologies made an eight-way NVIDIA SXM form factor accelerator submission for training workloads. The Dell PowerEdge XE9680 server with eight NVIDIA HGX H100 SXM GPUs had the lowest time to converge on the ResNet-50 benchmark among eight-GPU configurations, and has closely performed among other NVIDIA HGX systems on other benchmarks.
- Multinode results demonstrate near linear scaling, which shows that customers can gain faster time to value with all the workloads. These multinode submissions include different interconnects such as InfiniBand and Cornelis Omnipath, which allows for customers to make tradeoffs.
- Results for different Dell PowerEdge servers that render different accelerator TDP were submitted. These results are useful for scenarios in which the data center is power-constrained. These results help with FLOPS per watt decisions.
- Intel and AMD-based server submissions enable customers to see how CPUs can influence the training process.
- Our results included not only various systems, but also exceeded performance gains compared to the last round due to the newer generation of hardware acceleration from the newer server and accelerator.
Fig 1: Dell systems used for ResNet, MaskRCNN, and BERT benchmarks
Fig 2: Dell systems used for SSD, RNN-T, UNnet-3D, and DLRM benchmarks
Figure 1 and Figure 2 list the systems and corresponding NVIDIA GPUs that were used in tests. We see that various systems with different NVIDIA GPUs were used for different use cases such as ResNet-50, MaskRCNN, BERT, SSD, RNN-T, UNet-3D, and DLRMv2. All the systems performed optimally and delivered low time to converge. These results also include multinode results.
The single server with the lowest time to converge is the Dell PowerEdge XE9680 server, which delivers incredible time to value for training and inference workloads. These systems scale well and enable the current demand for very high compute. Large AI workloads, including sizable generative AI training (LLMs), can be trained on multiple PowerEdge XE9680 servers.
The following figure shows the improvement in performance from the previous submission. It shows the best Dell single-system training submission results compared to the previous round of submissions.
Fig 3: Performance improvement factor using a Dell PowerEdge XE9680 server with the previous generation Dell PowerEdge XE8545 server as a baseline across different benchmark
The figure shows the performance gains customers can expect if they upgrade to the latest generation of servers. Note that the latest generation server, the Dell PowerEdge XE9680 server, has eight NVIDIA H100 SXM GPUs; the previous generation Dell PowerEdge XE8545 server has four NVIDIA A100 SXM GPUs.
The most improvement at 846 percent was observed with the SSD benchmark, followed by the BERT benchmark at 611 percent. Other benchmarks yielded a greater than 230 percent improvement. These results are significant. The two-times improvement in time to train means more time for other workloads in the data center, yielding faster time to value for the business. With this acceleration, customers can expect faster prototyping, model training, and expedite their MLOps pipeline.
Conclusion
We have submitted compliant results for the MLCommons Training 3.0 benchmark. These results are numerous, using different servers powered by NVIDIA GPUs. Results show multinode scaling is linear, where more servers can help to solve the problem faster. Having various results helps customers choose the best server for their data center setting to deploy training workloads. Newer generation servers such as Dell PowerEdge XE9680, XE8640, and R760xa servers all deliver high performance while breaking MLCommons records across different use cases such as image classification, medical image segmentation, lightweight and heavy weight object detection, speech recognition, NLP, and recommendation. Furthermore, Project Helix offers customers an effective way to derive value from generative AI. Enterprises can enable their AI transformation with Dell Technologies efficiently to enable faster time to value to uniquely fit their needs.
MLCommons Results
https://mlcommons.org/en/training-normal-30/
The graphs above are MLCommons results MLPerf IDs from 3.0-2027 to 3.0-2053
The MLPerf™ name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information.
Related Blog Posts
Dell Technologies Shines in MLPerf™ Stable Diffusion Results
Tue, 12 Dec 2023 14:51:21 -0000
|Read Time: 0 minutes
Abstract
The recent release of MLPerf Training v3.1 results includes the newly launched Stable Diffusion benchmark. At the time of publication, Dell Technologies leads the OEM market in this performance benchmark for training a Generative AI foundation model, especially for the Stable Diffusion model. With the Dell PowerEdge XE9680 server submission, Dell Technologies is differentiated as the only vendor with a Stable Diffusion score for an eight-way system. The time to converge by using eight NVIDIA H100 Tensor Core GPUs is 46.7 minutes.
Overview
Generative AI workload deployment is growing at an unprecedented rate. Key reasons include increased productivity and the increasing convergence of multimodal input. Creating content has become easier and is becoming more plausible across various industries. Generative AI has enabled many enterprise use cases, and it continues to expand by exploring more frontiers. This growth can be attributed to higher resolution text to image, text-to-video generations, and other modality generations. For these impressive AI tasks, the need for compute is even more expansive. Some of the more popular generative AI workloads include chatbot, video generation, music generation, 3D assets generation, and so on.
Stable Diffusion is a deep learning text-to-image model that accepts input text and generates a corresponding image. The output is credible and appears to be realistic. Occasionally, it can be hard to tell if the image is computer generated. Consideration of this workload is important because of the rapid expansion of use cases such as eCommerce, marketing, graphics design, simulation, video generation, applied fashion, web design, and so on.
Because these workloads demand intensive compute to train, the measurement of system performance during their use is essential. As an AI systems benchmark, MLPerf has emerged as a standard way to compare different submitters that include OEMs, accelerator vendors, and others in a like-to-like way.
MLPerf recently introduced the Stable Diffusion benchmark for v3.1 MLPerf Training. It measures the time to converge a Stable Diffusion workload to reach the expected quality targets. The benchmark uses the Stable Diffusion v2 model trained on the LAION-400M-filtered dataset. The original LAION 400M dataset has 400 million image and text pairs. A subset of those images (approximately 6.5 million) is used for training in the benchmark. The validation dataset is a subset of 30 K COCO 2014 images. Expected quality targets are FID <= 90 and CLIP>=0.15.
The following figure shows a latent diffusion model[1]:
Figure 1: Latent diffusion model
[1] Source: https://arxiv.org/pdf/2112.10752.pdf
Stable Diffusion v2 is a latent diffusion model that combines an autoencoder with a diffusion model that is trained in the latent space of the autoencoder. MLPerf Stable Diffusion focuses on the U-Net denoising network, which has approximately 865 M parameters. There are some deviations from the v2 model. However, these adjustments are minor and encourage more submitters to make submissions with compute constraints.
The submission uses the NVIDIA NeMo framework, included with NVIDIA AI Enterprise, for secure, supported, and stable production AI. It is a framework to build, customize, and deploy generative AI models. It includes training and inferencing frameworks, guard railing toolkits, data curation tools, and pretrained models, offering enterprises an easy, cost effective, and a fast way to adopt generative AI.
Performance of the Dell PowerEdge XE9680 server and other NVIDIA-based GPUs on Stable Diffusion
The following figure shows the performance of NVIDIA H100 Tensor Core GPU-based systems on the Stable Diffusion benchmark. It includes submissions from Dell Technologies and NVIDIA that use different numbers of NVIDIA H100 GPUs. The results shown vary from eight GPUs (Dell submission) to 1024 GPUs (NVIDIA submission). The following figure shows the expected performance of this workload and demonstrates that strong scaling is achievable with less scaling loss.
Figure 2: MLPerf Training Stable Diffusion scaling results on NVIDIA H100 GPUs from Dell Technologies and NVIDIA
End users can use state-of-the-art compute to derive faster time to value.
Conclusion
The key takeaways include:
- The latest released MLPerf Training v3.1 measures Generative AI workloads like Stable Diffusion.
- Dell Technologies is the only OEM vendor to have made an MLPerf-compliant Stable Diffusion submission.
- The Dell PowerEdge XE9680 server is an excellent choice to derive value from Image Generation AI workloads for marketing, art, gaming, and so on. The benchmark results are outstanding for Stable Diffusion v2.
MLCommons Results
https://mlcommons.org/benchmarks/training/
The preceding graphs are MLCommons results for MLPerf IDs 3.1-2019, 3.1-2050, 3.1-2055, and 3.1-2060.
The MLPerf™ name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information.
Dell PowerEdge Servers Achieve Stellar Scores with MLPerf™ Training v3.1
Wed, 08 Nov 2023 17:43:48 -0000
|Read Time: 0 minutes
Abstract
MLPerf is an industry-standard AI performance benchmark. For more information about the MLPerf benchmarks, see Benchmark Work | Benchmarks MLCommons.
Today marks the release of a new set of results for MLPerf Training v3.1. The Dell PowerEdge XE9680, XE8640, and XE9640 servers in the submission demonstrated excellent performance. The tasks included image classification, medical image segmentation, lightweight and heavy-weight object detection, speech recognition, language modeling, recommendation, and text to image. MLPerf Training v3.1 results provide a baseline for end users to set performance expectations.
What is new with MLPerf Training 3.1 and the Dell Technologies submissions?
The following are new for this submission:
- For the benchmarking suite, a new benchmark was added: stable diffusion with the Laion400 dataset.
- Dell Technologies submitted the newly introduced Liquid Assisted Air Cooled (LAAC) PowerEdge XE9640 system, which is a part of the latest generation Dell PowerEdge servers.
Overview of results
Dell Technologies submitted 30 results. These results were submitted using five different systems. We submitted results for the PowerEdge XE9680, XE8640, and XE9640 servers. We also submitted multinode results for the PowerEdge XE9680 and XE8640 servers. The PowerEdge XE9680 server was powered by eight NVIDIA H100 Tensor Core GPUs, while the PowerEdge XE8640 and XE9640 servers were powered by four NVIDIA H100 Tensor Core GPUs each.
Datapoints of interest
Interesting datapoints include:
- Our new stable diffusion results with the PowerEdge XE9680 server have been submitted for the first time and are exclusive. Dell Technologies, NVIDIA, and Habana Labs are the only submitters to have made an official submission. This submission is important because of the explosion of Generative AI workloads. The submission uses the NVIDIA NeMo framework, included in NVIDIA AI Enterprise for secure, supported, and stable production AI.
- Dell PowerEdge XE8640 and XE9640 servers secured several top performer titles (#1 titles) among other systems equipped with four NVIDIA H100 GPUs. The tasks included language modeling, recommendation, heavy-weight object detection, speech to text, and medical image segmentation.
- A number of multinode results were submitted for the previous round, which can be compared with this round. PowerEdge XE9680 multinode results were submitted. Additionally, this round was the first time multinode results with the newer generation PowerEdge XE8640 servers were submitted. The results show near linear scaling. Furthermore, Dell Technologies is the only submitter in addition to NVIDIA, Habana Labs, and Intel making multinode, on-premises result submissions.
- The results for the PowerEdge XE9640 server with liquid assisted air cooling (LAAC) are similar to the PowerEdge XE8640 air-cooled server.
The following figure shows all the convergence times for Dell systems and corresponding workloads in the benchmark. Because different benchmarks are included in the same graph, the y axis is expressed logarithmically. Overall, these numbers show an excellent time to converge for the workload in question.
Figure 1. Logarithmic y axis: Overview of Dell MLPerf Training v3.1 results
Conclusion
We submitted compliant results for the MLCommons Training v3.1 benchmark. These results are based on the latest generation of Dell PowerEdge XE9680, XE8640, and XE9640 servers, powered by NVIDIA H100 Tensor Core GPUs. All results are stellar. They demonstrate that multinode scaling is linear and that more servers can help to solve the same problem faster. Different results allow end users to make decisions about expected performance before deploying their compute-intensive training workloads. The workloads in the submission include image classification, medical image segmentation, lightweight and heavy-weight object detection, speech recognition, language modeling, recommendation, and text to image. Enterprises can enable and maximize their AI transformation with Dell Technologies efficiently with Dell solutions.
MLCommons Results
https://mlcommons.org/benchmarks/training/
The preceding graphs are MLCommons results for MLPerf IDs from 3.1-2005 to 3.1-2009.
The MLPerf™ name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information.