
Empowering Enterprises with Generative AI: How Does MLPerf™ Help Support Requirements?
Fri, 14 Apr 2023 17:05:26 -0000
|Read Time: 0 minutes
Generative AI has developed into a critical workload in the deep learning ecosystem. In the generative AI world, 2023 has been a year of explosive growth as generative AI continues to make huge progress by improving the quality and ease of access to these ecosystems. With the advent of ChatGPT, Stable Diffusion, and so on, which have gained significant popularity, we can consider generative AI to be one of the pivotal use cases that mainstreams AI to the world. We expect to see generative AI push new frontiers and enable an explosion of productivity. This blog provides an overview of generative AI and its relevance to the MLCommonsTM AI system benchmark to which we submit on a frequent basis.
Introduction to Generative AI
Generative AI is a phenomenon by which AI systems (consisting of hardware and software) can produce plausible renders of images, audio, video, text, code, 3D renders, and so on when given an instruction prompt. The prompt can be text, voice, or other forms. Some popular examples include ChatGPT, Stable Diffusion image generator, and Text to speech engines.
These AI systems can enable a significant productivity boost by generating and modifying existing pieces of content that effectively improve the user’s workflow.
What can these AI Systems do?
Generative AI is capable of generating and optimizing:
- Chat and Text─This modality is useful for customer support, for generating blogs, ad copies, design guides, and technical reports, reading and taking action, answering questions, summarizing large documents, producing code that can run directly, inspiring developers to write improved code, and so on.
- Video generations:
- Talking head videos─These videos can be useful for content producers, tutorial guides, and so on in which personas are able to communicate with voice, lip syncing, and emotions, these are helpful for customer support and other interactive services.
- NERF (neural radiance fields) – Given a few angles from pictures, it can produce an entire scene of smooth footage that looks to be real. NERF can be useful in providing more perspective for a scene and enable more interesting viewpoints.
- High-resolution images─These creative images can be used for multiple purposes including B-rolls, explanation of ideas and simulated concepts, special effects, graphic vectors, infographics, backdrops, scenes and so on.
- High-fidelity audio─These audio samples can be voices, music, and so on. Voices can deliver emotions, be of high quality like voiceovers, and deliver speech for advertisements. Audio samples can also be songs for karaoke, songs with beats, customer support and so on.
- 3D Generations─These renders are useful for producing a new world with just imagination. They are powerful for VFX, VR, and other immersive experiences. These 3D generations can be used for creating digital clones of the real world, games, commercials, movies, and so on.
This blog does not highlight many other use cases. With more innovation and research, there will be a Cambrian explosion of more use cases that are fueled by generative AI. These models can also produce personalized content for the end user as opposed to serving generic material.
What kind of compute is needed to train these AI systems?
Training generative AI systems is a compute-intensive task. Typically, text generation, chatbots, and instruction followers have billions of parameters and use thousands of GPU hours. This task presents a large problem needing different mechanisms of parallelization, training update optimizations, including full stack (hardware and software) optimization, and so on.
For instance, the GPT3 model has 175 B parameters and the Megatron model has approximately 530 B parameters. Training and Inference procedures for these systems are significantly different than the traditional deep learning models that do not have as many parameters. For instance, large language models (LLMs) require large inference setups including multinode inference, scaling training to a trillion parameter models needing different mechanisms including dynamic sparsity, optimizing communication costs, self-tuning, and so on.
In essence, the compute needs for generative AI are ever growing in unique ways. While training generative AI models remains crucial for compute needs, the subsequent necessity for compute could be arriving from fine tuning and inferencing needs.
Why adapt now?
Generative AI has been in development for many years now; Transformers, Wavenet, GANs, Autoencoders with decoders, and so on have been around for quite some time. There has been much innovation in these areas, which continues to be mixed and matched to meet productive outcomes. For instance, the growth of multimodal models (models that take different kinds of inputs) facilitate a more collaborative workflow. Multimodal models form the cornerstone for enabling near human intelligence for a specific task. Although there is small chance of reaching human-level performance overall, these multimodal models can produce plausible results. Consumers of these systems can take the outputs, modify them and use them in their workflow. These systems render output quickly compared to a manual effort and provide more layers of creativity.
These plausible renders, ease of access, and open-source development have been an incredible fuel for popularizing generative AI systems. The next step is pushing these systems to perform better, whether by improving quality of service or improving throughput. Improving quality of service and throughput is an already established problem. To improve convergence and throughput, many benchmarks have been attempting to optimize AI systems.
Relationship to MLCommons
The MLCommons Training benchmark has been instrumental in enabling significant improvements for convergence of the training time of systems by taking a holistic view of the hardware and software. The MLCommons Inference benchmark has been conducive for optimizing the inference of AI systems.
Furthermore, MLCommons has generative AI benchmarks in their road map. For instance, LLM is part of MLCommons v3.0 training; Stable Diffusion is scheduled to be included as part of MLCommons v3.1 training.
The need to continuously improve systems is essential, more so now for generative AI use cases. We can see that the MLCommons community has made significant improvements in performance every year. These optimizations from vendors, benchmarks, and the deep learning community continue to serve this generative AI effort. All these efforts make adoption of generative AI more attractive now.
Paradigms
Some fundamental models that are used for generative AI workloads in MLPerf benchmarks include:
Figure 1: Transformer architecture
- Transformer─This model uses an attention mechanism to model areas of interest in a specific context. This method allows building relationships that signify how one element relates to others.
Figure 2: U-Net architecture
- 3d-UNet─This model uses convolution and pooling blocks to set up a contractive and expanding path that creates a bottleneck. The image is reconstructed from this bottleneck. The bottleneck captures the compression of data; only important information is used to reconstruct the image.
How is generative AI relevant to MLPerf Training and Inference?
MLPerf Training uses the BERT language model. Many text-based generative AI workloads are LLMs. While BERT is not as large as GPT3 (about 1/500th the size of GPT3 based on a number of parameters (340 M compared to 175 B)) it has fundamental blocks that GPT3 uses.
For instance, BERT uses multiple Attention Heads, Layernorms SoftMax, and so on, which GPT3 also uses. While parameters, layer count, and model size are larger for GPT3, BERT uses fundamentally similar procedures that are essential for training.
Conversely, Stable Diffusion uses UNet layers. This method is useful for constructing images of high quality. It takes encoded text and uses the UNet bottleneck to effectively enable a denoising procedure. 3D-UNet is a part of the MLPerf benchmark, which is optimized.
The preceding examples show that optimizations used in MLPerf are transferable, and we can use current MLPerf models to be a relative proxy to the generative AI workloads.
Furthermore, MLPerf includes LLMs and Stable Diffusion on the road map for the upcoming training submission versions. We can expect optimized versions of these implementations to be made available to the public.
The links in the references show optimizations made by NVIDIA for the benchmarks. Customers can take the already optimized references and use them for their generative AI use cases.
We recognize the importance of AI workloads including generative AI. Therefore, we submit to MLCommons benchmarks that provide like-to-like comparisons with different OEMs and vendors. Scale is an important aspect of generative AI workloads. We have introduced the new PowerEdge XE9680 server that produces stellar performance at scale. The following figure shows the performance improvement from MLPerf Inference v2.1 to MLPerf Inference v3.0.
* MLPerf ID 2.1-0014 and MLPerf ID 3.0-0013
Figure 3: MLPerf Inference 3.0 vs Inference 2.1 performance improvement from XE9680 server having 8xH100 GPUs compared to XE8545 having 4xA100 GPUs
PowerEdge XE9680 and XE8545 systems are an excellent choice for generative AI workloads. Customers can expect faster time to value and these systems scale very well, as attributed by the MLPerf training results.
Conclusion
While generative AI has produced enormous excitement, there are many challenges such as biased outputs, incorrect answers, hallucinations, instability, and so on that require monitoring and policing. Generative AI systems still cannot make autonomous decisions tied to other algorithms for mission-critical applications.
The latest MLPerf Inference 3.0 results show up to three times to eight times improvements for all categories. These improvements show Dell Technologies’ commitment to continuously enable improvement in performance. We understand generative AI is an important class of AI workload; Dell hardware supports these workloads. By upgrading to the latest servers, such as the PowerEdge XE9680 servers, customers can derive a faster time to value. Dell Technologies can help customers adapt and deploy generative AI workloads.
To summarize, compute, quality of service (plausible outputs), open-source development, and ease of access are major drivers for mass adoption of generative AI. Organizations can leverage these drivers to produce outputs for their workflow. Enabling these systems with humans in the loop are good first steps to boosting productivity. Dell Technologies has been making MLPerf submissions to show how our servers can deliver excellent performance. The optimizations made for MLPerf are transferable to generative AI workloads.
References
https://arxiv.org/abs/1706.03762
https://arxiv.org/abs/1505.04597
https://developer.nvidia.com/blog/leading-mlperf-training-2-1-with-full-stack-optimizations-for-ai/
https://developer.nvidia.com/blog/boosting-mlperf-training-performance-with-full-stack-optimization/
Related Blog Posts

Choosing a PowerEdge Server and NVIDIA GPUs for AI Inference at the Edge
Fri, 05 May 2023 16:38:19 -0000
|Read Time: 0 minutes
Dell Technologies submitted several benchmark results for the latest MLCommonsTM Inference v3.0 benchmark suite. An objective was to provide information to help customers choose a favorable server and GPU combination for their workload. This blog reviews the Edge benchmark results and provides information about how to determine the best server and GPU configuration for different types of ML applications.
Results overview
For computer vision workloads, which are widely used in security systems, industrial applications, and even in self-driven cars, ResNet and RetinaNet results were submitted. ResNet is an image classification task and RetinaNet is an object detection task. The following figures show that for intensive processing, the NVIDIA A30 GPU, which is a double-wide card, provides the best performance with almost two times more images per second than the NVIDIA L4 GPU. However, the NVIDIA L4 GPU is a single-wide card that requires only 43 percent of the energy consumption of the NVIDIA A30 GPU, considering nominal Thermal Design Power (TDP) of each GPU. This low-energy consumption provides a great advantage for applications that need lower power consumption or in environments that are more challenging to cool. The NVIDIA L4 GPU is the replacement for the best-selling NVIDIA T4 GPU, and provides twice the performance with the same form factor. Therefore, we see that this card is the best option for most Edge AI workloads.
Conversely, the NVIDIA A2 GPU exhibits the most economical price (compared to the NVIDIA A30 GPU's price), power consumption (TDP), and performance levels among all available options in the market. Therefore, if the application is compatible with this GPU, it has the potential to deliver the lowest total cost of ownership (TCO).
Figure 1: Performance comparison of NVIDIA A30, L4, T4, and A2 GPUs for the ResNet Offline benchmark
Figure 2: Performance comparison of NVIDIA A30, L4, T4, and A2 GPUs for the RetinaNet Offline benchmark
The 3D-UNet benchmark is the other computer vision image-related benchmark. It uses medical images for volumetric segmentation. We saw the same results for default accuracy and high accuracy. Moreover, the NVIDIA A30 GPU delivered significantly better performance over the NVIDIA L4 GPU. However, the same comparison between energy consumption, space, and cooling capacity discussed previously applies when considering which GPU to use for each application and use case.
Figure 3: Performance comparison of NVIDIA A30, L4, T4, and A2 GPUs for the 3D-UNet Offline benchmark
Another important benchmark is for BERT, which is a Natural Language Processing model that performs tasks such as question answering and text summarization. We observed similar performance differences between the NVIDIA A30, L4, T4, and A2 GPUs. The higher the value, the better.
Figure 4: Performance comparison of NVIDIA A30, L4, T4, and A2 GPUs for the BERT Offline benchmark
MLPerf benchmarks also include latency results, which are the time that systems take to process requests. For some use cases, this processing time can be more critical than the number of requests that can be processed per second. For example, if it takes several seconds to respond to a conversational algorithm or an object detection query that needs a real-time response, this time can be particularly impactful on the experience of the user or application.
As shown in the following figures, the NVIDIA A30 and NVIDIA L4 GPUs have similar latency results. Depending on the workload, the results can vary due to which GPU provides the lowest latency. For customers planning to replace the NVIDIA T4 GPU or seeking a better response time for their applications, the NVIDIA L4 GPU is an excellent option. The NVIDIA A2 GPU can also be used for applications that require low latency because it performed better than the NVIDIA T4 GPU in single stream workloads. The lower the value, the better.
Figure 4: Latency comparison of NVIDIA A30, L4, T4, and A2 GPUs for the ResNet single-stream and multistream benchmark
Figure 5: Latency comparison of NVIDIA A30, L4, T4, and A2 GPUs for the RetinaNet single-stream and multistream benchmark and the BERT single-stream benchmark
Dell Technologies submitted to various benchmarks to help understand which configuration is the most environmentally friendly as the data center’s carbon footprint is a concern today. This concern is relevant because some edge locations have power and cooling limitations. Therefore, it is important to understand performance compared to power consumption.
The following figure affirms that the NVIDIA L4 GPU has equal or better performance per watt compared to the NVIDIA A2 GPU, even with higher power consumption. For Throughput and Perf/watt values, higher is better; for Power(watt) values, lower is better.
Figure 6: NVIDIA L4 and A2 GPU power consumption comparison
Conclusion
With measured workload benchmarks on MLPerf Inference 3.0, we can conclude that all NVIDIA GPUs tested for Edge workloads have characteristics that address several use cases. Customers must evaluate size, performance, latency, power consumption, and price. When choosing which GPU to use and depending on the requirements of the application, one of the evaluated GPUs will provide a better result for the final use case.
Another important conclusion is that the NVIDIA L4 GPU can be considered as an exceptional upgrade for customers and applications running on NVIDIA T4 GPUs. The migration to this new GPU can help consolidate the amount of equipment, reduce the power consumption, and reduce the TCO; one NVIDIA L4 GPU can provide twice the performance of the NVIDIA T4 GPU for some workloads.
Dell Technologies demonstrates on this benchmark the broad Dell portfolio that provides the infrastructure for any type of customer requirement.
The following blogs provide analyses of other MLPerfTM benchmark results:
- Dell Servers Excel in MLPerf™ Inference 3.0 Performance
- Dell Technologies’ NVIDIA H100 SXM GPU submission to MLPerf™ Inference 3.0
- Empowering Enterprises with Generative AI: How Does MLPerf™ Help Support
- Comparison of Top Accelerators from Dell Technologies’ MLPerf™
References
For more information about Dell Power Edge servers, go to the following links:
- Dell’s PowerEdge XR7620 for Telecom/Edge Compute
- Dell’s PowerEdge XR5610 for Telecom/Edge Compute
- PowerEdge XR4520c Compute Sled specification sheet
- PowerEdge XE2420 Spec Sheet
For more information about NVIDIA GPUs, go to the following links:
MLCommonsTM Inference v3.0 results presented in this document are based on following system IDs:
ID | Submitter | Availability | System |
---|---|---|---|
2.1-0005 | Dell Technologies | Available | Dell PowerEdge XE2420 (1x T4, TensorRT) |
2.1-0017 | Dell Technologies | Available | Dell PowerEdge XR4520c (1x A2, TensorRT) |
2.1-0018 | Dell Technologies | Available | Dell PowerEdge XR4520c (1x A30, TensorRT) |
2.1-0019 | Dell Technologies | Available | Dell PowerEdge XR4520c (1x A2, MaxQ, TensorRT) |
2.1-0125 | Dell Technologies | Preview | Dell PowerEdge XR5610 (1x L4, TensorRT, MaxQ) |
2.1-0126 | Dell Technologies | Preview | Dell PowerEdge XR7620 (1x L4, TensorRT) |
Table 1: MLPerfTM system IDs

Dell Servers Excel in MLPerf™ Training v2.1
Wed, 16 Nov 2022 10:07:33 -0000
|Read Time: 0 minutes
Dell Technologies has completed the successful submission of MLPerf Training, which marks the seventh round of submission to MLCommons™. This blog provides an overview and highlights the performance of the Dell PowerEdge R750xa, XE8545, and DSS8440 servers that were used for the submission.
What’s new in MLPerf Training v2.1?
This round of submission does not include new benchmarks or changes in the existing benchmarks. A change is introduced in the submission compliance checker.
This round adds one-sided normalization to the checker to reduce variance in the number of steps to converge. This change means that if a result converges faster than the RCP mean within a certain range, the checker normalizes the results to the RCP mean. This normalization was not available in earlier rounds of submission.
What’s new in MLPerf Training v2.1 with Dell submissions?
For Dell submission for MLPerf Training v2.1, we included:
- Improved performance with BERT and Mask R-CNN models
- Minigo submission results on Dell PowerEdge R750xa server with A100 PCIe GPUs
Overall Dell Submissions
Figure 1. Overall submissions for all Dell PowerEdge servers in MLPerf Training v2.1
Figure 1 shows our submission in which the workloads span across image classification, lightweight and heavy object detection, speech recognition, natural language processing, recommender systems, medical image segmentation, and reinforcement learning. There were different NVIDIA GPUs including the A100, with PCIe and SXM4 form factors having 40 GB and 80 GB VRAM and A30.
The Minigo on the PowerEdge R750xa server is a first-time submission, and it takes around 516 minutes to run to target quality. That submission has 4x A100 PCIe 80 GB GPUs.
Our results have increased in count from 41 to 45. This increased number of submissions helps customers see the performance of the systems using different PowerEdge servers, GPUs, and CPUs. With more results, customers can expect to see the influence of using different hardware settings that can play a vital role in time to convergence.
We have several procured winning titles that demonstrate the higher performance of our systems in relation to other submitters, starting with the highest number of results across all the submitters. Some other titles include the top position in the time to converge for BERT, ResNet, and Mask R-CNN with our PowerEdge XE8545 server powered by NVIDIA A100-40GB GPUs.
Improvement in Performance for BERT and Mask R-CNN
Figure 2. Performance gains from MLPerf v2.0 to MLPerf v2.1 running BERT
Figure 2 shows the improvements seen with the PowerEdge R750xa and PowerEdge XE8545 servers with A100 GPUs from MLPerf training v2.0 to MLPerf training v2.1 running BERT language model workload. The PowerEdge XE8545 server with A100-80GB has the fastest time to convergence and the highest improvement at 13.1 percent, whereas the PowerEdge XE8545 server with A100-40GB has 7.74 percent followed by the PowerEdge R750xa server with A100-PCIe at 5.35 percent.
Figure 3. Performance gains from MLPerf v2.0 to MLPerf v2.1 running Mask R-CNN
Figure 3 shows the improvements seen with the PowerEdge XE8545 server with A100 GPUs. There is a 3.31 percent improvement in time to convergence with MLPerf v2.1.
For both BERT and Mask R-CNN, the improvements are software-based. These results show that software-only improvements can reduce convergence time. Customers can benefit from similar improvements without any changes in their hardware environment.
The following sections compare the performance differences between SXM and PCIe form factor GPUs.
Performance Difference Between PCIe and SXM4 Form Factor with A100 GPUs
Figure 4. SXM4 form factor compared to PCIe for the BERT
Figure 5. SXM4 form factor compared to PCIe for Resnet50 v1.5
Figure 6. SXM4 form factor compared to PCIe for the RNN-T
Table 1:
System | BERT | Resnet50 | RNN-T |
R750xax4A100-PCIe-80GB | 48.95 | 61.27 | 66.19 |
XE8545x4A100-SXM-80GB | 32.79 | 54.23 | 55.08 |
Percentage difference | 39.54% | 12.19% | 18.32% |
Figures 4, 5, and 6 and Table 1 show that SXM form factor is faster than the PCIe form factor for BERT, Resnet50 v1.5, and RNN-T workloads.
The SXM form factor typically consumes more power and is faster than PCIe. For the above workloads, the minimum percentage improvement in convergence that customers can expect is in double digits, ranging from approximately 12 percent to 40 percent, depending on the workload.
Multinode Results Comparison
Multinode performance assessment is more important than ever. With the advent of large models and different parallelism techniques, customers have an ever-increasing need to find results faster. Therefore, we have submitted several multinode results to assess scaling performance.
Figure 7. BERT multinode results with PowerEdge R750xa and XE8545 servers
Figure 7 indicates multinode results from three different systems with the following configurations:
- R750xa with 4 A100-PCIe-80GB GPUs
- XE8545 with 4 A100-SXM-40GB GPUs
- XE8545 with 4 A100-SXM-80GB GPUs
Every node of the above system has four GPUs each. When the graph shows eight GPUs, it means that the performance results are derived from two nodes. Similarly, for 16 GPUs the results are derived from four nodes, and so on.
Figure 8. Resnet50 multinode results with R750xa and XE8545 servers
Figure 9. Mask R-CNN multinode results with R750xa and XE8545 servers
As shown in Figures 7, 8, and 9, the multinode scaling results of the BERT, Resnet50, and Mask R-CNN are linear or nearly linear scaled. This shows that Dell servers offer outstanding performance with single-node and multinode scaling.
Conclusion
The findings described in this blog show that:
- Dell servers can run all types of workloads in the MLPerf Training submission.
- Software-only enhancements reduce time to solution for our customers, as shown in our MLPerf Training v2.1 submission, and customers can expect to see improvements in their environments.
- Dell PowerEdge XE8545 and PowerEdge R750xa servers with NVIDIA A100 with PCIe and SXM4 form factors are both great selections for all deep learning models.
- PCIe-based PowerEdge R750xa servers can deliver reinforcement learning workloads in addition to other classes of workloads, such as image classification, lightweight and heavy object detection, speech recognition, natural language processing, and medical image segmentation.
- The single-node results of our submission indicate that Dell servers deliver outstanding performance and that multinode run scales well and helps to reduce time to solution across a distinct set of workload types, making Dell servers apt for single-node and multinode deep learning training workloads.
- The single-node results of our submission indicate that Dell servers deliver outstanding performance and that multinode results show a well-scaled performance that helps to reduce time to solution across a distinct set of workload types. This makes Dell servers apt for both small training workloads on single nodes and large deep learning training workloads on multinodes.
Appendix
System Under Test
MLPerf system configurations for PowerEdge XE8545 systems
Operating system | CPU | Memory | GPU | GPU form factor | GPU count | Networking | Software stack |
XE8545x4A100-SXM-40GB 2xXE8545x4A100-SXM-40GB 4xXE8545x4A100-SXM-40GB 8xXE8545x4A100-SXM-40GB 16xXE8545x4A100-SXM-40GB 32xXE8545x4A100-SXM-40GB | 2x ConnectX-6 IB HDR 200Gb/Sec
|
| |||||
Red Hat Enterprise Linux | AMD EPYC 7713 | 1 TB | NVIDIA A100-SXM-40GB | SXM4 | 4, 8, 16, 32, 64, 128 |
| CUDA 11.6 Driver 510.47.03 cuBLAS 11.9.2.110 cuDNN 8.4.0.27 TensorRT 8.0.3 DALI 1.5.0 NCCL 2.12.10 Open MPI 4.1.1rc1 MOFED 5.4-1.0.3.0 |
XE8545x4A100-SXM-80GB |
|
| |||||
Ubuntu 20.04.4 | AMD EPYC 7763 | 1 TB | NVIDIA A100-SXM-80GB | SXM4 | 4 |
| CUDA 11.6 Driver 510.47.03 cuBLAS 11.9.2.110 cuDNN 8.4.0.27 TensorRT 8.0.3 DALI 1.5.0 NCCL 2.12.10 Open MPI 4.1.1rc1 MOFED 5.4-1.0.3.0 |
2xXE8545x4A100-SXM-80GB 4xXE8545x4A100-SXM-80GB |
|
| |||||
Red Hat Enterprise Linux | AMD EPYC 7713 | 1 TB | NVIDIA A100-SXM-80GB | SXM4 | 4, 8 |
| CUDA 11.6 Driver 510.47.03 cuBLAS 11.9.2.110 cuDNN 8.4.0.27 TensorRT 8.0.3 DALI 1.5.0 NCCL 2.12.10 Open MPI 4.1.1rc1 MOFED 5.4-1.0.3.0 |
MLPerf system configurations for Dell PowerEdge R750xa servers
| 2xR750xa_A100 | 8xR750xa_A100 |
MLPerf System ID | 2xR750xax4A100-PCIE-80GB | 8xR750xax4A100-PCIE-80GB |
Operating system | CentOS 8.2.2004 | |
CPU | Intel Xeon Gold 6338 | |
Memory | 512 GB | |
GPU | NVIDIA A100-PCIE-80GB | |
GPU form factor | PCIe | |
GPU count | 4,32 | |
Networking | 1x ConnectX-5 IB EDR 100Gb/Sec | |
Software stack | CUDA 11.6 Driver 470.42.01 cuBLAS 11.9.2.110 cuDNN 8.4.0.27 TensorRT 8.0.3 DALI 1.5.0 NCCL 2.12.10 Open MPI 4.1.1rc1 MOFED 5.4-1.0.3.0 | CUDA 11.6 Driver 470.42.01 cuBLAS 11.9.2.110 cuDNN 8.4.0.27 TensorRT 8.0.3 DALI 1.5.0 NCCL 2.12.10 Open MPI 4.1.1rc1 MOFED 5.4-1.0.3.0 |
MLPerf system configurations Dell DSS 8440 servers
| DSS 8440 |
MLPerf System ID | DSS8440x8A30-NVBRIDGE |
Operating system | CentOS 8.2.2004 |
CPU | Intel Xeon Gold 6248R |
Memory | 768 GB |
GPU | NVIDIA A30 |
GPU form factor | PCIe |
GPU count | 8 |
Networking | 1x ConnectX-5 IB EDR 100Gb/Sec |
Software stack | CUDA 11.6 Driver 510.47.03 cuBLAS 11.9.2.110 cuDNN 8.4.0.27 TensorRT 8.0.3 DALI 1.5.0 NCCL 2.12.10 Open MPI 4.1.1rc1 MOFED 5.4-1.0.3.0 |