Reference Architecture: Acceleration over PCIe for Dell EMC PowerEdge MX7000
Wed, 12 Aug 2020 14:04:57 -0000|
Read Time: 0 minutes
Many of today's demanding applications require GPU resources. Our reference architecture incorporates GPUs to the PowerEdge MX infrastructure, utilizing the PowerEdge MX Scalable Fabric, Dell EMC DSS 8440 GPU Server, and Liqid Command Center Software. Request a remote demo of this reference architecture or a quote from Dell Technologies Design Solutions Experts at the Design Solutions Portal.
The Dell EMC PowerEdge MX7000 Modular Chassis simplifies the deployment and management of today’s most challenging workloads by allowing IT administrators to dynamically assign, move and scale shared pools of compute, storage and networking resources. It provides IT administrators the ability to deliver fast results, eliminating managing and reconfiguring infrastructure to meet ever-changing needs of their end users. The addition of PCIe infrastructure to this managed pool of resources using Liqid technology designed on Dell EMC MX7000 expands the promise of software-defined composability for today’s AI-driven compute environments and high-value applications.
GPU Acceleration for PowerEdge MX7000
For workloads like AI that require parallel accelerated computing, the addition of GPU acceleration within the PowerEdge MX7000 is paramount. With Liqid technology and management software, GPUs of any form factor can be quickly added to any new or existing MX compute sled via the management interface, quickly delivering the resources needed to manage each step of the machine learning workflow including data ingest, cleansing, training, and inferencing. Spin-up new bare-metal servers with the exact number of accelerators required and then dynamically add or remove them as workload needs change.
GPU Expansion Over PCIe
Up to 8 x Compute Sleds per Chassis
PCIe Expansion Chassis
PCIe Gen3x4 Per Compute Sled
20x GPU (FHFL)
V100, A100, RTX, T4, Others
Linux, Windows, VMWare and Others
GPU, FPGA, and NVMe Storage
14U Total = MX7000 (7U) + PCIe Expansion Chassis (7U)
Implementing GPU Expansion for MX
GPUs are installed into the PCIe expansion chassis. Next, U.2 to four PCIe Gen3 adapters are added to each compute sled that requires GPU acceleration, and then they are connected to the expansion chassis (Figure 1). Liqid Command Center software enables discovery of all GPUs, making them ready to be added to the server over native PCIe. FPGA and NVMe storage can also be added to compute nodes in tandem. This PCIe expansion chassis & software are available from the Dell Design Solutions team.
Software Defined Composability
Once PCIe devices are connected to the MX7000, Liqid Command Center software enables the dynamic allocation of GPUs to MX compute sleds at the bare metal. Any amount of resources can be added to the compute sleds, via Liqid Command Center (GUI) or RESTful API, in any ratio (GPU hot-plug supported). To the operating system, the GPUs are presented as local resources direct connected to the MX compute sled over PCIe (Figure 3). All operating systems are supported including Linux, Windows, and VMware. As workload needs change, add or remove resources on the fly, via software including NVMe SSD and FPGA (Table 1).
Enabling GPU Peer-2-Peer Capability
A key feature included with the PCIe expansion solution for PowerEdge MX7000 is the ability for RDMA Peer-2-Peer between GPU devices. Direct RDMA transfers have a massive impact on both throughput and latency for the highest performing GPU-centric applications. Up to 10x improvement in performance has been achieved with RDMA Peer-2-Peer enabled. Below is the overview of how PCIe Peer-2-Peer functions (Figure 4).
Bypassing the x86 processor and enabling direct RDMA communication between GPUs, realizes a dramatic improvement in bandwidth and in addition a reduction in latency is also realized. This chart outlines the performance expected for GPUs that are composed to a single node with GPU RDMA Peer-2-Peer enabled (Table 2).
Application Level Performance
RDMA Peer-2-Peer is a key feature in GPU scaling for Artificial Intelligence, specifically machine learning based applications. Figure 5 outlines performance data measured on mainstream AI/ML applications on the MX7000 with GPU expansion over PCIe. It further demonstrates the performance scaling from 1-GPU to 8-GPU for a single MX740c compute sled. High scaling efficiency is observed for ResNet152, VGG16, Inception V3, and ResNet50 on MX7000 with composable PCIe GPUs measured with Peer-2-Peer enabled. These results indicate a near-linear growth pattern. and with the current capabilities of the Liqid PCIe 7U expansion sled one can allocate up to 20 GPUs to an application running on a single node.
Liqid PCIe expansion for the Dell EMC PowerEdge MX7000 unlocks the ability to manage the most demanding workloads in which accelerators are required for both new and existing deployments. Liqid collaborated with Dell Technologies Design Solutions to accelerate applications by through the addition of GPUs to the Dell EMC MX compute sleds over PCIe.
Learn More | See a Demo | Get a Quote
This reference architecture is available as part of the Dell Technologies Design Solutions.
Related Blog Posts
Dell EMC vSAN Ready Nodes: Taking VDI and AI Beyond “Good Enough”
Mon, 18 Oct 2021 12:52:37 -0000|
Read Time: 0 minutes
Some people have speculated that 2020 was “the year of VDI” while others say that it will never be the “year of VDI.” However, there is one certainty. In 2020 and part of 2021, organizations worldwide consumed a large amount of virtual desktop infrastructure (VDI). Some of these deployments went extremely well while other deployments were just “good enough.”
If you are a VDI enthusiast like me, there was much to learn from all that happened over the last 24 months. An interesting observation is that test VDI environments turned into production environments overnight. Also, people discovered that the capacity of clouds is not limitless. My favorite observation is the discovery by many IT professionals that GPUs can change the VDI experience from “good enough” to enjoyable, especially when coupled with an outstanding environment powered by Dell Technologies with VMware vSphere and VMware Horizon.
In this blog, I will tell you about how exceptional VDI (and AI/ML) is when paired with powerful technology.
This blog does not address cloud workloads as it is a substantial topic. It would be difficult for me to provide the proper level of attention in this blog, so I will address only on premises deployments.
Many end users adopt hyperconverged infrastructure (HCI) in their data centers because it is easy to consume. One of the most popular HCIs is Dell EMC VxRail Hyperconverged Infrastructure. You can purchase nodes to match your needs. These needs range from the traditional data center workloads, to Tanzu clusters, to VDI with GPUs, and to AI. VxRail enables you to deliver whatever your end users need. Your end users might be developers working from home on a containers-based AI project and they need a development environment, VxRail can provide it with relative ease.
Some IT teams might want an HCI experience that is more customer managed but they still want a system that is straightforward to deploy, validate, and is easy to maintain. This scenario is where Dell EMC vSAN Ready Nodes come into play.
Dell EMC vSAN Ready Nodes provide comprehensive, flexible, and efficient solutions optimized for your workforce’s business goals with a large choice of options (more than 250 as of the September 29, 2021 vSAN Compatibility Guide) from tower to rack mount to blades. A surprising option is that you can purchase Dell EMC vSAN Ready Nodes with GPUs, making them a great platform for VDI and virtualized AI/ML workloads.
Dell EMC vSAN Ready Nodes supports many NVIDIA GPUs used for VDI and AI workloads, notably the NVIDIA M10 and A40 GPUs for VDI workloads and the NVIDIA A30 and A100 GPUs for AI workloads. There are other available GPUs depending on workload requirements, however, this blog focuses on the more common use cases.
For some time, the NVIDIA M10 GPU has been the GPU of choice for VDI-based knowledge workers who typically use applications such as Microsoft PowerPoint and YouTube. The M10 GPU provides a high density of users per card and can support multiple virtual GPU (vGPU) profiles per card. The multiple profiles result from having four GPU chips per PCI board. Each chip can run a unique vGPU profile, which means that you can have four vGPU profiles. That is, there are twice as many profiles than are provided by other NVIDIA GPUs. This scenario is well suited for organizations with a larger set of desktop profiles.
Combining this profile capacity with Dell EMC vSAN Ready Nodes, organizations can deliver various desktop options yet be based on a standardized platform. Organizations can let end users choose the system that suites them best and can optimize IT resources by aligning them to an end user’s needs.
Typically, power users need or want more graphics capabilities than knowledge workers. For example, power users working in CAD applications need larger vGPU profiles and other capabilities like NVIDIA’s Ray Tracing technology to render drawings. These power users’ VDI instances tend to be more suited to the NVIDIA A40 GPU and associated vGPU profiles. It allows power users who do more than create Microsoft PowerPoint presentations and watch YouTube videos to have the desktop experience they need to work effectively.
The ideal Dell EMC vSAN Ready Nodes platform for the A40 GPU is based on the Dell EMC PowerEdge R750 server. The PowerEdge R750 server provides the power and capacity for demanding workloads like healthcare imaging and natural resource exploration. These workloads also tend to take full advantage of other features built into NVIDIA GPUs like CUDA. CUDA is a parallel computing platform and programming model that uses GPUs. It is used in many high-end applications. Typically, CUDA is not used with traditional graphics workloads.
In this scenario, we start to see the blend between graphics and AI/ML workloads. Some VDI users not only render complex graphics sets, but also use the GPU for other computational outcomes, much like AI and ML do.
I really like that I can run AI/ML workloads in a virtual environment. It does not matter if you are an IT administrator or an AI/ML administrator. You can run AI and ML workloads in a virtual environment.
Many organizations have realized that the same benefits virtualization has brought to IT can also be realized in the AI/ML space. There are additional advantages, but those are best kept for another time.
For some organizations, IT is now responsible for AI/ML environments, whether delivering test/dev environments for programmers or delivering a complete AI training environment. For other IT groups, this responsibility falls to highly paid data scientists. And for some IT groups, the responsibility is a mix.
In this scenario, virtualization shines. IT administrators can do what they do best: deliver a powerful Dell EMC vSAN Ready Node infrastructure. Then, data scientists can spend their time building systems in a virtual environment consuming IT resources instead of racking and cabling a server.
Dell EMC vSAN Ready nodes are great for many AI/ML applications. They are easy to consume as a single unit of infrastructure. Both the NVIDIA A30 GPU and the A100 GPU are available so that organizations can quickly and easily assemble the ideal architecture for AI/ML workloads.
This ease of consumption is important for both IT and data scientists. It is unacceptable when IT consumers like data scientists must wait for the infrastructure they need to do their job. Time is money. Data scientists need environments quickly, which Dell EMC vSAN Ready Nodes can help provide. Dell EMC vSAN Ready Nodes deploy 130 percent faster with Dell EMC OpenManage Integration for VMware vCenter (OMIVV) (Based on Dell EMC internal competitive testing of PowerEdge and OMIVV compared to Cisco UCS manual operating system deployment.)
This speed extends beyond day 0 (deployment) to day 1+ operations. When using the vLCM and OMIVV, complete hypervisor and firmware updates to an eight-node PowerEdge cluster took under four minutes compared to a manual process, which took3.5 hours.(Principle Technologies report commissioned by Dell Technologies, New VMware vSphere 7.0 features reduced the time and complexity of routine update and hardware compliance tasks, July 2020.)
Dell EMC vSAN Ready Nodes ensures that you do not have to be an expert in hardware compatibility. With over 250 Dell EMC vSAN Ready Nodes available (as of the September 29, 2021 vSAN Compatibility Guide), you do not need to guess which drives will work or if a network adapter is compatible. You can then focus more on data and the results and less on building infrastructure.
These time-to-value considerations, especially for AI/ML workloads, are important. Being able to deliver workloads such as AI/ML or VDI quickly can have a significant impact on organizations, as has been evident in many organizations over the last two years. It has been amazing to see how fast organizations have adopted or expanded their VDI environments to accommodate everyone from knowledge workers to high-end power users wherever they need to consume IT resources.
Beyond “just expanding VDI” to more users, organizations have discovered that GPUs can improve the end-user experience and, in some cases, not only help but were required. For many, the NVIDIA M10 GPU helped users gain the wanted remote experience and move beyond “good enough.” For others who needed a more graphics-rich experience, the NVIDIA A40 GPU continues to be an ideal choice.
When GPUs are brought together as part of a Dell EMC vSAN Ready Node, organizations have the opportunity to deliver an expanded VDI and AI/ML experience to their users. To find out more about Dell EMC vSAN Ready Nodes, see Dell EMC vSAN Ready Nodes.
Author: Tony Foster Twitter: @wonder_nerd LinkedIn: https://linkedin.com/in/wondernerd
Comparison of MLPerf™ Inference v1.1 Results of Dell EMC PowerEdge R7525 Servers with NVIDIA GPUs
Fri, 15 Oct 2021 19:25:26 -0000|
Read Time: 0 minutes
This blog showcases the MLPerf Inference v1.1 performance results of Dell EMC PowerEdge R7525 servers configured with NVIDIA A100 40 GB GPUs or with NVIDIA A30 GPUs. We compare the cost of a system with both types of GPUs to help you choose the best configuration for your AI inference workloads.
MLPerf Inference v1.1 falls under the benchmarks and metrics category from MLCommons™ and serves as the industry standard for machine learning (ML) inference performance. The MLPerf benchmarking suite measures the performance of ML workloads consistently and fairly. The MLPerf Inference benchmark measures how fast a system can perform ML inference by using a pretrained model in various deployment scenarios. For a comprehensive understanding of MLPerf Inference, see this blog.
Test bed details
The systems under test (SUT) include:
- PowerEdge R7525 server that is configured with three NVIDIA A100 PCIe 40 GB (250 W, 40 GB passive, double wide, full height GPU) GPUs. All references to the PowerEdge R7525 server with A100 GPUs assume that the configuration includes three NVIDIA A100 GPUs.
- PowerEdge R7525 server that is configured with three NVIDIA A30 (165 W, 24 GB passive, double wide, full height GPU with cable) GPUs. All references to the PowerEdge R7525 server with A30 GPUs assume that the configuration includes three NVIDIA A100 GPUs.
The following figure shows the PowerEdge R7525 server:
The following table shows the MLPerf system configurations for the SUTs:
Table 1: SUT configuration
PowerEdge R7525 with 3 A100 PCIe 40 GB GPUs
PowerEdge R7525 with 3 A30 GPUs
MLPerf System ID
GPU Driver 470.42.01
MLPerf Inference v1.1 results per model
ResNet50 is a 50-layer deep convolution neural network that is used for many computer vision applications. This neural network can address vanishing gradients using the concept of skip connections by allowing gradients to move through layers in the network. For an introduction to ResNet, see Deep Residual Learning for Image Recognition.
We conducted four tests on this model across the two SUTs: two in the Offline scenario and two in the Server scenario. The following figure shows our ResNet50 results. The performance of the PowerEdge R7525 server with A30 GPUs across both scenarios is approximately 50 percent higher than the PowerEdge R7525 server with A100 GPUs.
Figure 1: ResNet50 results on a PowerEdge R7525 server with A100 GPUs and a PowerEdge R7525 server with A30 GPUs
Bidirectional Encoder Representations from Transformers (BERT) is a state-of-the-art language representational model. In essence, BERT is a stack of Transformer encoders. The Transformer architecture is fast because it can process words simultaneously, and the context of words can be learned from both directions simultaneously. BERT can be used for neural machine translation, question answering, sentiment analysis, and text summarization, all of which require language understanding. BERT is trained in two phases: pretrain in which the model understands language and context, and fine-tuning in which BERT learns specific tasks such as questioning and answering. For an in-depth understanding, see BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding from Google AI Language.
For this model, we conducted eight tests across our systems in which we considered the default and high accuracy modes in both the Server and Offline scenarios. In the default mode, the PowerEdge R7525 server with A100 GPUs performed 69 percent better than the PowerEdge R7525 server with A30 GPUs in the Offline scenario and 99 percent better in the Server scenario. The high accuracy mode provided similar results in which the PowerEdge R7525 server with A100 GPUs performed 72 percent better than the PowerEdge R7525 server with A30 GPUs in the Offline scenario and 96 percent better in the Server scenario. In the following figure, bert-99 refers to the default accuracy target, whereas bert-99.9 refers to the high accuracy target.
Figure 2: BERT results on a PowerEdge R7525 with A100 GPUs and a PowerEdge R7525 with A30 GPUs
ResNet34 is an encoder on top of Single Shot Multibox Detector (SSD) that is used to improve performance and reduce training time. As the full form suggests, the SSD is a single stage objection detection model that is known for speed. For an in-depth understanding, see Small Object Detection using Context and Attention.
For this model, we conducted four tests across both of our systems. In the Offline scenario, the PowerEdge R7525 server with A100 GPUs outperformed the PowerEdge R7525 server with A30 GPUs by 74 percent. Similarly, in the Server scenario, the PowerEdge R7525 server with A100 GPUs performed 78 percent better than the PowerEdge R7525 server with A30 GPUs.
Figure 3: SSD-ResNet34 results on a PowerEdge R7525 server with A100 GPUs and a PowerEdge R7525 server with A30 GPUs
DLRM, an open-source Deep Learning Recommendation Model, is available on Facebook’s PyTorch platform. The model is composed of compute-dominated multilayer perceptrons (MLPs) and relies on data parallelism to improve performance. When predicting click percentage for certain items, for example, it is aligned with randomized Las Vegas algorithms in which resources (time and memory) are used freely but the results are always correct. DLRM uses collaborative filtering and predicative analysis-based approaches to process large amounts of data. For more information about DLRM, see Deep Learning Recommendation Model for Personalization and Recommendation Systems.
For this model, we conducted eight tests across both of our systems. For the PowerEdge R7525 server with A100 GPUs, we notice a tight range with a lower and upper bound of 764,569 and 768,806 result samples per second, respectively. Also, the results produced across the default and high accuracy tests are the same for their respective systems. The initial numbers from the PowerEdge R7525 server with A30 GPUs were slightly below expectations. After the submission deadline, our team was able to extract additional performance, particularly in the Server scenario. The numbers for the PowerEdge R7525 server with A30 GPUs shown in the following figure are not the same as the numbers published on the MLCommons website. However, these numbers are valid and pass all the required compliance tests. The PowerEdge R7525 server with A30 GPUs behaved like the PowerEdge R7525 server with A100 GPUs in that the Server scenario results are slightly lower than the Offline results. The tuned numbers provided the best per card performance among all A30 GPU submissions.
Figure 4: DLRM results on a PowerEdge R7525 server with A100 GPUs and a PowerEdge R7525 server with A30 GPUs
Recurrent Neural Network (RNNT) is a type of neural network in which outputs are recycled as inputs for the current step. By using one-hot encoding and memory, RNNT can remember information through time that might be useful in time series prediction. This model uses a squashing function to learn to predict the next potential word or step to take. The result of the squashing function is always between –1 and 1, which allows neural networks to remain nonlinear and thus effective as the same values are passed through the neural network.
For this model, we conducted four tests across both of our systems. In the Offline scenario, the PowerEdge R7525 server with A100 GPUs outperformed the PowerEdge R7525 server with A30 GPUs by 80 percent. In the Server scenario, the PowerEdge R7525 server with A100 GPUs excelled by performing 199 percent better than the PowerEdge R7525 server with A30 GPUs.
Figure 5: RNNT results on a PowerEdge R7525 server with A100 GPUs and a PowerEdge R7525 server with A30 GPUs
3D U-Net is an elegant improvement to the sliding window approach of convolution neural networks (CNNs) in which fewer training images can be used and more precise segmentations can be yielded. In brief, an input image goes through a contraction and expansion path (in a U shape architecture with skip connections) and becomes a segmentation map output. This segmentation map provides class labels for what is inside the image. For a deeper understanding of 3D U-Net's architecture, see U-Net: Convolutional Networks for Biomedical.
Across the two systems, we conducted Offline scenario tests for the default and high accuracy modes. The default and high accuracy modes yielded the same results across the two systems. Across the two systems, the PowerEdge R7525 server with A100 GPUs performed 75 percent better than the PowerEdge R7525 server with A30 GPUs.
Figure 6: 3D U-Net results on a PowerEdge R7525 server with A100 GPUs and a PowerEdge R7525 server with A30 GPUs
When placing an order for the PowerEdge R7525 Rack Server on the Dell Technologies website, customers are guided through the purchasing process with suggestions and requirements for their specific rack server. The PowerEdge R7525 server with three NVIDIA Ampere A100 GPUs is 1.423 times more expensive than the PowerEdge R7525 server with three NVIDIA Ampere A30 GPUs. The price difference between the two configurations is due to the powerful GPU itself. Also, the PowerEdge R7525 server with A100 GPUs requires higher performance fans and a more powerful thermal configuration. Despite the additional options required for the PowerEdge R7525 server with A100 GPUs, understanding the throughput performance (queries per second (QPS) in the Server mode and samples per second in the Offline mode) per dollar provides valuable insight into achievable performance per dollar spent.
The following figure shows the relative performance of the two systems per dollar. If we divide the performance achieved on a system for a particular benchmark by the total cost of the system, we determine the achievable throughput per dollar spent on the system. The higher the throughput per dollar amount indicates that greater performance can be extracted from the system per dollar spent.
Figure 7: Relative QPS per cost of a PowerEdge R7525 server with A100 GPUs and a PowerEdge R7525 server with A30 GPUs
In the figure, the orange line shows the normalized data of the throughput per cost of the PowerEdge R7525 server with A30 GPUs. The blue bars indicate the relative achievable performance of the PowerEdge R7525 server with A100 GPUs. For most of the benchmarks, we see an acceptable range of performance on both systems. However, the PowerEdge R7525 server with A100 GPUs unconditionally outperformed the PowerEdge R7525 server with A30 GPUs in the DLRM Server default and high accuracy modes as well as in the RNNT Server mode. Both systems perform well per dollar spent.
Note: We compiled the cost data in this section from the PowerEdge R7525 Rack Server page on the Dell Technologies website on September 7, 2021. The data might be subject to change.
The blog provides a detailed comparison of performance between the Dell EMC PowerEdge R7525 server configured with three A100s and the Dell EMC PowerEdge R7525 server configured with three A30 GPUs. If your ML workload focuses on inferencing, the PowerEdge R7525 server configured with A100s might suit your needs well. However, if you are looking for a system that not only performs well, but is also s more cost-effective, the PowerEdge R7525 server configured with A30 GPUs will suit those needs. Both systems performed well and are a great investment based on your ML workload requirements.
In future blogs, we plan to describe:
- How to run MLPerf Inference v1.1
- The PowerEdge R750xa server as a platform for inference v1.1
- The DSS8440 server as a platform for inference v1.1
- The PowerEdge R725 server as a platform for inference v1.1
- The PowerEdge XE8545 server as a platform for inference v1.1
- Comparison of inference v1.0 performance with inference v1.1 performance