
GPU-Accelerated AI and ML Capabilities
Mon, 14 Dec 2020 15:37:06 -0000
|Read Time: 0 minutes
Dell EMC Integrated System for Microsoft Azure Stack Hub has been extending Microsoft Azure services to customer-owned data centers for over three years. Our platform has enabled organizations to create a hybrid cloud ecosystem that drives application modernization and to address business concerns around data sovereignty and regulatory compliance.
Dell Technologies, in collaboration with Microsoft, is excited to announce upcoming enhancements that will unlock valuable, real-time insights from local data using GPU-accelerated AI and ML capabilities. Actionable information can be derived from large on-premises data sets at the intelligent edge without sacrificing security.
Partnership with NVIDIA
Today, customers can order our Azure Stack Hub dense scale unit configuration with NVIDIA Tesla V100S GPUs for running compute-intensive AI processes like inferencing, training, and visualization from virtual machine or container-based applications. Some customers choose to run Kubernetes clusters on their hardware-accelerated Azure Stack Hub scale units to process and analyze data sent from IoT devices or Azure Stack Edge appliances. Powered by the Dell EMC PowerEdge R840 rack server, these NVIDIA Tesla V100S GPUs use Discrete Device Assignment (DDA), also known as GPU pass-through, to dedicate one or more GPUs to an Azure Stack Hub NCv3 VM.
The following figure illustrates the resources installed in each GPU-equipped Azure Stack Hub dense configuration scale unit node.
This month, our Dell EMC Azure Stack Hub release 2011 will also support the NVIDIA T4 GPU – a single-slot, low-profile adapter powered by NVIDIA Turing Tensor Cores. These GPUs are perfect for accelerating diverse cloud-based workloads, including light machine learning, inference, and visualization. These adapters can be ordered with Dell EMC Azure Stack Hub all-flash scale units powered by Dell EMC PowerEdge R640 rack servers. Like the NVIDIA Tesla V100S, these GPUs use DDA to dedicate one adapter’s powerful capabilities to a single Azure Stack Hub NCas_v4 VM. A future Azure Stack Hub release will also enable GPU partitioning on the NVIDIA T4.
The following figure illustrates the resources installed in each GPU-equipped Azure Stack Hub all-flash configuration scale unit node.
Partnership with AMD
We are also pleased to announce a partnership with AMD to deliver GPU capabilities in our Dell EMC Integrated System for Microsoft Azure Stack Hub. Available today, customers can order our dense scale unit configuration with AMD Radeon Instinct MI25 GPUs aimed at graphics intensive visualization workloads like simulation, CAD applications, and gaming. The MI25 uses GPU partitioning (GPU-P) technology to allow users of an Azure Stack Hub NVv4 VM to consume only a portion of the GPU’s resources based on their workload requirements.
The following table is a summary of our hardware acceleration capabilities.
An engineered approach
Following our stringent engineered approach, Dell Technologies goes far beyond considering GPUs as just additional hardware components in the Dell EMC Integrated System for Microsoft Azure Stack Hub portfolio. We apply our pedigree as leaders in appliance-based solutions to the entire lifecycle of all our scale unit configurations. The dense and all-flash scale unit configurations with integrated GPUs are designed to follow best practices and use cases specifically with Azure-based workloads, rather than workloads running on traditional virtualization platforms. Dell Technologies is also committed to ensuring a simplified experience for initial deployment, patch and update, support, and streamlined operations and monitoring for these new configurations.
Additional considerations
There are a couple of additional details worth mentioning about our new Azure Stack Hub dense and all-flash scale unit configurations with hardware acceleration:
- The use of the GPU-backed N-Series VMs in Azure Stack Hub for compute-intensive AI and ML workloads is still in preview. Dell Technologies is very interested in speaking with customers about their use cases and workloads supported by this configuration. Please contact us at mhc.preview@dell.com to speak with one of our engineering technologists.
- The Dell EMC Integrated System for Microsoft Azure Stack Hub configurations with GPUs can be delivered fully racked and cabled in our Dell EMC rack. Customers can also elect to have the scale unit components re-racked and cabled in their own existing cabinets with the assistance of Dell Technologies Services.
Resources for further study
- At the time of publishing this blog post, only the NCv3 and NVv4 VMs are available in the Azure Stack Hub marketplace. The NCas_v4 currently is not visible in the portal. Please proceed to the Azure Stack Hub User Documentation for more information on these VM sizes.
- Customers may want to explore the Train Machine Learning (ML) model at the edge design pattern in the Azure Hybrid Documentation. This may prove to be a good starting point for putting this technology to work for their organization.
- Customers considering running AI and ML workloads on Dell EMC Integrated System for Microsoft Azure Stack Hub can also greatly benefit from storage-as-a-service with Dell EMC PowerScale. PowerScale can help enable faster training and validation of AI models, improve model accuracy, drive higher GPU utilization, and increase data science productivity. Visit Artificial Intelligence with Dell EMC PowerScale for more information.
Related Blog Posts

New Frontiers—Dell EMC PowerEdge R750xa Server with NVIDIA A100 GPUs
Tue, 01 Jun 2021 20:18:04 -0000
|Read Time: 0 minutes
Dell Technologies has released the new PowerEdge R750xa server, a GPU workload-based platform that is designed to support artificial intelligence, machine learning, and high-performance computing solutions. The dual socket/2U platform supports 3rd Gen Intel Xeon processors (code named Ice Lake). It supports up to 40 cores per processor, has eight memory channels per CPU, and up to 32 DDR4 DIMMs at 3200 MT/s DIMM speed. This server can accommodate up to four double-width PCIe GPUs that are located in the front left and the front right of the server.
Compared with the previous generation PowerEdge C4140 and PowerEdge R740 GPU platform options, the new PowerEdge R750xa server supports larger storage capacity, provides more flexible GPU offerings, and improves the thermal requirement
Figure 1 PowerEdge R750xa server
The NVIDIA A100 GPUs are built on the NVIDIA Ampere architecture to enable double precision workloads. This blog evaluates the new PowerEdge R750xa server and compares its performance with the previous generation PowerEdge C4140 server.
The following table shows the specifications for the NVIDIA GPU that is discussed in this blog and compares the performance improvement from the previous generation.
Table 1 NVIDIA GPU specifications
PCIe | Improvement | ||
GPU name | A100 | V100 |
|
GPU architecture | Ampere | Volta | - |
GPU memory | 40 GB | 32 GB | 60% |
GPU memory bandwidth | 1555 GB/s | 900 GB/s | 73% |
Peak FP64 | 9.7 TFLOPS | 7 TFLOPS | 39% |
Peak FP64 Tensor Core | 19.5 TFLOPS | N/A | - |
Peak FP32 | 19.5 TFLOPS | 14 TFLOPS | 39% |
Peak FP32 Tensor Core | 156 TFLOPS 312 TFLOPS* | N/A | - |
Peak Mixed Precision FP16 ops/ FP32 Accumulate | 312 TFLOPS 624 TFLOPS* | 125 TFLOPS | 5x |
GPU base clock | 765 MHz | 1230 MHz | - |
Peak INT8 | 624 TOPS 1,248 TOPS* | N/A | - |
GPU Boost clock | 1410 MHz | 1380 MHz | 2.1% |
NVLink speed | 600 GB/s | N/A | - |
Maximum power consumption | 250 W | 250 W | No change |
Test bed and applications
This blog quantifies the performance improvement of the GPUs with the new PowerEdge GPU platform.
Using a single node PowerEdge R750xa server in the Dell HPC & AI Innovation Lab, we derived all results presented in this blog from this test bed. This section describes the test bed and the applications that were evaluated as part of the study. The following table provides test environment details:
Table 2 Server configuration
Component | Test Bed 1 | Test Bed 2 |
Server | Dell PowerEdge R750xa
| Dell PowerEdge C4140 configuration M |
Processor | Intel Xeon 8380 | Intel Xeon 6248 |
Memory | 32 x 16 GB @ 3200MT/s | 16 x 16 GB @ 2933MT/s |
Operating system | Red Hat Enterprise Linux 8.3 | Red Hat Enterprise Linux 8.3 |
GPU | 4 x NVIDIA A100-PCIe-40 GB GPU | 4 x NVIDIA V100-PCIe-32 GB GPU |
The following table provides information about the applications and benchmarks used:
Table 3 Benchmark and application details
Application | Domain | Version | Benchmark dataset |
High-Performance Linpack | Floating point compute-intensive system benchmark | xhpl_cuda-11.0-dyn_mkl-static_ompi-4.0.4_gcc4.8.5_7-23-20 | Problem size is more than 95% of GPU memory |
HPCG | Sparse matrix calculations | xhpcg-3.1_cuda_11_ompi-3.1 | 512 * 512 * 288
|
GROMACS | Molecular dynamics application | 2020 | Ligno Cellulose Water 1536 Water 3072 |
LAMMPS | Molecular dynamics application | 29 October 2020 release | Lennard Jones |
LAMMPS
Large-Scale Atomic/Molecular Massively Parallel simulator (LAMMPS) is distributed by Sandia National Labs and the US Department of Energy. LAMMPS is open-source code that has different accelerated models for performance on CPUs and GPUs. For our test, we compiled the binary using the KOKKOS package, which runs efficiently on GPUs.
Figure 2 LAMMPS Performance on PowerEdge R750xa and PowerEdge C4140 servers
With the newer generation GPUs, this application improves 2.4 times compared to single GPU performance. The overall performance from a single server improved twice with the PowerEdge R750xa server and NVIDIA A100 GPUs.
GROMACS
GROMACS is a free and open-source parallel molecular dynamics package designed for simulations of biochemical molecules such as proteins, lipids, and nucleic acids. It is used by a wide variety of researchers, particularly for biomolecular and chemistry simulations. GROMACS supports all the usual algorithms expected from modern molecular dynamics implementation. It is open-source software with the latest versions available under the GNU Lesser General Public License (LGPL).
Figure 3 GROMACS performance on PowerEdge C4140 and r750xa servers
With the newer generation GPUs, this application improved approximately 1.5 times across the dataset compared to single GPU performance. The overall performance from a single server improved 1.5 times with a PowerEdge R750xa server and NVIDIA A100 GPUs.
High-Performance Linpack
High-Performance Linpack (HPL) needs no introduction in the HPC arena. It is a widely used standard benchmark tests in the industry.
Figure 4 HPL Performance on the PowerEdge R750xa server with A100 GPU and PowerEdge C4140 server with V100 GPU
Figure 5 Power use of the HPL running on NVIDIA GPUs
From Figure 4 and Figure 5, the following results were observed:
- Performance—For GPU count, the NVIDIA A100 GPU demonstrates twice the performance of the NVIDIA V100 GPU. Higher memory size, double precision FLOPS, and a newer architecture contribute to the improvement for the NVIDIA A100 GPU.
- Scalability—The PowerEdge R750xa server with four NVIDIA A100-PCIe-40 GB GPUs delivers 3.6 times higher HPL performance compared to one NVIDIA A100-PCIE-40 GB GPU. The NVIDIA A100 GPUs scale well inside the PowerEdge R750xa server for the HPL benchmark.
- Higher Rpeak—The HPL code on NVIDIA A100 GPUs uses the new double-precision Tensor cores. The theoretical peak for each GPU is 19.5 TFlops, as opposed to 9.7 TFlops.
- Power—Figure 5 shows power consumption of a complete HPL run with the PowerEdge R750xa server using four A100-PCIe GPUs. This result was measured with iDRAC commands, and the peak power consumption was observed as 2022 Watts. Based on our previous observations, we know that the PowerEdge C4140 server consumes approximately 1800 W of power.
HPCG
Figure 6 Scaling GPU performance data for HPCG Benchmark
As discussed in other blogs, high performance conjugate gradient (HPCG) is another standard benchmark to test data access patterns of sparse matrix calculations. From the graph, we see that the HPCG benchmark scales well with this benchmark resulting in 1.6 times performance improvement over the previous generation PowerEdge C4140 server with an NVIDIA V100 GPU.
The 72 percent improvement in memory bandwidth of the NVIDIA A100 GPU over the NVIDIA V100 GPU contributes to the performance improvement.
Conclusion
In this blog, we introduced the latest generation PowerEdge R750xa platform and discussed the performance improvement over the previous generation PowerEdge C4140 server. The PowerEdge R750xa server is a good option for customers looking for an Intel Xeon scalable CPU-based platform powered with NVIDIA GPUs.
With the newer generation PowerEdge R750xa server and NVIDIA A100 GPUs, the applications discussed in this blog show significant performance improvement.
Next steps
In future blogs, we plan to evaluate NVLINK bridge support, which is another important feature of the PowerEdge R750xa server and NVIDIA A100 GPUs.

Running containerized applications on Microsoft Azure's hybrid ecosystem - Introduction
Mon, 17 Aug 2020 18:48:17 -0000
|Read Time: 0 minutes
Running containerized applications on Microsoft Azure’s hybrid ecosystem
Introduction
A vast array of services and tooling has evolved in support of microservices and container-based application development patterns. One indispensable asset in the technology value stream found in most of these patterns is Kubernetes (K8s). Technology professionals like K8s because it has become the de-facto standard for container orchestration. Business leaders like it for its potential to help disrupt their chosen marketplace. However, deploying and maintaining a Kubernetes cluster and its complimentary technologies can be a daunting task to the uninitiated.
Enter Microsoft Azure’s portfolio of services, tools, and documented guidance for developing and maintaining containerized applications. Microsoft continues to invest heavily in simplifying this application modernization journey without sacrificing features and functionality. The differentiators of the Microsoft approach are two-fold. First, the applications can be hosted wherever the business requirements dictate – i.e. public cloud, on-premises, or spanning both. More importantly, there is a single control plane, Azure Resource Manager (ARM), for managing and governing these highly distributed applications.
In this blog series, we share the results of hands-on testing in the Dell Technologies labs with container-related services that span both Public Azure and on-premises with Azure Stack Hub. Azure Stack Hub provides a discrete instance of ARM, which allows us to leverage a consistent control plane even in environments with no connectivity to the Internet. It might be helpful to review articles rationalizing the myriad of announcements made at Microsoft Ignite 2019 about Microsoft’s hybrid approach from industry experts like Kenny Lowe, Thomas Maurer, and Mary Branscombe before delving into the hands-on activities in this blog.
Services available in Public Azure
Azure Kubernetes Service (AKS) is a fully managed platform service hosted in Public Azure. AKS makes it simple to define, deploy, debug, and upgrade even the most complex Kubernetes applications. With AKS, organizations can accelerate past the effort of deploying and maintaining the clusters to leveraging the clusters as target platforms for their CI/CD pipelines. DevOps professionals only need to concern themselves with the management and maintenance of the K8s agent nodes and leave the management of the master nodes to Public Azure.
AKS is just one of Public Azure’s container-related services. Azure Monitor, Azure Active Directory, and Kubernetes role-based access controls (RBAC) provide the critical governance needed to successfully operate AKS. Serverless Kubernetes using Azure Container Instances (ACI) can add compute capacity without any concern about the underlying infrastructure. In fact, ACI can be used to elastically burst from AKS clusters when workload demand spikes. Azure Container Registry (ACR) delivers a fully managed private registry for storing, securing, and replicating container images and artifacts. This is perfect for organizations that do not want to store container images in publicly available registries like Docker Hub.
Leveraging the hybrid approach
Microsoft is working diligently to deliver the fully managed AKS resource provider to Azure Stack Hub. The first step in this journey is to use AKS engine to bootstrap K8s clusters on Azure Stack Hub. AKS engine provides a command-line tool that helps you create, upgrade, scale, and maintain clusters. Customers interested in running production-grade and fully supported self-managed K8s clusters on Azure Stack Hub will want to use AKS engine for deployment and not the Kubernetes Cluster (preview) marketplace gallery item. This marketplace item is only for demonstration and POC purposes.
AKS engine can also upgrade and scale the K8s cluster it deployed on Azure Stack Hub. However, unlike the fully managed AKS in Public Azure, the master nodes and the agent nodes need to be maintained by the Azure Stack Hub operator. In other words, this is not a fully managed solution today. The same warning applies to the self-hosted Docker Container Registry that can be deployed to an on-premises Azure Stack Hub via a QuickStart template. Unlike ACR in Public Azure, Azure Stack Hub operators must consider backup and recovery of the images. They would also need to deploy new versions of the QuickStart template as they become available to upgrade the OS or the container registry itself.
If no requirements prohibit the sending of monitoring data to Public Azure and the proper connectivity exists, Azure Monitor for containers can be leveraged for feature-rich monitoring of the K8s clusters deployed on Azure Stack Hub with AKS engine. In addition, Azure Arc for Data Services can be leveraged to run containerized images of Azure SQL Managed Instances or Azure PostgreSQL Hyperscale on this same K8s cluster. The Azure Monitor and Azure Arc for Data Services options would not be available in submarine scenarios where there would be no connectivity to Azure whatsoever. In the disconnected scenario, the customer would have to determine how best to monitor and run data services on their K8s cluster independent of Public Azure.
Here is a summary of the articles in this blog post series:
Part 1: Running containerized applications on Microsoft Azure’s hybrid ecosystem – Provides an overview of the concepts covered in the blog series.
Part 2: Deploy K8s Cluster into an Azure Stack Hub user subscription – Setup an AKS engine client VM, deploy a cluster using AKS engine, and onboard the cluster to Azure Monitor for Containers.
Part 3: Deploy a self-hosted Docker Container Registry – Use one of the Azure Stack Hub QuickStart templates to setup container registry and push images to this registry. Then, pull these images from the registry into the K8s cluster deployed with AKS engine in Part 2.