Your Browser is Out of Date

ShareDemos uses technology that works best in other browsers.
For a full experience use one of the browsers below

Blogs

Short articles about artificial intelligence solutions and related technology trends

Blogs(15)

Tag :

All Tags

Author :

All Authors

AI HPC

Can AI Shape Cellular Network Operations?

Raja Neogi

Tue, 01 Dec 2020 17:55:41 -0000

|

Read Time: 0 minutes

Mobile network operators (MNOs) are in the process of overlaying their conventional macro cellular networks with shorter-range cells such as outdoor pico-cells. This substantially increases network complexity, which makes OPEX planning and management challenging. Artificial intelligence (AI) offers the potential for MNOs to operate their networks in a more cost-efficient manner. Even though AI deployment has its challenges, most agree such deployment will ease emerging network, model, and algorithm complexity.

Advancements in error coding and communication design have resulted in the performance of the point-to-point link being close to the Shannon limit. This has proven to be effective for designing the fourth generation (4G) long-term evolution (LTE)-Advanced air interface, which has multiple parallel point-to-point links. However, 5G air interfaces are more complicated due to their complex network topology and coordination schemes, and vastly diverse end-user applications. Deriving any performance optimum is computationally infeasible. AI, however, can tame the network complexity by providing competitive performances.

Cellular networks have been designed with the goal of approximating end-to-end system behavior using simple modeling approaches that are amenable to clean mathematical analysis. For example, practical systems use digital pre-distortion to linearize the end-to-end model, for which information theory provides a simple closed-form capacity expression. However, with non-linearities in the wireless channel (e.g., mm-Wave) or device components (e.g., power amplifier), it’s difficult to analytically model such behaviors.

In contrast, AI-based detection strategies can easily model such non-linearities. There are examples in cellular networks where the optimal algorithms are well characterized but complex to implement in practice. For example, for a point-to-point multiple-input-multiple-output (MIMO) link operating with an M-ary quadrature amplitude modulation (QAM) constellation and K spatial streams or reconstruction in compressive spectrum sensing, optimum solutions are extremely complex. In practice, most MIMO systems employ linear receivers, e.g., linear minimum mean squared error (MMSE) receivers, which are known to be sub-optimal yet are easy to implement. AI can offer an attractive performance–complexity trade-off. For example, a deep-learning-based MIMO receiver can provide better performance than linear receivers in a variety of scenarios, while retaining low complexity. 

Deep learning can be used for devising computationally efficient approaches for physical (PHY) layer communication receivers. Supervised learning can be used for MIMO symbol detection and channel decoding, fetching potentially superior performance; recurrent neural network (RNN)-based detection can be used for MIMO orthogonal frequency division multiplexing (OFDM) systems; convolutional neural network (CNN)-based supervised learning techniques can deliver channel estimation; unsupervised learning approaches can be used for automatic fault detection and root cause analysis leveraging self-organizing maps; deep reinforced learning (DRL) can be used for designing spectrum access, scheduling radio resources, and cell-sectorization. An AI-managed edge or data center can consider diverse network parameters and KPIs for optimizing on-off operation of servers while ensuring uninterrupted services for the clients. Leveraging historical data collected by data center servers, it’s possible to learn emerging service-usage patterns. 

Standards bodies like the Third Generation Partnership Project (3GPP) have defined Network Data Analytics Function (NWDAF) specifications for data collection and analytics in automated cellular networks (3GPP TR 23.791 specification). By leaving AI model development to implementation, 3GPP provides adequate flexibility for network vendors to deploy AI-enabled use cases. The inbound interfaces ingest data from various sources such as operation, administration, and maintenance (OAM), network function (NF), application function (AF), and data repositories; the outbound interfaces relay the algorithmic decisions to the NF and AF blocks, respectively. 

In addition to 3GPP, MNOs (AT&T, China Mobile, Deutsche Telekom, NTT DOCOMO, and Orange) established the O-RAN Alliance (https://www.o-ran.org/) with the intent to automate network functions and reduce operating expenses. The O-RAN architecture, which is shown in the following figure, includes an AI-enabled RAN intelligent controller (RIC) for both non-real time (non-RT) and near-real time (near-RT), multi-radio access technology protocol stacks. 

Figure: O-RAN Architecture (source: O-RAN Alliance)

The non-RT functions include service and policy management, higher-layer procedure optimization, and model training for the near-RT RAN functionality. The near-RT RIC is compatible with legacy radio resource management and enhances challenging operational functions such as seamless handover control, Quality of Service (QoS) management, and connectivity management with AI. The O-RAN alliance has set up two work groups standardizing the A1 interface (between non-RT RIC and near-RT RIC) and E2 interface (between near-RT RIC and digital unit [DU] stack). 

Even though AI shows great promise for cellular networks, significant challenges remain:

  • From a PHY and MAC layer perspective, training a cellular AI model using over-the-air feedback to update layer weights based on the back-propagation algorithm is expensive in terms of uplink control overhead. 
  • Separation of information across network protocol layers make it difficult to obtain labeled training data. For example, training an AI model residing within a base-station scheduler might be challenging if it requires access to application layer information.
  • It is important for cellular networks to be able to predict the worst-case behavior. This isn’t always easy for non-linear AI building blocks. 
  • Cellular networks and wireless standards have been designed based on theoretical analysis, channel measurements, and human intuition. This approach allows domain experts to run computer simulations to validate communication system building blocks. AI tools remain black boxes. It is still challenging to develop analytical models to test correctness and explain behaviors in a simple manner.  
  • If a communication task is performed using an AI model, it is often unclear whether the dataset used for training the model is general enough to capture the distribution of inputs as encountered in reality. For example, if a neural network-based symbol detector is trained under one modulation and coding scheme (MCS), it is unclear how the system would perform for a different MCS level. This is important because if the MCS is changing adaptively due to mobility and channel fading, there has to be a way of predicting system behavior.
  • Interoperability is crucial in today’s software defined everything (SDE). Inconsistency among AI-based modules from different vendors can potentially deteriorate overall network performance. For example, some actions (e.g., setting handover threshold) taken by an AI-based module from one vendor could counteract the actions taken by another network module (which may or may not be AI-based) from a second vendor. This could lead to unwanted handover occurrences between the original BS and the neighboring BS, causing increased signaling overhead.

 In summary, MNOs agree that:

  • Training needs to be distributed as more complex scenarios arise.
  • More tools explaining AI decision making are essential.
  • More tools are needed to compare AI model output to theoretical performance bounds.
  • AI models need to adapt based on surrounding contextual information.
  • AI deployment should first focus on wider timescale models until a point is reached when model decision making is indistinguishable from experts.
  • Fail-safe wrappers around models should limit impact of cascading errors.

AI can revitalize wireless communications. There are challenges to overcome, but, done right, there is opportunity to deliver massive-scale autonomics in cellular networks that support ultra-reliable low-latency communications, enhanced mobile broadband, and massive machine-to-machine communications.

Read Full Blog
deep learning NVIDIA PowerEdge GPU MLPerf

Quantifying Performance of Dell EMC PowerEdge R7525 Servers with NVIDIA A100 GPUs for Deep Learning Inference

Rakshith Vasudev Frank Han Dharmesh Patel

Tue, 17 Nov 2020 18:30:15 -0000

|

Read Time: 0 minutes

The Dell EMC PowerEdge R7525 server provides exceptional MLPerf Inference v0.7 Results, which indicate that:

  • Dell Technologies holds the #1 spot in performance per GPU with the NVIDIA A100-PCIe GPU on the DLRM-99 Server scenario
  • Dell Technologies holds the #1 spot in performance per GPU with the NVIDIA A100-PCIe on the DLRM-99.9 Server scenario
  • Dell Technologies holds the #1 spot in performance per GPU with the NVIDIA A100-PCIe on the ResNet-50 Server scenario

Summary

In this blog, we provide the performance numbers of our recently released Dell EMC PowerEdge R7525 server with two NVIDIA A100 GPUs on all the results of the MLPerf Inference v0.7 benchmark. Our results indicate that the PowerEdge R7525 server is an excellent choice for inference workloads. It delivers optimal performance for different tasks that are in the MLPerf Inference v0.7 benchmark. These tasks include image classification, object detection, medical image segmentation, speech to text, language processing, and recommendation. 

The PowerEdge R7525 server is a two-socket, 2U rack server that is designed to run workloads using flexible I/O and network configurations. The PowerEdge R7525 server features the 2nd Gen AMD EPYC processor, supports up to 32 DIMMs, has PCI Express (PCIe) Gen 4.0-enabled expansion slots, and provides a choice of network interface technologies to cover networking options. 

The following figure shows the front view of the PowerEdge R7525 server:

Figure 1. Dell EMC PowerEdge R7525 server

The PowerEdge R7525 server is designed to handle demanding workloads and for AI applications such as AI training for different kinds of models and inference for different deployment scenarios. The PowerEdge R7525 server supports various accelerators such as NVIDIA T4, NVIDIA V100S, NVIDIA RTX, and NVIDIA A100 GPU s. The following sections compare the performance of NVIDIA A100 GPUs with NVIDIA T4 and NVIDIA RTX GPUs using MLPerf Inference v0.7 as a benchmark.

The following table provides details of the PowerEdge R7525 server configuration and software environment for MLPerf Inference v0.7:

Component

Description

Processor

AMD EPYC 7502 32-Core Processor

Memory

512 GB (32 GB 3200 MT/s * 16)

Local disk

2x 1.8 TB SSD (No RAID)

Operating system

CentOS Linux release 8.1

GPU

NVIDIA A100-PCIe-40G, T4-16G, and RTX8000

CUDA Driver

450.51.05

CUDA Toolkit

11.0

Other CUDA-related libraries

TensorRT 7.2, CUDA 11.0, cuDNN 8.0.2, cuBLAS 11.2.0, libjemalloc2, cub 1.8.0, tensorrt-laboratory mlperf branch

Other software stack

Docker 19.03.12, Python 3.6.8, GCC 5.5.0, ONNX 1.3.0, TensorFlow 1.13.1, PyTorch 1.1.0, torchvision 0.3.0, PyCUDA 2019.1, SacreBLEU 1.3.3, simplejson, OpenCV 4.1.1

System profiles

Performance

For more information about how to run the benchmark, see Running the MLPerf Inference v0.7 Benchmark on Dell EMC Systems.

MLPerf Inference v0.7 performance results

The MLPerf inference benchmark measures how fast a system can perform machine learning (ML) inference using a trained model in various deployment scenarios. The following results represent the Offline and Server scenarios of the MLPerf Inference benchmark. For more information about different scenarios, models, datasets, accuracy targets, and latency constraints in MLPerf Inference v0.7, see Deep Learning Performance with MLPerf Inference v0.7 Benchmark.

In the MLPerf inference evaluation framework, the LoadGen load generator sends inference queries to the system under test, in our case, the PowerEdge R7525 server with various GPU configurations. The system under test uses a backend (for example, TensorRT, TensorFlow, or PyTorch) to perform inferencing and sends the results back to LoadGen.

MLPerf has identified four different scenarios that enable representative testing of a wide variety of inference platforms and use cases. In this blog, we discuss the Offline and Server scenario performance. The main differences between these scenarios are based on how the queries are sent and received:

  • Offline—One query with all samples is sent to the system under test. The system under test can send the results back once or multiple times in any order. The performance metric is samples per second.
  • Server—Queries are sent to the system under test following a Poisson distribution (to model real-world random events). One query has one sample. The performance metric is queries per second (QPS) within latency bound.

Note:  Both the performance metrics for Offline and Server scenario represent the throughput of the system. 

In all the benchmarks, two NVIDIA A100 GPUs outperform eight NVIDIA T4 GPUs and three NVIDIA RTX800 GPUs for the following models:

  • ResNet-50 image classification model
  • SSD-ResNet34 object detection model
  • RNN-T speech recognition model
  • BERT language processing model
  • DLRM recommender model
  • 3D U-Net medical image segmentation model

The following graphs show PowerEdge R7525 server performance with two NVIDIA A100 GPUs, eight NVIDIA T4 GPUs, and three NVIDIA RTX8000 GPUs with 99% accuracy target benchmarks and 99.9% accuracy targets for applicable benchmarks:

  • 99% accuracy (default accuracy) target benchmarks: ResNet-50, SSD-Resnet34, and RNN-T
  • 99% and 99.9% accuracy (high accuracy) target benchmarks: DLRM, BERT, and 3D-Unet

99% accuracy target benchmarks

ResNet-50

The following figure shows results for the ResNet-50 model:

Figure 2. ResNet-50 Offline and Server inference performance

From the graph, we can derive the per GPU values. We divide the system throughput (containing all the GPUs) by the number of GPUs to get the Per GPU results as they are linearly scaled. 

SSD-Resnet34

The following figure shows the results for the SSD-Resnet34 model:

Figure 3. SSD-Resnet34 Offline and Server inference performance

RNN-T

The following figure shows the results for the RNN-T model:

 

Figure 4. RNN-T Offline and Server inference performance

99.9% accuracy target benchmarks

DLRM

The following figures show the results for the DLRM model with 99% and 99.9% accuracy:

Chart, bar chart

Description automatically generated

Figure 5. DLRM Offline and Server Scenario inference performance – 99% and 99.9% accuracy                                   

 For the DLRM recommender and 3D U-Net medical image segmentation (see Figure 7) models, both 99% and 99.9% accuracy have the same throughput. The 99.9% accuracy benchmark also satisfies the required accuracy constraints with the same throughput as that of 99%.

BERT

The following figures show the results for the BERT model with 99%  and 99.9% accuracy:

Figure 6. BERT Offline and Server inference performance – 99% and 99.9% accuracy 

For the BERT language processing model, two NVIDIA A100 GPUs outperform eight NVIDIA T4 GPUs and three NVIDIA RTX8000 GPUs. However, the performance of three NVIDIA RTX8000 GPUs is a little better than that of eight NVIDIA T4 GPUs.

3D U-Net

For the 3D-Unet medical image segmentation model, only the Offline scenario benchmark is available.

The following figure shows the results for the 3D U-Net model Offline scenario:

 

Figure 7. 3D U-Net Offline inference performance 

For the 3D-Unet medical image segmentation model, since there is only offline scenario benchmark for 3D-Unet the above graph represents only Offline scenario.

The following table compares the throughput between two NVIDIA A100 GPUs, eight NVIDIA T4 GPUs, and three NVIDIA RTX8000 GPUs with 99% accuracy target benchmarks and 99.9% accuracy targets:

Model 

Scenario 

Accuracy

2 x A100 GPUs vs 8 x T4  GPUs

2 x A100 GPUs vs 3 x RTX8000 GPUs

ResNet-50

Offline 

99% 

5.21x

2.10x

Server 

4.68x

1.89x

SSD-Resnet34

Offline 

6.00x

2.35x

Server 

5.99x

2.21x

RNN-T

Offline 

5.55x

2.14x

Server 

6.71x

2.43x

DLRM

Offline 

6.55x

2.52x

Server 

5.92x

2.47x

Offline 

99.9% 

6.55x

2.52x

Server 

5.92x

2.47x

BERT

Offline 

99% 

6.26x

2.31x

Server 

6.80x

2.72x

Offline 

99.9% 

7.04x

2.22x

Server 

6.84x

2.20x

3D U-Net

Offline 

99% 

5.05x

2.06x

Server 

99.9% 

5.05x

2.06x

Conclusion

With support of NVIDIA A100, NVIDIA T4, or NVIDIA RTX8000 GPUs, Dell EMC PowerEdge R7525 server is an exceptional choice for various workloads that involve deep learning inference. However, the higher throughput that we observed with NVIDIA A100 GPUs translates to performance gains and faster business value for inference applications. 

Dell EMC PowerEdge R7525 server with two NVIDIA A100 GPUs delivers optimal performance for various inference workloads, whether it is in a batch inference setting such as Offline scenario or Online inference setting such as Server scenario. 

Next steps

In future blogs, we will discuss sizing the system (server and GPU configurations) correctly based on the type of workload (area and task).

 

 

 


 


 

 

Read Full Blog
deep learning NVIDIA PowerEdge NVMe GPU AMD

Deep Learning Training Performance on Dell EMC PowerEdge R7525 Servers with NVIDIA A100 GPUs

Frank Han Dharmesh Patel

Wed, 11 Nov 2020 16:22:42 -0000

|

Read Time: 0 minutes

 

Overview

The Dell EMC PowerEdge R7525 server, which was recently released, supports NVIDIA A100 Tensor Core GPUs. It is a two-socket, 2U rack-based server that is designed to run complex workloads using highly scalable memory, I/O capacity, and network options. The system is based on the 2nd Gen AMD EPYC processor (up to 64 cores), has up to 32 DIMMs, and has PCI Express (PCIe) 4.0-enabled expansion slots. The server supports SATA, SAS, and NVMe drives and up to three double-wide 300 W or six single-wide 75 W accelerators.

The following figure shows the front view of the server:

Figure 1: Dell EMC PowerEdge R7525 server

This blog focuses on the deep learning training performance of a single PowerEdge R7525 server with two NVIDIA  A100-PCIe GPUs. The results of using two NVIDIA V100S GPUs in the same PowerEdge R7525 system are presented as reference data. We also present results from the cuBLAS GEMM test and the ResNet-50 model form the MLPerf Training v0.7 benchmark. 

The following table provides the configuration details of the PowerEdge R7525 system under test:

Component

Description

Processor

AMD EPYC 7502 32-core processor

Memory

512 GB (32 GB 3200 MT/s * 16)

Local disk

2 x 1.8 TB SSD (No RAID)

Operating system

RedHat Enterprise Linux Server 8.2

GPU

Either of the following:
  • 2 x NVIDIA V100S-PCIe-32G
  • 2 x NVIDIA A100-PCIe-40G

CUDA driver

450.51.05

CUDA toolkit

11.0

Processor Settings > Logical Processors

Disabled

System profiles

Performance

CUDA Basic Linear Algebra 

The CUDA Basic Linear Algebra (cuBLAS) library is the CUDA version of standard basic linear algebra subroutines, part of CUDA-X. NVIDIA provides the cublasMatmulBench binary, which can be used to test the performance of general matrix multiplication (GEMM) on a single GPU. The results of this test reflect the performance of an ideal application that only runs matrix multiplication in the form of the peak TFLOPS that the GPU can deliver. Although GEMM benchmark results might not represent real-world application performance, it is still a good benchmark to demonstrate the performance capability of different GPUs.

Precision formats such as FP64 and FP32 are important to HPC workloads; precision formats such as INT8 and FP16 are important for deep learning inference. We plan to discuss these observed performances in our upcoming HPC and inference blogs.

Because FP16, FP32, and TF32 precision formats are imperative to deep learning training performance, the blog focuses on these formats.

The following figure shows the results that we observed:

Figure 2: cuBLAS GEMM performance on the PowerEdge R7525 server with NVIDIA V100S-PCIe-32G and NVIDIA A100-PCIe-40G GPUs

The results include:

  • For FP16, the HGEMM TFLOPs of the NVIDIA A100 GPU is 2.27 times faster than the NVIDIA V100S GPU.
  • For FP32, the SGEMM TFLOPs of the NVIDIA A100 GPU is 1.3 times faster than the NVIDIA V100S GPU.
  • For TF32, performance improvement is expected without code changes for deep learning applications on the new NVIDIA A100 GPUs. This expectation is because math operations are run on NVIDIA A100 Tensor Cores GPUs with the new TF32 precision format. Although TF32 reduces the precision by a small margin, it preserves the range of FP32 and strikes an excellent balance between speed and accuracy. Matrix multiplication gained a sizable boost from 13.4 TFLOPS (FP32 on the NVIDIA V100S GPU) to 86.5 TFLOPS (TF32 on the NVIDIA A100 GPU).

MLPerf Training v0.7 ResNet-50

MLPerf is a benchmarking suite that measures the performance of machine learning (ML) workloads. The MLPerf Training benchmark suite measures how fast a system can train ML models.

The following figure shows the performance results of the ResNet-50 under the MLPerf Training v0.7 benchmark:

Figure 3: MLPerf Training v0.7 ResNet-50 performance on the PowerEdge R7525 server with NVIDIA V100S-PCIe-32G and NVIDIA A100-PCIe-40G GPUs

The metric for the ResNet-50 training is the minutes that the system under test spends to train the ImageNet dataset to achieve 75.9 percent accuracy. Both runs using two NVIDIA A100 GPUs and two NVIDIA V100S GPUs converged at the 40th epoch. The NVIDIA A100 run took 166 minutes to converge, which is 1.8 times faster than the NVIDIA V100S run. Regarding throughput, two NVIDIA A100 GPUs can process 5240 images per second, which is also 1.8 times faster than the two NVIDIA V100S GPUs.

Conclusion

The Dell EMC PowerEdge R7525 server with two NVIDIA A100-PCIe GPUs demonstrates optimal performance for deep learning training workloads. The NVIDIA A100 GPU shows a greater performance improvement over the NVIDIA V100S GPU.

To evaluate deep learning and HPC workload and application performance with the PowerEdge R7525 server powered by NVIDIA GPUs, contact the HPC & AI Innovation Lab.

Next steps

We plan to provide performance studies on:

  • Three NVIDIA A100 GPUs in a PowerEdge R7525 server
  • Results of other deep learning models in the MLPerf Training v0.7 benchmark
  • Training scalability results on multiple PowerEdge R7525 servers




Read Full Blog
PowerEdge

Supercharge Inference Performance at the Edge using the Dell EMC PowerEdge XE2420

Liz Raymond Trevor Cockrell Ramesh Radhakrishnan

Wed, 04 Nov 2020 18:52:42 -0000

|

Read Time: 0 minutes

Deployment of compute at the Edge enables the real-time insights that inform competitive decision making. Application data is increasingly coming from outside the core data center (“the Edge”) and harnessing all that information requires compute capabilities outside the core data center. It is estimated that 75% of enterprise-generated data will be created and processed outside of a traditional data center or cloud by 2025.[1]

This blog demonstrates that the Dell EMC PowerEdge XE2420, a high-performance Edge server, performs AI inference operations more efficiently by leveraging its ability to use up to four NVIDIA T4 GPUs in an edge-friendly short-depth server. The XE2420 with NVIDIA T4 GPUs can classify images at 25,141 images/second, an equal performance to other conventional 2U rack servers that is persistent across the range of benchmarks.

XE2420 Features and Capabilities

The Dell EMC PowerEdge XE2420 is a 16” (400mm) deep, high-performance server that is purpose-built for the Edge. The XE2420 has features that provide dense compute, simplified management and robust security for harsh edge environments.

Built for performance: Powerful 2U, two-socket performance with the flexibility to add up to four accelerators per server and a maximum local storage of 132TB.

Designed for harsh edge environments: Tested to Network Equipment-Building System (NEBS) guidelines, with extended operating temperature tolerance of 5˚-45˚C without sacrificing performance, and an optional filtered bezel to guard against dust. Short depth for edge convenience and lower latency.

Integrated security and consistent management: Robust, integrated security with cyber-resilient architecture, and the new iDRAC9 with Datacenter management experience. Front accessible and cold-aisle serviceable for easy maintenance.

The XE2420 allows for flexibility in the type of GPUs you use, in order to accelerate a wide variety of workloads including high-performance computing, deep learning training and inference, machine learning, data analytics, and graphics. It can support up to 2x NVIDIA V100/S PCIe, 2x NVIDIA RTX6000, or up to 4x NVIDIA T4.

Edge Inferencing with the T4 GPU

The NVIDIA T4 is optimized for mainstream computing environments and uniquely suited for Edge inferencing. Packaged in an energy-efficient 70-watt, small PCIe form factor, it features multi-precision Turing Tensor Cores and new RT Cores to deliver power efficient inference performance. Combined with accelerated containerized software stacks from NGC, XE240 and NVIDIA T4 is a powerful solution to deploy AI application at scale on the edge.

 

Fig 1: NVIDIA T4 Specifications



 












Fig 2: Dell EMC PowerEdge XE2420 w/ 4x T4 & 2x 2.5” SSDs





Dell EMC PowerEdge XE2420 MLPerf Inference Tested Configuration


Processors

2x Intel Xeon Gold 6252 CPU @ 2.10GHz

Storage

 

1x 2.5" SATA 250GB

1x 2.5" NVMe 4TB

Memory

12x 32GB 2666MT/s DDR4 DIMM

GPUs

4x NVIDIA T4

OS

Ubuntu 18.04.5

 

 

Software

 

 

TensorRT 7.2

CUDA 11.0 Update 1

cuDNN 8.0.2

DALI 0.25.0

Hardware Settings

ECC off

 

Inference Use Cases at the Edge

As computing further extends to the Edge, higher performance and lower latency become vastly more important in order to decrease response time and reduce bandwidth. One suite of diverse and useful inference workload benchmarks is MLPerf. MLPerf Inference demonstrates performance of a system under a variety of deployment scenarios and aims to provide a test suite to enable balanced comparisons between competing systems along with reliable, reproducible results.

The MLPerf Inference v0.7 suite covers a variety of workloads, including image classification, object detection, natural language processing, speech-to-text, recommendation, and medical image segmentation. Specific scenarios covered include “offline”, which represents batch processing applications such as mass image classification on existing photos, and “server”, which represents an application where query arrival is random, and latency is important. An example of server is essentially any consumer-facing website where a consumer is waiting for an answer to a question. Many of these workloads are directly relevant to Telco & Retail customers, as well as other Edge use cases where AI is becoming more prevalent.

Measuring Inference Performance using MLPerf

We demonstrate inference performance for the XE2420 + 4x NVIDIA T4 accelerators across the 6 benchmarks of MLPerf Inference v0.7 in order to showcase the versatility of the system. The inference benchmarking was performed on:

  • Offline and Server scenarios at 99% accuracy for ResNet50 (image classification), RNNT (speech-to-text), and SSD-ResNet34 (object detection)
  • Offline and Server scenarios at 99% and 99.9% accuracy for BERT (NLP) and DLRM (recommendation)
  • Offline scenario at 99% and 99.9% accuracy for 3D-Unet (medical image segmentation)

These results and the corresponding code are available at the MLPerf website.[1]

Key Highlights

The XE2420 is a compact server that supports 4x 70W T4 GPUs in an efficient manner, reducing overall power consumption without sacrificing performance. This high-density and efficient power-draw lends it increased performance-per-dollar, especially when it comes to a per-GPU performance basis.

Additionally, the PowerEdge XE2420 is part of the NVIDIA NGC-Ready and NGC-Ready for Edge validation programs[i]. At Dell, we understand that performance is critical, but customers are not willing to compromise quality and reliability to achieve maximum performance. Customers can confidently deploy inference and other software applications from the NVIDIA NGC catalog knowing that the PowerEdge XE2420 meets the requirements set by NVIDIA to deploy customer workloads on-premises or at the Edge.

In the chart above, per-GPU (aka 1x T4) performance numbers are derived from the total performance of the systems on MLPerf Inference v0.7 & total number of accelerators in a system. The XE2420 + T4 shows equivalent per-card performance to other Dell EMC + T4 offerings across the range of MLPerf tests.

When placed side by side with the Dell EMC PowerEdge R740 (4x T4) and R7515 (4x T4), the XE2420 (4x T4) showed performance on par across all MLPerf submissions. This demonstrates that operating capabilities and performance were not sacrificed to achieve the smaller depth and form-factor.

Conclusion: Better Density and Flexibility at the Edge without sacrificing Performance

MLPerf inference benchmark results clearly demonstrate that the XE2420 is truly a high-performance, half-depth server ideal for edge computing use cases and applications. The capability to pack four NVIDIA T4 GPUs enables it to perform AI inference operations at par with traditional mainstream 2U rack servers that are deployed in core data centers. The compact design provides customers new, powerful capabilities at the edge to do more, faster without extra components. The XE2420 is capable of true versatility at the edge, demonstrating performance not only for common retail workloads but also for the full range of tested workloads. Dell EMC offers a complete portfolio of trusted technology solutions to aggregate, analyze and curate data from the edge to the core to the cloud and XE2420 is a key component of this portfolio to meet your compute needs at the Edge.

XE2420 MLPerf Inference v0.7 Full Results

The raw results from the MLPerf Inference v0.7 published benchmarks are displayed below, where the metric is throughput (items per second).

Benchmark

ResNet50

RNNT

SSD-ResNet-34

Scenario

Offline

Server

Offline

Server

Offline

Server

Result

25,141

21,002

6,239

4,584

568

509

 

Benchmark

BERT

DLRM

Scenario

Offline

Server

Offline

Server

Accuracy %

99

99.9

99

99.9

99

99.9

99

99.9

Result

1,796

839

1,608

759

140,217

140,217

126,513

126,513

 

Benchmark

3D-Unet

Scenario

Offline

Accuracy %

99

99.9

Result

30.32

30.32

 

 

Read Full Blog
MLPerf

Dell EMC Servers Shine in MLPerf Inference v0.7 Benchmark

Ramesh Radhakrishnan Frank Han Liz Raymond

Mon, 02 Nov 2020 15:16:55 -0000

|

Read Time: 0 minutes

As software applications and systems using Artificial Intelligence (AI) gain mainstream adoption across all industries, inference workloads for ongoing operations are becoming a larger resource consumer in the datacenter. MLPerf is a benchmark suite that is used to evaluate the performance profiles of systems for both training and inference AI tasks. In this blog we take a closer look at the recent results submitted by Dell EMC and how our various servers performed in the datacenter category.  

The reason we do this type of work is to help customers understand which server platform makes the most sense for their use case. Dell Technologies wants to make the choice easier and reduce work for our customers, so they don’t waste their precious resources. We want customers to use their time focusing on the use case helping accelerate time to value for the business.

Dell Technologies has a total of 210 submissions for MLPerf Inference v0.7 in the Datacenter category using various server platforms and accelerators. Why so many? It is because many customers have never run AI in their environment, the use cases are endless across industries and expertise limited. Customers have told us they need help identifying the correct server platform based on their workloads.

We’re proud of what we’ve done, but it’s still all about helping customers adopt AI. By sharing our expertise and providing guidance on infrastructure for AI, we help customers become successful and get their use case into production.

 

MLPerf Benchmarks

MLPerf was founded in 2018 with a goal of accelerating improvements in ML system performance. Formed as a collaboration of companies and researchers from leading educational institutions, MLPerf leverages open source code, public state-of-the-art Machine Learning (ML) models and publicly available datasets contributed to the ML community. The MLPerf suites include MLPerf Training and MLPerf Inference.

MLPerf Training measures how fast a system can train machine learning models. Training benchmarks have been defined for image classification, lightweight and heavy-weight object detection, language translation, natural language processing, recommendation and reinforcement learning.  Each benchmark includes specifications for input datasets, quality targets and reference implementation models. The first round of training submissions was published on the MLPerf website in December 2018 with results submitted by Google, Intel and NVIDIA.

The MLPerf Inference suite measures how quickly a trained neural network can evaluate new data and perform forecasting or classification for a wide range of applications. MLPerf Inference includes image classification, object detection and machine translation with specific models, datasets, quality, server latency and multi-stream latency constraints. MLPerf validated and published results for MLPerf Inference v0.7 on October 21, 2020. In this blog we take a closer look at the for MLPerf Inference v0.7 results submitted by Dell EMC and how the servers performed in the datacenter category.  

A summary of the key highlights of the Dell EMC results are shown in Table 1. These are derived from the submitted results in MLPerf datacenter closed category. Ranking and claims are based on Dell analysis of published MLPerf data. Per accelerator is calculated by dividing the primary metric of total performance by the number of accelerators reported.

Rank

Category

 Specifics

Use Cases

#1

Performance per Accelerator for NVIDIA A100-PCIe

PowerEdge R7525 

Medical Imaging, Image Classification

#1

Performance per Accelerator with NVIDIA T4 GPUs

PowerEdge XE2420, PowerEdge R7525, DSS8440

Medical Imaging, NLP, Image Classification, Speech Recognition, Object Detection, Recommendation

#1

Highest inference results with Quadro RTX6000 and RTX8000 

PowerEdge R7525, DSS 8440

Medical Imaging, NLP, Image Classification, Speech Recognition, Object Detection, Recommendation

Dell EMC had a total of 210 submissions for MLPerf Inference v0.7 in the Datacenter category using various Dell EMC platforms and accelerators from leading vendors. We achieved impressive results when compared to other submissions in the same class of platforms.

MLPerf Inference Categories and Dell EMC Achievements

A benchmark suite is made up of tasks or models from vision, speech, language and commerce use cases.   MLPerf Inference measures how fast a system can perform ML inference by using a load generator against the System Under Test (SUT) where the trained model is deployed.

There are three types of benchmark tests defined in MLPerf inference v0.7, one for datacenter systems, one for edge systems and one for mobile systems.  MLPerf then has four different scenarios to enable representative testing of a wide variety of inference platforms and use cases:

  • Single stream
  • Multiple stream
  • Server  
  • Offline

The single stream and multiple stream scenarios are only used for edge and mobile inference benchmarks.   The data center benchmark type targets systems designed for data center deployments and requires evaluation of both the server and offline scenarios.  The metrics used in the Datacenter category are inference operations/second. In the server scenario, the MLPerf load generator sends new queries to the SUT according to a Poisson distribution. This is representative of on-line AI applications such as translation, image tagging which have variable arrival patterns based on end-user traffic. Offline represents AI tasks done thru batch processing such as photo categorization where all the data is readily available ahead of time.

Dell EMC published multiple results in the datacenter systems category. Details on the models, dataset and the scenarios submitted for the different datacenter benchmark are shown in Table 2

Area

Task

Model 

Dataset

Required Scenarios

 

Vision

Image classification

ResNet50-v1.5

Imagenet 

(224x224)

Server, Offline

Object detection (large)

SSD-ResNet34

COCO 

(1200x1200)

Server, Offline

Medical image segmentation

3d Unet

BraTS 2019 (224x224x160)

Offline

Speech

Speech-to-text

RNNT

Librispeech dev-clean 

(samples < 15 seconds)

Server, Offline

Language

Language processing

BERT

SQuAD v1.1 

(max_seq_len=384)

Server, Offline

Commerce

Recommendation

DLRM

1TB Click Logs

Server, Offline

Next we highlight some of the key performance achievements for the broad range of solutions available in the Dell EMC portfolio for inference use cases and deployments.

1. Dell EMC is #1 in total number of datacenter submissions in the closed division including bare metal submissions using different GPUs, Xeon CPUs, Xilinx FPGA and virtualized submission on VMware vSphere 

The closed division enables head to head comparisons and consists of server platforms used from the Edge to private or public clouds. The Dell Technologies engineering team submitted 210 out of the total 509 results. 

We remain committed to helping customers deploy inference workloads as efficiently as possible, meeting their unique requirements of power, density, budget and performance.  The wide range of servers submitted by Dell Technologies’ is a testament to this commitment -

  • The only vendor with submissions for a variety of inference solutions – leveraging GPU, FPGA and CPUs for the datacenter/private cloud and Edge
  • Unique in the industry by submitting results across a multitude of servers that range from mainstream servers (R740/R7525) to dense GPU-optimized servers supporting up to 16 NVIDIA GPUs (DSS8440).
  • Demonstrated that customers that demand real-time inferencing at the telco or retail Edge can deploy up to 4 GPUs in a short depth NEBS-compliant PowerEdge XE2420 server.
  • Demonstrated efficient Inference performance using the 2nd Gen Intel Xeon Scalable platform on the PowerEdge R640 and PowerEdge R740 platforms for customers wanting to run inference on Intel CPUs.
  • Dell submissions using Xilinx U280 in PowerEdge R740 demonstrated that customers wanting low latency inference can leverage FPGA solutions.

2. Dell EMC is #1 in performance “per Accelerator” with PowerEdge R7525 and A100-PCIe for multiple benchmarks

The Dell EMC PowerEdge R7525 was purpose-built for superior accelerated performance. The MLPerf results validated leading performance across many scenarios including:

 

Performance Rank 

“Per Accelerator”

Inference Throughput

Dell EMC

System

#1 ResNet50 (server)

30,005

PowerEdge R7525 (3x NVIDIA A100-PCIE)

#1 3D-Unet-99 (offline)

39

PowerEdge R7525 (3x NVIDIA A100-PCIE)

#1 3D-Unet-99 (offline)

39

PowerEdge R7525 (3x NVIDIA A100-PCIE)

#2 DLRM-99 (server)

192,543

PowerEdge R7525 (2x NVIDIA A100-PCIE)

#2 DLRM-99  (server) 

192,543

PowerEdge R7525 (2x NVIDIA A100-PCIE)


3. Dell achieved the highest inference scores with NVIDIA Quadro RTX GPUs using the DSS 8440 and R7525

Dell Technologies engineering understands that since training isn’t the only AI workload, using the right technology for each job is far more cost effective. Dell is the only vendor to submit results using NVIDIA RTX6000 and RTX8000 GPUs that provide up to 48GB HBM memory for large inference models. The DSS 8440 with 10 Quadro RTX achieved

  • #2 and #3 highest system performance on RNN-T for Offline scenario.

 

The #1 ranking was delivered using 8x NVIDIA A100 SXM4 that was introduced in May 2020 and is a powerful system for customer to train state of the art deep learning models. Dell Technologies took the #2 and #3 spots with the DSS8440 server equipped with 10x NVIDIA RTX8000 and DSS8440 with 10x NVIDIA RTX6000 providing a better power and cost efficiency for inference workloads compared to other submissions.

4. Dell EMC claims #1 spots for NVIDIA T4 platforms with DSS 8440, XE2420 and PowerEdge R7525 

Dell Technologies provides system options for customers to deploy inference workloads that match their unique requirements. Today’s accelerators vary significantly in price, performance and power consumption. For example, the NVIDIA T4 is a low profile, lower power GPU option that is widely deployed for inference due to its superior power efficiency and economic value for that use case.

The MLPerf results corroborate the exemplary inference performance of NVIDIA T4 on Dell EMC Servers.  The T4 leads for performance per GPU among the 20 servers used to submit scores using NVIDIA T4 GPUs  

  • #1 in performance per GPU on 3d-unet-99 and 3d-unet-99.9 Offline scenario
  • #1 in performance per GPU on Bert-99 Server and Bert-99.9 Offline scenario
  • #1, #2 and #3 in performance with T4 on DLRM-99 & DLRM-99.9 Server scenario
  • #1 in performance per GPU on ResNet50 Offline scenario
  • #1 in performance per GPU on RNN-T Server and Offline scenario
  • #1 in performance per GPU on SSD-large Offline scenario

The best scores achieved for the NVIDIA T4 “Per GPU” rankings above and respective platforms are shown in the table:

Benchmark

Offline Scenario

Server Scenario

Rank

Throughput

Server

Rank

Throughput

Server

3d-unet-99

#1

7.6

XE2420

n/a

3d-unet-99.9

#1

7.6

XE2420

n/a

 

 

bert-99

#3

449

XE2420

#1

402

XE2420

bert-99.9

#1

213

DSS 8440

#2

190

XE2420

dlrm-99

#2

35,054

XE2420

#1

32,507

R7525

dlrm-99.9

#2

35,054

XE2420

#1

32,507

R7525

resnet

#1

6,285

XE2420

#4

5,663

DSS 8440

rnnt

#1

1,560

XE2420

#1

1,146

XE2420

ssd-large

#1

142

XE2420

#2

131

DSS 8440

5. Dell is the only vendor to submit results on virtualized infrastructure with vCPUs and NVIDIA virtual GPUs (vGPU) on VMware vSphere

Customers interested in deploying inference workloads for AI on virtualized infrastructure can leverage Dell servers with VMware software to reap the benefits of virtualization.

To demonstrate efficient virtualized performance on Intel 2nd Generation Intel Xeon Scalable processors, Dell EMC and VMware submitted results using vSphere and OpenVino on the PowerEdge R640.

  • Virtualization overhead for a single VM was observed to be minimal and testing showed that using multiple VMs could be deployed on a single server to achieve ~26% better throughput compared to a bare metal environment.

Dell EMC has published guidance on virtualizing GPUs using DirectPath I/O, NVIDIA Virtual Compute Server (vCS) and more. Dell EMC and VMware used NVIDIA vCS virtualization software in vSphere for MLPerf Inference benchmarks on virtualized NVIDIA T4 GPUs

  • VMware vSphere using NVIDIA vCS delivers near bare metal performance for MLPerf Inference v0.7 benchmarks. The inference throughput (queries processed per second) increases linearly as the number of vGPUs attached to the VM increases.

Blogs covering these virtualized tests in greater detail are published at VMware’s performance Blog site.

This finishes our coverage of the top 5 highlights out of the 200+ submissions done by Dell EMC in the datacenter division. Next we discuss other aspects of the GPU optimized portfolio that are important for customers – quality and support.


Dell has the highest number of NVIDIA GPU submissions using NVIDIA NGC Ready systems

Dell GPU enabled platforms are part of NVIDIA NGC-Ready and NGC-Ready for Edge validation programs. At Dell, we understand that performance is critical, but customers are not willing to compromise quality and reliability to achieve maximum performance. Customers can confidently deploy inference and other software applications from the NVIDIA NGC catalog knowing that the Dell systems meet all the requirements set by NVIDIA to deploy customer workloads on-premises or at the Edge.

NVIDIA NGC validated configs that were used for this round of MLPerf submissions are:

  • Dell EMC PowerEdge XE2420 (4x T4)
  • Dell EMC DSS 8440 (10x Quadro RTX 8000) 
  • Dell EMC DSS 8440 (12x T4) 
  • Dell EMC DSS 8440 (16x T4)
  • Dell EMC DSS 8440 (8x Quadro RTX 8000)
  • Dell EMC PowerEdge R740 (4x T4)
  • Dell EMC PowerEdge R7515 (4x T4)
  • Dell EMC PowerEdge R7525 (2x A100-PCIE)
  • Dell EMC PowerEdge R7525 (3x Quadro RTX 8000)

Dell EMC portfolio can address customers inference needs from on-premises to the edge

In this blog, we highlighted the results submitted by Dell EMC to demonstrate how our various servers performed in the datacenter category. The Dell EMC server portfolio provides many options for customer wanting to deploy AI inference in their datacenters or on the edge. We also offer a wide range of accelerator options including both multiple GPU and FPGA models for running inference either on bare metal or virtualized infrastructure that can meet specific application and deployment requirements.  

Finally, we list the performance for a subset of the server platforms that we see mostly commonly used by customers today for running inference workloads. These rankings highlight that the platform can support a wide range of inference use cases that are showcased in the MLPerf suite.

 1. The Dell EMC PowerEdge XE2420 with 4x NVIDIA T4 GPUs: Ranked between #1 and #3 in 14 out of 16 benchmark categories when compared with other T4 Servers

Dell EMC PowerEdge XE2420 (4x T4) 

Per Accelerator Ranking*

 

Offline

Server

 

 

3d-unet-99

#1

n/a

 

3d-unet-99.9

#1

bert-99

#3

#1

bert-99.9

#2

#2

dlrm-99

#1

#3

dlrm-99.9

#1

#3

resnet

#1

 

rnnt

#1

#1

ssd-large

#1

 

*out of 20 server submissions using T4


2. Dell EMC PowerEdge R7525 with 8x T4 GPUs: Ranked between #1 and #5 in 11 out of 16 benchmark categories in T4 server submission

Dell EMC PowerEdge R7525 (8x T4) 

Per Accelerator Ranking*

 

Offline

Server

 

 

3d-unet-99

#4

 n/a

 

3d-unet-99.9

#4

bert-99

#4

 

dlrm-99

#2

#1

dlrm-99.9

#2

#1

rnnt

#2

#5

ssd-large

#5

 

*out of 20 server submissions using T4


3. The Dell EMC PowerEdge R7525 with up to 3xA100-PCIe: ranked between #3 and #10 in 15 out of 16 benchmark categories across all datacenter submissions

Dell EMC PowerEdge R7525 (2|3x A100-PCIe) Per Accelerator

 

Offline

Server

 

 

 

 

3d-unet-99

#4

 n/a

 

3d-unet-99.9

#4

bert-99

#8

#9

bert-99.9

#7

#8

dlrm-99

#6

#4

dlrm-99.9

#6

#4

resnet

#10

#3

rnnt

#6

#7

ssd-large

#10

 

*out of total submissions (53)


4. The Dell EMC DSS 8440 with 16x T4 ranked between #3 and #7 when compared against all submissions using T4

Dell EMC DSS 8440 (16x T4)

 

Offline

Server

 

 

3d-unet-99

#4

n/a 

 

3d-unet-99.9

#4

bert-99

#6

#4

bert-99.9

#7

#5

dlrm-99

#3

#3

dlrm-99.9

#3

#3

resnet

#6

#4

rnnt

#5

#5

ssd-large

#7

#5

*out of 20 server submissions using T4


5. The Dell EMC DSS 8440 with 10x RTX6000 ranked between #2 and #6 in 14 out of 16 benchmarks when compared against all submissions

Dell EMC DSS 8440 (10x Quadro RTX6000)

 

Offline

Server

 

3d-unet-99

#4

 n/a

 

3d-unet-99.9

#4

bert-99

#4

#5

bert-99.9

#4

#5

dlrm-99

 

 

dlrm-99.9

 

 

resnet

#5

#6

rnnt

#2

#5

ssd-large

#5

#6


*out of total submissions (53)

 

6. Dell EMC DSS 8440 with 10x RTX8000 ranked between #2 and #6 when compared against all submissions

Dell EMC DSS 8440 (10x Quadro RTX8000)

 

Offline

Server

 

戴尔DELL EMC DSS 8440服务器-服务器-戴尔易安信(Dell EMC)企采中心

3d-unet-99

#5

 n/a

 

3d-unet-99.9

#5

bert-99

#5

#4

bert-99.9

#5

#4

dlrm-99

#3

#3

dlrm-99.9

#3

#3

resnet

#6

#5

rnnt

#3

#6

ssd-large

#6

#5

*out of total submissions (53)

 

Get more information on MLPerf results at www.mlperf.org and earn more about PowerEdge servers that are optimized for AI / ML / DL at www.DellTechnologies.com/Servers

Acknowledgements: These impressive results were made possible by the work of the following Dell EMC and partner team members - Shubham Billus, Trevor Cockrell, Bagus Hanindhito (Univ. of Texas, Austin), Uday Kurkure (VMWare), Guy Laporte, Anton Lokhmotov (Dividiti), Bhavesh Patel, Vilmara Sanchez, Rakshith Vasudev, Lan Vu (VMware) and Nicholas Wakou. We would also like to thank our partners – NVIDIA, Intel and Xilinx for their help and support in MLPerf v0.7 Inference submissions.

Read Full Blog
deep learning NVIDIA PowerEdge GPU

Deep Learning Performance with MLPerf Inference v0.7 Benchmark

Rakshith Vasudev Frank Han Dharmesh Patel

Wed, 21 Oct 2020 17:28:57 -0000

|

Read Time: 0 minutes

 

Summary

MLPerf is a benchmarking suite that measures the performance of Machine Learning (ML) workloads. It focuses on the most important aspects of the ML life cycle:

  • Training—The MLPerf training benchmark suite measures how fast a system can train ML models. 
  • Inference—The MLPerf inference benchmark measures how fast a system can perform ML inference by using a trained model in various deployment scenarios.

This blog outlines the MLPerf inference v0.7 data center closed results on Dell EMC PowerEdge R7525 and DSS8440 servers with NVIDIA GPUs running the MLPerf inference benchmarks. Our results show optimal inference performance for the systems and configurations on which we chose to run inference benchmarks.  

In the MLPerf inference evaluation framework, the LoadGen load generator sends inference queries to the system under test (SUT). In our case, the SUTs are carefully chosen PowerEdge R7525 and DSS8440 servers with various GPU configurations. The SUT uses a backend (for example, TensorRT, TensorFlow, or PyTorch) to perform inferencing and sends the results back to LoadGen.

MLPerf has identified four different scenarios that enable representative testing of a wide variety of inference platforms and use cases. The main differences between these scenarios are based on how the queries are sent and received:

  • Offline—One query with all samples is sent to the SUT. The SUT can send the results back once or multiple times in any order. The performance metric is samples per second. 
  • Server—The queries are sent to the SUT following a Poisson distribution (to model real-world random events). One query has one sample. The performance metric is queries per second (QPS) within latency bound.
  • Single-stream—One sample per query is sent to the SUT. The next query is not sent until the previous response is received. The performance metric is 90th percentile latency.
  • Multi-stream—A query with N samples is sent with a fixed interval. The performance metric is max N when the latency of all queries is within a latency bound.

MLPerf Inference  Rules describes detailed inference rules and latency constraints. This blog only focuses on Offline and Server scenarios, which are designed for data center environments. Single-stream and Multi-stream scenarios are designed for nondatacenter (edge and IoT) settings.

MLPerf Inference results can be submitted under either of the following divisions:

  • Closed division—The Closed division is intended to provide an “apples to apples” comparison of hardware platforms or software frameworks. It requires using the same model and optimizer as the reference implementation.

    The Closed division requires using preprocessing, postprocessing, and model that is equivalent to the reference or alternative implementation. It allows calibration for quantization and does not allow any retraining. MLPerf provides a reference implementation of each benchmark. The benchmark implementation must use a model that is equivalent, as defined in MLPerf Inference  Rules, to the model used in the reference implementation.

     
  • Open division—The Open division is intended to foster faster models and optimizers and allows any ML approach that can reach the target quality. It allows using arbitrary preprocessing or postprocessing and model, including retraining. The benchmark implementation may use a different model to perform the same task.

To allow the apples-to-apples comparison of Dell EMC results and enable our customers and partners to repeat our results, we chose to conduct testing under the Closed division, as shown in the results in this blog.

Criteria for MLPerf Inference v0.7 benchmark result submission  

The following table describes the MLPerf benchmark expectations:

Table 1: Available benchmarks in the Closed division for MLPerf inference v0.7 with their expectations.

Area

Task

Model

Dataset

QSL size

Required quality

Required server latency constraint

Vision

Image classification

Resnet50-v1.5

ImageNet (224x224)

1024

99% of FP32 (76.46%)

15 ms

Vision

Object detection (large)

SSD-ResNet34

COCO (1200x1200)

64

99% of FP32 (0.20 mAP)

100 ms

Vision

Medical image segmentation

3D UNET

BraTS 2019 (224x224x160)

16

99% of FP32 and 99.9% of FP32 (0.85300 mean DICE score)

N/A

Speech

Speech-to-text

RNNT

Librispeech dev-clean (samples < 15 seconds)

2513

99% of FP32 (1 - WER, where WER=7.452253714852645%)

1000 ms

Language

Language processing

BERT

SQuAD v1.1 (max_seq_len=384)

10833

99% of FP32 and 99.9% of FP32 (f1_score=90.874%)

130 ms

Commerce

Recommendation

DLRM

1 TB Click Logs

204800

99% of FP32 and 99.9% of FP32 (AUC=80.25%)

30 ms

For any benchmark, it is essential for the result submission to meet all the specifications in this table. For example, if we choose the Resnet50 model, then the submission must meet the 76.46 percent target accuracy and the latency must be within 15 ms for the ImageNet dataset.

Each data center benchmark requires the scenarios in the following table:

Table 2: Tasks and corresponding required scenarios for data center benchmark suite in MLPerf inference v0.7.

Area

Task

Required scenarios

Vision

Image classification

Server, Offline

Vision

Object detection (large)

Server, Offline

Vision

Medical image segmentation

Offline

Speech

Speech-to-text

Server, Offline

Language

Language processing

Server, Offline

Commerce

Recommendation

Server, Offline

SUT configurations

We selected the following servers with different types of NVIDIA GPUs as our SUT to conduct data center inference benchmarks:

Results

The following provides the results of the MLPerf Inference v0.7 benchmark.  

For the Offline scenario, the performance metric is Offline samples per second. For the Server scenario, the performance metric is queries per second (QPS). In general, the metrics represent throughput.

The following graphs include performance metrics for the Offline and Server scenarios. A higher throughput is a better result.

 

 

Figure 1: Resnet50 v1.5 Offline and Server scenario with 99 percent accuracy target

 

Figure 2: SSD w/ Resnet34 Offline and Server scenario with 99 percent accuracy target


Figure 3: DLRM Offline and Server scenario with 99 percent accuracy target

 

 

Figure 4: DLRM Offline and Server scenario with 99.9 percent accuracy target


Figure 5: 3D-Unet using the 99 and 99.9 percent accuracy targets.

Note: The 99 and 99.9 percent accuracy targets with DLRM and 3D-Unet show the same performance because the accuracy targets were met early.

 

 

Figure 6: BERT Offline and Server scenario with 99 percent accuracy target

 

Figure 7: BERT Offline and Server scenario with 99.9 percent accuracy target.

 

Figure 8: RNN-T Offline and Server scenario with 99 percent accuracy target


Performance per GPU

For an estimate of per GPU performance, we divided the results in the previous section by the number of GPUs on the system. We observed that the performance scales linearly as we increase the number of GPUs. That is, as we add more cards, the performance of the system is multiplied by the number of cards times the performance per card. We will provide this information in a subsequent blog post. 

The following figure shows the approximate per GPU performance:

 

Figure 9: Approximate per card performance for the Resnet50 Offline scenario

The R7525_QuadroRTX8000x3 and DSS8440_QuadroRTX8000x10 systems both use the RTX8000 card. Therefore, performance per card for these two systems is about the same. The A100 cards yield the highest performance; the T4 cards yield the lowest performance. 

Conclusion

In this blog, we quantified the MLPerf inference v0.7 performance on Dell EMC DSS8440 and PowerEdge R7525 severs with NVIDIA A100, RTX8000, and T4 GPUs with Resnet50, SSD w/ Resnet34, DLRM, BERT, RNN-T, and 3D-Unet benchmarks. These benchmarks span tasks from vision to recommendation. Dell EMC servers delivered top inference performance normalized to processor count among commercially available results. We found that the A100 GPU delivered the best overall performance and best performance-per-watt while the RTX GPUs delivered the best performance-per-dollar. If constrained in a limited power environment, T4 GPUs deliver the best performance-per-watt.

Next steps

In future blogs, we plan to describe how to:

  • Run and performance tune MLPerf inference v0.7
  • Size the system (server and GPU configurations) correctly based on the type of workload (area and task)
  • Understand per-card, per-watt, and per-dollar metrics to determine your infrastructure needs 
  • Understand MLPerf training results on recently released R7525 PowerEdge servers with NVIDIA A100 GPUs
Read Full Blog
deep learning NVIDIA PowerEdge GPU

Running the MLPerf Inference v0.7 Benchmark on Dell EMC Systems

Rakshith Vasudev Frank Han Dharmesh Patel

Tue, 03 Nov 2020 20:38:04 -0000

|

Read Time: 0 minutes

MLPerf is a benchmarking suite that measures the performance of Machine Learning (ML) workloads. It focuses on the most important aspects of the ML life cycle:

  • Training—The MLPerf training benchmark suite measures how fast a system can train ML models. 
  • Inference—The MLPerf inference benchmark measures how fast a system can perform ML inference by using a trained model in various deployment scenarios.

The MLPerf inference v0.7 suite contains the following models for benchmark:

  • Resnet50 
  • SSD-Resnet34 
  • BERT 
  • DLRM 
  • RNN-T 
  • 3D U-Net

Note: The BERT, DLRM, and 3D U-Net models have 99% (default accuracy) and 99.9% (high accuracy) targets.

This blog describes the steps to run MLPerf inference v0.7 tests on Dell Technologies servers with NVIDIA GPUs. It helps you run and reproduce the results that we observed in our HPC and AI Innovation Lab. For more details about the hardware and the software stack for different systems in the benchmark, see this GitHub repository.

Getting started

A system under test  consists of a defined set of hardware and software resources that will be measured for performance. The hardware resources may include processors, accelerators, memories, disks, and interconnect. The software resources may include an operating system, compilers, libraries, and drivers that significantly influences the running time of a benchmark. In this case, the system on which you clone the MLPerf repository and run the benchmark is known as the system under test (SUT).

For storage, SSD RAID or local NVMe drives are acceptable for running all the subtests without any penalty. Inference does not have strict requirements for fast-parallel storage. However, the BeeGFS or Lustre file system, the PixStor storage solution, and so on help make multiple copies of large datasets.

Clone the MLPerf repository

Follow these steps:

  1. Clone the repository to your home directory or another acceptable path:

    cd - 
    git clone https://github.com/mlperf/inference_results_v0.7.git
  2. Go to the closed/DellEMC directory:

    cd inference_results_v0.7/closed/DellEMC
  3. Create a “scratch” directory to store the models, datasets, preprocessed data, and so on:

    mkdir scratch

    This scratch directory requires at least 3 TB of space.

  4. Export the absolute path for MLPERF_SCRATCH_PATH with the scratch directory:

    export MLPERF_SCRATCH_PATH=/home/user/inference_results_v0.7/closed/DellEMC/scratch

Set up the configuration file

The closed/DellEMC/configs  directory includes a config.json file that lists configurations for different Dell servers that were systems in the MLPerf Inference v0.7 benchmark. If necessary, modify the configs/<benchmark>/<Scenario>/config.json file to include the system that will run the benchmark.

Note: If your system is already present in the configuration file, there is no need to add another configuration. 

In the configs/<benchmark>/<Scenario>/config.json file, select a similar configuration and modify it based on the current system, matching the number and type of GPUs in your system.

For this blog, we considered our R7525 server with a one-A100 GPU. We chose R7525_A100x1 as the name for this new system. Because the R7525_A100x1 system is not already in the list of systems, we added the R7525_A100x1 configuration.

Because the R7525_A100x2 reference system is the most similar, we modified that configuration and picked Resnet50 Server as the example benchmark.

The following example shows the reference configuration for two GPUs for the Resnet50 Server benchmark in the configs/<benchmark>/<Scenario>/config.json file:

"R7525_A100x2": {
         "active_sms": 100,
         "config_ver": {
        },
         "deque_timeout_us": 2000,
         "gpu_batch_size": 64,
         "gpu_copy_streams": 4,
         "gpu_inference_streams": 3,
         "input_dtype": "int8",
         "input_format": "linear",
         "map_path": "data_maps/imagenet/val_map.txt",
         "precision": "int8",
         "server_target_qps": 52400,
         "tensor_path": "${PREPROCESSED_DATA_DIR}/imagenet/ResNet50/int8_linear",
         "use_cuda_thread_per_device": true,
         "use_deque_limit": true,
         "use_graphs": true
    },

This example shows the modified configuration for one GPU:

"R7525_A100x1": {
         "active_sms": 100,
         "config_ver": {
        },
         "deque_timeout_us": 2000,
         "gpu_batch_size": 64,
         "gpu_copy_streams": 4,
         "gpu_inference_streams": 3,
         "input_dtype": "int8",
         "input_format": "linear",
         "map_path": "data_maps/imagenet/val_map.txt",
         "precision": "int8",
         "server_target_qps": 26200,
         "tensor_path": "${PREPROCESSED_DATA_DIR}/imagenet/ResNet50/int8_linear",
         "use_cuda_thread_per_device": true,
         "use_deque_limit": true,
         "use_graphs": true
    },

We modified the QPS parameter (server_target_qps) to match the number of GPUs. The server_target_qps  parameter is linearly scalable, therefore the QPS = number of GPUs x QPS per GPU.

We only modified the server_target_qps  parameter to get a baseline run first. You can also modify other parameters such as gpu_batch_sizegpu_copy_streams, and so on. We will discuss these other parameters in a future blog that describes performance tuning. 

Finally, we added the modified configuration for the new R7525_A100x1 system to the configuration file at configs/resnet50/Server/config.json.

Register the new system

After you add the new system to the config.json  file, register the new system in the list of available systems. The list of available systems is in the code/common/system_list.py file.  

Note: If your system is already registered, there is no need to add it to the code/common/system_list.py file. 

To register the system, add the new system to the list of available systems in the code/common/system_list.py file, as shown in the following:

# (system_id, gpu_name_from_driver, gpu_count)
system_list = ([
     ("R740_vT4x4", "GRID T4-16Q", 4),
     ("XE2420_T4x4", "Tesla T4", 4),
     ("DSS8440_T4x12", "Tesla T4", 12),
     ("R740_T4x4", "Tesla T4", 4),
     ("R7515_T4x4", "Tesla T4", 4),
     ("DSS8440_T4x16", "Tesla T4", 16),
     ("DSS8440_QuadroRTX8000x8", "Quadro RTX 8000", 8),
     ("DSS8440_QuadroRTX6000x10", "Quadro RTX 6000", 10),
     ("DSS8440_QuadroRTX8000x10", "Quadro RTX 8000", 10),
     ("R7525_A100x2", "A100-PCIE-40GB", 2),
     ("R7525_A100x3", "A100-PCIE-40GB", 3),
     ("R7525_QuadroRTX8000x3", "Quadro RTX 8000", 3),
    ("R7525_A100x1", "A100-PCIE-40GB", 1),
])

In the preceding example, the last line under system_list is the newly added R7525_A100x1 system. It is a tuple of the form (<system name>, <GPU name>, <GPU count>). To find the GPU name from the driver, run the nvidia-smi -L command.

Note: Ensure that you add the system configuration for all the benchmarks that you intend to run and add the system to the system_list.py file. Otherwise, the results might be suboptimal. The benchmark might choose the wrong system configuration or not run at all because it could not find appropriate configuration.

Build the Docker image and required libraries

Build the Docker image and then launch an interactive container. Then, in the interactive container, build the required libraries for inferencing.

  1. To build the Docker image, run the following command:

    make prebuild 

    ………

    Launching Docker interactive session
    docker run --gpus=all --rm -ti -w /work -v /home/user/inference_results_v0.7/closed/DellEMC:/work -v /home/user:/mnt// user \
            --cap-add SYS_ADMIN -e NVIDIA_MIG_CONFIG_DEVICES="all" \
            -v /etc/timezone:/etc/timezone:ro -v /etc/localtime:/etc/localtime:ro \
            --security-opt apparmor=unconfined --security-opt seccomp=unconfined \
            --name mlperf-inference-user -h mlperf-inference-userv0.7 --add-host mlperf-inference-user:127.0.0.1 \
            --user 1004:1004 --net host --device /dev/fuse --cap-add SYS_ADMIN   \
            -e MLPERF_SCRATCH_PATH=”/home/user/ inference_results_v0.7 /closed/DellEMC/scratch” mlperf-inference:user
    (mlperf) user@mlperf-inference-user:/work$

    The Docker container is launched with all the necessary packages installed.

  2. Access the interactive terminal on the container.
  3. To build the required libraries for inferencing, run the following command inside the interactive container:

    make build 
    (mlperf) user@mlperf-inference-user:/work$ make build
      …….
    [ 92%] Built target harness_triton
    [ 96%] Linking CXX executable /work/build/bin/harness_default
    [100%] Linking CXX executable /work/build/bin/harness_rnnt
    make[4]: Leaving directory '/work/build/harness'
    [100%] Built target harness_default
    make[4]: Leaving directory '/work/build/harness'
    [100%] Built target harness_rnnt
    make[3]: Leaving directory '/work/build/harness'
    make[2]: Leaving directory '/work/build/harness'
    Finished building harness.
    make[1]: Leaving directory '/work'
    (mlperf) user@mlperf-inference-user:/work

Download and preprocess validation data and models

To run MLPerf inference v0.7, download datasets and models, and then preprocess them. MLPerf provides scripts that download the trained models. The scripts also download the dataset for benchmarks other than Resnet50, DLRM, and 3D U-Net. 

For Resnet50, DLRM, and 3D U-Net, register for an account and then download the datasets manually:

  • Resnet50—Download the ImageNet 2012 Validation set and extract the downloaded file to $MLPERF_SCRATCH_PATH/data/imagenet/
  • DLRM—Download the Criteo Terabyte dataset and extract the downloaded file to $MLPERF_SCRATCH_PATH/data/criteo/
  • 3D U-Net—Download the BraTS challenge data and extract the downloaded file to $MLPERF_SCRATCH_PATH/data/BraTS/MICCAI_BraTS_2019_Data_Training

Except for the Resnet50, DLRM, and 3D U-Net datasets, run the following commands to download all the models, datasets, and then preprocess them:

$ make download_model # Downloads models and saves to $MLPERF_SCRATCH_PATH/models
$ make download_data # Downloads datasets and saves to $MLPERF_SCRATCH_PATH/data
$ make preprocess_data # Preprocess data and saves to $MLPERF_SCRATCH_PATH/preprocessed_data

Note: These commands download all the datasets, which might not be required if the objective is to run one specific benchmark. To run a specific benchmark rather than all the benchmarks, see the following sections for information about the specific benchmark.

After building the libraries and preprocessing the data, the folders containing the following are displayed:

(mlperf) user@mlperf-inference-user:/work$ tree -d -L 1

.

├── build—Logs, preprocessed data, engines, models, plugins, and so on 

├── code—Source code for all the benchmarks

├── compliance—Passed compliance checks 

├── configs—Configurations that run different benchmarks for different system setups

├── data_maps—Data maps for different benchmarks

├── docker—Docker files to support building the container

├── measurements—Measurement values for different benchmarks

├── results—Final result logs 

├── scratch—Storage for models, preprocessed data, and the dataset that is symlinked to the preceding build directory

├── scripts—Support scripts 

└── systems—Hardware and software details of systems in the benchmark

Running the benchmarks

Run any of the benchmarks that are required for your tests.

The Resnet50, SSD-Resnet34, and RNN-T benchmarks have 99% (default accuracy) targets. 

The BERT, DLRM, and 3D U-Net benchmarks have 99% (default accuracy) and 99.9% (high accuracy) targets. For information about running these benchmarks, see the Running high accuracy target benchmarks section  below.   

If you downloaded and preprocessed all the datasets (as shown in the previous section), there is no need to do so again. Skip the download and preprocessing steps in the procedures for the following benchmarks. 

NVIDIA TensorRT is the inference engine for the backend. It includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for deep learning applications.

Run the Resnet50 benchmark

To set up the Resnet50 dataset and model to run the inference:

  1. If you already downloaded and preprocessed the datasets, go step 5.
  2. Download the ImageNet 2012 Validation set.
  3. Extract the images to $MLPERF_SCRATCH_PATH/data/imagenet/. 
  4. Run the following commands:

    make download_model BENCHMARKS=resnet50
    make preprocess_data BENCHMARKS=resnet50
  5. Generate the TensorRT engines:

    # generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario 
    
     make generate_engines RUN_ARGS="--benchmarks=resnet50 --scenarios=Offline,Server --config_ver=default"
  6. Run the benchmark:

    # run the performance benchmark
    
    make run_harness RUN_ARGS="--benchmarks=resnet50 --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly" 
    make run_harness RUN_ARGS="--benchmarks=resnet50 --scenarios=Server --config_ver=default --test_mode=PerformanceOnly"
    
    # run the accuracy benchmark 
     
    make run_harness RUN_ARGS="--benchmarks=resnet50 --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" 
    make run_harness RUN_ARGS="--benchmarks=resnet50 --scenarios=Server --config_ver=default --test_mode=AccuracyOnly"

    The following output is displayed for a “PerformanceOnly” mode:

    The following is a “VALID“ result:
    ======================= Perf harness results: =======================
    R7525_A100x1_TRT-default-Server:
         resnet50: Scheduled samples per second : 26212.91 and Result is : VALID
    ======================= Accuracy results: =======================
    R7525_A100x1_TRT-default-Server:
         resnet50: No accuracy results in PerformanceOnly mode.

Run the SSD-Resnet34 benchmark 

To set up the SSD-Resnet34 dataset and model to run the inference:

  1. If necessary, download and preprocess the dataset:

    make download_model BENCHMARKS=ssd-resnet34
    make download_data BENCHMARKS=ssd-resnet34 
    make preprocess_data BENCHMARKS=ssd-resnet34
  2. Generate the TensorRT engines:

    # generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario 
    
    make generate_engines RUN_ARGS="--benchmarks=ssd-resnet34 --scenarios=Offline,Server --config_ver=default"
  3. Run the benchmark:

    # run the performance benchmark
    
    make run_harness RUN_ARGS="--benchmarks=ssd-resnet34 --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly"
    make run_harness RUN_ARGS="--benchmarks=ssd-resnet34 --scenarios=Server --config_ver=default --test_mode=PerformanceOnly"
    
    # run the accuracy benchmark
    
    make run_harness RUN_ARGS="--benchmarks=ssd-resnet34 --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" 
    make run_harness RUN_ARGS="--benchmarks=ssd-resnet34 --scenarios=Server --config_ver=default --test_mode=AccuracyOnly"

Run the RNN-T benchmark

To set up the RNN-T dataset and model to run the inference:

  1. If necessary, download and preprocess the dataset:

    make download_model BENCHMARKS=rnnt
    make download_data BENCHMARKS=rnnt 
    make preprocess_data BENCHMARKS=rnnt
  2. Generate the TensorRT engines:

    # generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario
    
    make generate_engines RUN_ARGS="--benchmarks=rnnt --scenarios=Offline,Server --config_ver=default" 
  3. Run the benchmark:

    # run the performance benchmark
    
    make run_harness RUN_ARGS="--benchmarks=rnnt --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly"
    make run_harness RUN_ARGS="--benchmarks=rnnt --scenarios=Server --config_ver=default --test_mode=PerformanceOnly" 
     
    # run the accuracy benchmark 
     
    make run_harness RUN_ARGS="--benchmarks=rnnt --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" 
    make run_harness RUN_ARGS="--benchmarks=rnnt --scenarios=Server --config_ver=default --test_mode=AccuracyOnly"

Running high accuracy target benchmarks

The BERT, DLRM, and 3D U-Net benchmarks have high accuracy targets.

Run the BERT benchmark

To set up the BERT dataset and model to run the inference:

  1. If necessary, download and preprocess the dataset:

    make download_model BENCHMARKS=bert
    make download_data BENCHMARKS=bert 
    make preprocess_data BENCHMARKS=bert
  2. Generate the TensorRT engines:

    # generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario and also for default and high accuracy targets.
    
    make generate_engines RUN_ARGS="--benchmarks=bert --scenarios=Offline,Server --config_ver=default,high_accuracy"
  3. Run the benchmark:

    # run the performance benchmark
    
    make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly"
    make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Server --config_ver=default --test_mode=PerformanceOnly"
    make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Offline --config_ver=high_accuracy --test_mode=PerformanceOnly"
    make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Server --config_ver=high_accuracy --test_mode=PerformanceOnly" 
     
    # run the accuracy benchmark 
     
    make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" 
    make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Server --config_ver=default --test_mode=AccuracyOnly" 
    make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Offline --config_ver=high_accuracy --test_mode=AccuracyOnly" 
    make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Server --config_ver=high_accuracy --test_mode=AccuracyOnly"

Run the DLRM benchmark

To set up the DLRM dataset and model to run the inference:

  1. If you already downloaded and preprocessed the datasets, go to step 5.
  2. Download the Criteo Terabyte dataset.
  3. Extract the images to $MLPERF_SCRATCH_PATH/data/criteo/ directory.
  4. Run the following commands:

    make download_model BENCHMARKS=dlrm
    make preprocess_data BENCHMARKS=dlrm
  5. Generate the TensorRT engines:

    # generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario and also for default and high accuracy targets.
    
    make generate_engines RUN_ARGS="--benchmarks=dlrm --scenarios=Offline,Server --config_ver=default, high_accuracy"
  6. Run the benchmark:

    # run the performance benchmark
    
    make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly"
    make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Server --config_ver=default --test_mode=PerformanceOnly"
    make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Offline --config_ver=high_accuracy --test_mode=PerformanceOnly"
    make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Server --config_ver=high_accuracy --test_mode=PerformanceOnly"
    
    # run the accuracy benchmark
    
    make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly"
    make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Server --config_ver=default --test_mode=AccuracyOnly"
    make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Offline --config_ver=high_accuracy --test_mode=AccuracyOnly"
    make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Server --config_ver=high_accuracy --test_mode=AccuracyOnly"

Run the 3D U-Net benchmark

Note: This benchmark only has the Offline scenario.

To set up the 3D U-Net dataset and model to run the inference:

  1. If you already downloaded and preprocessed the datasets, go to step 5
  2. Download the BraTS challenge data.
  3. Extract the images to the $MLPERF_SCRATCH_PATH/data/BraTS/MICCAI_BraTS_2019_Data_Training  directory.
  4. Run the following commands:

    make download_model BENCHMARKS=3d-unet
    make preprocess_data BENCHMARKS=3d-unet
  5. Generate the TensorRT engines:

    # generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario and for default and high accuracy targets.
    
    make generate_engines RUN_ARGS="--benchmarks=3d-unet --scenarios=Offline --config_ver=default,high_accuracy"
  6. Run the benchmark:

    # run the performance benchmark
    
    make run_harness RUN_ARGS="--benchmarks=3d-unet --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly"
    make run_harness RUN_ARGS="--benchmarks=3d-unet --scenarios=Offline --config_ver=high_accuracy --test_mode=PerformanceOnly"
    
    # run the accuracy benchmark 
     
    make run_harness RUN_ARGS="--benchmarks=3d-unet --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" 
    make run_harness RUN_ARGS="--benchmarks=3d-unet --scenarios=Offline --config_ver=high_accuracy --test_mode=AccuracyOnly"

Limitations and Best Practices

Note the following limitations and best practices:

  • To build the engine and run the benchmark by using a single command, use the make run RUN_ARGS… shortcut. The shortcut is valid alternative to the make generate_engines … && make run_harness.. commands.
  • If the server results are “INVALID”, reduce the QPS. If the latency constraints are not met during the run,  “INVALID” results are expected.
  • If you change the batch size, rebuild the engine.
  • Only the BERT, DLRM, 3D-Unet benchmarks support high accuracy targets.
  • 3D-UNet only has Offline scenario.

Conclusion

This blog provided the step-by-step procedures to run and reproduce MLPerf inference v0.7 results on Dell Technologies servers with NVIDIA GPUs.  

Next steps

In future blogs, we will discuss techniques to improvise performance.

Read Full Blog
AI Kubernetes

Bare Metal Compared with Kubernetes

Sam Lucido

Thu, 04 Jun 2020 16:19:26 -0000

|

Read Time: 0 minutes

It has been fascinating to watch the tide of application containerization build from stateless cloud native web applications to every type of data-centric workload. These workloads include high performance computing (HPC), machine learning and deep learning (ML/DL), and now most major SQL and NoSQL databases. As an example, I recently read the following Dell Technologies knowledge base article: Bare Metal vs Kubernetes: Distributed Training with TensorFlow.

Bare metal and bare metal server refer to implementations of applications that are directly on the physical hardware without virtualization, containerization, and cloud hosting. Many times, bare metal is compared to virtualization and containerization is used to contrast performance and manageability features. For example, contrasting an application on bare metal to an application running in a container can provide insights into the potential performance differences between the two implementations.

Figure 1: Comparison of bare metal to containers implementations

A screenshot of a cell phone

Description automatically generated

Containers encapsulate an application with supporting binaries and libraries to run on one shared operating system. The container’s runtime engine or management applications, such as Kubernetes, manage the container. Because of the shared operating system, a container’s infrastructure is lightweight, providing more reason to understand the differences in terms of performance.

In the case of comparing bare metal with Kubernetes, distributed training with TensorFlow performance was measured in terms of throughput. That is, we measured the number of images per second when training CheXNet. Five tests were run in which each test consecutively added more GPUs across the bare metal and Kubernetes systems. The solid data points in Figure 2 show that the tests were run using 1, 2, 3, 4, and 8 GPUs.

Figure 2: Running CheXNet training on Kubernetes compared to bare metal

A close up of a map

Description automatically generated

Figure 2 shows that the Kubernetes container configuration was similar in terms of performance to the bare metal configuration through 4 GPUs. The test through 8 GPUs shows an eight percent increase for bare metal compared with Kubernetes. However, the article that I referenced offers factors that might contribute to the delta:

  • The bare metal system takes advantage of the full bandwidth and latency of raw InfiniBand while the Kubernetes configuration uses software defined networking using flannel.
  • The Kubernetes configuration uses IP over InfiniBand, which can reduce available bandwidth.

Studies like this are useful because they provide performance insight that customers can use. I hope we see more studies that encompass other workloads. For example, a study about Oracle and SQL Server databases in containers compared with running on bare metal would be interesting. The goal would be to understand how a Kubernetes ecosystem can support a broad ecosystem of different workloads.

Hope you like the blog!

 

 

 

Read Full Blog
deep learning AI Spark

Deep Learning on Spark is Getting Interesting

Phil Hummel

Tue, 02 Jun 2020 18:57:09 -0000

|

Read Time: 0 minutes

The year 2012 will be remembered in history as a break out year for data analytics. Deep learnings meteoric rise to prominence can largely be attributed to the 2012 introduction of convolution neural networks (CNN)for image classification using the ImageNet dataset during the Large-Scale Visual Recognition Challenge (LSVRC) [1].  It was a historic event after a very, very long incubation period for deep learning that started with mathematical theory work in the 1940s, 50s, and 60s.  The prior history of neural networks and deep learning development is a fascination and should not be forgotten, but it is not an overstatement to say that 2012 was the breakout year for deep learning.

Coincidentally, 2012 was also a breakout year for in-memory distributed computing.  A group of researchers from the University of AMPlab published a paper with an unusual title that changed the world of data analytics. “Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing”. [2] This paper describes how the initial creators developed an efficient, general-purpose and fault-tolerant in-memory data abstraction for sharing data in cluster applications.  The effort was motivated by the short-comings of both MapReduce and other distributed-memory programming models for processing iterative algorithms and interactive data mining jobs.

The ongoing development of so many application libraries that all leverage Spark’s RDD abstraction including GraphX for creating graphs and graph-parallel computation, Spark Streaming for scalable fault-tolerant streaming applications and MLlib for scalable machine learning is proof that Spark achieved the original goal of being a general-purpose programming environment.  The rest of this article will describe the development and integration of deep learning libraries – a now extremely useful class of iterative algorithms that Spark was designed to address.  The importance of the role that deep learning was going to have on data analytics and artificial intelligence was just starting to emerge at the same time Spark was created so the combination of the two developments has been interesting to watch.

MLlib – The original machine learning library for Spark

MLlib development started not long after the AMPlab code was transferred to the Apache Software Foundation in 2013.  It is not really a deep learning library however there is an option for developing Multilayer perceptron classifiers [3] based on the feedforward artificial neural network with backpropagation implemented for learning the model.  Fully connected neural networks were quickly abandoned after the development of more sophisticated models constructed using convolutional, recursive, and recurrent networks. 

Fully connected shallow and deep networks are making a comeback as alternatives to tree-based models for both regression and classification.  There is also a lot of current interest in various forms of autoencoders used to learn latent (hidden) compressed representations of data dimension reduction and self-supervised classification.  MLlib, therefore, can be best characterized as a machine learning library with some limited neural network capability.

BigDL – Intel open sources a full-featured deep learning library for Spark

BigDL is a distributed deep learning library for Apache Spark.  BigDL implements distributed, data-parallel training directly on top of the functional compute model using the core Spark features of copy-on-write and coarse-grained operations.  The framework has been referenced in applications as diverse as transfer learning-based image classification, object detection and feature extraction, sequence-to-sequence prediction for precipitation nowcasting, neural collaborative filtering for recommendations, and more.  Contributors and users include a wide range of industries including Mastercard, World Bank, Cray, Talroo, University of California San Francisco (UCSF), JD, UnionPay, Telefonica, GigaSpaces. [4]

Engineers with Dell EMC and Intel recently completed a white paper demonstrating the use of deep learning development tools from the Intel Analytics Zoo [5] to build an integrated pipeline on Apache Spark ending with a deep neural network model to predict diseases from chest X-rays. [6]   Tools and examples in the Analytics Zoo give data scientists the ability to train and deploy BigDL, TensorFlow, and Keras models on Apache Spark clusters. Application developers can also use the resources from the Analytics Zoo to deploy production class intelligent applications through model extractions capable of being served in any Java, Scala, or other Java virtual machine (JVM) language. 

The researchers conclude that modern deep learning applications can be developed and deployed at scale on an existing Hadoop and Spark cluster. This approach avoids the need to move data to a different deep learning cluster and eliminates the operational complexities of provisioning and maintaining yet another distributed computing environment.  The open-source software that is described in the white paper is available from Github. [7]

H20.ai – Sparkling Water for Spark

H2O is fast, scalable, open-source machine learning, and deep learning for smarter applications. Much like MLlib, the H20 algorithms cover a wide range of useful machine learning techniques but only fully connected MLPs for deep learning.  With H2O, enterprises like PayPal, Nielsen Catalina, Cisco, and others can use all their data without sampling to get accurate predictions faster. [8]  Dell EMC, Intel, and H2o.ai recently developed a joint reference architecture that outlines both technical considerations and sizing guidance for an on-premises enterprise AI platform. [9]

The engineers show how running H2O.ai software on optimized Dell EMC infrastructure with the latest Intel® Xeon® Scalable processors and NVMe storage, enables organizations to use AI to improve customer experiences, streamline business processes, and decrease waste and fraud. Validated software included the H2O Driverless AI enterprise platform and the H2O and H2O Sparkling Water open-source software platforms. Sparkling Water is designed to be executed as a regular Spark application. It provides a way to initialize H2O services on Spark and access data stored in both Spark and H2O data structures. H20 Sparkling Water algorithms are designed to take advantage of the distributed in-memory computing of existing Spark clusters.  Results from H2O can easily be deployed using H2O low-latency pipelines or within Spark for scoring.

H2O Sparkling Water cluster performance was evaluated on three- and five-node clusters. In this mode, H2O launches through Spark workers, and Spark manages the job scheduling and communications between the nodes. Three and five Dell EMC PowerEdge R740xd Servers with Intel Xeon Gold 6248 processors were used to train XGBoost and GBM models using the mortgage data set derived from the Fannie Mae Single-Family Loan Performance data set.

Spark and GPUs

Many data scientists familiar with Spark for machine learning have been waiting for official support for GPUs.  The advantages realized from modern neural network models like the CNN entry in the 2012 LSVRC would not have been fully realized without the work of NVIDIA and others on new acceleration hardware.  NVIDIA’s GPU technology like the Volta V100 has morphed into a class of advanced, enterprise-class ML/DL accelerators that reduce training time for all types of neural network configurations including CCN, RNN (recurrent neural networks) and GAN (generative adversarial networks) to mention just a few of the most popular forms.  Deep learning researchers see many advantages to building end-to-data model training “pipelines” that take advantage of the generalized distributed computing capability of Spark for everything from data cleaning and shaping through to scale-out training using integration with GPUs.


NVIDIA recently announced that it has been working with Apache Spark’s open source community to bring native GPU acceleration to the next version of the big data processing framework, Spark 3.0 [10]  The Apache Spark community is distributing a preview release of Spark 3.0 to encourage wide-scale community testing of the upcoming release.  The preview is not a stable release of the expected API specification or functionality.  No firm date for the general availability of Spark 3.0 has been released but organizations exploring options for distributed deep learning with GPUs should start evaluating the proposed features and advantages of Spark 3.0.

Cloudera is also giving developers and data science an opportunity to do testing and evaluation with the preview release of Spark 3.0.  The current GA version of the Cloudera Runtime includes the Apache Spark 3.0 preview 2 as part of their CDS 3 (Experimental) Powered by Apache Spark release. [11] Full Spark 3.0 preview 2 documentation including many code samples is available from the Apache Spark website [12] 

What’s next

It’s been 8 years since the breakout events for deep learning and distributed computing with Spark were announced.  We have seen tremendous adoption of both deep learning and Spark for all types of analytics use cases from medical imaging to language processing to manufacturing control and beyond.  We are just now poised to see new breakthroughs in the merging of Spark and deep learning, especially with the addition of support for hardware accelerators.  IT professionals and data scientists are still too heavily burdened with the hidden technical debt overhead for managing machine learning systems. [13]  The integration of accelerated deep learning with the power of the Spark generalized distributed computing platform will give both the IT and data science communities a capable and manageable environment to develop and host end-to-end data analysis pipelines in a common framework.  

References

[1] Alom, M. Z., Taha, T. M., Yakopcic, C., Westberg, S., Sidike, P., Nasrin, M. S., ... & Asari, V. K. (2018). The history began from alexnet: A comprehensive survey on deep learning approaches. arXiv preprint arXiv:1803.01164.

[2] Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., ... & Stoica, I. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Presented as part of the 9th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 12) (pp. 15-28).

[3] Apache Spark (June 2020) Multilayer perceptron classifier https://spark.apache.org/docs/latest/ml-classification-regression.html#multilayer-perceptron-classifier

[4] Dai, J. J., Wang, Y., Qiu, X., Ding, D., Zhang, Y., Wang, Y., ... & Wang, J. (2019, November). Bigdl: A distributed deep learning framework for big data. In Proceedings of the ACM Symposium on Cloud Computing (pp. 50-60).

[5] Intel Analytics Zoo (June 2020) https://software.intel.com/content/www/us/en/develop/topics/ai/analytics-zoo.html

[6] Chandrasekaran, Bala (Dell EMC) Yang, Yuhao (Intel) Govindan, Sajan (Intel) Abd, Mehmood (Dell EMC) A. A. R. U. D. (2019).  Deep Learning on Apache Spark and Analytics Zoo.

[7] Dell AI Engineering (June 2020)  BigDL Image Processing Examples https://github.com/dell-ai-engineering/BigDL-ImageProcessing-Examples

[8] Candel, A., Parmar, V., LeDell, E., and Arora, A. (Apr 2020). Deep Learning with H2O https://www.h2o.ai/wp-content/themes/h2o2016/images/resources/DeepLearningBooklet.pdf

[9] Reference Architectures for H2O.ai (February 2020) https://www.dellemc.com/resources/en-us/asset/white-papers/products/ready-solutions/dell-h20-architectures-pdf.pdf Dell Technologies

[10] Woodie, Alex (May 2020) Spark 3.0 to Get Native GPU Acceleration https://www.datanami.com/2020/05/14/spark-3-0-to-get-native-gpu-acceleration/ datanami

[11] CDS 3 (Experimental) Powered by Apache Spark Overview (June 2020) https://docs.cloudera.com/runtime/7.0.3/cds-3/topics/spark-spark-3-overview.html

[12] Spark Overview (June 2020) https://spark.apache.org/docs/3.0.0-preview2/

[13] Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., ... & Dennison, D. (2015). Hidden technical debt in machine learning systems. In Advances in neural information processing systems (pp. 2503-2511).

Read Full Blog
AI deep learning HPC

Accelerating Insights with Distributed Deep Learning

Luke Wilson Ph.D. Michael Bennett

Wed, 15 Apr 2020 19:18:38 -0000

|

Read Time: 0 minutes

Originally published on Aug 6, 2018 1:17:46 PM 

Artificial intelligence (AI) is transforming the way businesses compete in today’s marketplace. Whether it’s improving business intelligence, streamlining supply chain or operational efficiencies, or creating new products, services, or capabilities for customers, AI should be a strategic component of any company’s digital transformation.

Deep neural networks have demonstrated astonishing abilities to identify objects, detect fraudulent behaviors, predict trends, recommend products, enable enhanced customer support through chatbots, convert voice to text and translate one language to another, and produce a whole host of other benefits for companies and researchers. They can categorize and summarize images, text, and audio recordings with human-level capability, but to do so they first need to be trained.

Deep learning, the process of training a neural network, can sometimes take days, weeks, or months, and effort and expertise is required to produce a neural network of sufficient quality to trust your business or research decisions on its recommendations. Most successful production systems go through many iterations of training, tuning and testing during development. Distributed deep learning can speed up this process, reducing the total time to tune and test so that your data science team can develop the right model faster, but requires a method to allow aggregation of knowledge between systems.

There are several evolving methods for efficiently implementing distributed deep learning, and the way in which you distribute the training of neural networks depends on your technology environment. Whether your compute environment is container native, high performance computing (HPC), or Hadoop/Spark clusters for Big Data analytics, your time to insight can be accelerated by using distributed deep learning. In this article we are going to explain and compare systems that use a centralized or replicated parameter server approach, a peer-to-peer approach, and finally a hybrid of these two developed specifically for Hadoop distributed big data environments.

Distributed Deep Learning in Container Native Environments

Container native (e.g., Kubernetes, Docker Swarm, OpenShift, etc.) have become the standard for many DevOps environments, where rapid, in-production software updates are the norm and bursts of computation may be shifted to public clouds. Most deep learning frameworks support distributed deep learning for these types of environments using a parameter server-based model that allows multiple processes to look at training data simultaneously, while aggregating knowledge into a single, central model.

The process of performing parameter server-based training starts with specifying the number of workers (processes that will look at training data) and parameter servers (processes that will handle the aggregation of error reduction information, backpropagate those adjustments, and update the workers). Additional parameters servers can act as replicas for improved load balancing.

Parameter server model for distributed deep learning

Worker processes are given a mini-batch of training data to test and evaluate, and upon completion of that mini-batch, report the differences (gradients) between produced and expected output back to the parameter server(s). The parameter server(s) will then handle the training of the network and transmitting copies of the updated model back to the workers to use in the next round.

This model is ideal for container native environments, where parameter server processes and worker processes can be naturally separated. Orchestration systems, such as Kubernetes, allow neural network models to be trained in container native environments using multiple hardware resources to improve training time. Additionally, many deep learning frameworks support parameter server-based distributed training, such as TensorFlow, PyTorch, Caffe2, and Cognitive Toolkit.

Distributed Deep Learning in HPC Environments

High performance computing (HPC) environments are generally built to support the execution of multi-node applications that are developed and executed using the single process, multiple data (SPMD) methodology, where data exchange is performed over high-bandwidth, low-latency networks, such as Mellanox InfiniBand and Intel OPA. These multi-node codes take advantage of these networks through the Message Passing Interface (MPI), which abstracts communications into send/receive and collective constructs.

Deep learning can be distributed with MPI using a communication pattern called Ring-AllReduce. In Ring-AllReduce each process is identical, unlike in the parameter-server model where processes are either workers or servers. The Horovod package by Uber (available for TensorFlow, Keras, and PyTorch) and the mpi_collectives contributions from Baidu (available in TensorFlow) use MPI Ring-AllReduce to exchange loss and gradient information between replicas of the neural network being trained. This peer-based approach means that all nodes in the solution are working to train the network, rather than some nodes acting solely as aggregators/distributors (as in the parameter server model). This can potentially lead to faster model convergence.

Ring-AllReduce model for distributed deep learning

The Dell EMC Ready Solutions for AI, Deep Learning with NVIDIA allows users to take advantage of high-bandwidth Mellanox InfiniBand EDR networking, fast Dell EMC Isilon storage, accelerated compute with NVIDIA V100 GPUs, and optimized TensorFlow, Keras, or Pytorch with Horovod frameworks to help produce insights faster. 

Distributed Deep Learning in Hadoop/Spark Environments

Hadoop and other Big Data platforms achieve extremely high performance for distributed processing but are not designed to support long running, stateful applications. Several approaches exist for executing distributed training under Apache Spark. Yahoo developed TensorFlowOnSpark, accomplishing the goal with an architecture that leveraged Spark for scheduling Tensorflow operations and RDMA for direct tensor communication between servers.

BigDL is a distributed deep learning library for Apache Spark. Unlike Yahoo’s TensorflowOnSpark, BigDL not only enables distributed training - it is designed from the ground up to work on Big Data systems. To enable efficient distributed training BigDL takes a data-parallel approach to training with synchronous mini-batch SGD (Stochastic Gradient Descent). Training data is partitioned into RDD samples and distributed to each worker. Model training is done in an iterative process that first computes gradients locally on each worker by taking advantage of locally stored partitions of the training data and model to perform in memory transformations. Then an AllReduce function schedules workers with tasks to calculate and update weights. Finally, a broadcast syncs the distributed copies of model with updated weights.

BigDL implementation of AllReduce functionality

The Dell EMC Ready Solutions for AI, Machine Learning with Hadoop is configured to allow users to take advantage of the power of distributed deep learning with Intel BigDL and Apache Spark. It supports loading models and weights from other frameworks such as Tensorflow, Caffe and Torch to then be leveraged for training or inferencing. BigDL is a great way for users to quickly begin training neural networks using Apache Spark, widely recognized for how simple it makes data processing.

One more note on Hadoop and Spark environments: The Intel team working on BigDL has built and compiled high-level pipeline APIs, built-in deep learning models, and reference use cases into the Intel Analytics Zoo library. Analytics Zoo is based on BigDL but helps make it even easier to use through these high-level pipeline APIs designed to work with Spark Dataframes and built in models for things like object detection and image classification.

Conclusion

Regardless of whether you preferred server infrastructure is container native, HPC clusters, or Hadoop/Spark-enabled data lakes, distributed deep learning can help your data science team develop neural network models faster. Our Dell EMC Ready Solutions for Artificial Intelligence can work in any of these environments to help jumpstart your business’s AI journey. For more information on the Dell EMC Ready Solutions for Artificial Intelligence, go to dellemc.com/readyforai.


Lucas A. Wilson, Ph.D. is the Chief Data Scientist in Dell EMC's HPC & AI Innovation Lab. (Twitter: @lucasawilson)

Michael Bennett is a Senior Principal Engineer at Dell EMC working on Ready Solutions.

Read Full Blog
deep learning AI HPC

Training an AI Radiologist with Distributed Deep Learning

Luke Wilson Ph.D.

Wed, 15 Apr 2020 19:18:38 -0000

|

Read Time: 0 minutes

Originally published on Aug 16, 2018 11:14:00 AM

The potential of neural networks to transform healthcare is evident. From image classification to dictation and translation, neural networks are achieving or exceeding human capabilities. And they are only getting better at these tasks as the quantity of data increases.

But there’s another way in which neural networks can potentially transform the healthcare industry: Knowledge can be replicated at virtually no cost. Take radiology as an example: To train 100 radiologists, you must teach each individual person the skills necessary to identify diseases in x-ray images of patients’ bodies. To make 100 AI-enabled radiologist assistants, you take the neural network model you trained to read x-ray images and load it into 100 different devices.

The hurdle is training the model. It takes a large amount of cleaned, curated, labeled data to train an image classification model. Once you’ve prepared the training data, it can take days, weeks, or even months to train a neural network. Even once you’ve trained a neural network model, it might not be smart enough to perform the desired task. So, you try again. And again. Eventually, you will train a model that passes the test and can be used out in the world.

Workflow for developing neural network modelsIn this post, I’m going to talk about how to reduce the time spent in the Train/Test/Tune cycle by speeding up the training portion with distributed deep learning, using a test case we developed in Dell EMC’s HPC & AI Innovation Lab to classify pathologies in chest x-ray images. Through a combination of distributed deep learning, optimizer selection, and neural network topology selection, we were able to not only speed the process of training models from days to minutes, we were also able to improve the classification accuracy significantly. 

Starting Point: Stanford University’s CheXNet

We began by surveying the landscape of AI projects in healthcare, and Andrew Ng’s group at Stanford University provided our starting point. CheXNet was a project to demonstrate a neural network’s ability to accurately classify cases of pneumonia in chest x-ray images.

The dataset that Stanford used was ChestXray14, which was developed and made available by the United States’ National Institutes of Health (NIH). The dataset contains over 120,000 images of frontal chest x-rays, each potentially labeled with one or more of fourteen different thoracic pathologies. The data set is very unbalanced, with more than half of the data set images having no listed pathologies.

Stanford decided to use DenseNet, a neural network topology which had just been announced as the Best Paper at the 2017 Conference on Computer Vision and Pattern Recognition (CVPR), to solve the problem. The DenseNet topology is a deep network of repeating blocks over convolutions linked with residual connections. Blocks end with a batch normalization, followed by some additional convolution and pooling to link the blocks. At the end of the network, a fully connected layer is used to perform the classification.

An illustration of the DenseNet topology (source: Kaggle)

Stanford’s team used a DenseNet topology with the layer weights pretrained on ImageNet and replaced the original ImageNet classification layer with a new fully connected layer of 14 neurons, one for each pathology in the ChestXray14 dataset. 

Building CheXNet in Keras

It’s sounds like it would be difficult to setup. Thankfully, Keras (provided with TensorFlow) provides a simple, straightforward way of taking standard neural network topologies and bolting-on new classification layers.

from tensorflow import keras
from keras.applications import DenseNet121

orig_net = DenseNet121(include_top=False, weights='imagenet', input_shape=(256,256,3)) 

In this code snippet, we are importing the original DenseNet neural network (DenseNet121) and removing the classification layer with the include_top=False argument. We also automatically import the pretrained ImageNet weights and set the image size to 256x256, with 3 channels (red, green, blue).

With the original network imported, we can begin to construct the classification layer. If you look at the illustration of DenseNet above, you will notice that the classification layer is preceded by a pooling layer. We can add this pooling layer back to the new network with a single Keras function call, and we can call the resulting topology the neural network's filters, or the part of the neural network which extracts all the key features used for classification. 

from keras.layers import GlobalAveragePooling2D

filters = GlobalAveragePooling2D()(orig_net.output) 

The next task is to define the classification layer. The ChestXray14 dataset has 14 labeled pathologies, so we have one neuron for each label. We also activate each neuron with the sigmoid activation function, and use the output of the feature filter portion of our network as the input to the classifiers. 

from keras.layers import Dense

classifiers = Dense(14, activation='sigmoid', bias_initializer='ones')(filters)  

The choice of sigmoid as an activation function is due to the multi-label nature of the data set. For problems where only one label ever applies to a given image (e.g., dog, cat, sandwich), a softmax activation would be preferable. In the case of ChestXray14, images can show signs of multiple pathologies, and the model should rightfully identify high probabilities for multiple classifications when appropriate.

Finally, we can put the feature filters and the classifiers together to create a single, trainable model.

from keras.models import Model  
  
chexnet = Model(inputs=orig_net.inputs, outputs=classifiers)  

With the final model configuration in place, the model can then be compiled and trained.

Accelerating the Train/Test/Tune Cycle with Distributed Deep Learning

To produce better models sooner, we need to accelerate the Train/Test/Tune cycle. Because testing and tuning are mostly sequential, training is the best place to look for potential optimization.

How exactly do we speed up the training process? In Accelerating Insights with Distributed Deep Learning, Michael Bennett and I discuss the three ways in which deep learning can be accelerated by distributing work and parallelizing the process:

  • Parameter server models such as in Caffe or distributed TensorFlow,
  • Ring-AllReduce approaches such as Uber’s Horovod, and
  • Hybrid approaches for Hadoop/Spark environments such as Intel BigDL.

Which approach you pick depends on your deep learning framework of choice and the compute environment that you will be using. For the tests described here we performed the training in house on the Zenith supercomputer in the Dell EMC HPC & AI Innovation Lab. The ring-allreduce approach enabled by Uber’s Horovod framework made the most sense for taking advantage of a system tuned for HPC workloads, and which takes advantage of Intel Omni-Path (OPA) networking for fast inter-node communication. The ring-allreduce approach would also be appropriate for solutions such as the Dell EMC Ready Solutions for AI, Deep Learning with NVIDIA.

The MPI-RingAllreduce approach to distributed deep learning

Horovod is an MPI-based framework for performing reduction operations between identical copies of the otherwise sequential training script. Because it is MPI-based, you will need to be sure that an MPI compiler (mpicc) is available in the working environment before installing horovod.

Adding Horovod to a Keras-defined Model

Adding Horovod to any Keras-defined neural network model only requires a few code modifications:

  1. Initializing the MPI environment,
  2. Broadcasting initial random weights or checkpoint weights to all workers,
  3. Wrapping the optimizer function to enable multi-node gradient summation,
  4. Average metrics among workers, and
  5. Limiting checkpoint writing to a single worker.

Horovod also provides helper functions and callbacks for optional capabilities that are useful when performing distributed deep learning, such as learning-rate warmup/decay and metric averaging.

Initializing the MPI Environment

Initializing the MPI environment in Horovod only requires calling the init method:

import horovod.keras as hvd  
  
hvd.init()  

This will ensure that the MPI_Init function is called, setting up the communications structure and assigning ranks to all workers.

Broadcasting Weights

Broadcasting the neuron weights is done using a callback to the Model.fit Keras method. In fact, many of Horovod’s features are implemented as callbacks to Model.fit, so it’s worthwhile to define a callback list object for holding all the callbacks.

callbacks = [ hvd.callbacks.BroadcastGlobalVariablesCallback(0) ] 

You’ll notice that the BroadcastGlobalVariablesCallback takes a single argument that’s been set to 0. This is the root worker, which will be responsible for reading checkpoint files or generating new initial weights, broadcasting weights at the beginning of the training run, and writing checkpoint files periodically so that work is not lost if a training job fails or terminates.

Wrapping the Optimizer Function

The optimizer function must be wrapped so that it can aggregate error information from all workers before executing. Horovod’s DistributedOptimizer function can wrap any optimizer which inherits Keras’ base Optimizer class, including SGD, Adam, Adadelta, Adagrad, and others.

import keras.optimizers  
  
opt = hvd.DistributedOptimizer(keras.optimizers.Adadelta(lr=1.0)) 

The distributed optimizer will now use the MPI_Allgather collective to aggregate error information from training batches onto all workers, rather than collecting them only to the root worker. This allows the workers to independently update their models rather than waiting for the root to re-broadcast updated weights before beginning the next training batch.

Averaging Metrics

Between steps error metrics need to be averaged to calculate global loss. Horovod provides another callback function to do this called MetricAverageCallback.

callbacks = [ hvd.callbacks.BroadcastGlobalVariablesCallback(0),  
              hvd.callbacks.MetricAverageCallback()  
            ]  

This will ensure that optimizations are performed on the global metrics, not the metrics local to each worker.

Writing Checkpoints from a Single Worker

When using distributed deep learning, it’s important that only one worker write checkpoint files to ensure that multiple workers writing to the same file does not produce a race condition, which could lead to checkpoint corruption.

Checkpoint writing in Keras is enabled by another callback to Model.fit. However, we only want to call this callback from one worker instead of all workers. By convention, we use worker 0 for this task, but technically we could use any worker for this task. The one good thing about worker 0 is that even if you decide to run your distributed deep learning job with only 1 worker, that worker will be worker 0.

callbacks = [ ... ]  
  
if hvd.rank() == 0:  
 callbacks.append(keras.callbacks.ModelCheckpoint('./checkpoint-{epoch].h5'))

Result: A Smarter Model, Faster!

Once a neural network can be trained in a distributed fashion across multiple workers, the Train/Test/Tune cycle can be sped up dramatically.

The figure below shows exactly how dramatically. The three tests shown are the training speed of the Keras DenseNet model on a single Zenith node without distributed deep learning (far left), the Keras DenseNet model with distributed deep learning on 32 Zenith nodes (64 MPI processes, 2 MPI processes per node, center), and a Keras VGG16 version using distributed deep learning on 64 Zenith nodes (128 MPI processes, 2 MPI processes per node, far right). By using 32 nodes instead of a single node, distributed deep learning was able to provide a 47x improvement in training speed, taking the training time for 10 epochs on the ChestXray14 data set from 2 days (50 hours) to less than 2 hours!

Performance comparisons of Keras models with distributed deep learning using Horovod

The VGG variant, trained on 128 Zenith nodes, was able to complete the same number of epochs as was required for the single-node DenseNet version to train in less than an hour, although it required more epochs to train. It also, however, was able to converge to a higher-quality solution. This VGG-based model outperformed the baseline, single-node model in 4 of 14 conditions, and was able to achieve nearly 90% accuracy in classifying emphysema.

Accuracy comparison of baseline single-node DenseNet model vs VGG variant with Horovod

Conclusion

In this post we’ve shown you how to accelerate the Train/Test/Tune cycle when developing neural network-based models by speeding up the training phase with distributed deep learning. We walked through the process of transforming a Keras-based model to take advantage of multiple nodes using the Horovod framework, and how these few simple code changes, coupled with some additional compute infrastructure, can reduce the time needed to train a model from days to minutes, allowing more time for the testing and tuning pieces of the cycle. More time for tuning means higher-quality models, which means better outcomes for patients, customers, or whomever will benefit from the deployment of your model.


Lucas A. Wilson, Ph.D. is the Chief Data Scientist in Dell EMC's HPC & AI Innovation Lab. (Twitter: @lucasawilson)

Read Full Blog
AI deep learning HPC

Challenges of Large-batch Training of Deep Learning Models

Vineet Gundecha

Wed, 15 Apr 2020 21:22:49 -0000

|

Read Time: 0 minutes

Originally published on Aug 27, 2018 1:29:28 PM

The process of training a deep neural network is akin to finding the minimum of a function in a very high-dimensional space. Deep neural networks are usually trained using stochastic gradient descent (or one of its variants). A small batch (usually 16-512), randomly sampled from the training set, is used to approximate the gradients of the loss function (the optimization objective) with respect to the weights. The computed gradient is essentially an average of the gradients for each data-point in the batch. The natural way to parallelize the training across multiple nodes/workers is to increase the batch size and have each node compute the gradients on a different chunk of the batch. Distributed deep learning differs from traditional HPC workloads where scaling out only affects how the computation is distributed but not the outcome.

Challenges of large-batch training

It has been consistently observed that the use of large batches leads to poor generalization performance, meaning that models trained with large batches perform poorly on test data. One of the primary reason for this is that large batches tend to converge to sharp minima of the training function, which tend to generalize less well. Small batches tend to favor flat minima that result in better generalization. The stochasticity afforded by small batches encourages the weights to escape the basins of attraction of sharp minima. Also, models trained with small batches are shown to converge farther away from the starting point. Large batches tend to be attracted to the minimum closest to the starting point and lack the exploratory properties of small batches.

The number of gradient updates per pass of the data is reduced when using large batches. This is sometimes compensated by scaling the learning rate with the batch size. But simply using a higher learning rate can destabilize the training. Another approach is to just train the model longer, but this can lead to overfitting. Thus, there’s much more to distributed training than just scaling out to multiple nodes.

An illustration showing how sharp minima lead to poor generalization. The sharp minimum of the training function corresponds to a maximum of the testing function which hurts the model's performance on test data 

How can we make large batches work?

There has been a lot of interesting research recently in making large-batch training more feasible. The training time for ImageNet has now been reduced from weeks to minutes by using batches as large as 32K without sacrificing accuracy. The following methods are known to alleviate some of the problems described above:

  1. Scaling the learning rate
    The learning rate is multiplied by k, when the batch size is multiplied by k. However, this rule does not hold in the first few epochs of the training since the weights are changing rapidly. This can be alleviated by using a warm-up phase. The idea is to start with a small value of the learning rate and gradually ramp up to the linearly scaled value.

  2. Layer-wise adaptive rate scaling
    A different learning rate is used for each layer. A global learning rate is chosen and it is scaled for each layer by the ratio of the Euclidean norm of the weights to Euclidean norm of the gradients for that layer.

  3. Using regular SGD with momentum rather than Adam
    Adam is known to make convergence faster and more stable. It is usually the default optimizer choice when training deep models. However, Adam seems to settle to less optimal minima, especially when using large batches. Using regular SGD with momentum, although more noisy than Adam, has shown improved generalization.

  4. Topologies also make a difference
    In a previous blog post, my colleague Luke showed how using VGG16 instead of DenseNet121 considerably sped up the training for a model that identified thoracic pathologies from chest x-rays while improving area under ROC in multiple categories. Shallow models are usually easier to train, especially when using large batches.

Conclusion   

Large-batch distributed training can significantly reduce training time but it comes with its own challenges. Improving generalization when using large batches is an active area of research, and as new methods are developed, the time to train a model will keep going down.

Read Full Blog
AI deep learning

Training Neural Network Models for Financial Services with Intel® Xeon Processors

Pei Yang Ph.D.

Thu, 16 Apr 2020 19:53:13 -0000

|

Read Time: 0 minutes

Originally published on Nov 5, 2018 9:10:17 AM 

Time series is a very important type of data in the financial services industry. Interest rates, stock prices, exchange rates, and option prices are good examples for this type of data. Time series forecasting plays a critical role when financial institutions design investment strategies and make decisions. Traditionally, statistical models such as SMA (simple moving average), SES (simple exponential smoothing), and ARIMA (autoregressive integrated moving average) are widely used to perform time series forecasting tasks.

Neural networks are promising alternatives, as they are more robust for such regression problems due to flexibility in model architectures (e.g., there are many hyperparameters that we can tune, such as number of layers, number of neurons, learning rate, etc.). Recently applications of neural network models in the time series forecasting area have been gaining more and more attention from statistical and data science communities.

In this blog, we will firstly discuss about some basic properties that a machine learning model must have to perform financial service tasks. Then we will design our model based on these requirements and show how to train the model in parallel on HPC cluster with Intel® Xeon processors.

Requirements from Financial Institutions

High-accuracy and low-latency are two import properties that financial service institutions expect from a quality time series forecasting model.

High Accuracy  A high level of accuracy in the forecasting model helps companies lower the risk of losing money in investments. Neural networks are believed to be good at capturing the dynamics in time series and hence yield more accurate predictions. There are many hyperparameters in the model so that data scientists and quantitative researchers can tune them to obtain the optimal model. Moreover, data science community believes that ensemble learning tends to improve prediction accuracy significantly. The flexibility of model architecture provides us a good variety of model members for ensemble learning.

Low Latency  Operations in financial services are time-sensitive.  For example, high frequency trading usually requires models to finish training and prediction within very short time periods. For deep neural network models, low latency can be guaranteed by distributed training with Horovod or distributed TensorFlow. Intel® Xeon multi-core processors, coupled with Intel’s MKL optimized TensorFlow, prove to be a good infrastructure option for such distributed training.

With these requirements in mind, we propose an ensemble learning model as in Figure 1, which is a combination of MLP (Multi-Layer Perceptron), CNN (Convolutional Neural Network) and LSTM (Long Short-Term Memory) models. Because architecture topologies for MLP, CNN and LSTM are quite different, the ensemble model has a good variety in members, which helps reduce risk of overfitting and produces more reliable predictions. The member models are trained at the same time over multiple nodes with Intel® Xeon processors. If more models need to be integrated, we just add more nodes into the system so that the overall training time stays short. With neural network models and HPC power of the Intel® Xeon processors, this system meets the requirements from financial service institutions.

Training high accuracy ensemble model on HPC cluster with Intel® Xeon processors

Fast Training with Intel® Xeon Scalable Processors

Our tests used Dell EMC’s Zenith supercomputer which consists of 422 Dell EMC PowerEdge C6420 nodes, each with 2 Intel® Xeon Scalable Gold 6148 processors. Figure 2 shows an example of time-to-train for training MLP, CNN and LSTM models with different numbers of processes. The data set used is the 10-Year Treasury Inflation-Indexed Security data. For this example, running distributed training with 40 processes is the most efficient, primarily due to the data size in this time series is small and the neural network models we used did not have many layers. With this setting, model training can finish within 10 seconds, much faster than training the models with one processor that has only a few cores, which typically takes more than one minute. Regarding accuracy, the ensemble model can predict this interest rate with MAE (mean absolute error) less than 0.0005. Typical values for this interest rate is around 0.01, so the relative error is less than 5%.

Training time comparison: Each of the models is trained on a single Dell EMC PowerEdge C6420 with 2x Intel Xeon® Scalable 6148 processors

Conclusion

With both high-accuracy and low-latency being very critical for time series forecasting in financial services, neural network models trained in parallel using Intel® Xeon Scalable processors stand out as very promising options for financial institutions. And as financial institutions need to train more complicated models to forecast many time series with high accuracy at the same time, the need for parallel processing will only grow.

Read Full Blog
AI deep learning

Neural Network Inference Using Intel® OpenVINO™

Vineet Gundecha

Thu, 16 Apr 2020 20:37:58 -0000

|

Read Time: 0 minutes

Originally published on Nov 9, 2018 2:12:18 PM 

Deploying trained neural network models for inference on different platforms is a challenging task. The inference environment is usually different than the training environment which is typically a data center or a server farm. The inference platform may be power constrained and limited from a software perspective. The model might be trained using one of the many available deep learning frameworks such as Tensorflow, PyTorch, Keras, Caffe, MXNet, etc. Intel® OpenVINO™ provides tools to convert trained models into a framework agnostic representation, including tools to reduce the memory footprint of the model using quantization and graph optimization. It also provides dedicated inference APIs that are optimized for specific hardware platforms, such as Intel® Programmable Acceleration Cards, and Intel® Movidius™ Vision Processing Units. 

The Intel® OpenVINO™ toolkit

Components

  1. The Model Optimizer is a cross-platform command-line tool that facilitates the transition between the training and deployment environment, performs static model analysis, and adjusts deep learning models for optimal execution on end-point target devices. It is a Python script which takes as input a trained Tensorflow/Caffe model and produces an Intermediate Representation (IR) which consists of a .xml file containing the model definition and a .bin file containing the model weights.
  2. The Inference Engine is a C++ library with a set of C++ classes to infer input data (images) and get a result. The C++ library provides an API to read the Intermediate Representation, set the input and output formats, and execute the model on devices. Each supported target device has a plugin which is a DLL/shared library. It also has support for heterogenous execution to distribute workload across devices. It supports implementing custom layers on a CPU while executing the rest of the model on a accelerator device.

Workflow

  1. Using the Model Optimizer, convert a trained model to produce an optimized Intermediate Representation (IR) of the model based on the trained network topology, weights, and bias values.
  2. Test the model in the Intermediate Representation format using the Inference Engine in the target environment with the validation application or the sample applications.
  3. Integrate the Inference Engine into your application to deploy the model in the target environment.

Using the Model Optimizer to convert a Keras model to IR

The model optimizer doesn’t natively support Keras model files. However, because Keras uses Tensorflow as its backend, a Keras model can be saved as a Tensorflow checkpoint which can be loaded into the model optimizer. A Keras model can be converted to an IR using the following steps

  1. Save the Keras model as a Tensorflow checkpoint. Make sure the learning phase is set to 0. Get the name of the output node.
import tensorflow as tf 
from keras.applications import Resnet50 
from keras import backend as K 
from keras.models import Sequential, Model

K.set_learning_phase(0)   # Set the learning phase to 0
model = ResNet50(weights='imagenet', input_shape=(256, 256, 3))  
config = model.get_config()
weights = model.get_weights()
model = Sequential.from_config(config)
output_node = model.output.name.split(':')[0]  # We need this in the next step
graph_file = "resnet50_graph.pb" 
ckpt_file = "resnet50.ckpt"
saver = tf.train.Saver(sharded=True)
tf.train.write_graph(sess.graph_def, '', graph_file)
saver.save(sess, ckpt_file)                                                    

2. Run the Tensorflow freeze_graph program to generate a frozen graph from the saved checkpoint.

tensorflow/bazel-bin/tensorflow/python/tools/freeze_graph --input_graph=./resnet50_graph.pb --input_checkpoint=./resnet50.ckpt --output_node_names=Softmax --output_graph=resnet_frozen.pb


3. Use the mo.py script and the frozen graph to generate the IR. The model weights can be quantized to FP16.

 python mo.py --input_model=resnet50_frozen.pb --output_dir=./ --input_shape=[1,224,224,3] --           data_type=FP16          

Inference

 The C++ library provides utilities to read an IR, select a plugin depending on the target device, and run the model.

  1. Read the Intermediate Representation - Using the InferenceEngine::CNNNetReader class, read an Intermediate Representation file into a CNNNetwork class. This class represents the network in host memory.
  2. Prepare inputs and outputs format - After loading the network, specify input and output precision, and the layout on the network. For these specification, use the CNNNetwork::getInputInfo() and CNNNetwork::getOutputInfo()
  3. Select Plugin - Select the plugin on which to load your network. Create the plugin with the InferenceEngine::PluginDispatcher load helper class. Pass per device loading configurations specific to this device and register extensions to this device.
  4. Compile and Load - Use the plugin interface wrapper class InferenceEngine::InferencePlugin to call the LoadNetwork() API to compile and load the network on the device. Pass in the per-target load configuration for this compilation and load operation.
  5. Set input data - With the network loaded, you have an ExecutableNetwork object. Use this object to create an InferRequest in which you signal the input buffers to use for input and output. Specify a device-allocated memory and copy it into the device memory directly, or tell the device to use your application memory to save a copy.
  6. Execute - With the input and output memory now defined, choose your execution mode:
    • Synchronously - Infer() method. Blocks until inference finishes.
    • Asynchronously - StartAsync() method. Check status with the wait() method (0 timeout), wait, or specify a completion callback.
  7. Get the output - After inference is completed, get the output memory or read the memory you provided earlier. Do this with the InferRequest GetBlob API.

The classification_sample and classification_sample_async programs perform inference using the steps mentioned above. We use these samples in the next section to perform inference on an Intel® FPGA.

Using the Intel® Programmable Acceleration Card with Intel® Arria® 10GX FPGA for inference

The OpenVINO toolkit supports using the PAC as a target device for running low power inference. The steps for setting up the card are detailed here. The pre-processing and post-processing is performed on the host while the execution of the model is performed on the card. The toolkit contains bitstreams for different topologies.

Programming the bitstream

aocl program <device_id> <open_vino_install_directory>/a10_dcp_bitstreams/2-0-1_RC_FP16_ResNet50-101.aocx

The Hetero plugin can be used with CPU as the fallback device for layers that are not supported by the FPGA. The -pc flag prints performance details for each layer

./classification_sample_async -d HETERO:FPGA,CPU -i <path/to/input/image.png> -m <path/to/ir>/resnet50_frozen.xml            

Conclusion

 Intel® OpenVINO™ toolkit is a great way to quickly integrate trained models into applications and deploy them in different production environments. The complete documentation for the toolkit can be found at https://software.intel.com/en-us/openvino-toolkit/documentation/featured.

Read Full Blog
AI deep learning

Deep Neural Network Inference Performance on Intel FPGAs using Intel OpenVINO

Vineet Gundecha

Thu, 16 Apr 2020 21:09:49 -0000

|

Read Time: 0 minutes

Originally published on Nov 16, 2018 9:22:39 AM 

Inference is the process of running a trained neural network to process new inputs and make predictions. Training is usually performed offline in a data center or a server farm. Inference can be performed in a variety of environments depending on the use case. Intel® FPGAs provide a low power, high throughput solution for running inference. In this blog, we look at using the Intel® Programmable Acceleration Card (PAC) with Intel® Arria® 10GX FPGA for running inference on a Convolutional Neural Network (CNN) model trained for identifying thoracic pathologies.

Advantages of using Intel® FPGAs

System Acceleration: Intel® FPGAs accelerate and aid the compute and connectivity required to collect and process the massive quantities of information around us by controlling the data path. In addition to FPGAs being used as compute offload, they can also directly receive data and process it inline without going through the host system. This frees the processor to manage other system events and enables higher real time system performance.

Power Efficiency: Intel® FPGAs have over 8 TB/s of on-die memory bandwidth. Therefore, solutions tend to keep the data on the device tightly coupled with the next computation. This minimizes the need to access external memory and results in a more efficient circuit implementation in the FPGA where data can be paralleled, pipelined, and processed on every clock cycle. These circuits can be run at significantly lower clock frequencies than traditional general-purpose processors and results in very powerful and efficient solutions.

Future Proofing: In addition to system acceleration and power efficiency, Intel® FPGAs help future proof systems. With such a dynamic technology as machine learning, which is evolving and changing constantly, Intel® FPGAs provide flexibility unavailable in fixed devices. As precisions drop from 32-bit to 8-bit and even binary/ternary networks, an FPGA has the flexibility to support those changes instantly. As next generation architectures and methodologies are developed, FPGAs will be there to implement them.

Model and software

The model is a Resnet-50 CNN trained on the NIH chest x-ray dataset. The dataset contains over 100,000 chest x-rays, each labelled with one or more pathologies. The model was trained on 512 Intel® Xeon® Scalable Gold 6148 processors in 11.25 minutes on the Zenith cluster at DellEMC.

The model is trained using Tensorflow 1.6. We use the Intel® OpenVINO™ R3 toolkit to deploy the model on the FPGA. The Intel® OpenVINO™ toolkit is a collection of software tools to facilitate the deployment of deep learning models. This OpenVINO blog post details the procedure to convert a Tensorflow model to a format that can be run on the FPGA.

Performance

In this section, we look at the power consumption and throughput numbers on the Dell EMC PowerEdge R740 and R640 servers.

Using the Dell EMC PowerEdge R740 with 2x Intel® Xeon® Scalable Gold 6136 (300W) and 4x Intel® PACs

The figures below show the power consumption and throughput numbers for running the model on Intel® PACs, and in combination with Intel® Xeon® Scalable Gold 6136. We observe that the addition of a single Intel® PAC adds only 43W to the system power while providing the ability to inference over 100 chest X-rays per second. The additional power and inference performance scales linearly with the addition of more Intel® PACs. At a system level, wee see a 2.3x improvement in throughput and 116% improvement in efficiency (images per sec per Watt) when using 4x Intel® PACs with 2x Intel® Xeon® Scalable Gold 6136.

Inference performance tests using ResNet-50 topology. FP11 precision. Image size is 224x224x3. Power measured via racadm


Performance per watt tests using ResNet-50 topology. FP11 precision. Image size is 224x224x3. Power measured via racadm

Using the Dell EMC PowerEdge R640 with 2x Intel® Xeon® Scalable Gold 5118 (210W) and 2x Intel® PACs

We also used a server with lower idle power. We see a 2.6x improvement in system performance in this case. As before, each Intel® PAC linearly adds performance to the system, adding more than 100 inferences per second for 43W (2.44 images/sec/W).

Inference performance tests using ResNet-50 topology. FP11 precision. Image size is 224x224x3. Power measured via racadm

Performance per watt tests using ResNet-50 topology. FP11 precision. Image size is 224x224x3. Power measured via racadm 

Conclusion

Intel® FPGAs coupled with Intel® OpenVINO™ provide a complete solution for deploying deep learning models in production. FPGAs offer low power and flexibility that make them very suitable as an accelerator device for deep learning workloads.

Read Full Blog