Short articles about data analytics solutions and related technology trends
AI-based Edge Analytics in the Service Provider Space
Fri, 15 Jan 2021 11:48:37 -0000
|Read Time: 0 minutes
Introduction
Advances in Service Provider performance management lags behind growth in digital transformation. Consider, for example, Dynamic Spectrum Sharing (DSS) in 5G networks – operators need to rapidly map small-cell flows to available frequency bands, in the presence of constraints like differing radio technologies and interference. Another example is the need to detect and/or predict infrastructure failures from KPIs, Traces, Profiles and Knowledge-bases, to trigger a fix before an issue manifests itself. Yet another example is energy optimization in data-centers, where servers are powered off to save energy, and workloads are moved around in the cluster, without affecting end to-end service. It is clear that in all of these scenarios, and in numerous other use-cases affecting industries such as factory automation, automotive, IIoT, smart cities, energy, healthcare, entertainment, and surveillance, AI on Big Data needs to replace legacy IT processes and tasks to trigger timely changes in the network substrate. The following figure illustrates how Big Data from the substrate can be consumed by fast-responding, interconnected AI models to act on service degradations. The traditional approach of DevOps reacting to irregularities visualized through Network Operations Center (NOC) terminals does not scale. Gartner and IDC both predict that by 2024 more than 60 percent of Mobile Network Operators (MNO) will adopt AI-based analytics in their IT operations.
Figure 1. Decision and Controls with Models
Architecture
Data streams originating in the substrate, and gathered in the collection gateway, may be compressed. There may be gaps in data collection that need interpolation. Not all data types collected will have an equal impact on decision-making, which means that feature-filtering is important for decision-making. These issues justify the need for multi-stage pre-processing. Similarly, rapid decision-making can be achieved through multi-stage interconnected models using deep-learning technology. Instead of having one huge and complex model, experts agree that simpler interconnected models lead to more reusable design. The following figure illustrates the decision-making process. It shows a sample interconnected model graph that detects anomalies, identifies root-causes, and decides on a control sequence to recommend remediation measures.
Figure 2. Runtime acceleration key for real-time loops
Deep-learning is a good tool for inductive reasoning, but deductive reasoning is also important for decision-making (for example, to limit cascading errors) and this requires one or more postprocessing stages. Collectively, these arguments point to a need for auto-pipelining through Function-as-a Service (FaaS) for WebScale automation in the cloud-native space. Add to this the need for streaming, visualization, and time-series databases for selective data-processing in the stages, and what we end up with is a Machine Learning Operating System (ML-OS) that provides these services. An ML-OS, such as Nuclio, automatically maps pipelined functions (for example, python definitions) to cloud-native frameworks, utilizing specified configurations, as well as supporting open-source tools for visualization, streaming, in-memory time-series databases, and GPU-based model acceleration. Applications developed on the ML-OS ingest data and output control sequences for continuous optimization in decision-making. These real-time decision-making loops collectively enable WebScale Network Automation, Slice Management, RAN operations, Traffic Optimization, QoE Optimization, and Security. The following figure illustrates the AIOPs platform.
Figure 3. AIOPs Platform
Deployment
In this section we show our prototype deployment using Generation (substrate) and Analytics infrastructure, as shown in the following figure. Generation includes workload that is placed in Red Hat OpenStack (R16) VMs, where synthetically-generated tomography images are compressively sensed in one VM and then reconstructed in another VM. System performance metrics from this workload environment are exported to the Service Telemetry Framework (STF) Gateway placed in RedHat OpenShift (v4.3) containers, which gather data for streaming to the Analytics cluster placed in VMware (v6.7) VMs. The Analytics cluster includes Iguazio ML-OS with native GPU acceleration, and Anodot models for correlation and anomaly detection.
Figure 4. Workload layout in virtual infrastructure
The OpenStack cluster has 3 physical control nodes (R640) and 2 physical compute nodes (R740). Vm-1 generates random tomography images, which are compressed and dispatched to Vm-2 for Reconstruction using L1 Lasso Regression. OpenShift (OCP) is deployed on a pool of VMware virtual hosts (v6.7) with vSAN (see Reference Design) on 3 physical nodes (R740xd). OCP deployment spans 3 control and 5 compute virtual hosts. There is a separate administration virtual host (CSAH) for infrastructure services (DHCP, DNS, HAPROXY, TFTP, PXE, and CLI) on the OCP platform. vSphere CSI drivers are enabled on the OCP platform so that persistent volume requirements for OCP pods are satisfied by vSAN storage. RH STF deployed on OpenShift facilitates the automated collection of measurements from a workload environment over RabbitMQ message bus. STF stores metrics in the local Prometheus database and can forward to data sinks like Nuclio or Isilon (remote storage). Nuclio ML-OS is installed as 3 data VMs and 3 application VMs using data, client, and management networks. Anodot models in the application VMs process metrics from the OpenStack environment to detect anomalies and correlate them, as shown in the following figure.
Figure 5. Sleeve tightening of metrics in Anodot
Building Blocks
The Python snippet and image snapshot shown below capture a workload running in the OpenStack space. Self-timed logic (not shown here) in Vm-1 is used to randomly generate CPU -utilization surges in compression by resizing imaging resolution. The Anodot dashboard shows the resulting surge in CPU-utilization in Vm-2 during reconstruction, hinting at a root cause issue. Similar behavior can be seen in network utilization, which the Anodot dashboard shows by aligning the anomalies to indicate correlation.
Figure 6. Anomaly detection in Anodot
Summary
The analytics solution proposed here uses open interfaces to aggregate data from all segments of the network, such as RAN, Packet Core, IMS, Messaging, Transport, Platform, and Devices. This provides the ability to correlate metrics and events across all nodes to create an end-to-end view of the network, the flow or a slice. AI turns this end-to-end insight into tangible inferences that drive autonomics in the substrate through control sequences.
Real-Time Streaming Solutions Beyond Data Ingestion
Wed, 16 Dec 2020 22:31:30 -0000
|Read Time: 0 minutes
So, it has been all about data—data at rest, data in-flight, IoT data, and so forth. Let’s touch base on the traditional data processing approaches and look at their synergy with modern database technologies. Users’ model-based inquiries manifest to a data entity that is created upon initiation of the request payloads. Traditional database and business applications have been the lone actors that collaborated to provide implementations of such data models. They interact in processing of the users’ inquiries and persisting the results in static data stores for further updates. The business continuity is measured by a degree of such activities among business applications consuming data from these shared data stores. Of course, with a lower degree of such activities, there exists a high potential for the business to be at idle states of operations caused by waiting for more data acquisitions.
The above paradigm is inherently set to potentially miss a great opportunity to maintain a higher degree of business continuity. To fill these gaps, a shift in the static data store paradigm is necessary. The new massive ingested data processing requirements mandate the implementation of processing models that continuously generate insight from any “data in-flight,” mostly in real time. To overcome storage access performance bottlenecks, persisting the interim computed results in a permanent data store is expected to be kept at a minimal level.
This blog addresses these modern data processing models from a real-time streaming ingestion and processing perspective. In addition, it discusses Dell Technologies’ offerings of such models in detail.
Customers have an option of building their own solutions based on the open source projects for adopting real-time streaming analytics technologies. The mix and match of such components to implement real-time data ingestion and processing infrastructures is cumbersome. It requires a variety of costly skills to stabilize such infrastructures in production environments. Dell Technologies offers validated reference architectures to meet target KPIs on storage and compute capacities to simplify these implementations. The following sections provide high-level information about real-time data streaming and popular platforms to implement these solutions. This blog focuses particularly on two Ready Architecture solutions from Dell—Streaming Data Platform (formerly known as Nautilus) and a Real-Time Streaming reference architecture based on Confluent’s Kafka ingestion platform—and provides a comparative analysis of the platforms.
The topic of real-time data streaming goes far beyond ingesting data in real time. Many publications clearly describe the compelling objectives behind a system that ingests millions of data events in real time. An article from Jay Kreps, one of the co-creators of open source Apache Kafka, provides a comprehensive and in-depth overview of ingesting real-time streaming data. This blog focuses on both ingestion and the processing side of the real-time streaming analytics platforms.
A comprehensive end-to-end big data analytics platform demands must-have features that:
Over the past many years, demand for real-time ingestion features have created motivations for implementing several streaming analytics engines, each with a unique targeted architecture. Streaming analytics engines provide capabilities ranging from micro-batching the streamed data during processing to a near-real-time performance to a true-real-time processing behavior. The ingested datatype may range from a byte-stream event to a complex event format. Examples of such data size ingestion engines are Dell Technologies supported Pravega and open source Apache 2.0 Kafka that can be seamlessly integrated with open source big data analytics engines such as Samza, Spark, Flink, and Storm, to name a few. Proprietary implementations of similar technologies are offered by a variety of vendors. A short list of these products includes Striim, WSO2 Complex Event processor, IBM Streams, SAP Event Stream Processor, and TIBCO Event Processing.
Dell Technologies offer customers two solutions to implement their real-time streaming infrastructure. One solution is built on Apache Kafka as the ingestion layer and Kafka Stream Processing as the default streaming data processing engine. The second solution is built on open source Pravega as the ingestion layer and Flink as the default real-time streaming data processing engine. But how are these solutions being used in response to customers’ requirements? Let’s review possible integration patterns where Dell Technologies real-time streaming offerings facilitate data ingestion and partial preprocessing layers for implementing these patterns.
Customers implement real-time streaming in different ways to meet their specific requirements. This implies that there may exist many ways of integrating a real-time streaming solution, with the remaining components in the customer’s infrastructure ecosystem. Figure 1 depicts a minimal big data integration pattern that customers may implement by mixing and matching a variety of existing streaming, storage, compute, and business analytics technologies.
Figure 1: A modern big data integration pattern for processing real-time ingested data
There are several options to implement the Stream Processors layer, including the following two offerings from Dell Technologies.
The core component of this solution is Apache Kafka, which also delivers Kafka Stream Processing in the same package. Confluent provides and supports the Apache Kafka distribution along with Confluent Enterprise-Ready Platform with advanced capabilities to improve Kafka. Additional community and commercial platform features enable:
Dell Technologies provides infrastructure for implementing stream processing deployment architectures using one of two Kafka distributions from Confluent—Standard Cluster Architecture or Large Cluster Architecture. Both cluster architectures may be implemented as either the streaming branch of a Lambda Architecture or as the single process flow engine in a Kappa Architecture. For a description of the difference between the two architectures, see this blog. For more details about the product, see Dell Real-Time Big Data Streaming Ready Architecture documentation.
Figure 2: Standard Dell Real-Time Streaming Big Data Cluster Architecture
Figure 3: Large Scalable Dell Real-Time Streaming Big Data Cluster Architecture
SDP is an elastically scalable platform for ingesting, storing, and analyzing continuously streaming data in real time. The platform can concurrently process both real-time and collected historical data in the same application. The core components of SDP are open source Pravega for ingestion, Long Term Storage, Apache Flink for compute, open source Kubernetes, and a Dell Technologies proprietary software known as Management Platform. Figure 4 shows the SDP architecture and its software stack components.
Figure 4: Streaming Data Platform Architecture Overview
Both Dell EMC real-time streaming solutions address the same problem and ultimately provide the same solution for it. However, in addition to using different technology implementations, each tends to be a better fit for certain streaming workloads. The best starting point for selecting one over the other is with an understanding of the exactions of the target use case and workload.
In most situations, users know what they want for a real-time ingestion solution—typically an open-source solution that is popular in the industry. Kafka is demanded by customers in most of these situations. Additional characteristics, such as the mechanisms for receiving and storing events and for processing, are secondary. Most of our customer conversations are about a reliable ingestion layer that can guarantee delivery of the customer’s business events to the consuming applications. Further detailed expectations are focused on no loss of events, simple yet long-term storage capacity, and, in most cases, a well-defined process integration method for implementing their initial preprocessing tasks such as filtering, cleansing, and any transformation-like Extract Transform Load (ETL). The purpose of preprocessing is to offload nonbusiness-logic-related work from the target analytics engine—i.e., Spark, Flink, Kafka Stream Processing—resulting in better overall end-to-end real-time performance.
Kafka is essentially a messaging vehicle to decouple the sender of the event from the application that processes it for gaining business insight. By default, Kafka uses the local disk for temporarily persisting the incoming data. However, the longer-term storage for the ingested data is implemented in what’s known as Kafka Broker Servers. When an event is received, it is broadcast to the interested applications known as subscribers. An application may subscribe to more than one event-type-group, also known as a topic. By default, Kafka stores and replicates events of a topic in partitions configured in Kafka Brokers. The replicas of an event may be distributed among several Brokers to prevent data loss and guarantee recovery in case of a failover. A Broker cluster may be constructed and configured on several Dell EMC PowerEdge R640 servers. To avoid Brokers’ storage and compute capacity limitations, the Brokers’ cluster may be extended through the addition of more Brokers to the cluster topology. This is a horizontally scalable characteristic of Kafka architecture. By design, the de facto analytics engine provided in an open source Kafka stack is known as Kafka Stream Processing. It is customary to use Kafka Stream Processing as a preprocessing engine and then route the results as real-time streaming artifacts to an actual business logic implementing analytics engine such as Flink or Spark Streaming. Confluent wraps the Kafka Stream Processing implementation in an abstract process layer known as KSQL APIs. It makes it extremely simple to run SQL like statements to process events in the core Kafka Stream Processing engine instead of complex third-generation languages such as Java or C++, or scripting languages such as Python.
Unlike Kafka’s messaging protocol and events persisting partitions, Pravega implements a storage protocol and starts to temporarily persist events as appended streams. As time goes by, and the events age, they become long-term data entities. Therefore, unlike Kafka, the Pravega architecture does not require separate long-term storage. Eventually, the historical data is available in the same storage. Pravega, in Dell’s current SDP architecture, routes previously appended streams to Flink, which provides a data pipeline to implement the actual business logic. When it comes to scalability, Pravega uses Isilon or ECS S3 as extended and/or archiving storage.
Although both SDP and Kafka act as a vehicle between the event sender and the event processor, they implement this transport differently. By design, Kafka implements the pub/sub pattern. It basically broadcasts the event to all interested applications at the same time. Pravega makes specific events available directly to a specific application by implementing a point-to-point pattern. Both Kafka and Pravega claim guaranteed delivery. However, the point-to-point approach supports a more rigid underlying transport.
Dell Technologies offers two real-time streaming solutions, and it is not a simple task to promote one over the other. Ideally, every customer problem requires an initial analysis on the data source, data format, data size, expected data ingestion frequency, guaranteed delivery requirements, integration requirements, transactional rollback requirements (if applicable), storage requirements, transformation requirements, and data structural complexity. Aggregated results from such analysis may help us suggest a specific solution.
Dell works with customers to collect as much detailed information as possible about the customer’s streaming use cases. Kafka Stream Processing has an impressive feature that offloads the transformation portion of the analytics of a pipeline engine such as Flink or Spark to its Kafka Stream Processing engine. This could be a great advantage. Meanwhile SDP requires extra scripting efforts outside of the Flink configuration space to provide the same logically equivalent capability. On the other hand, SDP simplifies storage complexities through Pravega native streams-per-segments architecture, while Kafka core storage logic pertains to a messaging layer that requires a dedicated file system. Customers that have IoT device data use cases are concerned with ingestion high frequency rate (number of events per second). Soon, we can use this parameter and provide some benchmarking results of a comparative analysis of ingestion frequency rate performed on our SDP and Confluent Real-Time Streaming solutions.
I owe an enormous debt of gratitude to my colleagues Mike Pittaro and Mike King of Dell Technologies. They shared their valuable time to discuss the nuances of the text, guided me to clarify concepts, and made specific recommendations to deliver cohesive content.
Author: Amir Bahmanyari, Advisory Engineer, Dell Technologies Data-Centric Workload & Solutions. Amir joined Dell Technologies Big Data Analytics team in late 2017. He works with Dell Technologies customers to build their Big Data solutions. Amir has a special interest in the field of Artificial Intelligence. He has been active in Artificial and Evolutionary Intelligence work since late 1980’s when he was a Ph.D. candidate student at Wayne State University, Detroit, MI. Amir implemented multiple AI/Computer Vision related solutions for Motion Detection & Analysis. His special interest in biological and evolutionary intelligence algorithms lead to innovate a neuron model that mimics the data processing behavior in protein structures of Cytoskeletal fibers. Prior to Dell, Amir worked for several startups in the Silicon Valley and as a Big Data Analytics Platform Architect at Walmart Stores, Inc.
MLPerf Inference v0.7 Benchmarks on Power Edge R7515 Servers
Tue, 08 Dec 2020 00:14:16 -0000
|Read Time: 0 minutes
MLPerf (https://mlperf.org) Inference is a benchmark suite for measuring how fast Machine Learning (ML) and Deep Learning (DL) systems can process input inference data and produce results using a trained model. The benchmarks belong to a diversified set of ML use cases that are popular in the industry and provide a standard for hardware platforms to perform ML-specific tasks. Hence, good performance under these benchmarks signifies a hardware setup that is well optimized for real world ML inferencing use cases.
Table 1 Dell EMC PowerEdge R7515 technical specifications
Component | Description |
---|---|
System name | PowerEdge R7515 |
Status | Commercially available |
System type | Data center |
Number of nodes | 1 |
Host processor model lane | AMD® EPYC® 7702P |
Host processors per node | 1 |
Host processor core count | 64 |
Host processor frequency | 2.00 GHz |
Host memory capacity | 256 GB DDR4, 2933 MHz |
Host storage | 3.2 TB SSD |
Host accelerator | NVIDIA Tesla T4 |
Accelerators per node | 4 |
The NVIDIA Tesla T4, based on NVIDIA’s Turing architecture is one of the most widely used AI inference accelerators. The Tesla T4 features NVIDIA Turing Tensor cores which enables it to accelerate all types of neural networks for images, speech, translation, and recommender systems, to name a few. Tesla T4 supports a wide variety of precisions and accelerates all major DL & ML frameworks, including TensorFlow, PyTorch, MXNet, Chainer, and Caffe2.
Table 2 NVIDIA Tesla T4 technical specifications
Component | Description |
---|---|
GPU architecture | NVIDIA Turing |
NVIDIA Turing Tensor cores | 320 |
NVIDIA CUDA® cores | 2,560 |
Single-precision | 8.1 TFLOPS |
Mixed-precision (FP16/FP32) | 65 TFLOPS |
INT8 | 130 TOPS |
INT4 | 260 TOPS |
GPU memory | 16 GB GDDR6, 320+ GB/s |
ECC | Yes |
Interconnect bandwidth | 32 GB/s |
System interface | X16 PCIe Gen3 |
Form factor | Low profile PCIe |
Thermal solution | Passive |
Compute APIs | CUDA, NVIDIA TensorRT™, ONNX |
Power | 70W |
The MLPerf inference benchmark measures how fast a system can perform ML inference using a trained model with new data that is provided in various deployment scenarios. Table 3 shows seven mature models that are in the official v0.7 release.
Table 3 MLPerf Inference Suite v0.7
Model | Reference application | Dataset |
---|---|---|
resnet50-v1.5 | vision/classification and detection | ImageNet (224 x 224) |
ssd-mobilenet 300 x 300 | vision/classification and detection | COCO (300 x 300) |
ssd-resnet34 1200 x 1200 | vision/classification and detection | COCO (1200 x 1200) |
bert | language | squad-1.1 |
dlrm | recommendation | Criteo Terabyte |
3d-unet | vision/medical imaging | BraTS 2019 |
rnnt | speech recognition | OpenSLR LibriSpeech Corpus |
The above models serve in various critical inference applications or use cases that are known as “scenarios.” Each scenario requires different metrics and demonstrates performance in a production environment. MLPerf Inference consists of four evaluation scenarios that are shown in Table 4:
Table 4 Deployment scenarios
Scenario | Sample use case | Metrics |
---|---|---|
Single-stream | Cell phone augmented reality | Latency in ms |
Multi-stream | Multiple camera driving assistance | Number of streams |
Server | Translation sites | QPS |
Offline | Photo sorting | Inputs/s |
The units on which Inference is measured are based on samples and queries. A sample is a unit on which inference is run, such as an image or sentence. A query is a set of samples that are issued to an inference system together. For detailed explanation of definitions, rules and constraints of MLPerf Inference see: https://github.com/mlperf/inference_policies/blob/master/inference_rules.adoc#constraints-for-the-closed-division
Default Accuracy refers to a configuration where the model infers samples with at least 99% accuracy. High Accuracy refers to a configuration where the model infers samples with 99.9% accuracy.
For MLPerf Inference v0.7 result submissions, Dell EMC used Offline and Server scenarios as they are more representative of datacenter systems. Offline scenario represents use cases where inference is done as a batch job (for instance using AI for photo sorting), while server scenario represents an interactive inference operation (translation app).
Table 5 PowerEdge R7515 inference results
3D-UNET | BERT | DLRM | ResNet50 | RNNT | SSD-ResNet34 | |||||
---|---|---|---|---|---|---|---|---|---|---|
System | Scenario | Default accuracy | High accuracy | Default accuracy | High accuracy | Default accuracy | High accuracy | Default accuracy | Default accuracy | Default accuracy |
Dell EMC R7515 (4 x T4) | Offline (samples/s) | 28 | 28 | 1,708 | 715 | 126,287 | 126,287 | 23,290 | 5,712 | 535 |
Server (queries/s) |
|
| 1,249 | 629 | 126,514 | 126,514 | 21,506 | 4,096 | 450 |
Table 5 above shows the raw performance of the R740_T4x4 SUT in samples/s for the offline scenario and queries for the server scenario. Detailed results for this and other configurations can be found at https://mlperf.org/inference-results-0-7/
Figures 1 to 4 below show the inference capabilities of two Dell PowerEdge servers; R7515 and PowerEdge R7525. They are both 2U and are powered by AMD processors. The R7515 is single socket, and the R7525 is dual socket. The R7515 used 4xNVIDIA Tesla T4 GPUs while the R7525 used four different configurations of three NVIDIA GPU accelerators; Tesla T4, Quadro RTX8000, and A100. Each bar graph indicates the relative performance of inference operations that are completed in a set amount of time while bounded by latency constraints. The higher the bar graph, the higher the inference capability of the platform.
Figure 1 Offline scenario relative performance with default accuracy for six different benchmarks and five different configurations using R7515_T4x4 as a baseline
Figure 2 Offline scenario relative performance with high accuracy for six different benchmarks and five different configurations using R7515_T4x4 as a baseline
Figure 3 Server scenario relative performance with default accuracy for five different benchmarks and five different configurations using R7515T4x4 as a baseline
Figure 4 Server scenario relative performance with high accuracy for two different benchmarks and five different configurations using R7515_T4x4 as a baseline
Figure 5 shows the relative price of each GPU configuration using the cost of Tesla T4 as the baseline and the corresponding price performance. The price/performance shown is an estimate to illustrate the “bang “for the money that is spent on the GPU configurations and should not be taken as the price/performance of the entire SUT. In this case, the shorter the bar the better.
As shown in the charts above, Dell EMC PowerEdge R7515 performed well in a wide range of benchmark scenarios. The benchmarks that are discussed in this paper included diverse use cases. For instance, image dataset inferencing (Object Detection using SSD-Resnet34 model on COCO dataset), language processing (BERT model used on SQUAD v1.1 for machine comprehension of texts), and recommendation engine (DLRM model with Criteo 1 TB clicks dataset).
MLPerf Inference v0.7 Benchmarks on PowerEdge R740 Servers
Tue, 08 Dec 2020 00:14:15 -0000
|Read Time: 0 minutes
MLPerf (https://mlperf.org) Inference is a benchmark suite for measuring how fast Machine Learning (ML) and Deep Learning (DL) systems can process input inference data and produce results using a trained model. The benchmarks belong to a diversified set of ML use cases that are popular in the industry and provide a standard for hardware platforms to perform ML-specific tasks. Hence, good performance under these benchmarks signifies a hardware setup that is well optimized for real world ML inferencing use cases.
Table 1 Dell EMC PowerEdge R740 technical specifications
Component | Description |
---|---|
System name | PowerEdge R740 |
Status | Commercially available |
System type | Data center |
Number of nodes | 1 |
Host processor model lane | Intel® Xeon® Gold 6248R |
Host processors per node | 2 |
Host processor core count | 24 |
Host processor frequency | 3.00 GHz |
Host memory capacity | 384 GB DDR4, 2933 MHz |
Host storage | 3.84 TB SSD |
Host accelerator | NVIDIA Tesla T4 |
Accelerators per node | 4 |
The NVIDIA Tesla T4, based on NVIDIA’s Turing architecture is one of the most widely used AI inference accelerators. The Tesla T4 features NVIDIA Turing Tensor cores which enable it to accelerate all types of neural networks for images, speech, translation, and recommender systems, to name a few. Tesla T4 supports a wide variety of precisions and accelerates all major DL & ML frameworks, including TensorFlow, PyTorch, MXNet, Chainer, and Caffe2.
Table 2 NVIDIA Tesla T4 technical specifications
Component | Description |
---|---|
GPU architecture | NVIDIA Turing |
NVIDIA Turing Tensor cores | 320 |
NVIDIA CUDA® cores | 2,560 |
Single-precision | 8.1 TFLOPS |
Mixed-precision (FP16/FP32) | 65 TFLOPS |
INT8 | 130 TOPS |
INT4 | 260 TOPS |
GPU memory | 16 GB GDDR6, 320+ GB/s |
ECC | Yes |
Interconnect bandwidth | 32 GB/s |
System interface | X16 PCIe Gen3 |
Form factor | Low-profile PCIe |
Thermal solution | Passive |
Compute APIs | CUDA, NVIDIA TensorRT™, ONNX |
Power | 70 W |
The MLPerf inference benchmark measures how fast a system can perform ML inference using a trained model with new data that is provided in various deployment scenarios. Table 3 shows seven mature models that are in the official v0.7 release.
Table 3 MLPerf Inference Suite v0.7
Model | Reference application | Dataset |
---|---|---|
resnet50-v1.5 | vision/classification and detection | ImageNet (224 x 224) |
ssd-mobilenet 300 x 300 | vision/classification and detection | COCO (300 x 300) |
ssd-resnet34 1200 x 1200 | vision/classification and detection | COCO (1200 x 1200) |
bert | language | squad-1.1 |
dlrm | recommendation | Criteo Terabyte |
3d-unet | vision/medical imaging | BraTS 2019 |
rnnt | speech recognition | OpenSLR LibriSpeech Corpus |
The above models serve in various critical inference applications or use cases that are known as “scenarios.” Each scenario requires different metrics and demonstrates performance in a production environment. MLPerf Inference consists of four evaluation scenarios that are shown in Table 4:
Table 4 Deployment scenarios
Scenario | Sample use case | Metrics |
---|---|---|
Single-stream | Cell phone augmented reality | Latency in milliseconds |
Multi-stream | Multiple camera driving assistance | Number of streams |
Server | Translation sites | QPS |
Offline | Photo sorting | Inputs/s |
The units on which Inference is measured are based on samples and queries. A sample is a unit on which inference is run, such as an image or sentence. A query is a set of samples that are issued to an inference system together. For detailed explanation of definitions, rules and constraints of MLPerf Inference see: https://github.com/mlperf/inference_policies/blob/master/inference_rules.adoc#constraints-for-the-closed-division
Default Accuracy refers to a configuration where the model infers samples with at least 99% accuracy. High Accuracy refers to a configuration where the model infers samples with 99.9% accuracy. For MLPerf Inference v0.7 result submissions, Dell EMC used Offline and Server scenarios as they are more representative of datacenter systems. Offline scenario represents use cases where inference is done as a batch job (for instance using AI for photo sorting), while server scenario represents an interactive inference operation (translation app).
Table 5 PowerEdge R740 inference results
3D-UNET | BERT | DLRM | ResNet50 | RNNT | SSD-ResNet34 | |||||
---|---|---|---|---|---|---|---|---|---|---|
System | Scenario | Default accuracy | High accuracy | Default accuracy | High accuracy | Default accuracy | High accuracy | Default accuracy | Default accuracy | Default accuracy |
Dell EMC R740 (4 x T4) | Offline (samples/s) | 29 | 29 | 1,7329 | 743 | 13,1571 | 13,1571 | 23,844 | 5,875 | 546 |
Server (queries/s |
|
| 1,349 | 679 | 126,015 | 126,015 | 21,805 | 4,196 | 470 |
Table 5 above shows the raw performance of the R740_T4x4 SUT in samples/s for the offline scenario and queries for the server scenario. Detailed results for this and other configurations can be found at https://mlperf.org/inference-results-0-7/.
Figures 1 and 2 below show the raw data inference performance of the R740_T4x4 SUT for five of the six MLPerf benchmarks that were submitted. Each bar graph indicates the relative performance of inference operations that are completed in a set amount of time while bounded by latency constraints. The higher the bar graph is, the higher the inference capability of the platform. Figure 3 compares offline scenario performance to server scenario and figure 4 compares offline performance using the default and high accuracy.
Figure 1 Default accuracy performance for (BERT,RNNT and SSD) offline and server scenarios
Figure 2 Default accuracy performance for DLRM and ResNet50 offline and server scenarios
Figure 3 Comparing offline to server scenario performance
Figure 4 Comparing offline default accuracy to high accuracy performance
Figure 5 Comparing NVIDIA Tesla T4 configurations’ offline performance using R740_T4x4 as a baseline
Figure 5 shows the relative offline performance per GPU card for Tesla T4 configurations from several submitter organizations.
Figure 6 Relative cost of GPU card configurations using R740_T4x4 as baseline and its BERT default Performance
Figure 6 shows the relative price of each GPU configuration using the cost of Tesla T4 as the baseline and the corresponding price performance. The price/performance shown is an estimate to illustrate the “bang“ for the money that is spent on the GPU configurations and should not be taken as the price/performance of the entire SUT. In this case, the shorter the bar the better.
As shown in the charts above, Dell EMC PowerEdge R740 performed well in a wide range of benchmark scenarios. The benchmarks that are discussed in this blog included diverse use cases. For instance, image dataset inferencing (Object Detection using SSD-Resnet34 model on COCO dataset), language processing (BERT model used on SQUAD v1.1 for machine comprehension of texts), and recommendation engine (DLRM model with Criteo 1 TB clicks dataset).
The Case for Elastic Stack on HCI
Thu, 11 Jun 2020 21:34:33 -0000
|Read Time: 0 minutes
The Elastic Stack, also known as the “ELK Stack”, is a widely used, collection of software products based on open source used for search, analysis, and visualization of data. The Elastic Stack is useful for a wide range of applications including observability (logging, metrics, APM), security, and general-purpose enterprise search. Dell Technologies is an Elastic Technology Partner1 This blog covers some basics of hyper-converged infrastructure (HCI), some Elastic Stack fundamentals, and the benefits of deploying Elastic Stack on HCI.
HCI integrates the compute and storage resources from a cluster of servers using virtualization software for both CPU and disk resources to deliver flexible, scalable performance and capacity on demand. The breadth of server offerings in the Dell PowerEdge portfolio gives system architects many options for designing the right blend of compute and storage resources. Local resources from each server in the cluster are combined to create virtual pools of compute and storage with multiple performance tiers.
VxFlex is a Dell Technologies developed, hypervisor agnostic, HCI platform integrated with high-performance, software-defined block storage. VxFlex OS is the software that creates a server and IP-based SAN from direct-attached storage as an alternative to a traditional SAN infrastructure. Dell Technologies also offers the VxRail HCI platform for VMware-centric environments. VxRail is the only fully integrated, pre-configured, and pre-tested VMware HCI system powered with VMware vSAN. We show below why both HCI offerings are highly efficient and effective platforms for a truly scalable Elastic Stack deployment.
The Elastic Stack is a collection of four open-source projects: Elasticsearch, Logstash, Kibana, and Beats. Elasticsearch is an open-source, distributed, scalable, enterprise-grade search engine based on Lucene. Elasticsearch is an end-to-end solution for searching, analyzing, and visualizing machine data from diverse source formats. With the Elastic Stack, organizations can collect data from across the enterprise, normalize the format, and enrich the data as desired. Platforms designed for scale-out performance running the Elastic Stack provides the ability to analyze and correlate data in near real-time.
In March 2020, Dell Technologies validated the Elastic Stack running on our VxFlex family of HCI2. It will be shown how the features of HCI provide distinct benefits and cost savings as an integrated solution for the Elastic Stack. The Elastic Stack, and Elasticsearch specifically, is designed for scale-out. Data nodes can be added to an Elasticsearch cluster to provide additional compute and storage resources. HCI also uses a scale-out deployment model that allows for easy, seamless scalability horizontally by adding additional nodes to the cluster(s). However, unlike bare-metal deployments, HCI also scales vertically by adding resources dynamically to Elasticsearch data nodes or any other Elastic Stack roles through virtualization. VxFlex admins use their preferred hypervisor and VxFLEX OS and for VxRail it is done with VMware ESX and vSAN. Additionally, the Elastic Stack can be deployed on Kubernetes clusters, therefor admins can also choose to leverage VMware Tanzu for Kubernetes management.
Virtualization has long been a strategy for achieving more efficient resource utilization and data center density. Elasticsearch data nodes tend to have average allocations of 8-16 cores and 64GB of RAM. With the current ability to support up to 112 cores and 6TB of RAM in a single 2RU Dell server, Elasticsearch is an attractive application for virtualization. Additionally, the Elastic Stack is also significantly more CPU efficient than some alternative products improving the cost-effectiveness of deploying Elastic with VMware or other virtualization technologies. We would recommend sizing for 1 physical CPU to 1 virtual CPU (vCPU) for Elasticsearch Hot Tier along with the management and control plane resources. While this is admittedly like the VMware guidance for some similar analytics platforms, these VMs tend to consume a significantly smaller CPU footprint per data node. The Elastic Stack tends to take advantage of hyperthreading and resource overcommitment more effectively. While needs will vary by customer use case, our experience shows the efficiencies in the Elastic Stack and Elastic data lifecycle management allow the Elasticsearch Warm Tier, Kibana, and Proxy servers can be supported by 1 physical CPU to 2 vCPUs and the Cold Tier can be upwards of 4 vCPUs to a physical CPU.
Because Elasticsearch tiers data on independent data nodes versus multiple mount points on a single data node or indexer, the multiple types and classes of software-defined storage defined for independent HCI clusters can be easily leveraged between Elasticsearch clusters to address data temperatures. It should be noted that currently Elastic does not currently recommend any non-block storage (S3, NFS, etc.) as a target for Elasticsearch except as a target for Elasticsearch Snapshot and Restore. (It is possible to use S3 or NFS on Isilon or ECS as an example as a retrieval target for Logstash, but that is a subject for a later blog.) For example, vSAN in VxRail provides Optane, NVMe, SSD, and HDD storage options. A user can deploy their primary Elastic Stack environment with its Hot Elasticsearch data nodes, Kibana, and the Elastic Stack management and control plane on an all-flash VxRail cluster, and then leverage a storage dense hybrid vSAN cluster for Elasticsearch cold data.
Image 1. Example Logical Elastic Stack Architecture on HCI
Software-defined storage in HCI provides native enterprise capabilities including data encryption and data protection. Because FlexOS and vSAN provide HA via the software-defined storage, Replica Shards in Elastic for data protection are not required. Elastic will shard an index into 5 shards by default for processing, but Replica Shards for data protection are optional. Because we have data protection at the storage layer we did not use Replicas in our validation of VxFlex and we saw no impact on performance.
HCI enables customers to expand and efficiently manage the rapid adoption of an Elastic environment with dynamic resource expansion and improved infrastructure management tools. This allows for the rapid adoption of new use cases and new insights. HCI reduces datacenter sprawl and associated costs and inefficiencies related to the adoption of Elastic on bare metal. Ultimately HCI can deliver a turnkey experience that enables our customers to continuously innovate through insights derived by the Elastic Stack.
About the Author
Keith Quebodeaux is an Advisory Systems Engineer and analytics specialist with Dell Technologies Advanced Technology Solutions (ATS) organization. He has worked in various capacities with Dell Technologies for over 20 years including managed services, converged and hyper-converged infrastructure, and business applications and analytics. Keith is a graduate of the University of Oregon and Southern Methodist University.
Acknowledgments
I would like to gratefully acknowledge the input and assistance of Craig G., Rakshith V., and Chidambara S. for their input and review of this blog. I would like to especially thank Phil H., Principal Engineer with Dell Technologies whose detailed and extensive advice and assistance provided clarity and focus to my meandering evangelism. Your support was invaluable. As with anything the faults are all my own.
Dell Technologies and Deloitte DataPaaS: Data Platform as a Service
Tue, 26 May 2020 14:13:30 -0000
|Read Time: 0 minutes
The Dell Technologies and Deloitte alliance combines Dell Technologies leading infrastructure software, and services with Deloitte’s ability to deliver solutions, to drive digital transformation for our mutual clients.
DataPaaS enables enterprise deployment and adoption of Deloitte best practice data analytics platforms for use cases such as Financial Services, Cyber Security, Business Analytics, IT Operations and IoT.
Why choose Dell Technologies and Deloitte
Best-in-class capabilities: The Dell Technologies and Deloitte alliance draws on strengths from each organization with the goal of providing best-in-class technology solutions to customers.
Strong track record of success: For years Dell Technologies and Deloitte have successfully worked together to help solve enterprise customers‘ most complex infrastructure, technology, cloud strategy, and business challenges.
Strategic approach: Successful engagements with a large, diverse group of customers have demonstrated the importance of taking a strategic approach to technology, solution design, integrations, and implementation.
Dell Technologies collaborates with Deloitte to deliver data analytics at scale, allowing customers to focus on outcomes, use cases and value
Keeping up with the demands of a growing data platform can be a real challenge. Getting data on-boarded quickly, deploying and scaling infrastructure, and managing users reporting and access demands becomes increasingly difficult. DataPaaS employs Deloitte’s best practise D8 Methodology to orchestrate the deployment, management and adoption of an organisation wide data platform.
Find out more
Contact us
Asia Pacific region
Stuart Hirst
Partner
Deloitte Risk Advisory Pty Ltd
shirst@deloitte.com.au
+612 487 471 729 @convergingdata
United States region
Todd Wingler
Business Development Executive
Deloitte Risk and Financial Advisory
twingler@deloitte.com
+1 480 232-8540 @twingler
EMEA region
Nicola Esposito
Partner
Deloitte Cyber
niesposito@deloitte.es
+34 918232431 @nicolaesposito
Chris Belsey
ISV Strategy & Alliances, Global Alliances
Dell Technologies
chris.belsey@dell.com
+44 75 0088 0803 @chrisbelseyemc
Byron Cheng
High Value Workloads Leader, Global Alliances
Dell Technologies
byron.cheng@dell.com
+1 949 241 6328 @byroncheng1
IIoT Analytics Design: How important is MOM (message-oriented middleware)?
Wed, 29 Apr 2020 22:20:16 -0000
|Read Time: 0 minutes
Originally published on Aug 6, 2018 1:17:46 PM
Artificial intelligence (AI) is transforming the way businesses compete in today’s marketplace. Whether it’s improving business intelligence, streamlining supply chain or operational efficiencies, or creating new products, services, or capabilities for customers, AI should be a strategic component of any company’s digital transformation.
Deep neural networks have demonstrated astonishing abilities to identify objects, detect fraudulent behaviors, predict trends, recommend products, enable enhanced customer support through chatbots, convert voice to text and translate one language to another, and produce a whole host of other benefits for companies and researchers. They can categorize and summarize images, text, and audio recordings with human-level capability, but to do so they first need to be trained.
Deep learning, the process of training a neural network, can sometimes take days, weeks, or months, and effort and expertise is required to produce a neural network of sufficient quality to trust your business or research decisions on its recommendations. Most successful production systems go through many iterations of training, tuning and testing during development. Distributed deep learning can speed up this process, reducing the total time to tune and test so that your data science team can develop the right model faster, but requires a method to allow aggregation of knowledge between systems.
There are several evolving methods for efficiently implementing distributed deep learning, and the way in which you distribute the training of neural networks depends on your technology environment. Whether your compute environment is container native, high performance computing (HPC), or Hadoop/Spark clusters for Big Data analytics, your time to insight can be accelerated by using distributed deep learning. In this article we are going to explain and compare systems that use a centralized or replicated parameter server approach, a peer-to-peer approach, and finally a hybrid of these two developed specifically for Hadoop distributed big data environments.
Container native (e.g., Kubernetes, Docker Swarm, OpenShift, etc.) have become the standard for many DevOps environments, where rapid, in-production software updates are the norm and bursts of computation may be shifted to public clouds. Most deep learning frameworks support distributed deep learning for these types of environments using a parameter server-based model that allows multiple processes to look at training data simultaneously, while aggregating knowledge into a single, central model.
The process of performing parameter server-based training starts with specifying the number of workers (processes that will look at training data) and parameter servers (processes that will handle the aggregation of error reduction information, backpropagate those adjustments, and update the workers). Additional parameters servers can act as replicas for improved load balancing.
Parameter server model for distributed deep learning
Worker processes are given a mini-batch of training data to test and evaluate, and upon completion of that mini-batch, report the differences (gradients) between produced and expected output back to the parameter server(s). The parameter server(s) will then handle the training of the network and transmitting copies of the updated model back to the workers to use in the next round.
This model is ideal for container native environments, where parameter server processes and worker processes can be naturally separated. Orchestration systems, such as Kubernetes, allow neural network models to be trained in container native environments using multiple hardware resources to improve training time. Additionally, many deep learning frameworks support parameter server-based distributed training, such as TensorFlow, PyTorch, Caffe2, and Cognitive Toolkit.
High performance computing (HPC) environments are generally built to support the execution of multi-node applications that are developed and executed using the single process, multiple data (SPMD) methodology, where data exchange is performed over high-bandwidth, low-latency networks, such as Mellanox InfiniBand and Intel OPA. These multi-node codes take advantage of these networks through the Message Passing Interface (MPI), which abstracts communications into send/receive and collective constructs.
Deep learning can be distributed with MPI using a communication pattern called Ring-AllReduce. In Ring-AllReduce each process is identical, unlike in the parameter-server model where processes are either workers or servers. The Horovod package by Uber (available for TensorFlow, Keras, and PyTorch) and the mpi_collectives contributions from Baidu (available in TensorFlow) use MPI Ring-AllReduce to exchange loss and gradient information between replicas of the neural network being trained. This peer-based approach means that all nodes in the solution are working to train the network, rather than some nodes acting solely as aggregators/distributors (as in the parameter server model). This can potentially lead to faster model convergence.
Ring-AllReduce model for distributed deep learning
The Dell EMC Ready Solutions for AI, Deep Learning with NVIDIA allows users to take advantage of high-bandwidth Mellanox InfiniBand EDR networking, fast Dell EMC Isilon storage, accelerated compute with NVIDIA V100 GPUs, and optimized TensorFlow, Keras, or Pytorch with Horovod frameworks to help produce insights faster.
Hadoop and other Big Data platforms achieve extremely high performance for distributed processing but are not designed to support long running, stateful applications. Several approaches exist for executing distributed training under Apache Spark. Yahoo developed TensorFlowOnSpark, accomplishing the goal with an architecture that leveraged Spark for scheduling Tensorflow operations and RDMA for direct tensor communication between servers.
BigDL is a distributed deep learning library for Apache Spark. Unlike Yahoo’s TensorflowOnSpark, BigDL not only enables distributed training - it is designed from the ground up to work on Big Data systems. To enable efficient distributed training BigDL takes a data-parallel approach to training with synchronous mini-batch SGD (Stochastic Gradient Descent). Training data is partitioned into RDD samples and distributed to each worker. Model training is done in an iterative process that first computes gradients locally on each worker by taking advantage of locally stored partitions of the training data and model to perform in memory transformations. Then an AllReduce function schedules workers with tasks to calculate and update weights. Finally, a broadcast syncs the distributed copies of model with updated weights.
BigDL implementation of AllReduce functionality
The Dell EMC Ready Solutions for AI, Machine Learning with Hadoop is configured to allow users to take advantage of the power of distributed deep learning with Intel BigDL and Apache Spark. It supports loading models and weights from other frameworks such as Tensorflow, Caffe and Torch to then be leveraged for training or inferencing. BigDL is a great way for users to quickly begin training neural networks using Apache Spark, widely recognized for how simple it makes data processing.
One more note on Hadoop and Spark environments: The Intel team working on BigDL has built and compiled high-level pipeline APIs, built-in deep learning models, and reference use cases into the Intel Analytics Zoo library. Analytics Zoo is based on BigDL but helps make it even easier to use through these high-level pipeline APIs designed to work with Spark Dataframes and built in models for things like object detection and image classification.
Regardless of whether you preferred server infrastructure is container native, HPC clusters, or Hadoop/Spark-enabled data lakes, distributed deep learning can help your data science team develop neural network models faster. Our Dell EMC Ready Solutions for Artificial Intelligence can work in any of these environments to help jumpstart your business’s AI journey. For more information on the Dell EMC Ready Solutions for Artificial Intelligence, go to dellemc.com/readyforai.
Lucas A. Wilson, Ph.D. is the Chief Data Scientist in Dell EMC's HPC & AI Innovation Lab. (Twitter: @lucasawilson)
Michael Bennett is a Senior Principal Engineer at Dell EMC working on Ready Solutions.