IIoT Analytics Design: How important is MOM (message-oriented middleware)?
Wed, 29 Apr 2020 22:20:16 -0000|
Read Time: 0 minutes
Originally published on Aug 6, 2018 1:17:46 PM
Artificial intelligence (AI) is transforming the way businesses compete in today’s marketplace. Whether it’s improving business intelligence, streamlining supply chain or operational efficiencies, or creating new products, services, or capabilities for customers, AI should be a strategic component of any company’s digital transformation.
Deep neural networks have demonstrated astonishing abilities to identify objects, detect fraudulent behaviors, predict trends, recommend products, enable enhanced customer support through chatbots, convert voice to text and translate one language to another, and produce a whole host of other benefits for companies and researchers. They can categorize and summarize images, text, and audio recordings with human-level capability, but to do so they first need to be trained.
Deep learning, the process of training a neural network, can sometimes take days, weeks, or months, and effort and expertise is required to produce a neural network of sufficient quality to trust your business or research decisions on its recommendations. Most successful production systems go through many iterations of training, tuning and testing during development. Distributed deep learning can speed up this process, reducing the total time to tune and test so that your data science team can develop the right model faster, but requires a method to allow aggregation of knowledge between systems.
There are several evolving methods for efficiently implementing distributed deep learning, and the way in which you distribute the training of neural networks depends on your technology environment. Whether your compute environment is container native, high performance computing (HPC), or Hadoop/Spark clusters for Big Data analytics, your time to insight can be accelerated by using distributed deep learning. In this article we are going to explain and compare systems that use a centralized or replicated parameter server approach, a peer-to-peer approach, and finally a hybrid of these two developed specifically for Hadoop distributed big data environments.
Distributed Deep Learning in Container Native Environments
Container native (e.g., Kubernetes, Docker Swarm, OpenShift, etc.) have become the standard for many DevOps environments, where rapid, in-production software updates are the norm and bursts of computation may be shifted to public clouds. Most deep learning frameworks support distributed deep learning for these types of environments using a parameter server-based model that allows multiple processes to look at training data simultaneously, while aggregating knowledge into a single, central model.
The process of performing parameter server-based training starts with specifying the number of workers (processes that will look at training data) and parameter servers (processes that will handle the aggregation of error reduction information, backpropagate those adjustments, and update the workers). Additional parameters servers can act as replicas for improved load balancing.
Worker processes are given a mini-batch of training data to test and evaluate, and upon completion of that mini-batch, report the differences (gradients) between produced and expected output back to the parameter server(s). The parameter server(s) will then handle the training of the network and transmitting copies of the updated model back to the workers to use in the next round.
This model is ideal for container native environments, where parameter server processes and worker processes can be naturally separated. Orchestration systems, such as Kubernetes, allow neural network models to be trained in container native environments using multiple hardware resources to improve training time. Additionally, many deep learning frameworks support parameter server-based distributed training, such as TensorFlow, PyTorch, Caffe2, and Cognitive Toolkit.
Distributed Deep Learning in HPC Environments
High performance computing (HPC) environments are generally built to support the execution of multi-node applications that are developed and executed using the single process, multiple data (SPMD) methodology, where data exchange is performed over high-bandwidth, low-latency networks, such as Mellanox InfiniBand and Intel OPA. These multi-node codes take advantage of these networks through the Message Passing Interface (MPI), which abstracts communications into send/receive and collective constructs.
Deep learning can be distributed with MPI using a communication pattern called Ring-AllReduce. In Ring-AllReduce each process is identical, unlike in the parameter-server model where processes are either workers or servers. The Horovod package by Uber (available for TensorFlow, Keras, and PyTorch) and the mpi_collectives contributions from Baidu (available in TensorFlow) use MPI Ring-AllReduce to exchange loss and gradient information between replicas of the neural network being trained. This peer-based approach means that all nodes in the solution are working to train the network, rather than some nodes acting solely as aggregators/distributors (as in the parameter server model). This can potentially lead to faster model convergence.
The Dell EMC Ready Solutions for AI, Deep Learning with NVIDIA allows users to take advantage of high-bandwidth Mellanox InfiniBand EDR networking, fast Dell EMC Isilon storage, accelerated compute with NVIDIA V100 GPUs, and optimized TensorFlow, Keras, or Pytorch with Horovod frameworks to help produce insights faster.
Distributed Deep Learning in Hadoop/Spark Environments
Hadoop and other Big Data platforms achieve extremely high performance for distributed processing but are not designed to support long running, stateful applications. Several approaches exist for executing distributed training under Apache Spark. Yahoo developed TensorFlowOnSpark, accomplishing the goal with an architecture that leveraged Spark for scheduling Tensorflow operations and RDMA for direct tensor communication between servers.
BigDL is a distributed deep learning library for Apache Spark. Unlike Yahoo’s TensorflowOnSpark, BigDL not only enables distributed training - it is designed from the ground up to work on Big Data systems. To enable efficient distributed training BigDL takes a data-parallel approach to training with synchronous mini-batch SGD (Stochastic Gradient Descent). Training data is partitioned into RDD samples and distributed to each worker. Model training is done in an iterative process that first computes gradients locally on each worker by taking advantage of locally stored partitions of the training data and model to perform in memory transformations. Then an AllReduce function schedules workers with tasks to calculate and update weights. Finally, a broadcast syncs the distributed copies of model with updated weights.
The Dell EMC Ready Solutions for AI, Machine Learning with Hadoop is configured to allow users to take advantage of the power of distributed deep learning with Intel BigDL and Apache Spark. It supports loading models and weights from other frameworks such as Tensorflow, Caffe and Torch to then be leveraged for training or inferencing. BigDL is a great way for users to quickly begin training neural networks using Apache Spark, widely recognized for how simple it makes data processing.
One more note on Hadoop and Spark environments: The Intel team working on BigDL has built and compiled high-level pipeline APIs, built-in deep learning models, and reference use cases into the Intel Analytics Zoo library. Analytics Zoo is based on BigDL but helps make it even easier to use through these high-level pipeline APIs designed to work with Spark Dataframes and built in models for things like object detection and image classification.
Regardless of whether you preferred server infrastructure is container native, HPC clusters, or Hadoop/Spark-enabled data lakes, distributed deep learning can help your data science team develop neural network models faster. Our Dell EMC Ready Solutions for Artificial Intelligence can work in any of these environments to help jumpstart your business’s AI journey. For more information on the Dell EMC Ready Solutions for Artificial Intelligence, go to dellemc.com/readyforai.
Lucas A. Wilson, Ph.D. is the Chief Data Scientist in Dell EMC's HPC & AI Innovation Lab. (Twitter: @lucasawilson)
Michael Bennett is a Senior Principal Engineer at Dell EMC working on Ready Solutions.
Related Blog Posts
The Case for Elastic Stack on HCI
Thu, 11 Jun 2020 21:34:33 -0000|
Read Time: 0 minutes
The Elastic Stack, also known as the “ELK Stack”, is a widely used, collection of software products based on open source used for search, analysis, and visualization of data. The Elastic Stack is useful for a wide range of applications including observability (logging, metrics, APM), security, and general-purpose enterprise search. Dell Technologies is an Elastic Technology Partner1 This blog covers some basics of hyper-converged infrastructure (HCI), some Elastic Stack fundamentals, and the benefits of deploying Elastic Stack on HCI.
HCI integrates the compute and storage resources from a cluster of servers using virtualization software for both CPU and disk resources to deliver flexible, scalable performance and capacity on demand. The breadth of server offerings in the Dell PowerEdge portfolio gives system architects many options for designing the right blend of compute and storage resources. Local resources from each server in the cluster are combined to create virtual pools of compute and storage with multiple performance tiers.
VxFlex is a Dell Technologies developed, hypervisor agnostic, HCI platform integrated with high-performance, software-defined block storage. VxFlex OS is the software that creates a server and IP-based SAN from direct-attached storage as an alternative to a traditional SAN infrastructure. Dell Technologies also offers the VxRail HCI platform for VMware-centric environments. VxRail is the only fully integrated, pre-configured, and pre-tested VMware HCI system powered with VMware vSAN. We show below why both HCI offerings are highly efficient and effective platforms for a truly scalable Elastic Stack deployment.
Elastic Stack Overview
The Elastic Stack is a collection of four open-source projects: Elasticsearch, Logstash, Kibana, and Beats. Elasticsearch is an open-source, distributed, scalable, enterprise-grade search engine based on Lucene. Elasticsearch is an end-to-end solution for searching, analyzing, and visualizing machine data from diverse source formats. With the Elastic Stack, organizations can collect data from across the enterprise, normalize the format, and enrich the data as desired. Platforms designed for scale-out performance running the Elastic Stack provides the ability to analyze and correlate data in near real-time.
Elastic Stack on HCI
In March 2020, Dell Technologies validated the Elastic Stack running on our VxFlex family of HCI2. It will be shown how the features of HCI provide distinct benefits and cost savings as an integrated solution for the Elastic Stack. The Elastic Stack, and Elasticsearch specifically, is designed for scale-out. Data nodes can be added to an Elasticsearch cluster to provide additional compute and storage resources. HCI also uses a scale-out deployment model that allows for easy, seamless scalability horizontally by adding additional nodes to the cluster(s). However, unlike bare-metal deployments, HCI also scales vertically by adding resources dynamically to Elasticsearch data nodes or any other Elastic Stack roles through virtualization. VxFlex admins use their preferred hypervisor and VxFLEX OS and for VxRail it is done with VMware ESX and vSAN. Additionally, the Elastic Stack can be deployed on Kubernetes clusters, therefor admins can also choose to leverage VMware Tanzu for Kubernetes management.
Virtualization has long been a strategy for achieving more efficient resource utilization and data center density. Elasticsearch data nodes tend to have average allocations of 8-16 cores and 64GB of RAM. With the current ability to support up to 112 cores and 6TB of RAM in a single 2RU Dell server, Elasticsearch is an attractive application for virtualization. Additionally, the Elastic Stack is also significantly more CPU efficient than some alternative products improving the cost-effectiveness of deploying Elastic with VMware or other virtualization technologies. We would recommend sizing for 1 physical CPU to 1 virtual CPU (vCPU) for Elasticsearch Hot Tier along with the management and control plane resources. While this is admittedly like the VMware guidance for some similar analytics platforms, these VMs tend to consume a significantly smaller CPU footprint per data node. The Elastic Stack tends to take advantage of hyperthreading and resource overcommitment more effectively. While needs will vary by customer use case, our experience shows the efficiencies in the Elastic Stack and Elastic data lifecycle management allow the Elasticsearch Warm Tier, Kibana, and Proxy servers can be supported by 1 physical CPU to 2 vCPUs and the Cold Tier can be upwards of 4 vCPUs to a physical CPU.
Because Elasticsearch tiers data on independent data nodes versus multiple mount points on a single data node or indexer, the multiple types and classes of software-defined storage defined for independent HCI clusters can be easily leveraged between Elasticsearch clusters to address data temperatures. It should be noted that currently Elastic does not currently recommend any non-block storage (S3, NFS, etc.) as a target for Elasticsearch except as a target for Elasticsearch Snapshot and Restore. (It is possible to use S3 or NFS on Isilon or ECS as an example as a retrieval target for Logstash, but that is a subject for a later blog.) For example, vSAN in VxRail provides Optane, NVMe, SSD, and HDD storage options. A user can deploy their primary Elastic Stack environment with its Hot Elasticsearch data nodes, Kibana, and the Elastic Stack management and control plane on an all-flash VxRail cluster, and then leverage a storage dense hybrid vSAN cluster for Elasticsearch cold data.
Image 1. Example Logical Elastic Stack Architecture on HCI
Software-defined storage in HCI provides native enterprise capabilities including data encryption and data protection. Because FlexOS and vSAN provide HA via the software-defined storage, Replica Shards in Elastic for data protection are not required. Elastic will shard an index into 5 shards by default for processing, but Replica Shards for data protection are optional. Because we have data protection at the storage layer we did not use Replicas in our validation of VxFlex and we saw no impact on performance.
HCI enables customers to expand and efficiently manage the rapid adoption of an Elastic environment with dynamic resource expansion and improved infrastructure management tools. This allows for the rapid adoption of new use cases and new insights. HCI reduces datacenter sprawl and associated costs and inefficiencies related to the adoption of Elastic on bare metal. Ultimately HCI can deliver a turnkey experience that enables our customers to continuously innovate through insights derived by the Elastic Stack.
- Elastic Technology and Cloud Partners - https://www.elastic.co/about/partners/technology
- Elastic Stack Solution on Dell EMC VxFlex Family - https://www.dellemc.com/en-in/collaterals/unauth/white-papers/products/converged-infrastructure/elastic-on-vxflex.pdf
- Elasticsearch Sizing and Capacity Planning Webinar - https://www.elastic.co/webinars/elasticsearch-sizing-and-capacity-planning
About the Author
Keith Quebodeaux is an Advisory Systems Engineer and analytics specialist with Dell Technologies Advanced Technology Solutions (ATS) organization. He has worked in various capacities with Dell Technologies for over 20 years including managed services, converged and hyper-converged infrastructure, and business applications and analytics. Keith is a graduate of the University of Oregon and Southern Methodist University.
I would like to gratefully acknowledge the input and assistance of Craig G., Rakshith V., and Chidambara S. for their input and review of this blog. I would like to especially thank Phil H., Principal Engineer with Dell Technologies whose detailed and extensive advice and assistance provided clarity and focus to my meandering evangelism. Your support was invaluable. As with anything the faults are all my own.
Dell Technologies and Deloitte DataPaaS: Data Platform as a Service
Tue, 26 May 2020 14:13:30 -0000|
Read Time: 0 minutes
The Dell Technologies and Deloitte alliance combines Dell Technologies leading infrastructure software, and services with Deloitte’s ability to deliver solutions, to drive digital transformation for our mutual clients.
DataPaaS enables enterprise deployment and adoption of Deloitte best practice data analytics platforms for use cases such as Financial Services, Cyber Security, Business Analytics, IT Operations and IoT.
Why choose Dell Technologies and Deloitte
Best-in-class capabilities: The Dell Technologies and Deloitte alliance draws on strengths from each organization with the goal of providing best-in-class technology solutions to customers.
Strong track record of success: For years Dell Technologies and Deloitte have successfully worked together to help solve enterprise customers‘ most complex infrastructure, technology, cloud strategy, and business challenges.
Strategic approach: Successful engagements with a large, diverse group of customers have demonstrated the importance of taking a strategic approach to technology, solution design, integrations, and implementation.
Dell Technologies collaborates with Deloitte to deliver data analytics at scale, allowing customers to focus on outcomes, use cases and value
Keeping up with the demands of a growing data platform can be a real challenge. Getting data on-boarded quickly, deploying and scaling infrastructure, and managing users reporting and access demands becomes increasingly difficult. DataPaaS employs Deloitte’s best practise D8 Methodology to orchestrate the deployment, management and adoption of an organisation wide data platform.
- “Splunk as a Platform” enabling data reuse and analytics across the business
- On-premise, Cloud or Hybrid – route data to the most cost-effective option or depending on Information Governance policies
- DataPaaS delivers a catalog of use-cases that can be deployed in minutes…not days or weeks
- Free up and retain specialist resources - move from troubleshooting and management of the platform, to getting value out of the data in the platform
- True DevOps, using CICD, spin up and destroy full environments as needed
- Enforce and maintain consistent configuration, continuously synced enabling simple recovery
- Data Acquisition Channel for rapid and automated data onboarding and routing
- DataPaaS enables Data DevOps; 5x faster, at 50% of the cost with 100% control and 8x the return on investment
Find out more
- Dell Technologies and Deloitte DataPaaS Data Platform as a Service
- Dell technologies and Deloitte DataPaaS Pandemic Support for Government Crisis Management
- Dell Technologies and Deloitte DataPaaS Risk Management Analytics for Real-Time Decision Making
Asia Pacific region
Deloitte Risk Advisory Pty Ltd
+612 487 471 729
United States region
Business Development Executive
Deloitte Risk and Financial Advisory
+1 480 232-8540
ISV Strategy & Alliances, Global Alliances
+44 75 0088 0803
High Value Workloads Leader, Global Alliances
+1 949 241 6328