Dell EMC Solution Insight for i-Abra
Thu, 14 Jan 2021 23:54:16 -0000|
Read Time: 0 minutes
Developing and deploying deep learning models doesn’t have to be so complicated
Dell and Intel® recently evaluated a solution that greatly simplifies the process of developing and deploying deep learning models. The i-Abra AI system we tested automatically builds an optimized inference classifier trained with a customer supplied data set of labeled Images and deployable on FPGA accelerators – all in a single workflow. In this overview, we describe how the i-Abra Pathworks software works, along with a comparison of using traditional development and deployment practices. Our engineers worked with Pathworks in a lab environment to gain insights on the comparison between i-Abra Pathworks and traditional methods that we also share in this article.
Deep learning applications and challenges
The promise of creating applications that bring to light valuable insights from data is real. The data science community has been very successful over the last 10 years in improving the process of extracting information from data using deep learning (DL) - the practice of training artificial neural networks to solve specific analytics and business-driven problems. In fact, deep learning has become the de facto approach for developing highly accurate models for image classification, object detection, real-time video analytics, and many other problems. These accuracy advancements have however increased the complexity of model development, training, and deployment.
There are many factors that contribute to the overall complexity of developing and using DL models. Apart from the work of data preparation, the deep learning life-cycle consists of two phases: 1) the training phase where a neural network architecture is proposed by the data scientists and trained to perform a task, and 2) the deployment phase where the trained neural network is readied for deployment. Also, the deployment target may be a user-facing software application or embedded in a backend process oftentimes requiring a different hardware system.
During the training stage, the data science team can spend days, weeks, or months painstakingly selecting, crafting, and customizing a neural network architecture that produces the most accurate model for performing the specified task. It is a process that requires data scientists with a combination of training and experience coming from a talent pool that is shockingly small. This is an expensive process in time, money, and people resources that impacts TCO for many systems.
But remember, the training phase is only the first half of the story. In order to get the finalized trained model deployed - the deployment stage – the finalized trained neural network must go through an optimization process for the target user system. A DL model must be quantized (converted to reduced precision mathematical operations), pruned (eliminate any unnecessary neurons and connections), and fused (multiple layers, as well as neuron weights and biases, are frozen and merged together). This preparation phase (of the deployment phase) requires trial and error together with testing that can take as much time as the training stage. Many trained models never make it into production deployment due to insurmountable hurdles associated with getting high-quality DL models into production to support the AI applications that would benefit customers and/or employees. In short, the deployment process is an additional expensive process in time, money, and resources with the potential risk of not achieving production usage. This complexity impacts TCO decisions greatly and can threaten the value of the original investment in training.
How i-Abra simplifies the deployment of neural network-based models
There is another way to gain DL productivity while minimizing both the TCO and risk of deployment delays. i-Abra Pathworks is a productivity tool for data scientists that integrates and automates both the training and deployment phases into a single workflow. Pathworks eliminates several of the pain points noted above in the traditional neural network-based data science life cycle. First, instead of relying on a data scientist to select, fine-tune, and craft a neural network architecture during training, i-Abra uses an evolutionary architecture design approach which automatically crafts a custom neural network architecture for the labeled training data being used. That means that your highly trained data scientists can spend more time-solving problems with minimal time lost in the minutia of exploring every hyper-parameter combination while building your next AI model.
Additionally, i-Abra Pathworks performs this evolutionary architecture discovery and training on efficient Intel Field Programmable Gate Array (FPGA) accelerators, the same hardware acceleration that i-Abra deploys for boosting the performance of models in production. This full ecosystem approach, employing the same FPGA technology for both the training and deployment stages, means that the model created during the training stage is the same model used in production deployment. There is no quantization, no pruning, and no fusing to be performed. The model does not need to be converted, translated, or reevaluated since the training optimization produced the most efficient design for the data. You simply push that efficient model directly into production, because the model you deploy is the exact same model that you trained.
With i-Abra Pathworks productivity tools, users provide the labeled images to the process and the result is a deployable model. This change from traditional training and deployment approaches dramatically reduces time, money, and required resources. The result is improved TCO, lower risk, and more successful projects. It also reduces ongoing system complexity, improves overall financial performance, and increased technical agility to deploy newer models as the system needs to change with time.
The very best AI capabilities result from tightly coupled SW and HW to achieve optimal TCO, performance, power, and maintenance of the deployment AI system. For many years, i-Abra and Intel have collaborated through joint investment on the core technology to tightly couple i-Abra SW with Intel Architecture. The i-Abra Pathworks training environment relies on an integrated mesh of multiple Intel Xeon processors and Intel FPGAs for fast trained model generation. The i-Abra Synapse inference environment integrates the deployed runtime AI models on Intel FPGA, Xeon, and Atom processors creating many optimized deployment options. i-Abra GraphDB utilizes Pathworks and Synapse for an additional AI level of insights that incorporate multiple I-Abra Synapse inference classification results with additional metadata for a reasoning neural network built on Intel Architecture. Collectively the i-Abra products are tuned to Intel architecture bringing unique turnkey AI capability to multiple markets uses cases. Dell, i-Abra, and Intel have further jointly collaborated to integrate the i-Abra training capabilities into Dell production class training appliances and server technologies. Dell is also creating tightly integrated, full end to end i-Abra based AI edge and network solutions across multiple markets on Intel architecture.
Dell EMC Model Training Cluster for i-Abra
The Dell EMC HPC & AI Innovation Lab has worked with i-Abra and Intel to develop a model training cluster for the i-Abra ecosystem, ensuring that the i-Abra Pathworks software stack functions in an efficient and performant manner while providing the robustness and reliability of the Dell EMC PowerEdge server and Dell EMC PowerSwitch network switch portfolio.
The model training cluster for i-Abra includes eight (8) Dell EMC PowerEdge R740 servers, each with two (2) second-generation Intel® Xeon® Scalable Gold 6248 processors and 768GB of high-speed DDR4 RAM. Each server also contains two (2) Intel® Programmable Acceleration Cards (PACs) with Intel Arria® 10 GX FPGA, as well as two (2) high capacity hard drives for storing and ingesting labeled training data. The PowerEdge servers are all connected with 10Gbps networking to ensure fast movement of training data and model information exchange during training.
The advantages of deep learning for extracting useful insights from many types of data have been proven across many industries and use cases. One of the most significant roadblocks hindering organizations that want to start or expand their use of deep learning models is too much complexity in moving from the training stage to the deployment stage. The i-Abra Pathworks solution simplifies that transition by optimizing models for deployment during training rather than using a traditional two-stage training followed by a deployment optimization process. The result is a more streamlined workflow with less room for errors and more models successfully deployed and a simplified long-term management experience. Dell and Intel have partnered with i-Abra to develop recommendations for assembling a model training cluster for the i-Abra ecosystem. Our results show that the i-Abra Pathworks software stack functions in an efficient and performant manner while providing the robustness and reliability of the Dell EMC PowerEdge server and Dell EMC PowerSwitch network switch portfolio.
Why Dell Technologies?
From the core to the edge to the cloud, Dell Technologies solutions are the proven choice for the modern datacenter. Dell Technologies provides the most secure, innovative, and scalable solutions enabling our customers to solve today’s challenges wherever they are in their digital transformation. And that’s why they’re used together at scale, across the globe, more than any other platform. Dell EMC also partners with the largest, most innovative players in information technology (IT) to provide jointly engineered services and solutions that speed deployment, improve performance, and maximize return on investment (ROI).
Related Blog Posts
Omnia: Open-source deployment of high-performance clusters to run simulation, AI, and data analytics workloads
Tue, 02 Feb 2021 16:07:10 -0000|
Read Time: 0 minutes
High Performance Computing (HPC), in which clusters of machines work together as one supercomputer, is changing the way we live and how we work. These clusters of CPU, memory, accelerators, and other resources help us forecast the weather and understand climate change, understand diseases, design new drugs and therapies, develop safe cars and planes, improve solar panels, and even simulate life and the evolution of the universe itself. The cluster architecture model that makes this compute-intensive research possible is also well suited for high performance data analytics (HPDA) and developing machine learning models. With the Big Data era in full swing and the Artificial Intelligence (AI) gold rush underway, we have seen marketing teams with their own Hadoop clusters attempting to transition to HPDA and finance teams managing their own GPU farms. Everyone has the same goals: to gain new, better insights faster by using HPDA and by developing advanced machine learning models using techniques such as deep learning and reinforcement learning. Today, everyone has a use for their own high-performance computing cluster. It’s the age of the clusters!
Today's AI-driven IT Headache: Siloed Clusters and Cluster Sprawl
Unfortunately, cluster sprawl has taken over our data centers and consumes inordinate amounts of IT resources. Large research organizations and businesses have a cluster for this and a cluster for that. Perhaps each group has a little “sandbox” cluster, or each type of workload has a different cluster. Many of these clusters look remarkably similar, but they each need a dedicated system administrator (or set of administrators), have different authorization credentials, different operating models, and sit in different racks in your data center. What if there was a way to bring them all together?
That’s why Dell Technologies started the Omnia project.
The Omnia Project
The Omnia project is an open-source initiative with a simple aim: To make consolidated infrastructure easy and painless to deploy using open open source and free use software. By bringing the best open source software tools together with the domain expertise of Dell Technologies' HPC & AI Innovation Lab, HPC & AI Centers of Excellence, and the broader HPC Community, Omnia gives customers decades of accumulated expertise in deploying state-of-the-art systems for HPC, AI, and Data Analytics – all in a set of easily executable Ansible playbooks. In a single day, a stack of servers, networking switches, and storage arrays can be transformed into one consolidated cluster for running all your HPC, AI, and Data Analytics workloads.
Simple by Design
Omnia’s design philosophy is simplicity. We look for the best, most straightforward approach to solving each task.
- Need to run the Slurm workload manager? Omnia assembles Ansible plays which build the right rpm files and deploy them correctly, making sure all the correct dependencies are installed and functional.
- Need to run the Kubernetes container orchestrator? Omnia takes advantage of community supported package repositories for Linux (currently CentOS) and automates all the steps for creating a functional multi-node Kubernetes cluster.
- Need a multi-user, interactive Python/R/Julia development environment? Omnia takes advantage of best-of-breed deployments for Kubernetes through Helm and OperatorHub, provides configuration files for dynamic and persistent storage, points to optimized containers in DockerHub, Nvidia GPU Cloud (NGC), or other container registries for unaccelerated and accelerated workloads, and automatically deploys machine learning platforms like Kubeflow.
Before we go through the process of building something from scratch, we will make sure there isn’t already a community actively maintaining that toolset. We’d rather leverage others' great work than reinvent the wheel.
Inclusive by Nature
Omnia’s contribution philosophy is inclusivity. From code and documentation updates to feature requests and bug reports, every user’s contributions are welcomed with open arms. We provide an open forum for conversations about feature ideas and potential implementation solutions, making use of issue threads on GitHub. And as the project grows and expands, we expect the technical governance committee to grow to include the top contributors and stakeholders from the community.
Omnia is just getting started. Right now, we can easily deploy Slurm and Kubernetes clusters from a stack of pre-provisioned, pre-networked servers, but our aim is higher than that. We are currently adding capabilities for performing bare-metal provisioning and supporting new and varying types of accelerators. In the future, we want to collect information from the iDRAC out-of-band management system on Dell EMC PowerEdge servers, configure Dell EMC PowerSwitch Ethernet switches, and much more.
What does the future hold? While we have plans in the near-term for additional feature integrations, we are looking to partner with the community to define and develop future integrations. Omnia will grow and develop based on community feedback and your contributions. In the end, the Omnia project will not only install and configure the open source software we at Dell Technologies think is important, but the software you – the community – want it to, as well! We can’t think of a better way for our customers to be able to easily setup clusters for HPC, AI, and HPDA workloads, all while leveraging the expertise of the entire Dell Technologies' HPC Community.
Omnia is available today on GitHub at https://github.com/dellhpc/omnia. Join the community now and help guide the design and development of the next generation of open-source consolidated cluster deployment tools!
Removing the barriers to hybrid-cloud flexibility for data analytics
Wed, 13 Jan 2021 20:54:36 -0000|
Read Time: 0 minutes
The fundamental tasks of collecting data, storing data, and providing processing power for data analytics is getting more difficult. Increasing data volumes along with the number of remote data sources and the rapidly evolving options for extracting valuable information make forecasting needs challenging and investment risky. IT organizations need the ability to quickly provision resources and incrementally scale both compute and storage on-demand as need develops. The three largest hyper-scale cloud providers all offer a wide range of infrastructure, platform and analytics “as-a-service” but all require vastly different skill sets, security models, and connectivity investments. Organizations interested in having hybrid cloud flexibility for data analytics are forced to choose a single cloud partner or add significant IT complexity by managing multiple options with no common toolset. In this Solutions Insight, we describe how the Robin Cloud-Native Platform (CNP) hosted onsite with Dell EMC PowerEdge servers provide application and infrastructure topology awareness to streamline the provisioning and life cycle management of your data applications with true hybrid cloud flexibility.
Providing a robust self-service experience
Data analytics professionals want easy access to internally managed provisioning of resources for experimentation and development without complex interactions with IT. Many of these professionals have experience with self-service portals that work for a single cloud service but have not yet had any hybrid cloud flexibility. Robin provides a rich out-of-the-box portal capability that IT can offer to developers, data engineers, and data scientists. Data professionals save valuable development time at each stage of the application lifecycle by leveraging the automation framework of Robin. IT gets a fully functional automation framework for hosting many popular enterprise applications on the Robin Platform. The Robin platform comes out-of-the-box with cluster-aware application bundles including relational databases, big data, NoSQL, and several AI/ML tools.
Robin leverages cloud-native technologies such as Kubernetes and Docker to modernize the management of your data analytics infrastructure. The Robin Kubernetes-based architecture gives you complete freedom and offers a consistent self-service capability to provision and move workloads across private and/or public clouds. Native integration between Kubernetes, storage, network, and application management layer enables full automation managing both clusters and applications with all the advantages of a true hybrid cloud experience. Robin has built-in the capability to create managed application snapshots that enable cloning, backup, and migration of applications between on-prem and cloud or between datacenters within an enterprise. Robin fully automates the end-to-end cluster provisioning process for the most challenging platform deployments including Cloudera, Apache Spark, Kafka, TensorFlow, Pytorch, Kubeflow, Scikit-learn, Caffe, Torch, and even custom application configurations.
Organizations that adopt the Robin platform benefit from accelerated deployment and simplified management of complex applications that can be provisioned by end-users through a familiar portal experience and true hybrid cloud flexibility.
Moving from self-service sandboxes to enterprise scale
We described above how the Robin platform benefits both data and IT professionals that want a full-featured self-service data analytics capability with true hybrid cloud operations by layering additional platform awareness and automation to cloud-native technologies such as Kubernetes and Docker. Organizations can start with small deployments, and as applications grow, they can add more resources. Robin can be deployed on the full range of Dell EMC PowerEdge servers with a custom mix of memory, storage, and accelerator options making it easy to scale-out by adding additional servers with the right capabilities to match changing resource demands. The Robin management console provides a single interface to expand existing deployments and/or add new clusters. Consolidation of multiple workloads under Robin management can also improve hardware utilization without compromising SLAs or QoS. The Robin platform provides multi-tenancy with fine-grained Role Based Access Control (RBAC) enabling safe resource sharing on fewer clusters. Applications can be incubated on multi-tenancy, mixed application clusters and then easily migrated to production class clusters hosting one or multiple mission-critical applications using Robin backup and restore capability across clusters and/or clouds.
While open-source Kubernetes has become the de facto platform for deploying on-demand applications, there remains a need for additional investment by organizations that need multi-cluster production deployments and service orchestration that can automate and manage day-0 and day-n lifecycle operations at scale. The Robin Automation Platform combines simplicity, usability, performance, and scale with a modern UI to provide bare metal, cluster, and application-as-a-service for both infrastructure and service orchestration. With Robin Bare Metal-as-a-Service, hundreds of thousands of bare-metal servers can be provisioned with specific BIOS, firmware, OS and other software packages or configurations depending on the needs of the application. With Robin, it is equally easy to manage upgrades, as well as a wide array of PowerEdge server options including firmware, OS, and application software across container platforms.
Automating day-n operations for stateful applications
Several priorities are driving interest in running stateful applications on Kubernetes. These include operational consistency, extending agility of containerization to data, faster collaboration, and the need for simplifying the delivery of data services. Robin solves the storage and network persistency challenges in Kubernetes to enable its use in the provisioning, management, high availability and fault tolerance of mission-critical stateful applications.
Creating a persistent storage volume for a single container is becoming a routine operation. However, when it comes to provisioning storage for complex stateful applications that span multiple pods and services, it requires automation of the cluster resources coordinated with storage management. Managing the changing requirements of stateful applications on a day-to-day basis requires data and storage management services such as snapshotting, backup, and cloning. Traditionally, this capability has resided only on high-end storage systems managed by the IT storage administrator teams. In order to provide true self-service capabilities to data professionals, organizations need simple storage and data management solution for Kubernetes that hides all the above complexities and provides simple commands that are developer-friendly and can easily be incorporated into development and production workflows.
With Robin CNP, analytics and DevOps teams can be self-sufficient while managing complex stateful applications without requiring specific storage expertise. Data management is supported with a Robin managed CSI-compliant block storage access layer with bare-metal performance. Storage management seamlessly integrates with Kubernetes-native administrative tooling such as Kubectl, Helm Charts, and Operators through standard APIs.
Robin CNP simplifies storage operations such as provisioning storage, ensuring data availability, maintaining low latency I/O performance, and detecting and repairing disk and I/O errors. Robin CNP also provides simple commands for data management operations such as backup/recovery, snapshots/rollback, and cloning of entire applications including data, metadata, and application configuration.
Robin CNP offers several improvements on the networking layer over open-source Kubernetes. These improvements are required to run enterprise-scale data and network-centric applications on Kubernetes. With Robin CNP developers/IT can set networking options while deploying applications and clusters in Kubernetes and preserve IP addresses during restarts and application migration. Robin’s flexible networking built on OVS and Calico supports overlay networking. Robin also supports dual-stack (IPV4/IPV6).
IT organizations adopting the Robin platform benefit from a single approach to application and infrastructure management from experimentation to dev/test to a production environment that can span multiple clouds. Robin excels at managing heterogeneous infrastructure assets with a mix of compute, storage, and workload accelerators that can match the changing needs of fast-moving enterprise-wide demand for resources. Dell Technologies provides a wide range of PowerEdge rack servers with innovative designs to transform IT and maximize performance across the widest range of applications. PowerEdge servers match well with the three main types of infrastructure assets typically needed for a Robin managed implementation:
The PowerEdge R640 is the ideal dual-socket platform for dense scale-out data center computing.
The PowerEdge R740xd delivers a perfect balance between storage scalability and performance. The 2U two-socket platform is ideal for software defined storage.
The PowerEdge R740 was designed to accelerate application performance leveraging accelerator cards and storage scalability. The 2-socket, 2U platform has the optimum balance of resources to power the most demanding environments.
Up to two 2nd Generation Intel® Xeon® Scalable processors, up to 28 cores per processor
Up to 24 NVMe drives and a total of 32 x 2.5” or 18 x 3.5” drives in a 2U dual-socket platform.
The scalable business architecture of the R740 can scale up to three 300W or six 150W GPUs, or up to three double-width or four single-width FPGAs
24 DDR4 DIMM slots, Supports RDIMM /LRDIMM, speeds up to 2933MT/s, 3TB max
Up to 12 NVDIMM, 192 GB Max
Up to 12 Intel® Optane™ DC persistent memory DCPMM, 6.14TB max, (7.68TB max with DPCMM + LRDIMM)
Front bays: Up to 24 x 2.5” SAS/SATA (HDD/SSD), NVMe SSD max 184.32TB or up to 12 x 3.5” SAS/SATA HDD max 192TB
Mid bay: Up to 4 x 2.5”, max 30.72TB SAS/SATA (HDD/SSD), or up to 4 x 3.5” SAS/SATA HDD, max 64TB
Rear bays: Up to 4 x 2.5”, max 30.72TB SAS/SATA (HDD/SSD), or up to 2 x 3.5” SAS/SATA HDD max 32TB
Robin is the ideal platform for hosting both stateful and stateless applications with support for both virtual machines and Docker-based applications. It includes a storage layer that provides data services, including snapshots, clones, backup/restore, and replication that enable hybrid cloud and multi-cloud operations for stateful applications that are not possible with pure open-source cloud-native technologies.
With the Robin platform on Dell EMC PowerEdge servers, organizations can:
· Decouple and scale compute and storage independently
· Provision/Decommission compute only clusters within minutes for ephemeral workloads
· All operations can be fully integrated with simple API commands from your development and/or production workflows.
· Migrate data workloads among data centers and public clouds
· Provide self-service capability for developers and data scientists to improve productivity
· Eliminate planning delays, start small and dynamically scale-up/out nodes to meet demand
· Consolidate multiple workloads on shared infrastructure to improve hardware utilization
· Trade resources among application clusters to manage cyclical compute requirements and surges
This results in,
· Reduced Costs
· Delivering faster insights
· Future-proofing the enterprise
For more information
Dell Technologies and Robin Systems welcome your feedback on this article and the information presented herein. Contact the Dell Technologies Solutions team by email or provide your comments by completing our documentation survey.
You can also contact our regional sales teams for more information via email at the following addresses:
North America: email@example.com
Thank you for your interest.