Implementing AI: Dell PowerEdge XE9640 and Intel® Data Center GPU Max 1550
Fri, 12 Apr 2024 15:36:58 -0000
|Read Time: 0 minutes
Implementing AI: Dell PowerEdge XE9640 and Intel® Data Center GPU Max 1550
An AI inference and training POC powered by Dell Technologies
Author: Esther Baldwin – Intel, Yashesh Shroff – Intel, Justin King - Dell
Summary
In the current economic climate, the CIO’s access to infrastructure for Artificial Intelligence (AI) development and delivery is challenging. In addition to the increasing demand for FLOPs to generate new and faster insights for their business, CIO’s face challenges on several fronts. These include the supply chain and lead time for traditional resources and the need to continue to maintain modernized environments which drive growth while reducing the cost of programs and bring forward tangible value.
In addition to modernization challenges for new data management approaches, such as data lakes, CIOs are being asked to support artificial intelligence technology for various uses as it permeates all aspects of the business environment. For many, this is an emerging technology, and they are often under a barrage of marketing and sales information. Dell Technologies, as a trusted advisor, is there to alleviate these pressures and help its customers navigate the complex decisions that turn vision and planning into reality.
With Dell Technologies’ this solution is designed and optimized to give CIOs options. With the Intel® Data Center GPU Max Series, developers working on multiple models for inference and training will find extensive resources to combat today’s competitive landscape. This comprehensive brief will provide an overview of how AI best meets the AI developer's needs.
Dell and Intel have partnered to deliver a server solution powered by Dell infrastructure with the Intel Max Series GPU. The PowerEdge XE9640 offers:
|
Business Challenges and Benefits
The Intel Max Series 1550 GPU meets industry challenges with flexible options that empower you to deliver everything you would expect from a modern high-performance graphics processing unit.
• Program once - No code changes between Intel ® Xeon® CPU and Max Series GPU
• Intel oneAPI to allow hardware vendor independence - No vendor lock-in
• AI-boosting Intel® Xe Matrix Extensions (XMX) with deep systolic arrays enabling vector and matrix capabilities in a single device
• Solving large problems - Largest L2 Cache for a GPU, 408MB (10x of A100 and 25x of MI250)
• Built-in hardware accelerated Ray Tracing cores, an advantage for visualization
• Xe Link – high-speed coherent, unified fabric offers flexibility to any form factor, enabling scale-up - 16Xe Links for GPU – to-GPU comms
• Advanced manufacturing processes - Modular & Flexible architecture that allows the SoC to be constructed from 47 individual silicon tiles
• High Bandwidth Memory - integrated on the package
• Versatile: Supports both HPC and AI workloads. AI support for popular models such as Resnet, Bert, Cosmic Tagger, Llama 2-(7B, 13B, 70B), GPT-J 6B, BLOOM-176B, and more.
• Available today from Dell
|
What is Intel® Data Center GPU Max 1550?
The Intel Max Series GPU provides support for over 60 AI models. It also offers application readiness for business applications in high-performance computing, including energy, life science, and physics as well as top applications in the financial services industry and manufacturing and more. It is available with the Dell PowerEdge XE964.
What do AI developers care about, and where and why will the XE9640 work for them?
AI developers look for several key features in a compute platform, such as performance, versatility, scalability, ease of use, and support for AI frameworks. The Dell XE9640 platform is designed with these needs in mind.
Performance: With the Max 1550 GPUs, the Dell XE9640 platform offers 2.7X peak throughput across various datatypes (FP64, FP32, TF32, BF16/FP16, and INT8).
Versatility: This performance translates into inference and training advantages for the top AI models used for image classification, image segmentation, object detection, natural language processing, speech recognition, speech synthesis, and recommendation. Details of these workloads are provided in the following section.
Scalability: With XeLink, developers can access high-speed connections via GPU-to-GPU fabric on Max 1550, thereby scaling up workloads to four cards on the Dell XE9640 platform. Moreover, ethernet or Infiniband fabric can connect GPUs across nodes in a scale-out configuration.
Ease of use: With numerous models now available for easy onboarding via GitHub (https://github.com/IntelAI/models) and a well-documented oneAPI software stack, developers can start building applications that leverage Intel’s advanced Data Center CPU capabilities, such as AVX512, and Max GPU
AI Frameworks: The oneAPI software stack supports the latest releases of PyTorch and Tensorflow through plugins, IPEX and ITEX, respectively. This makes writing code that runs efficiently on the Max 1550 GPU cards with as little as two lines of change. Find more details on this at https://software.intel.com/.
AI Workloads
Published AI and HPC workloads can be found here. Find detailed guides on running the workloads with supported frameworks and essential open-source libraries which provide developers with the tools and experiences they need to deliver value. The workloads are provided as deployable PyTorch and Tensorflow containers and include sample scripts that minimize deployment time.
Below are sample use cases and associated models:
- Enterprise: Llama 2, GPT-J-6B, BLOOM-176, ResNet-50, BERT-Large, and many more
- Financial Services: STAC-A2 and FSI Kernels
- Life & Material Sciences: LAMMPS Multi-GPU scaling–Tungsten workload, NWChemEx PWDFT, AutoDock, NAMD, RELION
- Astrophysics: DPEcho
- Physics: 3D GAN for Particle Shower Simulation, DeepGalaxy, QMCPack,
- Earth Systems Modelling: SpecFEM3D_Globe Multi-GPU Scaling – Global_s362ani_shakemovie,
ECMWF Cloudsc - Energy: Seismic Kernel Multi-GPU scaling
- Manufacturing: CoMLSim, JacobiSolver
Generative AI is of high interest in delivering business impact. The Dell PowerEdge XE9640 has the software and hardware capabilities to drive GenAI use cases in an Enterprise setting. Along with traditional deep learning (CV, RecSys, NLP) models, there is growing support for GenAI workloads such as Llama-2, Mistral for use with inference, fine-tuning, and developing Retrieval Augmented Generation (RAG) pipelines. To see the performance results for a workload that interests you, contact your Dell representative.
oneAPI Software and AI Tools from Intel
Making developers’ life easier is the oneAPI open and standards-based specification which supports multiple architecture types including but not limited to GPU, CPU, and FPGA. The specification defines a set of library interfaces that are commonly used in a variety of workflows.
AI Tools from Intel is a toolkit that provides familiar Python tools and frameworks to data scientists, AI developers, and researchers to accelerate end-to-end data science and analytics pipelines on Intel® architecture, a vital component of the Dell PowerEdge XE9640. The components are built using oneAPI libraries for low-level compute optimizations.
The AI Tools maximize performance from preprocessing through machine learning and provide interoperability for efficient model development. Train on Intel® CPUs and GPUs and integrate fast inference into your AI development workflow with Intel®-optimized deep learning frameworks for TensorFlow and PyTorch, pre-trained models, and model optimization tools. Don’t forget to look at the Intel Distribution for Python with highly optimized scikit-learn which is part of the AI Tools from Intel.
With compute-intensive Python packages, Modin*, scikit-learn*, and XGBoost, you can achieve drop-in acceleration for data preprocessing and machine learning workflows.
For more details, refer to the oneAPI specification page here and the Resources section at the end of the guide, where you can download the oneAPI base toolkit and AI Tools from Intel.
Dell PowerEdge XE9640 Overview
Density-optimized AI acceleration with the Dell PowerEdge delivers real-time insights. Dell’s first liquid-cooled 4-way GPU platform is in the XE9640 2U server. It is designed to drive the latest cutting-edge AI, Machine Learning, and Deep Learning Neural Network applications.
- Combines a high core count of up to 56 cores in the 4th Gen Intel® Xeon® processors and the most GPU. memory and bandwidth available today to break through the bounds of today’s and tomorrow’s AI computing.
- The Intel Data Center Max GPU series 1550 600W OAM GPUs is fully interconnected with XeLink.
- Ideal 2U form factor building block for dense Supercomputer and HPC acceleration workloads and applications.
- Supports Rack Direct Liquid Cooling Infrastructure: Cool IT with 42U XE9640 rack manifold and 48U XE9640 rack manifold.
Security
Security is integrated into every phase of the PowerEdge lifecycle, including a protected supply chain and factory-to-site integrity assurance. Silicon-based root of trust anchors end-to-end boot resilience while Multi-Factor Authentication (MFA) and role-based access controls ensure trusted operations.
- Cryptographically signed firmware
- Data at Rest Encryption (SEDs with local or external key mgmt)
- Secure Boot
- Secured Component Verification (Hardware integrity check)
- Secure Erase
- Silicon Root of Trust
- System Lockdown (requires iDRAC9 Enterprise or Datacenter)
- TPM 2.0 FIPS, CC-TCG certified, TPM 2.0 China NationZ
Accelerated I/O throughput
- Direct liquid-cooled Processors and GPUs enable efficient cooling for the highest performance, efficient power utilization, and lower TCO
- Dell Multi-vector cooling manages components to operate optimally
- Is the ideal dual-socket 1U rack server for dense scale-out data center computing applications. Benefiting from the flexibility of 2.5” or 3.5” drives, the performance of NVMe, and embedded intelligence, it ensures optimized application performance in a secure platform.
Dell Infrastructure Components
The following Dell components provide the foundation for AI solutions that lend themselves to development and delivery.
Dell PowerScale is an AI-ready data platform designed to easily store, manage, and protect data. Accelerate your AI workloads wherever your unstructured data lives—on-premises, at the edge, and in any cloud.
Dell Unity XT Storage provides flexible hybrid flash storage for cost-sensitive enterprises that want to leverage a combination of flash and disk for lower cost than all flash/NVMe architectures. It supports unified block and file workloads, online upgrades without migrations, guaranteed 3:1 dedupe, and sync replication.
Dell PowerVault Storage is optimized for DAS and SAN applications and supports PowerEdge server capacity expansion via PowerEdge-ready JBODs. It provides management simplicity and low-cost block storage and is ideal for edge and high-capacity data warehouse deployments.
Dell ECS Storage is an enterprise-grade, cloud-scale object storage platform providing comprehensive protocol support for unstructured object and file workloads on a single modern platform. Depending on capacity requirements, either the ECS EX500 or EX5000 may be used.
Dell PowerSwitch Networking switches are based on open standards to free the data center from outdated, proprietary approaches: They support future-ready networking technology that helps you improve network performance, lower network management cost, and complexity, and adopt innovations in networking.
Why Dell Technologies
The technology required for data management and enterprise analytics is evolving quickly, and companies may not have experts on staff or who have the time to design, deploy, and manage solution stacks at the pace required. Dell Technologies has been a leader in AI, Big Data, and advanced analytics for over a decade with proven products, solutions, and expertise. Dell Technologies has teams of application and infrastructure experts dedicated to staying on the cutting edge, testing new technologies, and tuning solutions for your applications to help you keep pace with this constantly evolving landscape.
Dell Technologies is building a broad ecosystem of partners in the data space to bring our customers the necessary experts, resources, and capabilities and accelerate their data strategy. We believe customers should be able to deliver AI innovation using data irrespective of where it resides, across on-prem, public cloud, and edge. By partnering with industry leaders in enterprise data management and analytics, we create optimized solutions for our customers.
Dell Technologies uniquely provides an extensive portfolio of technologies to deliver the advanced infrastructure that underpins successful data implementations. With years of experience and an ecosystem of curated technology and service partners, Dell Technologies provides innovative solutions, servers, networking, storage, workstations, and services that reduce complexity and enable you to capitalize on a data universe.
Proof Points
One of the more recent and fast-growing use cases is Generative AI (GenAI). The following chart shows a sample benchmark of a GenAI workload run on a Dell PowerEdge XE9640 x4 platform:
This figure demonstrates efficient linear scale-up from a single card workload up to four cards for LlaMA-2 7B inference. Details on the workload configuration and environment setup can be found at the Dell Infohub blog: https://infohub.delltechnologies.com/p/expanding-gpu-choice-with-intel-data-center-gpu-max-series/. Scalers AI, A Dell partner, developed and performed the benchmark.
Conclusion
Whether you want to expand your existing capabilities or start your first project, talk to us about your AI vision and what you need.
Your company needs all tools and technologies working in concert to achieve success. Fast, effective systems that complement time management practices are crucial to maximizing every employee hour. High-level data collection and processing that provides rich, detailed analytics can ensure your marketing campaigns strategically target your ideal customers and encourage conversion. To top it off, you need affordable products in a timely fashion that meet your criteria and then some. The XE9640 with the Intel Max 1550 GPU will meet the needs of AI developers.
Understanding AI Language
AI Models
A model is a program that analyzes data. There are many different models in use in AI, and they are specialized for the type of data they analyze. The model being used for this brief is LLaMa V2, a collection of generative AI models that are pretrained and fine-tuned to generate text (can scale from 7 to 70 billion parameters). LLaMa V2 is part of a new trend of having “nimble” models. These models are more customized to specific business needs, smaller, and lower cost to train and deploy.
“Dell uses Llama 2 internally for both experimental work and production deployment. One use case provides a chatbot-style interface to support Retrieval Augmented Generation (RAG) to get information from Dell’s knowledge base of articles. Llama 2 itself is a freely available open-source technology.” [i]
What is “GenAI”
Generative AI, or GenAI, is a subset of artificial intelligence with the potential to transform the business world due to its ability to create new content from existing data. It is a powerful tool that can generate text, images, videos, and even code, revolutionizing businesses' operations.
- Healthcare: generate synthetic data for research without violating privacy regulations.
- Research: create new models for chemical compound molecules for pharmaceutical drug discovery. Manufacturing product and part design.
- Creativity: create music, fashion design, and product design; edit images; create unique art; provide realistic images and immersive worlds for virtual and augmented reality; augment game development for in-game content creation and game play adaptation.
- Natural language understanding and processing: human-like chatbot interaction, virtual assistants.
- Software: write code, significantly reducing development time and costs.
- Data generation: create synthetic data for training machine learning models and testing edge-based communication systems.
Parameters
A variable that indicates the size of the model. For instance, Llama-2 70B is around 70 billion parameters.
Tokens
A unit that a model uses to compute the length of a text can be pieces of words, punctuation, or emojis. It is used to learn context and semantics. Text is split up into smaller units to be processed, and then new text is generated. One way to measure its capacity is the number of tokens your hardware can process.
References
- Dell PowerScale
- Dell PowerEdge XE9640
- PowerEdge XE9640 Rack Server
- Dell PowerVault Storage
- Dell Unity XT Storage
- Dell ECS Object Storage
- Dell PowerSwitch Networking
- https://arxiv.org/abs/2210.05837 for CoMLSim background
- Intel® oneAPI Base Toolkit
- AI Tools from Intel
Related Documents
Spark Machine Learning on Dell HS5610 Platform with Cloudera
Mon, 29 Jan 2024 22:49:51 -0000
|Read Time: 0 minutes
Executive Summary
To establish a thorough solution collateral for the Dell PowerEdge HS5610 platform integrated with Cloudera software, we are commencing benchmarking initiatives this year. These benchmarks will form the foundational baseline for our future testing endeavors, and it's essential to emphasize that we will not be making comparisons with previous generations.
This initiative holds great significance, prompted directly by Dell's explicit request to craft this reference solution. Intel has taken charge of executing the benchmark tests and generously shared their Best-Known Methods (BKMs), providing invaluable guidance for this critical undertaking.
What are the key takeaways?
Cloudera Data Platform built on Dell’s 16G PowerEdge servers with Intel® 4th Generation Xeon processor architecture can accommodate growing enterprise data workloads and efficiently manage increasing demands for analytics and machine learning in a smaller footprint. Cloudera Data Platform delivers easier data management and scalability for data anywhere with optimal performance, scalability, and security.
As organizations create more diverse and more user-focused data products and services, there is a growing need for machine learning, which can be used to develop personalization, recommendations, and predictive insights. But as organizations amass greater volumes and greater varieties of data, data scientists are spending most of their time supporting their infrastructure instead of building the models to solve their data problems. To help solve this problem, as an integrated part of Cloudera’s platform, Spark provides a general machine learning library that is designed for simplicity, scalability, and easy integration with other tools. With the scalability, language compatibility, simple administration and compliance-ready security and governance provided through cloudera, data scientists can solve and iterate through their data problems faster.
Spark MLlib
Spark MLlib is a distributed machine learning framework built on top of Spark Core. The key benefit of MLlib is that it allows data scientists to focus on their data problems and models instead of solving the complexities surrounding distributed data. MLlib leverages the advantages of in-memory computation and is optimized for matrix and vector operations, aligning its capabilities with specific algorithmic requirements for the given use case.
K-means Overview
Clustering stands as a fundamental exploratory data analysis technique, providing valuable insights into the inherent structure of data. One prominent algorithm for this purpose is K-means, widely recognized for partitioning data points into a predefined number of clusters. This technique finds extensive applications in diverse domains including market segmentation, document clustering, image segmentation, search engines, real estate, anomaly detection and image compression, highlighting it’s versatility and importance in data analysis.
- K-Means Overview
K-means clustering performance
We achieved remarkable results by clustering 10 million samples with 1,000 dimensions in just 283 seconds. This accomplishment was made possible through the application of the K-Means algorithm from Spark's ML library, which was provided by Cloudera 7.1.8 and deployed on Dell PowerEdge HS5610 platform.
We conducted a performance evaluation of Spark's MLlib K-Means algorithm using the HiBench Benchmark.
For detailed information on our benchmarking process, you can refer to Intel GitHub repository: https://github.com/Intel-bigdata/HiBench
Note - This result is not compared against any other platform hardware or software. We will use this as baseline for future products.
Configuration Details
Workload Configuration | |
Platform | Dell PowerEdge HS5610 |
CPU | 6448Y |
Memory | 512 GB (16 x 32GB DDR5-4800) |
Boot Device | Dell EMC™ Boot Optimized Server Storage (BOSS-N1) with 2 x 480 GB M.2 NVMe SSDs (RAID1) |
HDFS Data Disk | 2 x Dell Ent NVMe P5500 RI U.2 3.84TB |
HDFS Namenode Disk | 1 x Dell Ent NVMe P5500 RI U.2 3.84TB |
Yarn Cache Disk | 1 x Dell Ent NVMe P5500 RI U.2 3.84TB |
Network Interface Controller | NetXtreme BCM5720 Gigabit Ethernet PCIe |
Cluster size | 1 |
Cloudera Distribution | Cloudera Data Platform 7.1.8 |
Compute Engine | Spark 3.2.0 |
Workload | Hibench 7.1.1 – Kmeans Algorithm |
Iterations and result choice | 3 iterations, average |
Spark Configuration | |
spark.deploy.mode | yarn |
Executor Numbers | 16 |
Executor cores | 8 |
spark.executor.memory | 24g |
spark.executor.memoryOverhead | 4g |
spark.driver.memory | 20g |
spark default parallelism | 128 |
spark.driver.maxResultSize | 20g |
spark.serializer | org.apache.spark.serializer.KryoSerializer |
spark.kryoserializer.buffer.max | 1g |
spark.network.timeout | 1200s |
K-means Configuration | |
Number of clusters | 5 |
Dimensions | 1,000 |
Number of Samples | 10,000,000 |
Samples per inputfile | 10,000 |
Number of Iterations | 40 |
k | 300 |
Powering AI using Red Hat Openshift with Intel based PowerEdge servers
Fri, 13 Oct 2023 14:42:09 -0000
|Read Time: 0 minutes
End-to-End AI using OpenShift Overview
At the top of this webpage are 3 PDF files outlining test results and reference configurations for Dell PowerEdge servers using both the 3rd Generation Intel® Xeon® processors and the 4th Generation Intel Xeon processors. All testing was conducted in Dell Labs by Intel and Dell Engineers in May and June of 2023.
- “Dell DfD E2E AI ICX” – highlights the recommended configurations for Dell PowerEdge servers using 3rd generation Intel Xeon processors.
- “Dell DfD E2E AI SPR” – highlights the recommended configurations for Dell PowerEdge servers using 4th generation Intel Xeon processors.
- “DfD – PowerEdge E2E AI Test Report” – Highlights the results of performance testing on both configurations with comparisons that demonstrate both performance and reduced power consumption for each.
Solution Overview
Red Hat OpenShift, the industry's leading hybrid cloud application platform powered by Kubernetes, brings together tested and trusted services to reduce the friction of developing, modernizing, deploying, running, and managing applications. OpenShift delivers a consistent experience across public cloud, on-premise, hybrid cloud, or edge architecture.[i]
Companies using OpenShift[ii]
- 50% of Fortune Global 500 aerospace and defense companies.
- 57% of Fortune Global 500 technology companies.
- 51% of Fortune Global 500 financial companies.
- 80% of Fortune Global 500 telecommunications companies.
- 54% of Fortune Global 500 motor vehicles and parts companies.
- 50% of Fortune Global 500 food and drug stores.
Elasticsearch with Dell PowerEdge and Intel processor benefits
The introduction of new server technologies allows customers to deploy solutions using the newly introduced functionality but it can also provide an opportunity for them to review their current infrastructure and determine if the new technology might increase performance and efficiency. With this in mind, Dell and Intel recently conducted Natural Language Processing Artificial Intelligence (AI) performance testing of a RedHat OpenShift solution on the new Dell PowerEdge R760 with 4th generation Intel® Xeon® Scalable processors and compared the results to the same solution running on the previous generation R750 with 3rd generation Intel® Xeon® Scalable processors to determine if customers could benefit from a transition.
Some of the key changes incorporated into 4th generation Intel® Xeon® Scalable processors utilized for this test included:
- New Advanced Matrix Extension (AMX) capabilities
- Improved Advanced Vector Extension (AVX) performance
- The new Intel® Extension for TensorFlow® open-source solution
Raw performance: As noted in the report, our tests showed a 3.47x increase in transfer learning performance and a 5.59x increase in Inferencing Performance
Relative Power Consumption: In addition to higher performance, the R760 based solution also delivered up to 3.39x better performance per watt than the previous generation:
Conclusion
Choosing the right combination of Server and Processor can increase performance and reduce cost. As this testing demonstrated, the Dell PowerEdge R760 with 4th Generation Intel® Xeon® Platinum 8462Y+ CPU’s delivered up to 5.59x more throughput than the Dell PowerEdge R750 with 3rd Generation Intel® Xeon® Platinum 8362 CPU’s and provided up to 3.39x better power efficiency.
Efficient, scalable, and optimized means to run Enterprise AI pipelines on Intel HW; full end-to-end OpenShift stack with Kubeflow
- Up to 3.47x better transfer learning (Fine Tuning) throughput than 3rd Gen Xeon Scalable Processor; with linear scaling on 1, 2, and 4 nodes
- Up to 3.39x higher transfer learning power efficiency than 3rd Gen Xeon Scalable Processor
- Up to 5.59x better performance (inferencing) over 3rd gen Intel Xeon Scalable Processors with FP32 precision using the same core count
- Up to 3.61x performance improvement over 3rd generation Intel® Xeon® Scalable Processors with INT8 precision using same core count
[ii] Source: Fortune 500 subscription data as of 26 September 2022