LLM inference using TorchServe on Red Hat OpenShift with PowerEdge R760 using 5th Gen Intel® Xeon® Processors
Download PDFWed, 14 Aug 2024 19:09:02 -0000
|Read Time: 0 minutes
Summary
This Direct from Development (DfD) tech note describes the new AI capabilities you can expect from the Dell PowerEdge R760 and 5th Gen Intel® Xeon® Scalable Processor 8562Y+. Online model serving is critical to enterprises to meet concurrent user needs at low latency, especially with growing model sizes and capabilities. Deploying the latest trend of Generative AI models in production also requires resource management and complex model deployment pipelines.
In this document, we cover the deployment of two such large language models (LLM) using TorchServe, an open source tool for serving Pytorch models in production. Combined with using Red Hat OpenShift, the industry's leading hybrid cloud application platform powered by Kubernetes, Red Hat OpenShift enables businesses to scale AI workloads easily based on demand.
Market positioning
Red Hat OpenShift (RHOS) provides a robust platform for Large Language Model (LLM) inference and fine-tuning experiments. Leveraging Kubernetes containerization, Red Hat OpenShift Container Platform (RHOCP) packages LLM models and dependencies for easy deployment and portability. Using TorchServe with RHOS, Pytorch models can be deployed within OpenShift clusters to expose model endpoints as RESTful APIs. This enables enterprises to deploy and scale the models of their choice to meet their end-customer or internal users needs for responses at acceptable next-token latencies.
Dell Technologies and Intel® conducted LLM AI performance testing using the latest Dell PowerEdge R760 with 5th Gen Intel® Xeon® Scalable processors. This platform supports up to DDR5-5600 MT/s, thus providing higher memory bandwidth which is critical for LLMs. Support for eight hot-plug 2.5”/3.5” HDD/SSD drives are offered in this platform. Value SAS (vSAS) SSD support has also been expanded to provide more options for an affordable performance SSD tier. These drives can be configured with a Dell rack solution for small businesses that requires a scalable server optimized for enterprise-class workloads.
Dell PowerEdge R760 with 5th Gen Intel® Xeon® Scalable processors also helps "Accelerate Transformation Anywhere" with technology and solutions for AI-driven innovations and automation. 5th Gen Intel® Xeon® Scalable Processors offer advanced CPU capabilities for AI :
- Built in AI accelerator with Intel® Advanced Matrix Extensions (Intel® AMX) built into each core. This accelerator supports BF16 for inferencing and fine-tuning. It also supports quantization with support for INT8 for inferencing, which leads to a significant reduction in memory requirements.
- Intel® Advanced Vector Extensions 512 (Intel® AVX-512) for help with non‒deep learning vector computations.
- Software optimizations for Intel® Xeon® processors are available as part of the latest releases of Intel® Extension for Pytorch GitHub repositories.
System configuration tested
Tables 1, 2, and 3 list details about the hardware platform, software configuration, and workload configuration.
Table 1. Hardware configuration
System | Dell PowerEdge R760 |
Chassis | Rack Mount Chassis |
CPU Model | Intel® Xeon® Platinum 8562Y+ |
Microarchitecture | EMR_MCC |
Sockets | 2 |
Cores per Socket | 32 |
Hyperthreading | Enabled |
CPUs | 128 |
Intel Turbo Boost | Enabled |
Base Frequency | 2.8GHz |
All-core Maximum Frequency | 3.8GHz |
Maximum Frequency | 4.1GHz |
NUMA Nodes | 2 |
Prefetchers | L2 HW: Enabled, L2 Adj.: Enabled, DCU HW: Enabled, DCU IP: Enabled, AMP: Disabled, Homeless: Enabled, LLC: Enabled |
Accelerators | DLB 2, DSA 2, IAA 2, QAT 2 |
Installed Memory | 512GB (16x32GB DDR5 5600 MT/s [5600 MT/s]) |
Hugepagesize | 2048 kB |
Transparent Huge Pages | Always |
Automatic NUMA Balancing | Disabled |
NIC | 2x Ethernet Controller X710 for 10GBASE-T for SFP, 2x Ethernet Controller E810-C for QSFP |
Disk | 4 x (3.5 T)SAMSUNG MZQL23T8HCLS-00B7C, 1 x 894.3G INTEL SSDSC2KG96 |
Table 2. Software configuration
Configuration | Setting |
OS | Red Hat Enterprise Linux CoreOS 415.92.202403061641-0 (Plow) |
Kernel | 5.14.0-284.55.1.el9_2.x86_64 |
RedHat OpenShift, Kubernetes | v1.27.8+4fab27b |
Framework | torch 2.2.0+cpu, torchserve 0.9.0, intel_extension_for_pytorch 2.2.0, oneccl_bind_pt 2.2.0+cpu |
Other Software | cmake-3.20.2, findutils-4.6.0, zip2-1.0.6, gcc-8.5.0, gcc-c++-8.5.0, gcc-toolset-12-12.0, gcc-toolset-12-runtime-12.0, git-2.39.3, gperftools-devel-2.7-9.el8, libatomic-8.5.0, libfabric-1.18.0, procps-ng-3.3.15, python3-distutils-extra-2.39, python39-3.9.18, python39-devel-3.9.18, python39-pip-20.2.4, unzip-6.0, wget-1.19.5, which-2.21, java-17-openjdk-17.0.10.0.7, intel-oneapi-openmp-2023.2.1, ninja 1.11.1.1, accelerate 0.25.0, sentencepiece 0.1.99, protobuf 4.25.1, datasets 2.15.0, transformers 4.35.0, wheel 0.42.0 |
OS | Red Hat Enterprise Linux CoreOS 415.92.202403061641-0 (Plow) |
Kernel | 5.14.0-284.55.1.el9_2.x86_64 |
RedHat OpenShift, Kubernetes | v1.27.8+4fab27b |
Framework | torch 2.2.0+cpu, torchserve 0.9.0, intel_extension_for_pytorch 2.2.0, oneccl_bind_pt 2.2.0+cpu |
Other Software | cmake-3.20.2, findutils-4.6.0, zip2-1.0.6, gcc-8.5.0, gcc-c++-8.5.0, gcc-toolset-12-12.0, gcc-toolset-12-runtime-12.0, git-2.39.3, gperftools-devel-2.7-9.el8, libatomic-8.5.0, libfabric-1.18.0, procps-ng-3.3.15, python3-distutils-extra-2.39, python39-3.9.18, python39-devel-3.9.18, python39-pip-20.2.4, unzip-6.0, wget-1.19.5, which-2.21, java-17-openjdk-17.0.10.0.7, intel-oneapi-openmp-2023.2.1, ninja 1.11.1.1, accelerate 0.25.0, sentencepiece 0.1.99, protobuf 4.25.1, datasets 2.15.0, transformers 4.35.0, wheel 0.42.0 |
Table 3. Workload configuration
System | Dell PowerEdge R760 |
Model | Meta Llama-2-7b-hf, Meta Llama-2-13b-hf |
Containers and Virtualization | Custom container with torchserve related deployment files “torchserve-ipex-22-quant:1.0” |
Warm up steps | 10 |
Steps | 60 |
Number of Users [1 BS each] | 1, 2, 4, 8, 16, 32, 64 |
batch size per inference server (TorchServe instance) | max 64. Actual batch size processed per instance on a given scenario depends on the load balancer (Round Robin) that passes requests among the available instances. Requests are then dynamically grouped into a larger batch on each instance to optimize processing efficiency. Standard batch size per instance: [Num of users]/[Num of instances] |
Beam Width | 1 (greedy search) |
Input Token Size | 32, 256, 1024, 2048 |
Performance results
The test configuration benchmark request spawns a subprocess for each user. Within the subprocess, an independent request is generated and sent to the service endpoint. This endpoint serves as the load balancer and distributes all incoming user requests to existing instances of TorchServe (two instances running per node) using Round Robin logic. The load balancer logic implemented is true round robin using Route exposing. Thus, for every concurrent user test scenario, each instance receives exactly half of the requests. This makes the 2nd token latency the same on both instances (~+/-1ms). Also, the Topology and CPU manager features of OpenShift are enabled to ensure that each instance of TorchServe is affinitized to a single NUMA node. We further leave four CPUs of the total CPUs (128) in the system for CoreOS/other services on the node. With the 124CPUs, we deploy two instances of TorchServe, each with 62 CPUs.
The following figure shows the single node online inference performance for both INT8 and BFloat16 datatypes accelerated using 5th Gen Xeon® built-in AI Acceleration with AMX. With the models in the X-axis and the number of concurrent users supported on the Y-axis, we scale the input token size from 256, 1024. Some of the key takeaways about this graph are:
- LLAMA2-7B and LLAMA2-13B (BF16 and INT8 quantized) models can support up to 64 and 32 concurrent users @ next token latency requirements (~<100ms) for input token size 256, output tokens: 256, Greedy Search.
- LLAMA2-7B and LLAMA2-13B (BF16 and INT8 quantized) models can support up to 32 and 16 concurrent users @ next token latency requirements (~<100ms) for input token size 1024, output tokens: 256, Greedy Search.
From a real-world use case perspective, a two-socket Dell PowerEdge R760, equipped with two Intel® Xeon® Platinum 8562Y+ processors, can handle up to 32 customer chatbot queries, each approximately two pages long. Additionally, it can support up to 64 customer chatbot queries, each around half a page long.
Figure 1. LLM Model Serving using TorchServe on a Dell PowerEdge R760 node with 2x 5th Gen Intel® Xeon® Scalable Processors
Using this methodology, enterprises can meet and support additional concurrent users by scaling out beyond a single node.
Real-world application
One potential way of translating this concurrent user data into a realistic use case is the number of Daily Active Users (DAUs) for a chatbot. Let’s say you support a big company of around 5,000 worldwide workers and are planning to stand up a global chatbot, and you are trying to determine how many Xeon® servers you would need to support it. You start by making some assumptions: that you estimate there will be around 1,500 (DAUs) who will use the tool. You could do the simple math and divide 1,500 DAUs by the number of concurrent users in Figure 1, but you would probably be overprovisioning your Xeon® servers because each of those DAUs would not be hitting the chatbot simultaneously and constantly over a 24-hour period. However, if you assume each DAU engages in five sessions per day with four prompts per session, you should support this chatbot on a single Dell PowerEdge R760 equipped with two Intel® Xeon® Platinum 8562Y+ processors!
How many 5th Gen Intel® Xeon®-based servers does it take to support a chatbot?[1]
Figure 2. Hypothetical example of the number of Daily Active Users for a chatbot that can be supported using a single Dell PowerEdge R760 node with 2x 5th Gen Intel® Xeon® Scalable Processors, based on the concurrency data in Figure 1
Terminology
Table 4. Terminology
Term | Explanation |
Concurrent users | Handling concurrent users effectively is crucial in online inferencing to ensure performance, scalability, and reliability. It requires careful planning, resource management, and optimization of both the infrastructure and the inferencing algorithms. The number can vary depending on Small, Medium, or Large Scale applications. |
Next Token Latency | Next token latency refers to the time it takes for a language model to generate the next token in a sequence. (Language models generate text token by token.) |
Token Size | A token typically represents a discrete unit of text, such as a word or character. For common English text, one token corresponds to approximately 4 characters or roughly ¾ of a word (for example, 100 tokens ≈ 75 words)[2]. |
Greedy Search | When generating sequences (such as sentences or text), greedy search selects the most probable next token at each step.This method is simple and computationally efficient, but it can produce suboptimal results if it is unable to predict future words. |
Conclusion
This study showcased the performance effectiveness of 5th Gen Xeon® processors on Dell PowerEdge servers for AI Large Language Model (LLM) tasks within the Red Hat OpenShift (RHOS) environment. Specifically, it focused on Meta LLAMA 2 LLM models served using TorchServe supporting concurrent users. Furthermore, the research emphasized the importance of selecting the right combination of server, processor, and software products to achieve scalability and improved performance.
For more details about offline inferencing performance with Intel 5th Gen Xeon® Processors on Dell PowerEdge R760, see this white paper: Driving GenAI Advancements: Dell PowerEdge R760 with the Latest 5th Gen Intel® Xeon® Scalable Processors.
References
- https://pytorch.org/serve/use_cases.html
- https://pytorch.org/serve/server.html
- https://pytorch.org/serve/large_model_inference.html
- https://docs.databricks.com/en/machine-learning/foundation-models/prov-throughput-run-benchmark.html
- https://github.com/intel/intel-extension-for-pytorch/releases/tag/v2.3.0%2Bcpu
[1] Modeling is based on the number of chatbot sessions that can be processed over a 24-hour period, using Llama2-7B INT8 concurrency data for 1024 input and output tokens, assuming next token latencies of ~100ms. Each session consists of four separate prompts to mimic a chatbot conversation. Based on provisioning for a single server, an average of 2.3% of the DAUs would be able to access the chatbot simultaneously at a given time without experiencing additional latency.
[2] https://platform.openai.com/tokenizer
Authors: Sharath Kumar, Marcin Hoffmann, Abi Prabhakaran, Mishali Naik, Esther Baldwin, Manya Rastogi, Edward Groden