Your Browser is Out of Date

Nytro.ai uses technology that works best in other browsers.
For a full experience use one of the browsers below

Dell.com Contact Us
United States/English
Tao Zhang
Tao Zhang

Tao is Sr. Principal Engineering Technologist in the Hardware Architecture Team of the ISG CTIO Office. He works on the hardware feasibility evaluation and performance benchmarking for various accelerators & AI/ML solutions. In the past, he mostly worked on the silicon and hardware design at several different companies and institutes including LSI Corporation, CERN, Google. He is IEEE Senior Member and Co-author of the book - Advanced Fiber Access Networks.

Connect on LinkedIn

Assets

Home > AI Solutions > Gen AI > Blogs

PowerEdgeXE9680 LLama3.1 -405B AI Artificial Intelligence inferencing XE9680 GenAI LLM Meta Llama

Running Llama 3.1 405B models on Dell PowerEdge XE9680

Khushboo Rathi Tao Zhang Sarah Griffin Khushboo Rathi Tao Zhang Sarah Griffin

Tue, 01 Oct 2024 18:31:21 -0000

|

Read Time: 0 minutes

Introduction

Meta recently introduced the largest open-source language model and most capable, Llama 3.1 405B, which enables the AI community to explore new frontiers in synthetic data generation, model distillation, and building agent applications. Llama 3.1 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. This openly available model sets a new standard for general knowledge, steerability, math, tool use, and multilingual translation supported in 8 languages (English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai). The Dell Technologies PowerEdge XE9680 server stands ready to smoothly host these powerful new models.

The Llama 3.1 405B release includes two versions: a pre-trained general LLM and an instruction fine-tuned LLM. The general version was trained with about 15 trillion tokens of publicly available data. Each version has open model weights released in BF16 precision and supports a model parallelism of 16, designed to run on two nodes with 8 GPUs each. Both the general and the instruction-tuned releases include a version with model parallelism of 8, while using FP8 dynamic quantization to allow for running on a single 8-GPU node, as well as a third, compact version of each model in FP8. All model versions use Grouped-Query Attention (GQA) for improved inferencing scalability. Table 1 lists models that can be downloaded after obtaining Llama 3.1 405B model access by completing the required Meta AI license agreement. The models are also available on Dell Enterprise Hub.

Table 1: Llama 3.1 405B Available Versions [1]

Model Versions 

Model Names 

Context Length 

Training Tokens 

Model Parallelism

Pre-trained 

  •  meta-llama-3.1-405b-MP16 (multi-node)
  •  meta-llama-3.1-405b (MP8)
  •  meta-llama-3.1-405b-FP8 (Quantized)

128 K

~15 T

16 and 8 

Instruct 

  •  meta-llama-3.1-405b-instruct-MP16 (multi-node)
  •  meta-llama-3.1-405b-instruct (MP8)
  •  meta-llama-3.1-405b-instruct-FP8 (Quantized)


In this blog, we demonstrate the inference flow of the Llama 3.1 405B models on Dell PowerEdge XE9680 servers [2]. With 8x NVIDIA H100 GPUs connected through 900GB/s NVLink internal fabric, as well as the support for 400Gb/s InfiniBand and Spectrum-X cross-node communication, the XE9680 is an ideal platform for both single-node and multi-node LLM inferencing for this level of model size.

Experimental setup 

In our experiments, two Dell PowerEdge XE9680 servers are configured identically and are connected through an InfiniBand (IB) fabric to perform the multi-node inferencing. The flexibility of the XE9680 enables us to run two experiments: one using the FP8 instruct version running on a single PowerEdge server and another setup using the BF16 instruct model running on two nodes. Table 2 details each system’s configuration.

Table 2: Experimental Configuration for one Dell PowerEdge XE9680  

Component

Details

Hardware

Compute server for inferencing

PowerEdge XE9680

GPUs

8x NVIDIA H100 80GB 700W SXM5 

Host Processor Model Name 

Intel(R) Xeon 8468 (TDP 350W)

Host Processors per Node 

2

Host Processor Core Count 

48

Host Memory Capacity 

16x 64GB 4800 GHz RDIMMs

Host Storage Type 

SSD 

Host Storage Capacity 

4x 1.92TB Samsung U.2 NVMe

Software

Operating System

Ubuntu 22.04.3 LTS

CUDA

12.1

      CUDNN

9.1.0.70

NVIDIA Driver

550.90.07

Framework 

PyTorch 2.5.0

Single Node Inferencing on PowerEdge XE9680

We performed torchrun on one XE9680 using the FP8 quantized version of Llama 3.1 405B. Details on the FP8 version can be found in the official release paper from Meta [1]. Here we explain the steps for running inference on this single system.  

Step 1: Complete the Download form for the Llama 3.1 405B Instruct FP8 model. A unique URL will be provided. Git clone the repo llama-model and run download.sh under the models/llama3_1 folder to start downloading the models.

Step 2:  Create and activate a conda environment.

  • conda create --name myenv python=3.10 (replace "myenv" with your desired environment name)
  • Activate the environment: conda activate myenv    

Step 3: Navigate to the model folder within the conda environment. Ensure conda environment matches the CUDA, CUDNN, and NVIDIA driver versions listed in Table 2. Install required libraries from requirements.txt. 

  • cd path/to/model/folder (replace "path/to/model/folder" with the actual path)
  • pip install -r requirements.txt

Step 4: To verify the deployment with torchrun, build run_chat_completion.sh. We utilize example_chat_completion.py from the llama Git repository as a testing example within the script below. For the FP8 inference, we import the FBGEMM library into the script for efficient execution [3]. We also use environment variables and settings as detailed in PyTorch distribution communication packages documentation [4].

run_chat_completion.sh
 
#!/bin/bash
 
set -euo
set -x
 
cd $(git rev-parse --show-toplevel)
 
MASTER_ADDR=$1
NODE_RANK=$2
CKPT_DIR=$3
TOKENIZER_PATH=$4
NNODES=$5
NPROC=$6
RUN_ID=$7
 
if [ $RUN_ID = "fp8" ]; then
    ENABLE_FP8="--enable_fp8"
else
    ENABLE_FP8=""
fi
 
NCCL_NET=Socket NCCL_SOCKET_IFNAME=eno8303 TIKTOKEN_CACHE_DIR="" \
        torchrun \
         --nnodes=$NNODES --nproc_per_node=$NPROC \
         --node_rank=$NODE_RANK \
         --master_addr=$MASTER_ADDR \
         --master_port=29501 \
         --run_id=$RUN_ID \
         example_chat_completion.py $CKPT_DIR $TOKENIZER_PATH $ENABLE_FP8

Step 5: Execute the bash file run_chat_completion.sh using the required arguments. Example values are shown.

sh run_chat_completion.sh $MASTER_HOST $NODE_RANK $LOCAL_CKPT_DIR $TOKENIZER_PATH $NNODES $NPROCS $RUN_ID
 
Example Usage:
 
$MASTER_HOST= 192.x.x.x
$NODE_RANK = 0
$LOCAL_CKPT_DIR= <model_checkpoint_dir_location>
$TOKENIZER_PATH= <model_tokenizer_dir_location>
$NNODES=1
$NPROCS=8
$RUN_ID=fp8

Figure 1 shows sample output after running the inference on XE9680.

Figure 1: XE9680 display of inference output.

XE9680 display of inference output of Llama 3.1 405B

Multi-Node Inferencing on 2 PowerEdge XE9680

Running the Llama 3.1 405B Instruct model with BF16 precision requires more GPU memory than the 640GB available on a single XE9680 with Nvidia H100. Fortunately, the XE9680 is flexible and enables the deployment of large-scale models that require distributed processing across multiple GPUs to meet these demanding memory requirements. We recommend a multi-node inferencing setup utilizing two XE9680 nodes, with a total of 16 GPUs, connected via 400 Gb/s high-speed InfiniBand (IB) network. This ensures seamless communication and maximum throughput. Within each PowerEdge XE9680, the NVLink high bandwidth enables tensor parallelism. In our experiment, the memory consumption of the model on 2 XE9680 came out to be around 1TB during the inferencing.

Multi-node inferencing follows steps similar to those of single-node inferencing.

Step 1: Download the Llama3.1-405B-instruct-MP16 model.

Step 2: Follow steps 2 to 4 from the single inferencing section above to create a conda environment in each XE9680 node. Use the example_chat_completion.py  to execute a test example within run_chat_completion.sh.

Step 3: We execute torchrun separately on both nodes. Assign one node as $MASTER_HOST and set its $NODE_RANK to 0. Include the $MASTER_HOST IP address in the settings for the second node and set its $NODE_RANK to 1, as outlined below.

sh run_chat_completion.sh $MASTER_HOST $NODE_RANK $LOCAL_CKPT_DIR $TOKENIZER_PATH $NNODES $NPROCS
 
Example Usage on Node1:
$MASTER_HOST= 192.x.x.x
$NODE_RANK = 0
$LOCAL_CKPT_DIR= <model_checkpoint_dir_location>
$TOKENIZER_PATH= <model_tokenizer_dir_location>
$NNODES=2
$NPROCS=8
 
Example Usage on Node2:
$MASTER_HOST= 192.x.x.x
$NODE_RANK = 1
$LOCAL_CKPT_DIR=<model_checkpoint_dir_location>
$TOKENIZER_PATH=<model_tokenizer_dir_location>
$NNODES=2
$NPROCS=8

Conclusion

Dell’s PowerEdge XE9680 server is purposely built to excel at the most demanding AI workloads, including the latest releases of Llama 3.1 405B. With the best available GPUs and high-speed in-node and cross-node network fabric, the XE9680 is an ideal platform flexible enough for both single-node and multi-node inferencing. In this blog, we demonstrated both single-node and multi-node inferencing capabilities by deploying the best-in-class open source Llama 3.1 405B models. Our step-by-step guide provides clear instructions for replication. Stay tuned for our next blog posts, where we dive into exciting use-cases such as synthetic data generation and LLM distillation and share performance metrics of these models on the XE9680 platform.

References

[1]. The Llama 3 Herd of Models: arxiv.org/pdf/2407.21783

[2] https://www.dell.com/en-us/shop/ipovw/poweredge-xe9680

[3] https://github.com/meta-llama/llama-stack/blob/main/llama_toolchain/inference/quantization

[4] Distributed communication package - torch.distributed — PyTorch 2.4 documentation

Authors: 
Khushboo Rathi (khushboo.rathi@dell.com) ; 
Tao Zhang (tao.zhang9@dell.com) ; 
Sarah Griffin (sarah_g@dell.com)

 

Home > AI Solutions > Gen AI > Direct from Development: Tech Notes

XE9680 GenAI LLM Llama

Deploying Llama 8B Model with Advanced Quantization Techniques on Dell Server

Anamika Chatterjee Tao Zhang Onur Celebioglu Saurabh Jha Anamika Chatterjee Tao Zhang Onur Celebioglu Saurabh Jha

Tue, 06 Aug 2024 19:16:24 -0000

|

Read Time: 0 minutes

Introduction

Large language models (LLMs) have shown excellent performance on various tasks, but the large model size makes it difficult to deploy in resource constrained environments. The LLM evolution diagram in Figure 1 shows the popular pre-trained models since 2017, most of which are based on the transformer architecture [1] [2]. It is not hard to find the trend of larger and more open-source models following the timeline. Open-source models boosted the popularity of LLMs by eliminating the training cost associated with the large-scale infrastructure and the long training time required to create them. Another portion of the cost of LLM applications comes from the deployment where an efficient inference platform is required.

Meta Llama models have become one of the most powerful open-source Large Language Model (LLM) series. Notably, the recently released Llama 3 models achieved impressive performance across various benchmarks due, in part, to the highly curated 15T tokens of pre-training data [3]. Despite their impressive performance, deploying Llama 3 models with high throughput may still pose significant challenges in resource constrained environments due to the large model size and limited hardware resources in many scenarios. Fortunately, low-bit quantization has emerged as one of the most popular techniques for compressing LLMs. This technique reduces the memory and computational requirements of LLMs during inference, enabling them to run on resource-limited devices and potentially boost the performance.

This blog focuses on how to deploy LLMs efficiently on Dell Servers with different quantization techniques. To better understand the impact of quantization, we first benchmarked the model accuracy under different quantization techniques. Then we demonstrated the throughput and memory requirements of running LLMs under different quantization techniques through experiments. Specifically, we chose the open-source model Meta-Llama-3-8B-Instruct [4] for its popularity. Given that the Llama-2-7B and Llama-3.1-8B models have a similar model architecture and comparable size, the conclusions based on the experiments for Llama-3-8B in the blog also apply for those models as well. The server used was Dell PowerEdge XE9680 with NVIDIA H100s GPUs [5] [6]. The deployment framework in the experiments is TensorRT-LLM, which enables different quantization techniques including advanced 4bit quantization as demonstrated in the blog [7].

 

Figure 1. LLM evolution

Background

One key approach in model compression is quantization which changes full or half precision floating point numbers into simpler forms like integers. This conversion cuts down on the memory footprint that the model needs. Quantization intrinsically introduces some accuracy loss. Most of the quantization techniques implement algorithms to minimize the accuracy loss so the model quality has a negligible degradation after quantization.

In this blog, we focused on post training quantization (PTQ) under the inferencing framework TensorRT-LLM. PTQ entails quantizing the parameters of a Large Language Model (LLM) after it has completed the training phase. The primary objective of PTQ is to reduce the memory requirements of the LLM, without necessitating modifications to the LLM architecture or requiring a re-training process.

Test setup

Meta-Llama-3-8B-Instruct [4] is an instruction-tuned version of the base 8b model meta-llama/Meta-Llama-3-8B [10]. It has been optimized for dialogue applications and was fine-tuned on over 10 million human-annotated data samples with a combination of rejection sampling, proximal policy optimization (PPO), and direct policy optimization (DPO) [18]. We downloaded the bfloat16 model from Hugging Face and used the format bfloat16 as the baseline model for comparison with the quantized models.

We investigated four main advanced quantization techniques along with KV cache quantization options to compare with the baseline bfloat16 model. As shown in Table 1, for weight-only quantization, we explored AWQ (activation-aware weight quantization) W4A16 [11], GPTQ (Accurate Post-Training Quantization for Generative Pre-trained Transformers) W4A16 [12]. For weight-activation quantization, we explored FP8 W8A8 [13], AWQ W4A8 [17] and SmoothQuant W8A8 [14]. We run the experiments under the TensorRT-LLM framework which integrates the toolkit that allows quantization and deployment for the main-stream LLM models.

Table 1. List of quantization techniques under investigation

Quantization Techniques

Precision of weights

Precision of activations

Precision of KV Cache

AWQ W4A16

INT 4

 

FP16

FP16

AWQ W4A16 with INT8 KV Cache

INT4

FP16

INT8

AWQ W4A8

    

INT4

FP8

FP16

GPTQ W4A16

INT4

FP16

FP16

FP8 W8A8

FP8

FP8

FP16

SmoothQuant W8A8

INT8

INT8

FP16

We also investigated the SmoothQuant by different quantization granularities - per-tensor and per-token + per-channel quantization. Per-tensor quantization applies a single quantization step size to the entire matrix. For better precision, fine-grained quantization can be enabled by using different quantization step sizes for each token's activations (per-token quantization) or for each output channel of weights (per-channel quantization) [14].

For the accuracy evaluation across models with different quantization techniques, we choose the Massive Multitask Language Understanding (MMLU) datasets. The benchmark covers 57 different subjects and ranges across different difficulty levels for both world knowledge and problem-solving ability tests [8]. The granularity and breadth of the subjects in MMLU dataset allow us to evaluate the model accuracy across different applications. To summarize the results in a more precise form, the 57 subjects in the MMLU dataset can be further grouped into 21 categories or even 4 main categories as STEM, humanities, social sciences, and others (business, health, misc.) [9].

Performance is evaluated in terms of throughput (tokens/sec) across different batch sizes on Dell XE9680 server. In this blog, we do not investigate the GPU parallelism impact, but focus on the quantization itself, so we use only one of the 8 available H100 GPUs for the experiments. The XE9680 server configuration and high-level specification of H100 are shown in Table 1 and 2 [5] [6]. We tested four different combinations of input sequence length and output length: 128/128; 128/2048; 2048/128; 2048/2048, to represent different use cases.

Table 2. XE9680 server configuration

System Name

PowerEdge XE9680

Status

Available

System Type

Data Center

Number of Nodes

1

Host Processor Model

4th Generation Intel® Xeon® Scalable Processors

Host Process Name

Intel® Xeon® Platinum 8470

Host Processors per Node

2

Host Processor Core Count

52

Host Processor Frequency

2.0 GHz, 3.8 GHz Turbo Boost

Host Memory Capacity and Type

2TB, 32x 64 GB DIMM, 4800 MT/s DDR5

Host Storage Capacity

1.8 TB, NVME

Table 3. H100 High-level specification

GPU Architecture        

NVIDIA Hopper

GPU Number, Memory, and name

1 x 80GB H100

GPU Memory Bandwidth and Type

2TB/s HBM3

Max Power Consumption

Up to 700W (configurable)

Form Factor

SXM

The inference framework that includes different quantization tools is NVIDIA TensorRT-LLM release version 0.11.0. The operating system for the experiments is Ubuntu 22.04 LTS.

To achieve an optimized performance, a TensorRT engine needs to be built with the different options for the H100 GPU under the TensorRT-LLM framework. A brief introduction for those options and recommendations can be found in [15] [16].

Results

We first show the model accuracy results based on the MMLU dataset tests, then the performance results when running those models on PowerEdge XE9680. Lastly, we show the actual peak memory usage for different scenarios. Brief discussions are given for each result. The conclusions are summarized in the next section.

Accuracy

Figure 2. MMLU 4-category accuracy test result

Figure 2 shows the accuracy test results of 4 main MMLU categories for the Meta-Llama-3-8B-Instruct model. Compared to the baseline bfloat16 model, all int4 quantized models (AWQ W4A16, GPTQ W4A16, AWQ W4A8) exhibit an overall accuracy drop of approximately 2%. In contrast, the accuracy of the FP8 quantized model is much closer to that of the bfloat16 model. Additionally, the SmoothQuant (W8A8) model with per token + per channel quantization achieves accuracy levels close to the bfloat16 model, while the SmoothQuant model with per tensor quantization shows the worst performance among all. Notably, the AWQ W4A16 model with a quantized KV cache (int8) does not experience any accuracy degradation from the test, maintaining accuracy comparable to the native AWQ W4A16 model.

Figure 3. MMLU 21-category accuracy test result

Figure 3 further shows the accuracy test results of 21 MMLU sub-categories for the Meta-Llama-3-8B-Instruct model. Similar conclusions can be drawn that AWQ quantized models performed worse than bfloat16 across all subjects, except for chemistry and engineering, where AWQ W4A16 and AWQ W4A8 outperformed bfloat16. GPTQ did not surpass bfloat16 in general, except in math, chemistry, engineering, and philosophy, where it shows similar accuracy. GPTQ outperformed or matched AWQ in most subjects, except for economics, other, history, and politics. The FP8 model performed on par with bfloat16 in most areas, except health, chemistry, computer science, and history, while exceeding bfloat16 in economics and philosophy. SmoothQuant with per-tensor quantization had the worst performance among all, whereas SmoothQuant with per-token and per-channel quantization performed comparably to bfloat16 in all subjects except biology, history, and geography. Overall, AWQ and GPTQ experienced an approximate 2% accuracy drop compared to bfloat16, while FP8 and SmoothQuant with per-token and per-channel quantization techniques maintained the accuracy close to bfloat16.

Performance

Figure 4. Throughput test result

Figure 5. Throughput at small batch sizes

Figure 4 shows the throughput numbers when running Meta-Llama-3-8B-Instruct with different batch size, input length, output length and quantization methods on XE9680 server. Figure 5 shows throughput benefits on a smaller batch size. From the diagrams, we can observe AWQ W4A16 achieves ~ 1.4x to 1.9x higher throughput for smaller batch sizes (<=4). Similarly, AWQ W4A16 with quantized KV Cache (INT8) achieves a throughput increase of about 1.3x to 1.8x for smaller batch sizes (<=4). However, its overall throughput across all batch sizes is lower compared to AWQ W4A16 without quantized KV Cache. This is due to TensorRT-LLM maintaining one KV cache per Transformer layer, necessitating as many KV caches as layers in the model. When INT8 or FP8 KV caches are enabled, input values must be quantized to 8 bits using a scaling factor, and during generation, these values are dequantized on-the-fly in the MHA/MQA kernel, making this dequantization step compute-bound for every layer [16]. 

AWQ W4A8 achieves ~ 1.3x to 1.9x higher throughput than bfloat16 across all batch sizes but has lower throughput than AWQ W4A16 for smaller batch sizes (<=4). FP8 outperforms bfloat16 across all batch sizes, achieving throughput gains of about 1.3x to 1.5x, but does not outperform AWQ and GPTQ for smaller batch sizes (<=4). GPTQ outperforms bfloat16 for smaller batch sizes (<=4), achieving throughput gains of approximately 1.4x to 1.7x, and performs slightly better than AWQ W4A16 across all batch sizes except when batch size is 1.

SmoothQuant (W8A8) outperforms bfloat16 across all batch sizes, achieving throughput gains of approximately 1.1x to 1.3x, but does not outperform AWQ or GPTQ for smaller batch sizes (<=4), and does not surpass FP8 at any batch size. SmoothQuant with per-tensor quantization performs slightly better than SmoothQuant with per-token and per-channel quantization, but it has the worst accuracy among all techniques, as illustrated in the MMLU accuracy test result figures.

GPU memory usage

Figure 6. Peak GPU memory usage

Figure 6 shows the peak GPU memory usage when running Meta-Llama-3-8B-Instruct with different batch size, input length, output length and quantization methods on XE9680 server. The results indicate that 4-bit quantization techniques significantly reduce the memory required to run the model. Compared to the memory needed for the baseline FP16 model, quantized models using AWQ or GPTQ require half or even less memory, depending on the batch size. The AWQ model with a quantized KV cache shows lower memory consumption compared to the native AWQ model due to the reduced precision of cache values. SmoothQuant has less memory consumption than bfloat16, but its memory footprint is greater than INT4 and FP8 quantization. FP8 also consumes significantly less memory than bfloat16 but uses less memory than AWQ and GPTQ on higher batch-size >=32.

Conclusions

Table 4. Summary of different quantization techniques

Quantization Techniques

Accuracy comparison with bfloat16

Throughput comparison with bfloat16

 

Memory comparison with bfloat16

 

AWQ W4A16

2.2% down 

~ 1.4x to 1.9x higher throughput for smaller batch size (<=4)

~ 1.2x to 2.5x memory saving

AWQ W4A16 with INT8 KV Cache

2.2% down 

~ 1.3x to 1.8x higher throughput for smaller batch size (<=4)

~ 1.6x to 2.5x memory saving

 

AWQ W4A8

    

2.6% down 

~ 1.3x to 1.9x higher throughput for smaller batch size (<=4)

 

~ 1.2x to 2.5x memory saving

 

GPTQ W4A16 

1.9% down 

~ 1.4x to 1.7x higher throughput for all batch size

~ 1.2x to 2.4x memory saving

 

FP8 W8A8

0.1% down 

~ 1.3x to 1.5x higher throughput for all batch size

~ 1.3x to 1.7x memory saving

 

SmoothQuant W8A8 per-tensor

27.6% down

~ 1.1x to 1.5x higher throughput for all batch size

~ 1.1x to 1.6x memory saving

 

SmoothQuant W8A8 per-token+per-channel

0.4 % down

~ 1.1x to 1.3x higher throughput for all batch size

~ 1.1x to 1.6x memory saving

 

Table 4 summarizes the experiment results. In terms of performance, we can conclude that for smaller batch size (<=4), AWQ W4A16 and GPTQ are preferred options. AWQ W4A16 with quantized KV cache has less throughput but more memory savings than AWQ W4A16 without quantized KV cache. Whereas for large-batch inference scenarios, such as serving scenarios (batch size >=8) AWQ W4A8, FP8, SmoothQuant are the preferred options. Specifically, FP8 shows the best performance both in terms of throughput and accuracy followed by SmoothQuant and AWQ W4A8. But SmoothQuant (per_token + per_channel quantization) has better accuracy than AWQ W4A8, whereas SmoothQuant (per_token) has the worst accuracy. AWQ and GPTQ experienced an approximate 2% accuracy drop compared to bfloat16, while FP8 and SmoothQuant per-token and per-channel quantization technique maintains accuracy like bfloat16. For some subjects one quantization technique has better accuracy than the other, which indicates the model should be limited to the applications tied to some specific subjects.

Note: the results are based on the Llama-3-8B model running on XE9680 machine with the TensorRT-LLM framework. Conclusions are valid for the models with similar architecture and size under the same setup. However, it may vary with a different model architecture or size, hardware, and deployment framework. So, one should experiment to achieve an optimized model quantization and deployment techniques under the given hardware and application constraints.

References

[1]. https://infohub.delltechnologies.com/en-us/p/deploying-llama-7b-model-with-advanced-quantization-techniques-on-dell-server/

[2]. A. Vaswani et. al, “Attention Is All You Need,” https://arxiv.org/abs/1706.03762

[3]. https://ai.meta.com/blog/meta-llama-3/

[4]. https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

[5]. https://www.dell.com/en-us/shop/ipovw/poweredge-xe9680

[6]. https://www.nvidia.com/en-us/data-center/h100/

[7]. https://github.com/NVIDIA/TensorRT-LLM

[8]. D. Hendrycks et. all, “Measuring Massive Multitask Language Understanding,” https://arxiv.org/abs/2009.03300

[9]. https://github.com/hendrycks/test/blob/master/categories.py

[10]. https://huggingface.co/meta-llama/Meta-Llama-3-8B

[11]. J. Lin et. al, “AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration,” https://arxiv.org/abs/2306.00978

[12]. E. Frantar et. al, “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers”, https://arxiv.org/abs/2210.17323

[13]. Paulius Micikevicius et. al, “FP8 FORMATS FOR DEEP LEARNING”, https://arxiv.org/pdf/2209.05433

[14]. G. Xiao et. al, “SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models,” https://arxiv.org/abs/2211.10438

[15]. https://nvidia.github.io/TensorRT-LLM/performance/perf-best-practices.html

[16]. https://nvidia.github.io/TensorRT-LLM/advanced/gpt-attention.html

[17]. https://nvidia.github.io/TensorRT-Model-Optimizer/guides/_choosing_quant_methods.html 

[18] https://huggingface.co/blog/llama3

Home > AI Solutions > Gen AI > Blogs

AI Artificial Intelligence inferencing XE9680 GenAI LLM Meta Llama

Running Meta Llama 3 Models on Dell PowerEdge XE9680

Tao Zhang Khushboo Rathi Onur Celebioglu Tao Zhang Khushboo Rathi Onur Celebioglu

Tue, 18 Jun 2024 15:42:05 -0000

|

Read Time: 0 minutes

Introduction

Recently, Meta has open-sourced its Meta Llama 3 text-to-text models with 8B and 70B sizes, which are the highest scoring LLMs that have been open-sourced so far in their size ranges, in terms of  quality of responses[1]. In this blog, we will run those models on the Dell PowerEdge XE9680 server to show their performance and improvement by comparing them to the Llama 2 models.

Open-sourcing the Large Language Models (LLMs) enables easy access to this state-of-the-art technology and has accelerated innovations in the field and adoption for different applications. As shown in Table 1, this round of Llama 3 release includes the following five models in total, including two pre-trained models with sizes of 8B and 70B and their instruction-tuned versions, plus a safeguard version for the 8B model[2].

Table 1:  Released Llama 3 Models

Model size (Parameters) 

Model names 

Context
length 

Training tokens 

Vocabulary length

8B 

  • Meta-Llama-3-8B 
  • Meta-Llama-3-8B-Instruct
  • Meta-Llama-Guard-2-8B

                             8K 

                                    

15T 

                                    

128K 

70B 

  • Meta-Llama-3-70B 
  • Meta-Llama-3-70B-Instruct

Llama 3 is trained on 15T tokens which is 7.5X the number of tokens on which Llama 2 was trained. Training with large, high-quality datasets and refined post-training processes improved Llama 3 model’s capabilities, such as reasoning, code generation, and instruction following. Evaluated across main accuracy benchmarks, the Llama 3 model not only exceeds its precedent, but also leads over other main open-source models by significant margins. The Llama 3 70B instruct model shows close or even better results compared to the commercial closed-source models such as Gemini Pro[1]. 

The model architecture of Llama 3 8B is similar to that of Llama 2 7B with one significant difference. Besides a larger parameter size, the Llama 3 8B model uses the group query attention (GQA) mechanism instead of the multi-head attention (MHA) mechanism used in the Llama 2 7B model. Unlike MHA which has the same number of Q (query), K (key), and V (value) matrixes, GQA reduces the number of K and V matrixes required by sharing the same KV matrixes across grouped Q matrixes. This reduces the memory  required and improves computing efficiency during the inferencing process[3]. In addition, the Llama 3 models improved the max context window length to 8192 compared to 4096 for the Llama 2 models. Llama 3 uses a new tokenizer called tik token that expands the vocabulary size to 128K when compared to 32K used in Llama 2. The new tokenization scheme offers improved token efficiency, yielding up to 15% fewer tokens compared to Llama 2 based on Meta’s benchmark[1].

This blog focuses on the inference performance tests of the Llama 3 models running on the Dell PowerEdge XE9680 server, especially, the comparison with Llama 2 models, to show the improvements in the new generation of models.

Test setup

The server we used to benchmark the performance is the PowerEdge XE9680 with 8x H100 GPUs[4]. The detailed server configurations are shown in Table 2.

Table 2:  XE9680 server configuration

System Name

PowerEdge XE9680

Status

Available

System Type

Data Center

Number of Nodes

1

Host Processor Model

4th Generation Intel® Xeon® Scalable Processors

Host Process Name

Intel® Xeon® Platinum 8470

Host Processors per Node

2

Host Processor Core Count

52

Host Processor Frequency

2.0 GHz, 3.8 GHz Turbo Boost

Host Memory Capacity and Type

2TB, 32x 64 GB DIMM, 4800 MT/s DDR5

Host Storage Capacity

1.8 TB, NVME

GPU Number and Name

8x H100

GPU Memory Capacity and Type

80GB, HBM3

GPU High-speed Interface

PCIe Gen5 / NVLink Gen4

The XE9680 is the ideal server, optimized for AI workloads with its 8x NVSwitch interconnected H100 GPUs. The high-speed NVLink interconnect allows deployment of large models like Llama 3 70B that need to span multiple GPUs for best performance and memory capacity requirements. With its 10 PCIe slots, the XE9680 also provides a flexible PCIe architecture that enables a variety of AI fabric options.

For these tests, we deployed the Llama 3 models Meta-Llama-3-8B and Meta-Llama-3-70B, and the Llama 2 models Llama-2-7b-hf and Llama-2-70b-hf. These models are available for download from Hugging Face after permission approved by Meta. For a fair comparison between Llama 2 and Llama 3 models, we ran the models with native precision (float16 for Llama 2 models and bfloat16 for Llama 3 models) instead of any quantized precision. 

Given that it has the same basic model architecture as Llama 2, Llama 3 can easily be integrated into any available software eco-system that currently supports the Llama 2 model. For the experiments in this blog, we chose NVIDIA TensorRT-LLM latest release (version 0.9.0) as the inference framework. NVIDIA CUDA version was 12.4; the driver version was 550.54.15. The operating system for the experiments was Rocky Linux 9.1.

Knowing that the Llama 3 improved accuracy significantly, we first concentrated on the inferencing speed tests. More specifically, we tested the Time-to-First-Token (TTFT) and throughput over different batch sizes for both Llama 2 and Llama 3 models, as shown in the Results section. To make the comparison between two generations of models easy, and to mimic a summarization task, we kept the input token length and output token length at 2048 and 128 respectively for most of the experiments. We also measured throughput of the Llama 3 with the long input token length (8192), as one of the most significant improvements. Because H100 GPUs support the fp8 data format for the models with negligible accuracy loss, we measured the throughput under long input token length for the Llama 3 model at fp8 precision.

Results

This figure shows the inference speed comparison between Llama-3-70b and Llama-2-70b models in terms of Time-to-First-Token.

Figure 1.  Inference speed comparison: Llama-3-70b vs Llama-2-70b: Time-to-First-Token

This figure shows the inference speed comparison between Llama-3-70b and Llama-2-70b models in terms of throughput.

Figure 2: Inference speed comparison: Llama-3-70b vs Llama-2-70b: Throughput

Figures 1 and 2 show the inference speed comparison with the 70b Llama 2 (Llama-2-70b) and Llama 3 (Llama-3-70b) models running across eight H100 GPUs in a tensor parallel (TP=8) fashion on an XE9680 server. From the test results, we can see that for both TTFT (Figure 1) and throughput (Figure 2), the Llama 3 70B model has a similar inference speed to the Llama 2 70b model. This is expected given the same size and architecture of the two models. So, by deploying Llama 3 instead of Llama 2 on an XE9680, organizations can immediately see a big boost in accuracy and quality of responses, using the same software infrastructure, without any impact to latency or throughput of the responses.

This figure shows the inference speed comparison between Llama-3-8b and Llama-2-7b models in terms of Time-to-First-Token.

Figure 3.   Inference speed comparison: Llama-3-8b vs Llama-2-7b: Time-to-First-Token

This figure shows the inference speed comparison between Llama-3-8b and Llama-2-7b models in terms of throughput.

Figure 4:  Inference speed comparison: Llama-3-8b vs Llama-2-7b: Throughput  

Figures 3 and 4 show the inference speed comparison with the 7b Llama 2 (Llama-2-7b) and 8b Llama 3 (Llama-3-8b) models running on a single H100 GPU on an XE9680 server. From the results, we can see the benefits of using the group query attention (GQA) in the Llama 3 8B architecture, in terms of reducing the GPU memory footprint in the prefill stage and speeding up the calculation in the decoding stage of the LLM inferencing. Figure 3 shows that Llama 3 8B has a similar response time in generating the first token even though it is a 15% larger model compared to Llama-2-7b. Figure 4 shows that Llama-3-8b has higher throughput than Llama-2-7b when the batch size is 4 or larger. The benefits of GQA grow as the batch size increases. We can see from the experiments that:

  • the Llama-2-7b cannot run at a batch size of 64 or larger with the 16-bit precision and given input/output token length, because of the OOM (out of memory) error
  • the Llama-3-8b can run at a batch size of 128, which gives more than 2x throughput with the same hardware configuration

This figure shows the throughput of the Llama-3-70b model at the max input token length on the XE9680 server.

Figure 5:  Llama-3-70b throughput under 8192 input token length

Another improvement of the Llama 3 model: it supports a max input token length of 8192, which is 2x of that of a Llama 2 model. We tested it with the Llama-3-70b model running on 8 H100 GPUs of the XE9680 server. The results are shown in Figure 5. The throughput increases with the batch size tested and can achieve 271 tokens/s at a batch size of 16, indicating that more aggressive quantization techniques can further improve the throughput.

Conclusion

In this blog, we investigated the Llama 3 models that were released recently, by comparing their inferencing speed with that of the Llama 2 models by running on a Dell PowerEdge XE9680 server. With the numbers collected through experiments, we showed that not only is the Llama 3 model series a big leap in terms of the quality of responses, it also has great inferencing advantages in terms of high throughput with a large achievable batch size, and long input token length. This makes Llama 3 models great candidates for those long context and offline processing applications.

Authors: Tao Zhang, Khushboo Rathi, and Onur Celebioglu

[1] Meta AI, “Introducing Meta Llama 3: The most capable openly available LLM to date”, https://ai.meta.com/blog/meta-llama-3/.

[3] J. Ainslie et. al, “GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints”, https://arxiv.org/abs/2305.13245


Home > AI Solutions > Artificial Intelligence > Blogs

MLPerf™ Inference 4.0 on Dell PowerEdge Server with Intel® 5th Generation Xeon® CPU

Tao Zhang Bhavesh Patel Nirmala Sundararajan Jay Engh Yuning Qiu Louie Tsai Taran Iyengar Ramesh Chukka Tao Zhang Bhavesh Patel Nirmala Sundararajan Jay Engh Yuning Qiu Louie Tsai Taran Iyengar Ramesh Chukka

Mon, 22 Apr 2024 05:40:38 -0000

|

Read Time: 0 minutes

Introduction

In this blog, we present the MLPerf™ v4.0 Data Center Inference results obtained on a Dell PowerEdge R760 with the latest 5th Generation Intel® Xeon® Scalable Processors (a CPU only system).

These new Intel® Xeon® processors use an Intel® AMX matrix multiplication engine in each core to boost overall inferencing performance. With a focus on ease of use, Dell Technologies delivers exceptional CPU performance results out of the box with an optimized BIOS profile that fully unleashes the power of Intel’s OneDNN software – a software which is fully integrated with both PyTorch and TensorFlow frameworks. The server configurations and the CPU specifications in the benchmark experiments are shown in Tables 1 and 2, respectively.

Table 1.  Dell PowerEdge R760 Server Configuration

System Name

PowerEdge R760

Status

Available

System Type

Data Center

Number of Nodes

1

Host Processor Model

5th Generation Intel® Xeon® Scalable Processors

Host Processors per Node

2

Host Processor Core Count

64

Host Processor Frequency

1.9 GHz, 3.9 GHz Turbo Boost

Host Memory Capacity

2 TB, 16 x 128 GB 5600 MT/s

Host Storage Capacity

7.68TB, NVME

Table 2.  5th Generation Intel® Xeon® Scalable Processor Technical Specifications

Product Collection 

5th Generation Intel® Xeon® Scalable Processors

Processor Name 

Platinum 8592+ 

Status 

Launched 

# of CPU Cores 

64

# of Threads 

128

Base Frequency 

1.9 GHz 

Max Turbo Speed 

3.9 GHz 

Cache L3 

320 MB 

Memory Type  

DDR5 5600 MT/s

ECC Memory Supported 

Yes 


MLPerf™ Inference v4.0 - Datacenter

The MLPerf™ inference benchmark measures how fast a system can perform ML inference using a trained model with new data in a variety of deployment scenarios. There are two benchmark suites, one for Datacenter systems and one for Edge. Figure 1 shows the 7 models with each targeting at different task in the official release v4.0 for Datacenter systems category that were run on this PowerEdge R760 and submitted in the closed category. The dataset and quality target are defined for each model for benchmarking, as listed in Table 3.

 

 Figure 1. Benchmarked models for MLPerf™ datacenter inference v4.0

Table 3. Datacenter Suite Benchmarks. Source: MLCommons™

Area

Task

Model

Dataset

QSL Size

Quality

Server latency

 constraint

Vision

Image classification

ResNet50-v1.5

ImageNet (224x224)

1024

99% of FP32 (76.46%)

15 ms

Vision

Object 

detection

RetinaNet

OpenImages (800x800)

64

99% of FP32 (0.20 mAP)

100 ms

Vision

Medical 

imaging

3D-Unet

KITS 2019 (602x512x512)

16

99.9% of FP32 (0.86330 mean DICE score)

N/A

Speech

Speech-to-text

RNN-T

Librispeech dev-clean (samples < 15 seconds)

2513

99% of FP32 (1 - WER, where WER=7.452253714852645%)

1000 ms

Language

Language processing

BERT-large

SQuAD v1.1 (max_seq_len=384)

10833

99% of FP32 and 99.9% of FP32 (f1_score=90.874%)

130 ms

Language

Summarization

GPT-J

CNN Dailymail (v3.0.0, max_seq_len=2048)

13368

99% of FP32 (f1_score=80.25% 

rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). 

20 s

Commerce

Recommendation

DLRMv2

Criteo 4TB multi-hot

204800

99% of FP32 (AUC=80.25%)

60 ms

Scenarios

The models are deployed in a variety of critical inference applications or use cases known as “scenarios” where each scenario requires different metrics, demonstrating production environment performance in practice. Following is the description of each scenario. Table 4 shows the scenarios required for each Datacenter benchmark included in this submission v4.0.

Offline scenario: represents applications that process the input in batches of data available immediately and do not have latency constraints for the metric performance measured in samples per second.

Server scenario: represents deployment of online applications with random input queries. The metric performance is measured in queries per second (QPS) subject to latency bound. The server scenario is more complicated in terms of latency constraints and input queries generation. This complexity is reflected in the throughput-degradation results compared to the offline scenario.

Each Datacenter benchmark requires the following scenarios:

Table 4. Datacenter Suite Benchmark Scenarios. Source: MLCommons™  

Area

Task

Required Scenarios

Vision

Image classification

Server, Offline

Vision

Object detection

Server, Offline

Vision

Medical imaging

Offline

Speech

Speech-to-text

Server, Offline

Language

Language processing

Server, Offline

Language

Summarization

Server, Offline

Commerce

Recommendation

Server, Offline

Software stack and system configuration

The software stack and system configuration used for this submission is summarized in Table 5.

Table 5. System Configuration

OS 

CentOS Stream 8 (GNU/Linux x86_64) 

Kernel 

6.7.4-1.el8.elrepo.x86_64

Intel® Optimized Inference SW for MLPerf™  

MLPerf™ Intel® OneDNN integrated with Intel® Extension for PyTorch (IPEX) 

ECC memory mode 

ON 

Host memory configuration 

2TB, 16 x 128 GB, 1 DIMM per channel, well balanced

Turbo mode 

ON 

CPU frequency governor 

Performance 

What is Intel® AMX (Advanced Matrix Extensions)?

Intel® AMX is a built-in accelerator that enables 5th Gen Intel® Xeon® Scalable processors to optimize deep learning (DL) training and inferencing workloads. With the high-speed matrix multiplications enabled by Intel® AMX, 5th Gen Intel® Xeon® Scalable processors can quickly pivot between optimizing general computing and AI workloads.

Imagine an automobile that could excel at city driving and then quickly shift to deliver Formula 1 racing performance. 5th Gen Intel® Xeon® Scalable processors deliver this level of flexibility. Developers can code AI functionality to take advantage of the Intel® AMX instruction set as well as code non-AI functionality to use the processor instruction set architecture (ISA). Intel® has integrated the oneAPI Deep Neural Network Library (oneDNN) – its oneAPI DL engine – into popular open-source tools for AI applications, including TensorFlow, PyTorch, PaddlePaddle, and ONNX.

AMX architecture

Intel® AMX architecture consists of two components, as shown in Figure 1:

  • Tiles consist of eight two-dimensional registers, each 1 kilobyte in size. They store large chunks of data.
  • Tile Matrix Multiplication (TMUL) is an accelerator engine attached to the tiles that performs matrix-multiply computations for AI.

Figure 2. Intel® AMX architecture consists of 2D register files (tiles) and TMUL

Results

Both MLPerf™ v3.1 and MLPerf™ v4.0 benchmark results are based on the Dell R760 server but with different generations of Xeon® CPUs (4th Generation Intel® Xeon® CPUs for MLPerf™ v3.1 versus 5th Generation Intel® Xeon® CPUs for MLPerf™ v4.0) and optimized software stacks. In this section, we show the performance in the comparing mode so the improvement from the last submission can be easily observed.

Comparing Performance from MLPerfTM v4.0 to MLPerfTM v3.1

 

ResNet50 server & offline scenarios:

Figure 3. ResNet50 inference throughput in server and offline scenarios

BERT Large Language Model server & offline scenarios:

Figure 4. BERT Inference results for server and offline scenarios 

RetinaNet Object Detection Model server & offline scenarios:

Figure 5. RetinaNet Object Detection Model Inference results for server and offline scenarios

RNN-T Text to Speech Model server & offline scenarios:

Figure 6. RNN-T Text to Speech Model Inference results for server and offline scenarios

3D-Unet Medical Imaging Model offline scenarios:

Figure 7. 3D-Unet Medical Imaging Model Inferencing results for server and offline scenarios

DLRMv2-99 Recommendation Model server & offline scenarios:

Figure 8. DLRMv2-99 Recommendation Model Inference results for server and offline scenarios

GPT-J-99 Summarization Model server & offline scenarios:

Figure 9. GPT-J-99 Summarization Model Inference results for server and offline scenarios

Conclusion

  • The PowerEdge R760 server with 5th Generation Intel® Xeon® Scalable Processors produces strong data center inference performance, confirmed by the official version 4.0 MLPerfTM benchmarking results from MLCommonsTM.
  • The high performance and versatility are demonstrated across natural language processing, image classification, object detection, medical imaging, speech-to-text inference, recommendation, and summarization systems.
  • Compared to its prior version 3.0 and 3.1 submissions enabled by 4th Generation Intel® Xeon® Scalable Processors, the R760 with 5th Generation Intel® Xeon® Scalable Processors show significant performance improvement across different models, including the generative AI models like GPT-J.
  • The R760 supports different deep learning inference scenarios in the MLPerfTM benchmark scenarios as well as other complex workloads such as database and advanced analytics. It is an ideal solution for data center modernization to drive operational efficiency, lead to higher productivity, and minimize total cost of ownership (TCO).

References

MLCommonsTM MLPerfTM v4.0 Inference Benchmark Submission IDs

ID

Submitter

System

4.0-0026

Dell

Dell PowerEdge Server R760 (2x Intel® Xeon® Platinum 8592+)

 

 

 

Home > AI Solutions > Gen AI > Blogs

HS5610 LLM Llama2 Intel Xeon CPU Generative AI quantization

Investigating the Memory Access Bottlenecks of Running LLMs

Tao Zhang Bhavesh Patel Tao Zhang Bhavesh Patel

Thu, 18 Jan 2024 20:20:03 -0000

|

Read Time: 0 minutes

Introduction

Memory access and computing are the two main functions in any computer system. In past decades, the computing capability of a processor has greatly benefited from Moore’s Law which brings smaller and faster transistors into the silicon die almost every year. On the other hand, system memory is regressing. The trend of shrinking fabrication technology for a system is making memory access much slower. This imbalance causes the computer system performance to be bottle-necked by the memory access; this is referred to as the “memory wall” issue. The issue gets worse for large language model (LLM) applications, because they require more memory and computing. Therefore, more memory access is required to be able to execute those larger models. 

In this blog, we will investigate the impacts of memory access bottlenecks to the LLM inference results. For the experiments, we chose the Llama2 chat models running on a Dell PowerEdge HS5610 server with the 4th Generation Intel® Xeon® Scalable Processors. For quantitative analysis, we will be using the Intel profile tool – Intel® VTune™ Profiler to capture the memory access information while running the workload. After identifying the location of the memory access bottlenecks, we propose the possible techniques and configurations to mitigate the issues in the conclusion session.

Background

The Natural Language Processing (NLP) has greatly benefited from the transformer architecture since it was introduced in 2017 [1]. The trajectory of the NLP models has been moved to transformer-based architectures given its parallelization and scalability features over the traditional Recurrent Neural Networks (RNN) architectures. Research shows a scaling law of the transformer-based language models, in which the accuracy is strongly related to the model size, dataset size and the amount of compute [2]. This inspired the interest in using Large Language Models (LLMs) for high accuracy and complicated tasks. Figure 1 shows the evolution of the LLMs since the transformer architecture was invented. We can see the parameters of the LLMs have increased dramatically in the last 5 years. This trend is continuing. As shown in the figure, most of the LLMs today come with more than 7 billion parameters. Some models like GPT4 and PaLM2 have trillion-level parameters to support multi-mode features.

 

Title: LLM evolution - Description: Main LLMs since the transformers was introduced in 2017.

Figure 1: LLM evolution

What comes with the large models are the challenges on the hardware systems for training and inferencing those models. On the one hand, the computation required is tremendous as it is proportional to the model size. On the other hand, memory access is expensive. This mainly comes from the off-chip communication and complicated cache architectures required to support the large model parameters and computation.

Test Setup

The hardware platform we used for this study is HS5610 which is the latest 16G cloud-optimized server from Dell product portfolio. Figure 2 gives an overview of HS5610. It has been designed with CSP features that allow the same benefits with full PowerEdge features & management like mainstream Dell servers, as well as open management (OpenBMC), cold aisle service, channel firmware, and services. The server has two sockets with an Intel 4th generation 32-core Intel® Xeon® CPU on each socket. The TDP power for each CPU is 250W. Table 1 and Table 2 show the details of the server configurations and CPU specifications.

PowerEdge HS5610 An overview of PowerEdge HS5610 Server layout.

Figure 2: PowerEdge HS5610 [3]

Product Collection 

4th Generation Intel® Xeon® Scalable Processors

Processor Name 

Platinum 8480+ 

Status 

Launched 

# of CPU Cores 

32

# of Threads 

64

Base Frequency 

2.0 GHz 

Max Turbo Speed 

3.8 GHz 

Cache L3 

64 MB 

Memory Type  

DDR5 4800 MT/s

ECC Memory Supported 

Yes 

Table 1: HS5610 Server Configurations

System Name 

PowerEdge HS5610 

Status 

Available 

System Type

Data Center

Number of Nodes 

Host Processor Model 

4th Generation Intel® Xeon® Scalable Processors

Host Processors per Node 

Host Processor Core Count 

32

Host Processor Frequency 

2.0 GHz, 3.8 GHz Turbo Boost

Host Memory Capacity 

1TB, 16 x 64GB DIMM 4800 MHz 

Host Storage Capacity 

4.8 TB, NVME 

Table 2: 4th Generation 32-core Intel® Xeon® Scalable Processor Technical Specifications

Software Stack and System Configuration

The software stack and system configuration used for this submission is summarized in Table 3. Optimizations have been done for the PyTorch framework and Transformers library to unleash the Xeon CPU AI instruction capabilities. Also, a low-level tool - Intel® Neural Compressor has been used for high-accuracy quantization.

OS 

CentOS Stream 8 (GNU/Linux x86_64) 

Intel® Optimized Inference SW  

OneDNN™ Deep Learning, ONNX, Intel® Extension for PyTorch (IPEX), Intel® Extension for Transformers (ITREX), Intel® Neural Compressor

ECC memory mode 

ON 

Host memory configuration 

1TiB 

Turbo mode 

ON 

CPU frequency governor 

Performance 

Table 3: Software stack and system configuration

The model under tests is Llama2-chat-hf models with 13 billion parameters (Llama2-13b-chat-hf). The model is based on the pre-trained 13 billion Llama2 model and fine-tuned with human feedback for chatbot applications. The Llama2 model has light (7b), medium (13b) and heavy (70b) size versions.

The profile tool used in the experiments is Intel® VTune™. It is a powerful low-level performance analysis tool for x86 CPUs that supports algorithms, micro-architecture, parallelism, and IO related analysis etc. For the experiments, we use the memory access analysis under micro-architecture category. Note Intel® VTune™ consumes significant hardware resources which impacts the performance results if we run the tool along with the workload. So, we use it as a profile/debug tool to investigate the bottleneck. The performance numbers we demonstrate here are running without Intel® VTune™ on.

The experiments are targeted to cover the following:

  • Single-socket performance vs dual-socket performance to demonstrate the NUMA memory access impact.
  • Performance under different CPU-core numbers within a single socket to demonstrate the local memory access impact.
  • Performance with different quantization to demonstrate the quantization impact.
  • Intel® VTune™ memory access results.

Because Intel® VTune™ has minimum capture durations and max capture size requirements, we focus on capturing the results for the medium-size model (Llama2-13b-chat-hf). This prevents short/long inference time therefore avoiding an underload or overload issue. All the experiments are based on the batch size equals to 1. Performance is characterized by latency or throughput. To reduce the measurement errors, the inference is executed 10 times to get the averaged value. A warm-up process by loading the parameter and running a sample test is executed before running the defined inference. 

Results

For this section, we showcase the performance results in terms of throughput for single-socket and dual socket scenarios under different quantization types followed by the Intel® VTune™ capturing results.

Single-socket Results Under Different Quantization Types:

 

Title: Single-socket throughput - Description: Single-socket throughput in HS5610 server running Llama2 models under different quantization types.

Figure 3: Single-socket throughput in HS5610 server running Llama2 models under different quantization types

Figure 3 shows the throughputs of running different Llama2 chat models with different quantization types on a single socket. The “numactl” command is used to confine the workload within one single 32-core CPU. From the results, we can see that quantization greatly helps to improve the performance across different models.

Title: Single-socket fp32 Vtune results - (a) - Description: Bandwidith and utilization diagram for single-socket fp32.Title: Single-socket fp32 Vtune results - (b) - Description: Elapsed time analysis for single-socket fp32.

(a)

(b)

Figure 4:Intel® VTune™ memory analysis results for single-socket fp32 results:

(a). bandwidth and utilization diagram (b). elapsed time analysis

 

Title: Single-socket bf16 Vtune results - (a) - Description: Bandwidth and utilization diagram for single-socket bf16.Title: Single-socket bf16 Vtune results - (b) - Description: Elapsed time analysis for single-socket bf16.

(a)

(b)

Figure 5: Intel® VTune™ memory analysis results for single-socket bf16 results:

(a). bandwidth and utilization diagram (b). elapsed time analysis

To better understand what would happen at the lower level, we will take the Llama2 13 billion model as an example. We will use Intel® VTune™ to capture the bandwidth and utilization diagram and the elapsed time analysis for the fp32 data type (shown in Figure 4) and use bf16 data type (shown in Figure 5). We can see that by reducing the representing bits, the bandwidth required for the CPU and DRAM communication is reduced. In this scenario, the DRAM utilization drops from 63.4% for fp32 (shown in Figure 4 (a)) to 28.7% (shown in Figure 4 (b)). The also indicates that the weight data can arrive quicker to the CPU chip. Now we can benefit from the quicker memory communication. The CPU utilization also increases from 10% for fp32 (shown in Figure 4 (a)) to 15.6% for bf16 (shown in Figure 4 (b)). Both faster memory access and better CPU utilization translate to better performance with a more than 50% (from 2.47 tokens/s for fp32 to 3.74 tokens/s) throughput boost as shown in Figure 3. Diving deeper with the elapsed time analysis shown in Figure 4 (b), and Figure 5 (b), L1 cache is one of the performance bottleneck locations on the chip. Quantization reduces the possibility that the task gets stalled.

Dual-socket Results Under Different Quantization Types:

 

Title: Dual-socket througput - Description: Dual-socket throughput in HS5610 server running Llama2 models under different quantization types.                                                                         

 

Figure 6: Dual-socket throughput in HS5610 server running Llama2 models under different quantization types

 

Title: Dual-socket fp32 Vtune results - (a) - Description: Bandwidth and utilization diagram for dual-socket fp32.Title: Dual-socket fp32 Vtune results - (b) - Description: Elapsed time analysis for dual-socket fp32.

(a)

(b)

Figure 7: Intel® VTune™ memory analysis results for dual-socket fp32 results:

(a). bandwidth and utilization diagram (b). elapsed time analysis

 

Title: Dual-socket bf16 Vtune results - (a) - Description: Bandwidth and utilization diagram for dual-socket bf16.Title: Dual-socket bf16 Vtune results - (b) - Description: Elapsed time analysis for dual-socket bf16.

(a)

(b)

Figure 8: Intel® VTune™ memory analysis results for dual-socket bf16 results:

(a). bandwidth and utilization diagram (b). elapsed time analysis

Now moving to the dual-socket scenarios shown in Figure 6-8, we have similar observations regarding the impacts of the quantization: Quantization increases CPU utilization and reduces the L1 cache bottleneck, therefore boosting the throughputs across different Llama2 models.

Comparing the performance between the single-socket (shown in Figure 3) and dual-socket (shown in Figure 6) scenarios indicates negligible performance improvement. As seen in Figure 7 and 8, even though we get better CPU utilizations, the communication between two sockets (the UPI or the NUMA memory access), becomes the main bottleneck that offsets the benefits of having more computing cores.

Conclusion

Based on the experiment results for different Llama2 models under various configurations, we have the conclusions as the following:

  • Quantization improves the performance across the models with different weights by reducing the L1 cache bottleneck and increasing the CPU utilization. It also indicates that we can optimize the TCO by reducing the memory requirements (in terms of the capacity and speed) if we were able to quantize the model properly.
  • Crossing-socket communication from either UPI or NUMA memory access is a significant bottleneck that may affect performance. Optimizations include the reducing of the inter-socket communication. For example better partitioning of the model is critical. Alternatively, this also indicates that executing one workload on a single dedicated CPU with enough cores is desirable for cost and performance considerations.

References

[1]. A. Vaswani et. al, “Attention Is All You Need”, https://arxiv.org/abs/1706.03762

[2]. J. Kaplan et. al, “Scaling Laws for Neural Language Models”, https://arxiv.org/abs/2001.08361

[3]. https://www.dell.com/en-us/shop/ipovw/poweredge-hs5610

Home > AI Solutions > Gen AI > Blogs

LLM Llama2 Generative AI quantization

Deploying Llama 7B Model with Advanced Quantization Techniques on Dell Server

Tao Zhang Bhavesh Patel Tao Zhang Bhavesh Patel

Tue, 16 Jan 2024 20:05:01 -0000

|

Read Time: 0 minutes

Introduction

Large-language Models (LLMs) have gained great industrial and academic interest in recent years. Different LLMs have been adopted in various applications, such as: content generation, text summarization, sentiment analysis, and healthcare. The LLM evolution diagram in Figure 1 shows the popular pre-trained models since 2017 when the transformer architecture was first introduced [1]. It is not hard to find the trend of larger and more open-source models following the timeline. Open-source models boosted the popularity of LLMs by eliminating the huge training cost associated with the large scale of the infrastructure and long training time required. Another portion of the cost of LLM applications comes from the deployment where an efficient inference platform is required.

This blog focuses on how to deploy LLMs efficiently on Dell platform with different quantization techniques. We first benchmarked the model accuracy under different quantization techniques. Then we demonstrated their performance and memory requirements of running LLMs under different quantization techniques through experiments. Specifically, we chose the open-source model Llama-2-7b-chat-hf for its popularity [2]. The server is chosen to be Dell main-stream server R760xa with NVIDIA L40 GPUs [3] [4]. The deployment framework in the experiments is TensorRT-LLM, which enables different quantization techniques including advanced 4bit quantization as demonstrated in the blog [5].

 

Title: LLM evolution - Description: Main LLMs since the transformers was introduced in 2017.

Figure 1 :LLM evolution

 

Background

LLM inferencing processes tend to be slow and power hungry, because of the characteristics of LLMs being large in weight size and having auto-regression. How to make the inferencing process more efficient under limited hardware resources is among the most critical problems for LLM deployment. Quantization is an important technique widely used to push for more efficient LLM deployment. It can relieve the large hardware resource requirements by reducing the memory footprint and computation energy, as well as improve the performance with faster memory access time compared to the deployment with the original un-quantized model. For example, in [6], the performance in terms of throughput by tokens per second (tokens/s) for Llama-2-7b model is improved by more than 2x by quantizing from floating point 16-bit format to integer 8-bit. Recent research made more aggressive quantization techniques like 4-bit possible and available in some deployment frameworks like TensorRT-LLM. However, quantization is not free, and it normally comes with accuracy loss. Besides the cost, reliable performance with acceptable accuracy for specific applications is what users would care about. Two key topics covered in this blog are accuracy and performance. We first benchmark the accuracy of the original model and quantized models over different tasks. Then we deployed those models into Dell server and measured their performance. We further measured the GPU memory usage for each scenario. 

Test Setup

The model under investigation is Llama-2-7b-chat-hf [2]. This is a finetuned LLMs with human-feedback and optimized for dialogue use cases based on the 7-billion parameter Llama-2 pre-trained model. We load the fp16 model as the baseline from the huggingface by setting torch_dtype to float16.

We investigated two advanced 4-bit quantization techniques to compare with the baseline fp16 model. One is activation-aware weight quantization (AWQ) and the other is GPTQ [7] [8]. TensorRT-LLM integrates the toolkit that allows quantization and deployment for these advanced 4-bit quantized models.

For accuracy evaluation across models with different quantization techniques, we choose the Massive Multitask Language Understanding (MMLU) datasets. The benchmark covers 57 different subjects and ranges across different difficulty levels for both world knowledge and problem-solving ability tests [9]. The granularity and breadth of the subjects in MMLU dataset allow us to evaluate the model accuracy across different applications. To summarize the results more easily, the 57 subjects in the MMLU dataset can be further grouped into 21 categories or even 4 main categories as STEM, humanities, social sciences, and others (business, health, misc.) [10].

Performance is evaluated in terms of tokens/s across different batch sizes on Dell R760xa server with one L40 plugged in the PCIe slots. The R760xa server configuration and high-level specification of L40 are shown in Table 1 and 2 [3] [4]. To make the comparison easier, we fix the input sequence length and output sequence length to be 512 and 200 respectively.


System Name

PowerEdge R760xa

Status

Available

System Type

Data Center

Number of Nodes

1

Host Processor Model

4th Generation Intel® Xeon® Scalable Processors

Host Process Name

Intel® Xeon® Gold 6430

Host Processors per Node

2

Host Processor Core Count

32

Host Processor Frequency

2.0 GHz, 3.8 GHz Turbo Boost

Host Memory Capacity and Type

512GB, 16 x 32GB DIMM, 4800 MT/s DDR5

Host Storage Capacity

1.8 TB, NVME


Table 1:  R760xa server configuration

 

GPU Architecture        

L40 NVIDIA Ada Lovelace Architecture

GPU Memory Bandwidth         

48 GB GDDR6 with ECC

Max Power Consumption

300W

Form Factor

4.4" (H) x 10.5" (L) Dual Slot

Thermal

Passive


Table 2:  L40 High-level specification

The inference framework that includes different quantization tools is NVIDIA TensorRT-LLM initial release version 0.5. The operating system for the experiments is Ubuntu 22.04 LTS.

Results

We first show the model accuracy results based on the MMLU dataset tests in Figure 2 and Figure 3, and throughput performance results when running those models on PowerEdge R760xa in Figure 4. Lastly, we show the actual peak memory usage for different scenarios. Brief discussions are given for each result. The conclusions are summarized in the next section.

Accuracy

 

Title: MMLU 4-category accuracy test result - Description: The comparison of MMLU 4-category accuracy for AWQ, GPTQ and original models.

Figure 2:MMLU 4-category accuracy test result

Figure 2 shows the accuracy test results of 4 main MMLU categories for the Llama-2-7b-chat-hf model. Compared to the baseline fp16 model, we can see that the model with 4-bit AWQ has a significant accuracy drop. On the other hand, the model with 4-bit GPTQ has a much smaller accuracy drop, especially for the STEM category, the accuracy drop is smaller than 5%. 

 

Title: MMLU 21-category accuracy test result - Description: The comparison of MMLU 21-category accuracy for AWQ, GPTQ and original models.

Figure 3:MMLU 21-category accuracy test result

 Figure 3 further shows the accuracy test results of 21 MMLU sub-categories for the Llama-2-7b-chat-hf model. Similar conclusions can be made that the 4-bit GPTQ quantization gives much better accuracy, except for the law category, the two quantization techniques achieve a close accuracy. 

Performance

 

Title: Throughput test result - Description: Throughput comparison for  AWQ, GPTQ and original models.

Figure 4: Throughput test result

Figure 4 shows the throughput numbers when running Llama-2-7b-chat-hf with different batch size and quantization methods on R760xa server. We observe significant throughput boost with the 4-bit quantization, especially when the batch size is small. For example, a 3x tokens/s is achieved when the batch size is 1 when comparing the scenarios with 4-bit AWQ or GPTQ quantization to the 16-bit baseline scenario. Both AWQ and GPTQ quantization give similar performance across different batch sizes. 

GPU Memory Usage

 

Title: Peak GPU memory usage - Description: Peak GPU memory usage for AWQ, GPTQ and original models.

Figure 5: Peak GPU memory usage

Figure 5 shows the peak GPU memory usage when running Llama-2-7b-chat-hf with different batch size and quantization methods on R760xa server. From the results, 4-bit quantization techniques greatly reduced the memory required for running the model. Compared to the memory size required for the baseline fp16 model, the quantized models with AWQ or GPTQ only requires half or even less of the memory, depending on the batch size. A slightly larger peak memory usage is also observed for GPTQ quantized model compared to the AWQ quantized model.

Conclusion

  • We have shown the impacts for accuracy, performance, and GPU memory usage by applying advanced 4-bit quantization techniques on Dell PowerEdge server when running Llama 7B model.
  • We have demonstrated the great benefits of these 4-bit quantization techniques in terms of improving throughput and saving GPU memory. 
  • We have quantitively compared the quantized models with the baseline model in terms of accuracy among various subjects based on the MMLU dataset.
  • Tests showed that with an acceptable accuracy loss, 4-bit GPTQ is an attractive quantization method for the LLM deployment where the hardware resource is limited. On the other hand, large accuracy drops across many MMLU subjects have been observed for the 4-bit AWQ. This indicates the model should be limited to the applications tied to some specific subjects. Otherwise, other techniques like re-training or fine-turning techniques may be required to improve accuracy. 

References

[1]. A. Vaswani et. al, “Attention Is All You Need”, https://arxiv.org/abs/1706.03762

[2]. https://huggingface.co/meta-llama/Llama-2-7b-chat-hf

[3]. https://www.dell.com/en-us/shop/dell-poweredge-servers/poweredge-r760xa-rack-server/spd/poweredge-r760xa/pe_r760xa_16902_vi_vp

[4]. https://www.nvidia.com/en-us/data-center/l40/

[5]. https://github.com/NVIDIA/TensorRT-LLM

[6]. https://infohub.delltechnologies.com/p/running-llms-on-dell-poweredge-servers-with-intel-r-4th-generation-xeon-r-cpus/

[7]. J. Lin et. al, “AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration”, https://arxiv.org/abs/2306.00978

[8]. E. Frantar et. al, “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers”, https://arxiv.org/abs/2210.17323

[9]. D. Hendrycks et. all, “Measuring Massive Multitask Language Understanding”, https://arxiv.org/abs/2009.03300

[10]. https://github.com/hendrycks/test/blob/master/categories.py


Home > AI Solutions > Artificial Intelligence > Blogs

AI MLPerf R760 generative AI Intel XEON CPU Inference

MLPerf™ Inference 3.1 on Dell PowerEdge Server with Intel® 4th Generation Xeon® CPU

Tao Zhang Brandt Springman Bhavesh Patel Louie Tsai Yuning Qiu Ramesh Chukka Tao Zhang Brandt Springman Bhavesh Patel Louie Tsai Yuning Qiu Ramesh Chukka

Thu, 11 Jan 2024 19:43:07 -0000

|

Read Time: 0 minutes

Introduction

MLCommons™ has released the v3.1 results for its machine learning inference benchmark suite, MLPerf™. This blog focuses on the impressive datacenter inference results obtained across different use cases by using the new 4th Generation Intel Xeon Scalable Processors on a Dell PowerEdge R760 server. This submission covers the benchmark results for all 7 use cases defined in MLPerf™, which are Natural Language Processing (BERT), Image Classification (ResNet50), Object Detection (RetinaNet), Speech-to-Text (RNN-T), Medical Imaging (3D-Unet), Recommendation Systems (DLRMv2), and Summarization (GPT-J).

These new Intel® Xeon® processors use an Intel AMX® matrix multiplication engine in each core to boost overall inferencing performance. With a focus on ease of use, Dell Technologies delivers exceptional CPU performance results out of the box with an optimized BIOS profile that fully unleashes the power of Intel’s OneDNN software – software which is fully integrated with both PyTorch and TensorFlow frameworks. The server configurations and the CPU specifications in the benchmark experiments are shown in Tables 1 and 2 respectively.

System Name 

PowerEdge R760 

Status 

Available 

System Type

Data Center

Number of Nodes 

Host Processor Model 

4th Generation Intel® Xeon® Scalable Processors

Host Processors per Node 

Host Processor Core Count 

56

Host Processor Frequency 

2.0 GHz, 3.8 GHz Turbo Boost

Host Memory Capacity 

1TB, 16 x 64GB DIMM 4800 MHz 

Host Storage Capacity 

4.8 TB, NVME 

Table 1.  Dell PowerEdge R760 Server Configuration

Product Collection 

4th Generation Intel® Xeon® Scalable Processors

Processor Name 

Platinum 8480+ 

Status 

Launched 

# of CPU Cores 

56

# of Threads 

112

Base Frequency 

2.0 GHz 

Max Turbo Speed 

3.8 GHz 

Cache L3 

105 MB 

Memory Type  

DDR5 4800 MT/s

ECC Memory Supported 

Yes 

Table 2.  4th Generation Intel® Xeon® Scalable Processor Technical Specifications

 

MLPerf™ Inference v3.1 - Datacenter

The MLPerf™ inference benchmark measures how fast a system can perform ML inference using a trained model with new data in a variety of deployment scenarios. There are two benchmark suites – one for Datacenter systems and one for Edge. Table 3 lists the 7 mature models with each targeting a different task in the official release v3.1 for Datacenter systems category that were run on this PowerEdge R760. Compared to the v3.0 release, v3.1 added the updated version of the recommendation model – DLRMv2 – and introduced the first Large-Language Model (LLM) – GPT-J.

Area

Task

Model

Dataset

QSL Size

Quality

Server latency constraint

Vision

Image classification

ResNet50-v1.5

ImageNet (224x224)

1024

99% of FP32 (76.46%)

15 ms

Vision

Object detection

RetinaNet

OpenImages (800x800)

64

99% of FP32 (0.20 mAP)

100 ms

Vision

Medical imaging

3D-Unet

KITS 2019 (602x512x512)

16

99.9% of FP32 (0.86330 mean DICE score)

N/A

Speech

Speech-to-text

RNN-T

Librispeech dev-clean (samples < 15 seconds)

2513

99% of FP32 (1 - WER, where WER=7.452253714852645%)

1000 ms

Language

Language processing

BERT-large

SQuAD v1.1 (max_seq_len=384)

10833

99% of FP32 and 99.9% of FP32 (f1_score=90.874%)

130 ms

Language

Summarization

GPT-J-99

CNN Dailymail (v3.0.0, max_seq_len=2048)

13368

99% of FP32 (f1_score=80.25% 

rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). 

20 s

Commerce

Recommendation

DLRMv2

Criteo 4TB Multi-hot

204800

99% of FP32 (AUC=80.25%)

60 ms

Table 3. Datacenter Suite Benchmarks. Source:  MLCommons™ 

 

Scenarios

The models are deployed in a variety of critical inference applications or use cases known as “scenarios” where each scenario requires different metrics, demonstrating production environment performance in practice. Following is the description of each scenario. Table 4 shows the scenarios required for each Datacenter benchmark included in this submission v3.1.

Offline scenario: represents applications that process the input in batches of data available immediately and do not have latency constraints for the metric performance measured in samples per second.

Server scenario: represents deployment of online applications with random input queries. The metric performance is measured in queries per second (QPS) subject to latency bound. The server scenario is more complicated in terms of latency constraints and input queries generation. This complexity is reflected in the throughput-degradation results compared to the offline scenario.

Each Datacenter benchmark requires the following scenarios:

Area

Task

Required Scenarios

Vision

Image classification

Server, Offline

Vision

Object detection

Server, Offline

Vision

Medical imaging

Offline

Speech

Speech-to-text

Server, Offline

Language

Language processing

Server, Offline

Language

Summarization

Server, Offline

Commerce

Recommendation

Server, Offline

Table 4. Datacenter Suite Benchmark Scenarios. Source:  MLCommons™


Software stack and system configuration

The software stack and system configuration used for this submission is summarized in Table 5.

OS 

CentOS Stream 8 (GNU/Linux x86_64) 

Intel® Optimized Inference SW for MLPerf™  

MLPerf™ Intel OneDNN integrated with PyTorch 

ECC memory mode 

ON 

Host memory configuration 

1TiB 

Turbo mode 

ON 

CPU frequency governor 

Performance 

Table 5. System Configuration

 

What is Intel AMX (Advanced Matrix Extensions)?

Intel AMX is a built-in accelerator that enables 4th Gen Intel Xeon Scalable processors to optimize deep learning (DL) training and inferencing workloads. With the high-speed matrix multiplications enabled by Intel AMX, 4th Gen Intel Xeon Scalable processors can quickly pivot between optimizing general computing and AI workloads.

Imagine an automobile that could excel at city driving and then quickly shift to deliver Formula 1 racing performance. 4th Gen Intel Xeon Scalable processors deliver this level of flexibility. Developers can code AI functionality to take advantage of the Intel AMX instruction set as well as code non-AI functionality to use the processor instruction set architecture (ISA).

Intel has integrated the Intel® oneAPI Deep Neural Network Library (oneDNN) – its oneAPI DL engine – into popular open-source tools for AI applications, including TensorFlow, PyTorch, PaddlePaddle, and ONNX.

 

AMX architecture

Intel AMX architecture consists of two components, as shown in Figure 1:

  • Tiles consist of eight two-dimensional registers, each 1 kilobyte in size. They store large chunks of data.
  • Tile Matrix Multiplication (TMUL) is an accelerator engine attached to the tiles that performs matrix-multiply computations for AI.

Figure 1. Intel AMX architecture consists of 2D register files (tiles) and TMUL

 

Results

Both MLPerf™ v3.0 and MLPerf™ v3.1 benchmark results are based on the latest Dell R760 server utilizing 4th Generation Intel® Xeon® Scalable Processors.

For the ResNet50 Image Classification, RetinaNet Object Detection, BERT Large Language, and RNN-T Speech Models – which are identical models with same datasets for both MLPerf™ v3.0 and MLPerf™ v3.1 – we re-run those for the latest submission. The results show negligible differences between two submissions.

We added three new benchmark results for MLPerf™ v3.1 submission compared to MLPerf™ v3.0 submission. Those are 3D-Unet Medical Imaging, DLRMv2 Recommendation, and GPT-J Summarization models. Given that there is no previous result for comparison, we simply show the current result on the R760.

Comparing Performance from MLPerfTM v3.1 to MLPerfTM v3.0

ResNet50 server & offline scenarios:

 

Figure 2. ResNet50 inference throughput in server and offline scenarios

BERT Large Language Model server & offline scenarios:

 

Figure 3. BERT Inference results for server and offline scenarios 

RetinaNet Object Detection Model server & offline scenarios:

Figure 4. RetinaNet Object Detection Model Inference results for server and offline scenarios

RNN-T Text to Speech Model server & offline scenarios:

Figure 5. RNN-T Text to Speech Model Inference results for server and offline scenarios

3D-Unet Medical Imaging Model offline scenarios:

Figure 6. 3D-Unet Medical Imaging Model Inferencing results for server and offline scenarios

DLRMv2-99 Recommendation Model server & offline scenarios:

Figure 7. DLRMv2-99 Recommendation Model Inference results for server and offline scenarios (submitted in the open category)

GPT-J-99 Summarization Model server & offline scenarios:

Figure 8. GPT-J-99 Summarization Model Inference results for server and offline scenarios

 

Conclusion

  • The PowerEdge R760 server with 4th Generation Intel® Xeon® Scalable Processors produces strong data center inference performance, confirmed by the official version 3.1 MLPerfTM benchmarking results from MLCommonsTM.
  • The high performance and versatility are demonstrated across natural language processing, image classification, object detection, medical imaging, speech-to-text inference, recommendation, and summarization systems.
  • The R760 with 4th Generation Intel® Xeon® Scalable Processors show good performance in supporting generative AI models like GPT-J.
  • The R760 supports different deep learning inference scenarios in the MLPerfTM benchmark scenarios as well as other complex workloads such as database and advanced analytics. It is an ideal solution for data center modernization to drive operational efficiency, lead higher productivity, and minimize total cost of ownership (TCO).

 

References

MLCommonsTM MLPerfTM v3.1 Inference Benchmark Submission IDs

ID

Submitter

System

3.1-0059

Dell

Dell PowerEdge Server R760 (1x Intel Xeon Platinum 8480+)

3.1-0060

Dell

Dell PowerEdge Server R760 (1x Intel Xeon Platinum 8480+)

3.1-4184

Dell

Dell PowerEdge Server R760 (1x Intel Xeon Platinum 8480+)

 

Authors: Tao Zhang (tao.zhang9@dell.com); Brandt Springman (brandt.springman@dell.com); Bhavesh Patel (bhavesh_a_patel@dell.com); Louie Tsai (louie.tsai@intel.com); Yuning Qiu (yuning.qiu@intel.com); Ramesh Chukka (ramesh.n.chukka@intel.com)

 

Home > AI Solutions > Artificial Intelligence > Blogs

AI inferencing generative AI Intel XEON 4th Generation CPU Dell CSP platform

Running LLMs on Dell PowerEdge Servers with Intel® 4th Generation Xeon® CPUs

Tao Zhang Bhavesh Patel Tao Zhang Bhavesh Patel

Thu, 11 Jan 2024 19:38:53 -0000

|

Read Time: 0 minutes

Introduction

Large-language Models (LLMs) have gained great industrial and academic interests in recent years. Different LLMs have been adopted in various applications, such as content generation, text summarization, sentiment analysis, and healthcare. The list goes on.

When we think about LLMs and what methodologies we can use for inferencing and fine-tuning, the question always comes up as to which compute device we should use. For inferencing, we wanted to explore what the performance metrics are when running on an Intel 4th Generation CPU, and what are some of the variables we should explore?

This blog focuses on LLM inference results on Dell PowerEdge Servers with the 4th Generation Intel® Xeon® Scalable Processors. Specifically, we demonstrated their performance and power while running the stable diffusion and Llama2 chat models on R760 and HS5610 servers. We also explored the performance and power impacts with different quantization bits and CPU/socket numbers through experiments and will present the inference results of stable diffusion and Llama2 models obtained on a Dell PowerEdge R760 and HS5610 with the 4th Generation Intel® Xeon® Scalable Processors.

We selected the aforementioned Dell platforms because we wanted to explore how our CSP-focused platforms like HS5610 perform when it comes to inferencing and whether they can meet the requirements for LLM models. These new Intel® Xeon® processors use an Intel AMX® matrix multiplication engine in each core to boost overall inferencing performance. By combining with the quantization techniques, we further improved the inference performance with the CPU-only system. Moreover, we also show how the CPU core and socket numbers affect the performance results.

 

Background

Transformer is regarded as the 4th fundamental model after Multilayer Perceptron (MLP), Recurrent Neural Networks (RNN), and Convolutional Neural Networks (CNN). Known for its parallelization and scalability, transformer has greatly boosted the performance and capability of LLMs since it was introduced in 2017 [1]. 

Today, LLMs have been rapidly adopted in various applications like content generation, text summarization, sentiment analysis, code generation, healthcare, and so on, as shown in Figure 1 [2]. This trend is continuing. More open-source LLMs are popping up almost on a monthly basis. Moreover, the transformer-based techniques are being used alongside with other methods, greatly improving the accuracy and performance of the original tasks. For example, the stable diffusion model uses the LLM at the input as the neural language understanding engine. Combined with the diffusion model, it has greatly improved the quality and throughput of the text-to-image generation task [3]. Note that for simplicity in this blog, we use the term “LLMs” to represent both those transformer-based models shown in Figure 1 and the derivative models like stable diffusion models.

A diagram of company logosDescription automatically generated

Figure 1.  LLM Timeline [2] Image credit: Wayne Xin Zhao, et.al, “A Survey of Large Language Models”]                         

While training and fine-tuning those LLMs is normally time- and cost-consuming, deploying the LLMs at the edge has its own challenges. Considering both performance and power, deploying the LLMs can be, in a sense, more cost-sensitive given the volumes of the systems required to cover various applications. GPUs are widely used to deploy LLMs. In this blog, we demonstrate the feasibility of deploying those LLMs with Intel 4th generation Intel® Xeon® CPUs with Dell PowerEdge servers and illustrate that good performance can be achieved with a proper hardware configuration – like CPU core numbers and quantization method for popular LLMs. 

 

Test Setup

The hardware platforms we used for the experiments are PowerEdge R760 and HS5610, which are the latest mainstream and cloud-optimized servers respectively from Dell product portfolio. Figure 2 shows the rack-level interface for the HS5610 server. As a cloud-optimized solution, the HS5610 server has been designed with CSP features that allow the same benefits with full PowerEdge features and management like the mainstream server R760, as well as open management (OpenBMC), cold aisle service, channel firmware, and services. Both servers have two sockets with an Intel 4th generation Xeon CPU on each socket. R760 features a 56-core CPU – Intel® Xeon® Platinum 8480+ (TDP: 350W) in each socket, and HS5610 has a 32-core CPU – Intel® Xeon® Gold 6430 (TDP: 250W) in each socket. Tables 1-4 show the details of the server configurations and CPU specifications. During tests, we use the numactl command to set the numbers of the sockets or CPU cores to execute the LLM inference tasks.

 

A computer server with text and labelsDescription automatically generated with medium confidence

Figure 2.   PowerEdge HS5610 [4]

System Name 

PowerEdge R760 

Status 

Available 

System Type

Data Center

Number of Nodes 

Host Processor Model 

4th Generation Intel® Xeon® Scalable Processors

Host Processors per Node 

Host Processor Core Count 

56

Host Processor Frequency 

2.0 GHz, 3.8 GHz Turbo Boost

Host Memory Capacity 

1TB, 16 x 64GB DIMM 4800 MHz 

Host Storage Capacity 

4.8 TB, NVME 

Table 1.  R760 Server Configuration

Product Collection 

4th Generation Intel® Xeon® Scalable Processors

Processor Name 

Platinum 8480+ 

Status 

Launched 

# of CPU Cores 

56

# of Threads 

112

Base Frequency 

2.0 GHz 

Max Turbo Speed 

3.8 GHz 

Cache L3 

108 MB 

Memory Type  

DDR5 4800 MT/s

ECC Memory Supported 

Yes 

Table 2.  4th Generation 56-core Intel® Xeon® Scalable Processor Technical Specifications

System Name 

PowerEdge HS5610 

Status 

Available 

System Type

Data Center

Number of Nodes 

Host Processor Model 

4th Generation Intel® Xeon® Scalable Processors

Host Processors per Node 

Host Processor Core Count 

32

Host Processor Frequency 

2.0 GHz, 3.8 GHz Turbo Boost

Host Memory Capacity 

1TB, 16 x 64GB DIMM 4800 MHz 

Host Storage Capacity 

4.8 TB, NVME 

Table 3.  HS5610 Server Configuration

Product Collection 

4th Generation Intel® Xeon® Scalable Processors

Processor Name 

Gold 6430 

Status 

Launched 

# of CPU Cores 

32

# of Threads 

64

Base Frequency 

2.0 GHz 

Max Turbo Speed 

3.8 GHz 

Cache L3 

64 MB 

Memory Type  

DDR5 4800 MT/s

ECC Memory Supported 

Yes 

Table 4.  4th Generation 32-core Intel® Xeon® Scalable Processor Technical Specifications



Software stack and system configuration

The software stack and system configuration used for this submission is summarized in Table 5. Optimizations have been done for the PyTorch framework and Transformers library to unleash the Xeon CPU machine learning capabilities. Moreover, a low-level tool -- Intel® Neural Compressor -- has been used for high-accuracy quantization.

OS 

CentOS Stream 8 (GNU/Linux x86_64) 

Intel® Optimized Inference SW  

OneDNN™ Deep Learning, ONNX, Intel® Extension for PyTorch (IPEX), Intel® Extension for Transformers (ITREX), Intel®  Neural Compressor

ECC memory mode 

ON 

Host memory configuration 

1TiB 

Turbo mode 

ON 

CPU frequency governor 

Performance 

Table 5.  Software stack and system configuration

The models under testing are stable diffusion model version 1.4 (~1 billion parameters) and Llama2-chat-HF models with 7 billion, 13 billion, and 70 billion parameters. We purposely choose those models because they are open-sourced, representative, and cover a wide parameter range. Different quantization bits are tested to characterize the corresponding performance and power consumption.

All the experiments are based on batch-size equal to 1. Performance is characterized by latency or throughput. To reduce the measurement errors, the inference is executed 10 times to get the averaged value. A warm-up process is executed by loading the parameter and running a sample test before running the defined inference.

 

Results

We show some typical results in this section alongside brief discussions for each result. The conclusions are summarized in the next section.

HS5610 Results

Latency vs Quantization vs Cores – Stable Diffusion Model:

A graph of different colored linesDescription automatically generated

Figure 3. Latency in HS5610 server running Stable Diffusion

Figure 3 shows that HS5610 can generate a new image in approximately 3 seconds when running at bf16 Stable Diffusion V1.4 model. Quantizing to 16 bits greatly reduces the latency compared to using fp32 model. Scaling up the core numbers from 16 to 32 cores greatly reduces the latency, however scaling up across the sockets does not help. This is mainly due to the NUMA remote memory bottleneck.

Power Consumption – Stable Diffusion Model:

A graph of a power profileDescription automatically generated(a)    A graph of power linesDescription automatically generated(b)

Figure 4. Power consumption of CPU and DIMM in HS5610 server running stable diffusion: (a) fp32 model (b) bf16 model

Figure 4 shows the power profile comparison of HS5610 when running the stable diffusion model with (a) fp32 weights and (b) bf16 weights. To finish the same tasks (warm up and inferencing), the bf16 model takes significantly less time (shorter power profile duration) compared to fp32 scenario. The plot also shows that much larger DIMM power is required to run fp32 compared to bf16. Executing the task pushes the CPU working close to the TDP limit, with the exception of the CPU1 in Figure 4b, indicating that further improvement is possible to further reduce the latency for the bf16 model.

Throughput vs Quantization vs Cores – Llama2 Chat Models:

A graph of different colored linesDescription automatically generated(a)A graph of different colored linesDescription automatically generated(b)

Figure 5. Throughput in HS5610 server running Llama2: (a) 1-socket (b) 2-socket

Figure 5 shows the throughput numbers when running Llama2 chat models with different parameter sizes and quantization bits in HS5610 server. Figure 5a shows the single socket scenario and 5b shows the dual-socket scenario. Smaller models with lower quantization bits give higher throughputs which is to be expected. Like the stable diffusion model, quantization greatly improves the throughput. However, scaling up with more CPU cores across the socket has negligible results in boosting the performance.

R760 Results

Throughput vs Quantization vs Cores – Llama2 Chat Models:

A graph of different colored linesDescription automatically generated(a)A graph of different colored linesDescription automatically generated(b)

Figure 6. Throughput in R760 server running Llama2: (a) 1-socket (b) 2-socket

Figure 6 shows the throughput numbers when running Llama2 chat models with different parameter sizes and quantization bits in R760 server. We get similar observations as the results shown in HS5610 server. A smaller model gives a higher throughput, and quantization greatly improves the throughput. One difference is that we get a 10-30% performance improvement depending on models when scaling up across sockets, showing a benefit from larger core numbers. The performance across the models is good enough for most real-time chatbot applications.

Performance Per Watt – Llama2 Chat Models:

A graph with numbers and a red lineDescription automatically generated(a)A green line graph with black textDescription automatically generated(b)A graph with a blue lineDescription automatically generated(c)

Figure 7. Performance per watt in R760 server running Llama2: (a) 7b (b)13b (c) 70b

We further plot the performance per watt curve which is strongly related to the total cost of ownership (TCO) of the system in Figure 7. From the plots, the quantization can greatly help with the performance efficiency, especially for the models with large parameters.

 

Conclusion

  • We have shown that the Intel 4th generation Intel® Xeon® CPUs on Dell PowerEdge mainstream and HS class platforms can easily meet performance requirements when it comes to Inferencing with Llama2 models.
  • We also demonstrate the benefits of quantization or using lower precision for inferencing quantitively, which can give a better TCO in terms of performance per watt and memory footprint as well as enable better user experience by improving the throughput.
  • These studies also show that we need to right-size the infrastructure based on the application and model size.

 

References

[1]. A. Vaswani et. al, “Attention Is All You Need”, https://arxiv.org/abs/1706.03762

[2]. W. Zhao et. al, “A Survey of Large Language Models”, https://doi.org/10.48550/arXiv.2303.18223

[3]. R. Rombach et. al, “High-Resolution Image Synthesis with Latent Diffusion Models”, https://arxiv.org/abs/2112.10752

[4]. https://www.dell.com/en-us/shop/ipovw/poweredge-hs5610

 

Authors: Tao Zhang (tao.zhang9@dell.com); Bhavesh Patel (bhavesh_a_patel@dell.com)