Generative AI model validation

Thank you for your feedback!

System configurations

The following tables list the system configurations and software stack used for generative AI validation:

Table 8. System configuration

Component	Details
Compute server for inferencing	4 x PowerEdge R760xa	4 x PowerEdge R760xa	2 x PowerEdge XE9680	2 x PowerEdge XE8640
GPUs per server	4 x NVIDIA L40S PCIe GPUs	4 x NVIDIA H100 PCIe GPUs	8 x NVIDIA H100 SXM GPUs	4 x NVIDIA H100 SXM GPUs
Network adapter	1 x NVIDIA ConnectX-6 DX Dual Port 100 GbE	1 x NVIDIA ConnectX-6 DX Dual Port 100 GbE	2 x NVIDIA ConnectX-6 DX Dual Port 100 GbE	1 x NVIDIA ConnectX-6 DX Dual Port 100 GbE, OCP NIC 3.0
Network switch	2 x PowerSwitch S5248F-ON	2 x PowerSwitch S5248F-ON	2 x PowerSwitch S5248F-ON	2 x PowerSwitch S5248F-ON

Table 9. Software components and versions

Component	Details
Operating system	Ubuntu 22.04.1 LTS
Cluster management	NVIDIA Base Command Manager Essentials 10.23.12
Kubernetes	Upstream Kubernetes - Version v1.27.6
GPU operator	NVIDIA GPU operator v22.9.2
Inference server	NVIDIA Triton Inference Server v23.04
Inference Engine	NVIDIA TensorRT-LLM 0.7.1 and 0.8.0

We deployed and validated the models in the following scenarios:

Deployment using Triton Inference Server with TensorRT-LLM
Deploying using TensorRT-LLM container
Model validation using a Text Classification

Models used for validation

TensorRT-LLM supports a wide variety of models as listed here. In this design, we validated the following generative AI models:

Llama 2 7B, 13B, and 70B
Mistral
Falcon 180B

Deploying with Triton Inference Server

We used the following steps to validate the models in Triton Inference Server:

Clone the GitHub repository and configure the libraries:

$ git clone https://github.com/triton-inference-server/tensorrtllm_backend.git

Cloning into 'tensorrtllm_backend'...

remote: Enumerating objects: 870, done.

remote: Counting objects: 100% (348/348), done.

remote: Compressing objects: 100% (165/165), done.

remote: Total 870 (delta 229), reused 242 (delta 170), pack-reused 522

Receiving objects: 100% (870/870), 387.70 KiB | 973.00 KiB/s, done.

Resolving deltas: 100% (439/439), done.

$ cd tensorrtllm_backend/

user@node002:/aipsf600/TensonRT-LLM/v0.8.0/tensorrtllm_backend$ git submodule update --init --recursive

user@node002:/aipsf600/TensonRT-LLM/v0.8.0/tensorrtllm_backend$ git lfs install

Updated git hooks.

Git LFS initialized.

user@node002:/aipsf600/TensonRT-LLM/v0.8.0/tensorrtllm_backend$ git lfs pull

user@node002:/aipsf600/TensonRT-LLM/v0.8.0/tensorrtllm_backend$ cp ../TensorRT-LLM/examples/llama/out/* all_models/inflight_batcher_llm/tensorrt_llm/1/

Copy the template models to llama_ifb and modify the configuration files from the repository skeleton with the following commands:

user@node002:/aipsf600/TensonRT-LLM/v0.8.0/tensorrtllm_backend$ export HF_LLAMA_MODEL=meta-llama/Llama-2-70b-chat-hf

user@node002:/aipsf600/TensonRT-LLM/v0.8.0/tensorrtllm_backend$ cp all_models/inflight_batcher_llm/ llama_ifb -r

user@node002:/aipsf600/TensonRT-LLM/v0.8.0/tensorrtllm_backend$ python3 tools/fill_template.py -i llama_ifb/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:llama,triton_max_batch_size:64,preprocessing_instance_count:1

user@node002:/aipsf600/TensonRT-LLM/v0.8.0/tensorrtllm_backend$ python3 tools/fill_template.py -i llama_ifb/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:llama,triton_max_batch_size:64,postprocessing_instance_count:1

user@node002:/aipsf600/TensonRT-LLM/v0.8.0/tensorrtllm_backend$ python3 tools/fill_template.py -i llama_ifb/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False

user@node002:/aipsf600/TensonRT-LLM/v0.8.0/tensorrtllm_backend$ python3 tools/fill_template.py -i llama_ifb/ensemble/config.pbtxt triton_max_batch_size:64

user@node002:/aipsf600/TensonRT-LLM/v0.8.0/tensorrtllm_backend$ python3 tools/fill_template.py -i llama_ifb/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:/llama_ifb/tensorrt_llm/1/,max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_batching,max_queue_delay_microseconds:600

Build a container based on the Triton trt-llm backend from GitHub. Using Docker, create the local image for triton with the most recent trt-llm backend.

user@node002:/aipsf600/TensonRT-LLM/v0.8.0/tensorrtllm_backend$ DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .

Verify that the following images are created on the local repository:

user@node002:/aipsf600/TensonRT-LLM/v0.8.0/tensorrtllm_backend$ docker images
REPOSITORY IMAGE ID CREATED SIZE
triton_trt_llm latest 03f416455199 2 hours ago 53.1GB

Run the Docker container and launch the Triton server. Specify the “world size,” which is the number of GPUs the model was built for, and point to the model repository.

user@node002:/aipsf600/TensonRT-LLM/v0.8.0/tensorrtllm_backend$ docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v $(pwd)/llama_ifb:/llama_ifb -v $(pwd)/scripts:/opt/scripts triton_trt_llm:latest bash

Set up the Hugging Face login and install additional dependencies:

root@node002:/app# huggingface-cli login --token <token>

Run the Triton Server:

root@node002:/app# python /opt/scripts/launch_triton_server.py --model_repo /llama_ifb/ --world_size 4

Test inference by running the following command:

root@node002:/app# curl -X POST localhost:8000/v2/models/ensemble/generate -d '{
"text_input": " <s>[INST] <<SYS>> You are a helpful assistant <</SYS>> What is the capital of Texas?[/INST]",
"parameters": {
"max_tokens": 100,
"bad_words":[""],
"stop_words":[""],
"temperature":0.2,
"top_p":0.7
}
}'

Note that the response is similar to the following:

{"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":" Sure, I'd be happy to help! The capital of Texas is Austin."}

To run the procedure from another server, replace localhost with the IP address or name of the host.

Deploying the TensorRT container

TensorRT-LLM can be installed in a stand-alone container without Triton Inference Server. It can also be installed with other inference server like Kserve. To deploy the container, follow the instruction described here.

Model validation using text classification

We validated the Llama 2 model for binary text classification using the Stanford Politeness Corpus dataset, which has text phrases labeled as polite or impolite. We evaluate models using a fixed test dataset containing 700 phrases. The dataset requests sentiment analysis from the LLM models.

The following text is an example of a prompt and the response:

Prompt: What do you mean? How am I supposed to vindicate myself of this ridiculous accusation? Question: polite or impolite?

Response: Impolite

The following table summarizes the score we obtained for the Llama 2 70B models. Other Llama 2 models scores were similar. An explanation of the performance metrics can be found here.

Table 10. Score summary

Model	Label	Precision	Recall	F1
Llama 2 70B	Polite	93.46	92.86	93.16
Llama 2 70B	Impolite	92.52	93.15	92.83

Measuring the accuracy of LLMs is vital for assessing their performance, comparing different models, and tracking progress in the field of NLP. It will guide users in determining which foundation model to use for further customization and fine-tuning. Also, accurate evaluation metrics validate research findings, aid in model optimization, and address potential biases or ethical concerns. Ultimately, accurate assessment ensures informed and responsible deployment of LLMs in diverse real-world applications. We found that the HuggingFace LLM Leaderboard is a good starting point for comparing model accuracy.

Your Browser is Out of Date