Home > AI Solutions > Gen AI > Guides > Design Guide—Generative AI in the Enterprise - Inferencing > Generative AI model validation
The following tables list the system configurations and software stack used for generative AI validation:
Component | Details | |||
Compute server for inferencing | 4 x PowerEdge R760xa | 4 x PowerEdge R760xa | 2 x PowerEdge XE9680 | 2 x PowerEdge XE8640 |
GPUs per server | 4 x NVIDIA L40S PCIe GPUs | 4 x NVIDIA H100 PCIe GPUs | 8 x NVIDIA H100 SXM GPUs | 4 x NVIDIA H100 SXM GPUs |
Network adapter | 1 x NVIDIA ConnectX-6 DX Dual Port 100 GbE | 1 x NVIDIA ConnectX-6 DX Dual Port 100 GbE | 2 x NVIDIA ConnectX-6 DX Dual Port 100 GbE | 1 x NVIDIA ConnectX-6 DX Dual Port 100 GbE, OCP NIC 3.0 |
Network switch | 2 x PowerSwitch S5248F-ON | 2 x PowerSwitch S5248F-ON | 2 x PowerSwitch S5248F-ON | 2 x PowerSwitch S5248F-ON |
Component | Details |
Operating system | Ubuntu 22.04.1 LTS |
Cluster management | NVIDIA Base Command Manager Essentials 10.23.12 |
Kubernetes | Upstream Kubernetes - Version v1.27.6 |
GPU operator | NVIDIA GPU operator v22.9.2 |
Inference server | NVIDIA Triton Inference Server v23.04 |
Inference Engine | NVIDIA TensorRT-LLM 0.7.1 and 0.8.0 |
We deployed and validated the models in the following scenarios:
TensorRT-LLM supports a wide variety of models as listed here. In this design, we validated the following generative AI models:
We used the following steps to validate the models in Triton Inference Server:
$ git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
Cloning into 'tensorrtllm_backend'...
remote: Enumerating objects: 870, done.
remote: Counting objects: 100% (348/348), done.
remote: Compressing objects: 100% (165/165), done.
remote: Total 870 (delta 229), reused 242 (delta 170), pack-reused 522
Receiving objects: 100% (870/870), 387.70 KiB | 973.00 KiB/s, done.
Resolving deltas: 100% (439/439), done.
$ cd tensorrtllm_backend/
user@node002:/aipsf600/TensonRT-LLM/v0.8.0/tensorrtllm_backend$ git submodule update --init --recursive
user@node002:/aipsf600/TensonRT-LLM/v0.8.0/tensorrtllm_backend$ git lfs install
Updated git hooks.
Git LFS initialized.
user@node002:/aipsf600/TensonRT-LLM/v0.8.0/tensorrtllm_backend$ git lfs pull
user@node002:/aipsf600/TensonRT-LLM/v0.8.0/tensorrtllm_backend$ cp ../TensorRT-LLM/examples/llama/out/* all_models/inflight_batcher_llm/tensorrt_llm/1/
user@node002:/aipsf600/TensonRT-LLM/v0.8.0/tensorrtllm_backend$ export HF_LLAMA_MODEL=meta-llama/Llama-2-70b-chat-hf
user@node002:/aipsf600/TensonRT-LLM/v0.8.0/tensorrtllm_backend$ cp all_models/inflight_batcher_llm/ llama_ifb -r
user@node002:/aipsf600/TensonRT-LLM/v0.8.0/tensorrtllm_backend$ python3 tools/fill_template.py -i llama_ifb/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:llama,triton_max_batch_size:64,preprocessing_instance_count:1
user@node002:/aipsf600/TensonRT-LLM/v0.8.0/tensorrtllm_backend$ python3 tools/fill_template.py -i llama_ifb/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:llama,triton_max_batch_size:64,postprocessing_instance_count:1
user@node002:/aipsf600/TensonRT-LLM/v0.8.0/tensorrtllm_backend$ python3 tools/fill_template.py -i llama_ifb/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
user@node002:/aipsf600/TensonRT-LLM/v0.8.0/tensorrtllm_backend$ python3 tools/fill_template.py -i llama_ifb/ensemble/config.pbtxt triton_max_batch_size:64
user@node002:/aipsf600/TensonRT-LLM/v0.8.0/tensorrtllm_backend$ python3 tools/fill_template.py -i llama_ifb/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:/llama_ifb/tensorrt_llm/1/,max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_batching,max_queue_delay_microseconds:600
user@node002:/aipsf600/TensonRT-LLM/v0.8.0/tensorrtllm_backend$ DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .
user@node002:/aipsf600/TensonRT-LLM/v0.8.0/tensorrtllm_backend$ docker images
REPOSITORY IMAGE ID CREATED SIZE
triton_trt_llm latest 03f416455199 2 hours ago 53.1GB
user@node002:/aipsf600/TensonRT-LLM/v0.8.0/tensorrtllm_backend$ docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v $(pwd)/llama_ifb:/llama_ifb -v $(pwd)/scripts:/opt/scripts triton_trt_llm:latest bash
root@node002:/app# huggingface-cli login --token <token>
root@node002:/app# python /opt/scripts/launch_triton_server.py --model_repo /llama_ifb/ --world_size 4
root@node002:/app# curl -X POST localhost:8000/v2/models/ensemble/generate -d '{
"text_input": " <s>[INST] <<SYS>> You are a helpful assistant <</SYS>> What is the capital of Texas?[/INST]",
"parameters": {
"max_tokens": 100,
"bad_words":[""],
"stop_words":[""],
"temperature":0.2,
"top_p":0.7
}
}'
{"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":" Sure, I'd be happy to help! The capital of Texas is Austin."}
To run the procedure from another server, replace localhost with the IP address or name of the host.
TensorRT-LLM can be installed in a stand-alone container without Triton Inference Server. It can also be installed with other inference server like Kserve. To deploy the container, follow the instruction described here.
We validated the Llama 2 model for binary text classification using the Stanford Politeness Corpus dataset, which has text phrases labeled as polite or impolite. We evaluate models using a fixed test dataset containing 700 phrases. The dataset requests sentiment analysis from the LLM models.
The following text is an example of a prompt and the response:
Prompt: What do you mean? How am I supposed to vindicate myself of this ridiculous accusation? Question: polite or impolite?
Response: Impolite
The following table summarizes the score we obtained for the Llama 2 70B models. Other Llama 2 models scores were similar. An explanation of the performance metrics can be found here.
Model | Label | Precision | Recall | F1 |
Llama 2 70B | Polite | 93.46 | 92.86 | 93.16 |
Impolite | 92.52 | 93.15 | 92.83 |
Measuring the accuracy of LLMs is vital for assessing their performance, comparing different models, and tracking progress in the field of NLP. It will guide users in determining which foundation model to use for further customization and fine-tuning. Also, accurate evaluation metrics validate research findings, aid in model optimization, and address potential biases or ethical concerns. Ultimately, accurate assessment ensures informed and responsible deployment of LLMs in diverse real-world applications. We found that the HuggingFace LLM Leaderboard is a good starting point for comparing model accuracy.