Introduction to Using the GenAI-Perf Tool on Dell Servers
Mon, 22 Jul 2024 13:16:51 -0000
|Read Time: 0 minutes
Overview
Performance analysis is a critical component that ensures the efficiency and reliability of models during the inference phase of a machine learning life cycle. For example, using a concurrency mode or request rate mode to simulate load on a server helps you understand various load conditions that are crucial for capacity planning and resource allocation. Depending on the use case, the analysis helps to replicate real-world scenarios. It can optimize performance to maintain a specific concurrency of incoming requests to the server, ensuring that the server can handle constant load or bursty traffic patterns. Providing a comprehensive view of the models’ performance enables data scientists to build models that are not only accurate but also robust and efficient.
Triton Performance Analyzer is a CLI tool that analyzes and optimizes the performance of Triton-based systems. It provides detailed information about the systems’ performance, including metrics related to GPU, CPU, and memory. It can also collect custom metrics using Triton’s C API. The tool supports various inference load modes and performance measurement modes.
The Triton Performance Analyzer can help identify performance bottlenecks, optimize system performance, troubleshoot issues, and more. In the suite of Tritons’ performance analysis tools, GenAI-Perf (which was released recently) uses Perf Analyzer in the backend. The GenAI-Perf tool can be used to gather various LLM metrics.
This blog focuses on the capabilities and use of GenAI-Perf.
GenAI-Perf
GenAI-Perf is a command-line performance measurement tool that is customized to collect metrics that are more useful when analyzing an LLM’s performance. These metrics include, output token throughput, time to first token, inter-token latency, and request throughput.
The metrics can:
- Analyze the system performance
- Determine how quickly the system starts processing the request
- Provide the overall time taken by the system to completely process a request
- Retrieve a granular view of how fast the system can generate individual parts of the response
- Provide a general view on the system’s efficiency in generating tokens
This blog also describes how to collect these metrics and automatically create plots using GenAI-Perf.
Implementation
The following steps guide you through the process of using GenAI-Perf. In this example, we collect metrics from a Llama 3 model.
Triton Inference Server
Before running the GenAI-Perf tool, launch Triton Inference Server with your Large Language Model (LLM) of choice.
The following procedure starts Llama 3 70 B and runs on Triton Inference Server v24.05. For more information about how to convert HuggingFace weights to run on Triton Inference Server, see the Converting Hugging Face Large Language Models to TensorRT-LLM blog.
The following example shows a sample command to start a Triton container:
docker run --rm -it --net host --shm-size=128g \ --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \ -v $(pwd)/llama3-70b-instruct-ifb:/llama_ifb \ -v $(pwd)/scripts:/opt/scripts \ nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3
Because Llama 3 is a gated model distributed by Hugging Face, you must request access to Llama 3 using Hugging Face and then create a token. For more information about Hugging Face tokens, see https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/gated_model_access.
An easy method to use your token with Triton is to log in to Hugging Face, which caches a local token:
huggingface-cli login --token hf_Enter_Your_Token
The following example shows a sample command to start the inference:
python3 /opt/scripts/launch_triton_server.py --model_repo /llama_ifb/ --world_size 4
NIM for LLMs
Another method to deploy the Llama 3 model is to use the NVIDIA Inference Microservices (NIM). For more information about how to deploy NIM on the Dell PowerEdge XE9680 server, see Introduction to NVIDIA Inference Microservices, aka NIM. Also, see NVIDIA NIM for LLMs - Introduction.
The following example shows a sample script to start NIM with Llama 3 70b Instruct:
export NGC_API_KEY=<enter-your-key> export CONTAINER_NAME=meta-llama3-70b-instruct export IMG_NAME="nvcr.io/nim/meta/llama3-70b-instruct:1.0.0" export LOCAL_NIM_CACHE=/aipsf810/project-helix/NIM/nim mkdir -p "$LOCAL_NIM_CACHE" docker run -it --rm --name=$CONTAINER_NAME \ --runtime=nvidia \ --gpus all \ --shm-size=16GB \ -e NGC_API_KEY \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \ -u $(id -u) \ -p 8000:8000 \ $IMG_NAME
Triton SDK container
After the Triton Inference container is launched and the inference is started, run the Triton Server SDK:
docker run -it --net=host --gpus=all \ nvcr.io/nvidia/tritonserver:24.05-py3-sdk
You can install the GenAI-Perf tool using pip. In our example, we use the NGC container, which is easier to use and manage.
Measure throughput and latency
When the containers are running, log in to the SDK container and run the GenAI-Perf tool.
The following example shows a sample command:
genai-perf \ -m ensemble \ --service-kind triton \ --backend tensorrtllm \ --num-prompts 100 \ --random-seed 123 \ --synthetic-input-tokens-mean 200 \ --synthetic-input-tokens-stddev 0 \ --streaming \ --output-tokens-mean 100 \ --output-tokens-stddev 0 \ --output-tokens-mean-deterministic \ --tokenizer hf-internal-testing/llama-tokenizer \ --concurrency 1 \ --generate-plots \ --measurement-interval 10000 \ --profile-export-file profile_export.json \ --url localhost:8001
This command produces values similar to the values in the following table:
Statistic | Average | Minimum | Maximum | p99 | p90 | p75 |
Time to first token (ns) Inter token latency (ns) Request latency (ns) Number of output tokens Number of input tokens | 40,375,620 17,272,993 1,815,146,838 108 200 | 37,453,094 5,665,738 1,811,433,087 100 200 | 74,652,113 19,084,237 1,850,664,440 123 200 | 69,046,198 19,024,802 1,844,310,335 122 200 | 39,642,518 18,060,240 1,814,057,039 116 200 | 38,639,988 18,023,915 1,813,603,920 112 200 |
Output token throughput (per sec): 59.63
Request throughput (per sec): 0.55
See Metrics at https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/client/src/c%2B%2B/perf_analyzer/genai-perf/README.html#metrics for more information.
To run the performance tool with NIM, you must change parameters such as the model name, service-kind, endpoint-type, and so on, as shown in the following example:
genai-perf \ -m meta/llama3-70b-instruct \ --service-kind openai \ --endpoint-type chat \ --backend tensorrtllm \ --num-prompts 100 \ --random-seed 123 \ --synthetic-input-tokens-mean 200 \ --synthetic-input-tokens-stddev 0 \ --streaming \ --output-tokens-mean 100 \ --output-tokens-stddev 0 \ --tokenizer hf-internal-testing/llama-tokenizer \ --concurrency 1 \ --measurement-interval 10000 \ --profile-export-file profile_export.json \ --url localhost:8000
Results
The GenAI-Perf tool saves the output to the artifacts directory by default. Each run creates an artifacts/<model-name>_<service-kind>_<backend>_<concurrency"X"> directory.
The following example shows a sample directory:
ll artifacts/ensemble-triton-tensorrtllm-concurrency1/ total 2800 drwxr-xr-x 3 user user 127 Jun 10 13:40 ./ drwxr-xr-x 10 user user 4096 Jun 10 13:34 ../ -rw-r--r-- 1 user user 16942 Jun 10 13:40 all_data.gzip -rw-r--r-- 1 user user 126845 Jun 10 13:40 llm_inputs.json drwxr-xr-x 2 user user 4096 Jun 10 12:24 plots/ -rw-r--r-- 1 user user 2703714 Jun 10 13:40 profile_export.json -rw-r--r-- 1 user user 577 Jun 10 13:40 profile_export_genai_perf.csv
The profile_export_genai_perf.csv file provides the same results that are displayed during the test.
You can also plot charts that are based on the data automatically. To enable this feature, include --generate-plots in the command.
The following figure shows the distribution of input tokens to generated tokens. This metric is useful to understand how the model responds to different lengths of input.
Figure 1: Distribution of input tokens to generated tokens
The following figure shows a scatter plot of how token-to-token latency varies with output token position. These results show how quickly tokens are generated and how consistent the generation is regarding various output token positions.
Figure 2: Token-to-token latency compared to output token position
Conclusion
Performance analysis during the inference phase is crucial as it directly impacts the overall effectiveness of a machine learning solution. Tools such as GenAI-Perf provide comprehensive information that help anyone looking to deploy optimized LLMs in production. The NVIDIA Triton suite has been extensively tested on Dell servers and can be used to capture important LLM metrics with minimum effort. The GenAI-Perf tool is easy to use and produces extensive data that can be used to tune your LLM for optimal performance.