Introduction to NVIDIA Inference Microservices, aka NIM
Fri, 14 Jun 2024 19:16:30 -0000
|Read Time: 0 minutes
At NVIDIA GTC 2024, the major release of the NVIDIA Inference Microservices, aka NIM was announced. NIM is part of the portfolio making up the Nvidia AI Enterprise stack. Why the focus on inferencing? Because when we look at use cases for Generative AI, the vast majority of them are inferencing use cases. Therefore it was critical to simplify the deployment applications by leveraging inferencing.
What is NIM?
NIM is a set of microservices designed to automate the deployment of Generative AI Inferencing applications. NIM was built with flexibility in mind. It supports a wide range of GenAI models, but also enabled frictionless scalability of GenAI inferencing. Below is a high-level view of the NIM components:
Diving a layer deeper, NIM is made of the following services:
- An API layer
- A server layer
- A runtime layer
- A model “engine”
Each microservice is based on a docker container, simplifying the deployment on a wide variety of platforms and operating systems. Those containers can be downloaded from the Nvidia Docker Registry on NGC (https://catalog.ngc.nvidia.com).
Because of their modular nature, NIMs can be adapted for vastly different use cases. For instance, at time of launch, NVIDIA is releasing a number of NIMs, including but not limited to a NIM for LLM, a NIM for Automated Speech Recognition and a NIM for Biology to predict the 3D structure of a protein. The remainder of this blog will be focused on NIM for LLMs.
While models can be deployed as part of a NIM, NIMs are flexible enough to allow for the use of NVIDIA models, but they are also able to leverage models from NVIDIA, on both cases, NVIDIA pre-generate the model engines and include industry standard APIs with the Triton Inference server. The figure below shows what happens at the start of a NIM:
When the docker run command is issued to start the NIM, once the containers have been downloaded, the containers will look to see if a model is already present on the filesystem. If it is, it will then use said model, but if it isn’t, it will download a default model from NGC. While having the model automatically downloaded seems like a small step, it fits with the overall philosophy of NIM around ease of use and faster time to deployment.
Setting up NIM
Thanks to their packaging as docker containers, NIMs have few prerequisites. You basically need a GPU or set of homogenous GPUs with sufficient aggregate memory to run your model and that/those GPU(s) will need to have tensor parallelism enabled. NIMs can run on pretty much any Linux distribution and only require docker, the Nvidia Container Toolkit and CUDA drivers to be installed.
Once all the prerequisites have been installed, it is possible to verify your installation using the following docker command: docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi which should display an output similar to the one below:
$ docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Fri May 24 14:13:48 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:19:00.0 Off | 0 |
| N/A 48C P0 128W / 700W | 76989MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA H100 80GB HBM3 On | 00000000:3B:00.0 Off | 0 |
| N/A 40C P0 123W / 700W | 77037MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA H100 80GB HBM3 On | 00000000:4C:00.0 Off | 0 |
| N/A 40C P0 128W / 700W | 77037MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA H100 80GB HBM3 On | 00000000:5D:00.0 Off | 0 |
| N/A 47C P0 129W / 700W | 77037MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA H100 80GB HBM3 On | 00000000:9B:00.0 Off | 0 |
| N/A 49C P0 142W / 700W | 77037MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA H100 80GB HBM3 On | 00000000:BB:00.0 Off | 0 |
| N/A 41C P0 131W / 700W | 77037MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA H100 80GB HBM3 On | 00000000:CB:00.0 Off | 0 |
| N/A 48C P0 144W / 700W | 77037MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA H100 80GB HBM3 On | 00000000:DB:00.0 Off | 0 |
| N/A 40C P0 129W / 700W | 76797MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
The next thing you will need is an NCG Authentication key. An NGC API key is required to access NGC resources and a key can be generated here: https://org.ngc.nvidia.com/setup/personal-keys.
When creating an NGC API key, ensure that at least “NGC Catalog” is selected from the “Services Included” dropdown. More Services can be included if this key is to be reused for other purposes.
This key will need to be passed to docker run
in the next section as the NGC_API_KEY
environment variable to download the appropriate models and resources when starting the NIM.
If you’re not familiar with how to do this, the simplest way is to export it in your terminal:
export NGC_API_KEY=<value>
Run one of the following to make it available at startup:
# If using bash
echo "export NGC_API_KEY=<value>" >> ~/.bashrc
# If using zsh
echo "export NGC_API_KEY=<value>" >> ~/.zshrc
To pull the NIM container image from NGC, first authenticate with the NVIDIA Container Registry with the following command:
echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin
Use $oauthtoken
as the username and NGC_API_KEY
as the password. The $oauthtoken
username is a special name that indicates that you will authenticate with an API key and not a username and password. This is what the output of the command looks like:
$ echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin
WARNING! Your password will be stored unencrypted in /home/fbronzati/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store
Login Succeeded
A few times throughout this documentation, the ngc
CLI tool will be used. Before continuing, please refer to the NGC CLI documentation for information on how to download and configure the tool.
Note: The ngc
tool used to use the environment variable NGC_API_KEY
but has deprecated that in favor of NGC_CLI_API_KEY
. In the previous section, you set NGC_API_KEY
and it will be used in future commands. If you run ngc
with this variable set, you will get a warning saying it is deprecated in favor of NGC_CLI_API_KEY.
This can be safely ignored for now. You can set NGC_CLI_API_KEY
, but so long as NGC_API_KEY
is set, you will still get the warning.
Launching NIM
The below command will launch a Docker container for the meta-llama3-8b-instruct
model.
# Choose a container name for bookkeeping
export CONTAINER_NAME=meta-llama3-8b-instruct
# Choose a LLM NIM Image from NGC
export IMG_NAME="nvcr.io/nim/meta/llama3-8b-instruct:1.0.0"
# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
# Start the LLM NIM
docker run -it --rm --name=$CONTAINER_NAME --runtime=nvidia --gpus all -e NGC_API_KEY -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" -u $(id -u) -p 8000:8000 $IMG_NAME
The NVIDIA NIM for LLM will automatically select the most compatible profile based on your system specification, using either backend:
- TensorRT-LLM for optimized inference engines, when a compatible model is found
- vLLM for generic non-optimized model
The selection will be logged at startup. For example:
Detected 6 compatible profile(s).
Valid profile: dcd85d5e877e954f26c4a7248cd3b98c489fbde5f1cf68b4af11d665fa55778e (tensorrt_llm-h100-fp8-tp2-latency) on GPUs [0, 1]
Valid profile: 30b562864b5b1e3b236f7b6d6a0998efbed491e4917323d04590f715aa9897dc (tensorrt_llm-h100-fp8-tp1-throughput) on GPUs [0]
Valid profile: 879b05541189ce8f6323656b25b7dff1930faca2abe552431848e62b7e767080 (tensorrt_llm-h100-fp16-tp2-latency) on GPUs [0, 1]
Valid profile: cb52cbc73a6a71392094380f920a3548f27c5fcc9dab02a98dc1bcb3be9cf8d1 (tensorrt_llm-h100-fp16-tp1-throughput) on GPUs [0]
Valid profile: 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2) on GPUs [0, 1, 2, 3, 4, 5, 6, 7]
Valid profile: 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1) on GPUs [0, 1, 2, 3, 4, 5, 6, 7]
Selected profile: dcd85d5e877e954f26c4a7248cd3b98c489fbde5f1cf68b4af11d665fa55778e (tensorrt_llm-h100-fp8-tp2-latency)
Profile metadata: precision: fp8
Profile metadata: tp: 2
Profile metadata: llm_engine: tensorrt_llm
Profile metadata: feat_lora: false
Profile metadata: gpu: H100
Profile metadata: pp: 1
Profile metadata: gpu_device: 2330:10de
Profile metadata: profile: latency
It is possible to override this behavior, set a specific profile ID with -e NIM_MODEL_PROFILE=<value>
. The following list-model-profiles
command lists the available profiles for the IMG_NAME
LLM NIM:
docker run --rm --runtime=nvidia --gpus=all $IMG_NAME list-model-profiles
Updating NIM to use PowerScale as model cache
It is possible to modify the NIM deployment process to leverage PowerScale as the cache to store the model.
Why would one do that? Because storing the model on a cache on PowerScale allows the re-use of the model on any server or even multiple clusters. This is not as critical when an application leverages a foundation model, but if you have spent money and resources to customize a particular model, this method allows that particular model to be re-used by multiple applications.
It also means that the application can scale horizontally as the model is now available to multiple servers, thus potentially improving its performance.
To achieve this, a few things need to happen.
First let’s export the container name and the image name:
$ export CONTAINER_NAME=meta-llama3-8b-instruct
$ export IMG_NAME="nvcr.io/nim/meta/llama3-8b-instruct:1.0.0"
Then let’s create the directory that will be used as cache on PowerScale and export that directory:
$ export LOCAL_NIM_CACHE=/aipsf810/project-helix/NIM/nim
$ mkdir -p "$LOCAL_NIM_CACHE"
Then let’s run the container with these environment variables:
$ docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME
Unable to find image 'nvcr.io/nim/meta/llama3-8b-instruct:1.0.0' locally
1.0.0: Pulling from nim/meta/llama3-8b-instruct
5e8117c0bd28: Already exists
d67fcc6ef577: Already exists
47ee674c5713: Already exists
63daa0e64b30: Already exists
d9d9aecefab5: Already exists
d71f46a15657: Already exists
054e2ffff644: Already exists
7d3cd81654d5: Already exists
dca613dca886: Already exists
0fdcdcda3b2e: Already exists
af7b4f7dc15a: Already exists
6d101782f66c: Already exists
e8427cb13897: Already exists
de05b029a5a2: Already exists
3d72a2698104: Already exists
aeff973c2191: Already exists
85d7d3ff0cca: Already exists
5996430251dd: Already exists
314dc83fdfc2: Already exists
5cef8f59ae9a: Already exists
927db4ce3e96: Already exists
cbe4a04f4491: Already exists
60f1a03c0955: Pull complete
67c1bb2b1aac: Pull complete
f16f7b821143: Pull complete
9be4fff0cd1a: Pull complete
Digest: sha256:7fe6071923b547edd9fba87c891a362ea0b4a88794b8a422d63127e54caa6ef7
Status: Downloaded newer image for nvcr.io/nim/meta/llama3-8b-instruct:1.0.0
===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================
NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/meta/llama3-8b-instruct
Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/.
A copy of this license can be found under /opt/nim/LICENSE.
The use of this model is governed by the AI Foundation Models Community License
here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.
ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3.
A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.
2024-06-05 14:52:46,069 [INFO] PyTorch version 2.2.2 available.
2024-06-05 14:52:46,732 [WARNING] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
2024-06-05 14:52:46,733 [INFO] [TRT-LLM] [I] Starting TensorRT-LLM init.
2024-06-05 14:52:48,096 [INFO] [TRT-LLM] [I] TensorRT-LLM inited.
[TensorRT-LLM] TensorRT-LLM version: 0.10.1.dev2024053000
INFO 06-05 14:52:49.6 api_server.py:489] NIM LLM API version 1.0.0
INFO 06-05 14:52:49.12 ngc_profile.py:217] Running NIM without LoRA. Only looking for compatible profiles that do not support LoRA.
INFO 06-05 14:52:49.12 ngc_profile.py:219] Detected 6 compatible profile(s).
INFO 06-05 14:52:49.12 ngc_injector.py:106] Valid profile: dcd85d5e877e954f26c4a7248cd3b98c489fbde5f1cf68b4af11d665fa55778e (tensorrt_llm-h100-fp8-tp2-latency) on GPUs [0, 1]
INFO 06-05 14:52:49.12 ngc_injector.py:106] Valid profile: 30b562864b5b1e3b236f7b6d6a0998efbed491e4917323d04590f715aa9897dc (tensorrt_llm-h100-fp8-tp1-throughput) on GPUs [0]
INFO 06-05 14:52:49.12 ngc_injector.py:106] Valid profile: 879b05541189ce8f6323656b25b7dff1930faca2abe552431848e62b7e767080 (tensorrt_llm-h100-fp16-tp2-latency) on GPUs [0, 1]
INFO 06-05 14:52:49.12 ngc_injector.py:106] Valid profile: cb52cbc73a6a71392094380f920a3548f27c5fcc9dab02a98dc1bcb3be9cf8d1 (tensorrt_llm-h100-fp16-tp1-throughput) on GPUs [0]
INFO 06-05 14:52:49.12 ngc_injector.py:106] Valid profile: 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2) on GPUs [0, 1, 2, 3, 4, 5, 6, 7]
INFO 06-05 14:52:49.12 ngc_injector.py:106] Valid profile: 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1) on GPUs [0, 1, 2, 3, 4, 5, 6, 7]
INFO 06-05 14:52:49.12 ngc_injector.py:141] Selected profile: dcd85d5e877e954f26c4a7248cd3b98c489fbde5f1cf68b4af11d665fa55778e (tensorrt_llm-h100-fp8-tp2-latency)
INFO 06-05 14:52:50.112 ngc_injector.py:146] Profile metadata: precision: fp8
INFO 06-05 14:52:50.112 ngc_injector.py:146] Profile metadata: tp: 2
INFO 06-05 14:52:50.112 ngc_injector.py:146] Profile metadata: llm_engine: tensorrt_llm
INFO 06-05 14:52:50.112 ngc_injector.py:146] Profile metadata: feat_lora: false
INFO 06-05 14:52:50.112 ngc_injector.py:146] Profile metadata: gpu: H100
INFO 06-05 14:52:50.112 ngc_injector.py:146] Profile metadata: pp: 1
INFO 06-05 14:52:50.112 ngc_injector.py:146] Profile metadata: gpu_device: 2330:10de
INFO 06-05 14:52:50.112 ngc_injector.py:146] Profile metadata: profile: latency
INFO 06-05 14:52:50.112 ngc_injector.py:166] Preparing model workspace. This step might download additional files to run the model.
tokenizer_config.json [00:00:00] [██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████] 49.79 KiB/49.79 KiB 843.55 KiB/s (0s)rank1.engine [00:01:43] [██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████] 4.30 GiB/4.30 GiB 42.53 MiB/s (0s)trt_llm_config.yaml [00:00:00] [███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████] 1012 B/1012 B 17.60 KiB/s (0s)config.json [00:00:00] [█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████] 654 B/654 B 27.50 KiB/s (0s)special_tokens_map.json [00:00:00] [████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████] 73 B/73 B 3.12 KiB/s (0s)rank0.engine [00:01:43] [██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████] 4.30 GiB/4.30 GiB 42.63 MiB/s (0s)tokenizer.json [00:00:00] [████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████] 8.66 MiB/8.66 MiB 31.35 MiB/s (0s)checksums.blake3 [00:00:00] [████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████] 402 B/402 B 11.67 KiB/s (0s)generation_config.json [00:00:00] [███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████] 187 B/187 B 8.23 KiB/s (0s)model.safetensors.index.json [00:00:00] [███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████] 23.39 KiB/23.39 KiB 931.10 KiB/s (0s)config.json [00:00:00] [███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████] 5.21 KiB/5.21 KiB 98.42 KiB/s (0s)metadata.json [00:00:00] [████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████] 232 B/232 B 6.36 KiB/s (0s)INFO 06-05 14:56:25.185 ngc_injector.py:172] Model workspace is now ready. It took 215.073 seconds
INFO 06-05 14:56:25.207 async_trtllm_engine.py:74] Initializing an LLM engine (v1.0.0) with config: model='/tmp/meta--llama3-8b-instruct-ic179k86', speculative_config=None, tokenizer='/tmp/meta--llama3-8b-instruct-ic179k86', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
WARNING 06-05 14:56:25.562 logging.py:314] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-05 14:56:25.583 utils.py:142] Using provided selected GPUs list [0, 1]
INFO 06-05 14:56:25.583 utils.py:201] Using 0 bytes of gpu memory for PEFT cache
INFO 06-05 14:56:25.593 utils.py:207] Engine size in bytes 4613382012
INFO 06-05 14:56:25.604 utils.py:211] available device memory 85170061312
INFO 06-05 14:56:25.604 utils.py:218] Setting free_gpu_memory_fraction to 0.9
WARNING 06-05 14:57:25.189 logging.py:314] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-05 14:57:25.198 serving_chat.py:347] Using default chat template:
{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>
'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>
' }}{% endif %}
WARNING 06-05 14:57:25.454 logging.py:314] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-05 14:57:25.462 api_server.py:456] Serving endpoints:
0.0.0.0:8000/openapi.json
0.0.0.0:8000/docs
0.0.0.0:8000/docs/oauth2-redirect
0.0.0.0:8000/metrics
0.0.0.0:8000/v1/health/ready
0.0.0.0:8000/v1/health/live
0.0.0.0:8000/v1/models
0.0.0.0:8000/v1/version
0.0.0.0:8000/v1/chat/completions
0.0.0.0:8000/v1/completions
INFO 06-05 14:57:25.462 api_server.py:460] An example cURL request:
curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta/llama3-8b-instruct",
"messages": [
{
"role":"user",
"content":"Hello! How are you?"
},
{
"role":"assistant",
"content":"Hi! I am quite well, how can I help you today?"
},
{
"role":"user",
"content":"Can you write me a song?"
}
],
"top_p": 1,
"n": 1,
"max_tokens": 15,
"stream": true,
"frequency_penalty": 1.0,
"stop": ["hello"]
}'
INFO 06-05 14:57:25.508 server.py:82] Started server process [32]
INFO 06-05 14:57:25.509 on.py:48] Waiting for application startup.
INFO 06-05 14:57:25.518 on.py:62] Application startup complete.
INFO 06-05 14:57:25.519 server.py:214] Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Your output might differ slightly from the above, but if you reach the last line, then you have now successfully deployed a NIM for LLM with Llama 3 8B cached on PowerScale.
But what if I want to run a different model, such as Llama 3 70B instead of 8B. Easy, kill the previous container, change the 2 following environment variables:
$ export CONTAINER_NAME=meta-llama3-70b-instruct
$ export IMG_NAME="nvcr.io/nim/meta/llama3-70b-instruct:1.0.0"
And run the same command as previously:
$ docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME
At the end of which, you will now have a NIM for LLM running Llama 3 70B and yes, it is that simple to now deploy all the required components to run inference.
Selecting a specific GPU
In all the commands above, I have instructed the container to use all the available GPUs, by passing --gpus all parameter. This is acceptable in homogeneous environments with 1 or more of the same GPU.
In heterogeneous environments with a combination of GPUs (for example: A6000 + a GeForce display GPU), workloads should only run on compute-capable GPUs. Expose specific GPUs inside the container using either:
- the --gpus flag (ex: --gpus='"device=1"')
- the environment variable NVIDIA_VISIBLE_DEVICES (ex: -e NVIDIA_VISIBLE_DEVICES=1)
The device ID(s) to use as input(s) are listed in the output of nvidia-smi -L:
fbronzati@node041:/aipsf810/project-helix/NIM$ nvidia-smi -L
GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-a4c60bd7-b5fc-f461-2902-65138251f2cf)
GPU 1: NVIDIA H100 80GB HBM3 (UUID: GPU-e5cd81b5-2df5-6404-8568-8b8e82783770)
GPU 2: NVIDIA H100 80GB HBM3 (UUID: GPU-da78242a-c12a-3d3c-af30-5c6d5a9627df)
GPU 3: NVIDIA H100 80GB HBM3 (UUID: GPU-65804526-f8a9-5f9e-7f84-398b099c7b3e)
GPU 4: NVIDIA H100 80GB HBM3 (UUID: GPU-14a79c8f-0439-f199-c0cc-e46ee9dc05c1)
GPU 5: NVIDIA H100 80GB HBM3 (UUID: GPU-06938695-4bfd-9c39-e64a-15edddfd5ac2)
GPU 6: NVIDIA H100 80GB HBM3 (UUID: GPU-23666331-24c5-586b-8a04-2c6551570860)
GPU 7: NVIDIA H100 80GB HBM3 (UUID: GPU-14f2736c-f0e9-0cb5-b12b-d0047f84531c)
For further information on this, please refer to the NVIDIA Container Toolkit documentation for more instructions.
Selecting a specific model profile
When I ran the NIM container above, I let it pick the default model profile for my system, it is also possible to specify which model profile I want to use. To do that, I need to ID of the profile. Getting the ID of a profile is as easy a running the following command for the specific image you are looking at. For instance, to get all the profile available for meta-llama3-70b-instruct:
$ docker run --rm --runtime=nvidia --gpus=all $IMG_NAME list-model-profiles
===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================
NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/meta/llama3-70b-instruct
Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/.
A copy of this license can be found under /opt/nim/LICENSE.
The use of this model is governed by the AI Foundation Models Community License
here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.
ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3.
A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.
SYSTEM INFO
- Free GPUs:
- [2330:10de] (0) NVIDIA H100 80GB HBM3 (H100 80GB) [current utilization: 0%]
- [2330:10de] (1) NVIDIA H100 80GB HBM3 (H100 80GB) [current utilization: 0%]
- [2330:10de] (2) NVIDIA H100 80GB HBM3 (H100 80GB) [current utilization: 0%]
- [2330:10de] (3) NVIDIA H100 80GB HBM3 (H100 80GB) [current utilization: 0%]
- [2330:10de] (4) NVIDIA H100 80GB HBM3 (H100 80GB) [current utilization: 0%]
- [2330:10de] (5) NVIDIA H100 80GB HBM3 (H100 80GB) [current utilization: 0%]
- [2330:10de] (6) NVIDIA H100 80GB HBM3 (H100 80GB) [current utilization: 0%]
- [2330:10de] (7) NVIDIA H100 80GB HBM3 (H100 80GB) [current utilization: 0%]
MODEL PROFILES
- Compatible with system and runnable:
- 93782c337ddaf3f6b442ef7baebd0a732e433b7b93ec06bfd3db27945de560b6 (tensorrt_llm-h100-fp8-tp8-latency)
- 8bb3b94e326789381bbf287e54057005a9250e3abbad0b1702a70222529fcd17 (tensorrt_llm-h100-fp8-tp4-throughput)
- a90b2c0217492b1020bead4e7453199c9f965ea53e9506c007f4ff7b58ee01ff (tensorrt_llm-h100-fp16-tp8-latency)
- abcff5042bfc3fa9f4d1e715b2e016c11c6735855edfe2093e9c24e83733788e (tensorrt_llm-h100-fp16-tp4-throughput)
- 7e8f6cc0d0fde672073a20d5423977a0a02a9b0693f0f3d4ffc2ec8ac35474d4 (vllm-fp16-tp8)
- df45ca2c979e5c64798908815381c59159c1d08066407d402f00c6d4abd5b108 (vllm-fp16-tp4)
- With LoRA support:
- 36fc1fa4fc35c1d54da115a39323080b08d7937dceb8ba47be44f4da0ec720ff (tensorrt_llm-h100-fp16-tp4-throughput-lora)
- 0f3de1afe11d355e01657424a267fbaad19bfea3143a9879307c49aed8299db0 (vllm-fp16-tp8-lora)
- a30aae0ed459082efed26a7f3bc21aa97cccad35509498b58449a52c27698544 (vllm-fp16-tp4-lora)
- Incompatible with system:
- 2e9b29c44b3d82821e7f4facd1a652ec5e0c4e7e473ee33451f6d046f616c3e5 (tensorrt_llm-l40s-fp8-tp8-latency)
- 8b8e03de8630626b904b37910e3d82a26cebb99634a921a0e5c59cb84125efe8 (tensorrt_llm-l40s-fp8-tp4-throughput)
- 96b70da1414c7beb5cf671b3e7cf835078740764064c453cd86e84cf0491aac0 (tensorrt_llm-l40s-fp16-tp8-throughput)
- b811296367317f5097ed9f71b8f08d2688b2411c852978ae49e8a0d5c3a30739 (tensorrt_llm-a100-fp16-tp4-throughput)
- 7f8bb4a2b97cf07faf6fb930ba67f33671492b7653f2a06fe522c7de65b544ca (tensorrt_llm-a100-bf16-tp8-latency)
- 03fdb4d11f01be10c31b00e7c0540e2835e89a0079b483ad2dd3c25c8cc29b61 (tensorrt_llm-l40s-fp16-tp8-throughput-lora)
- 7ba9fbd93c41a28358215f3e94e79a2545ab44e39df016eb4c7d7cadc384bde7 (tensorrt_llm-a100-fp16-tp4-throughput-lora)
For example below instead of using the default configuration, we selected the profile df45ca2c979e5c64798908815381c59159c1d08066407d402f00c6d4abd5b108 (vllm-fp16-tp4) and deployed using the flag -e NIM_MODEL_PROFILE=, following the output of the Llama 3 70B with vLLM deployment:
$ docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-e NIM_MODEL_PROFILE=df45ca2c979e5c64798908815381c59159c1d08066407d402f00c6d4abd5b108 \
-p 8000:8000 \
$IMG_NAME
Running Inference Requests
A NIM typically exposes 2 OpenAI compatible API endpoints: the completions endpoint and the chat completions endpoint. In the next section, I will show how to interact with those endpoints.
OpenAI Completion Request
The Completions endpoint is generally used for base models. With the Completions endpoint, prompts are sent as plain strings, and the model produces the most likely text completions subject to the other parameters chosen. To stream the result, set "stream": true
.
curl -X 'POST' \
'http://0.0.0.0:8000/v1/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta-llama3-8b-instruct",
"prompt": "Once upon a time",
"max_tokens": 64
}'
Using the Llama 3 8B model, the request outputs the following:
$ curl -X 'POST' \
'http://0.0.0.0:8000/v1/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta-llama3-8b-instruct",
"prompt": "Once upon a time",
"max_tokens": 64
}'
{"id":"cmpl-799d4f8aa64943c4a5a737b5defdfdeb","object":"text_completion","created":1716483431,"model":"meta-llama3-8b-instruct","choices":[{"index":0,"text":", there was a young man named Jack who lived in a small village at the foot of a vast and ancient forest. Jack was a curious and adventurous soul, always eager to explore the world beyond his village. One day, he decided to venture into the forest, hoping to discover its secrets.\nAs he wandered deeper into","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":5,"total_tokens":69,"completion_tokens":64}}
Using Llama 3 70B and the request outputs the following:
$ curl -X 'POST' \
'http://0.0.0.0:8000/v1/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta-llama3-70b-instruct",
"prompt": "Once upon a time",
"max_tokens": 64
}'
{"id":"cmpl-1b7394c809b244d489efacd13c2e13ac","object":"text_completion","created":1716568789,"model":"meta-llama3-70b-instruct","choices":[{"index":0,"text":", there was a young girl named Lily. She lived in a small village surrounded by lush green forests and rolling hills. Lily was a gentle soul, with a heart full of love for all living things.\nOne day, while wandering through the forest, Lily stumbled upon a hidden clearing. In the center of the clearing stood","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":5,"total_tokens":69,"completion_tokens":64}}
OpenAI Chat Completion Request
The Chat Completions endpoint is typically used with chat
or instruct
tuned models that are designed to be used through a conversational approach. With the Chat Completions endpoint, prompts are sent in the form of messages with roles and contents, giving a natural way to keep track of a multi-turn conversation. To stream the result, set "stream": true
.
Running this request against Llama 3 8B produces the output below:
$ curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta-llama3-8b-instruct",
"messages": [
{
"role":"user",
"content":"Hello! How are you?"
},
{
"role":"assistant",
"content":"Hi! I am quite well, how can I help you today?"
},
{
"role":"user",
"content":"Can you write me a song?"
}
],
"max_tokens": 32
}'
{"id":"cmpl-cce6dabf331e465f8e9107b05eb92f6c","object":"chat.completion","created":1716483457,"model":"meta-llama3-8b-instruct","choices":[{"index":0,"message":{"role":"assistant","content":"I'd be happy to try. Can you give me a few details to get started?\n\n* Is there a specific topic or theme you'd like the song to"},"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":47,"total_tokens":79,"completion_tokens":32}}
And against Llama 3 70B:
$ curl -X 'POST' 'http://0.0.0.0:8000/v1/chat/completions' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{
"model": "meta-llama3-70b-instruct",
"messages": [
{
"role":"user",
"content":"Hello! How are you?"
},
{
"role":"assistant",
"content":"Hi! I am quite well, how can I help you today?"
},
{
"role":"user",
"content":"Can you write me a song?"
}
],
"max_tokens": 32
}'
{"id":"cmpl-c6086a36cbf84fa387a18d6da4de6ffb","object":"chat.completion","created":1716569009,"model":"meta-llama3-70b-instruct","choices":[{"index":0,"message":{"role":"assistant","content":"I'd be happy to try. Can you give me a bit more information on what you're looking for? Here are a few questions to get started:\n\n*"},"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":47,"total_tokens":79,"completion_tokens":32}}
Conclusion
As shown in this blog, NIMs significantly changes the way infrastructure to run inference is deployed. NIMs packages all the critical components together in a single container that can be run using the standard docker run command.