This blog is one in a series of three that shows how Dell Technologies and our partner AI ecosystem can help you to provision the most powerful open-source model available easily. In this series of blogs, we share information about the ease of deploying the Llama 3.1 405b model in the Dell PowerEdge XE9680 server by using NVIDIA NIM, Dell Enterprise Hub using Text Generation Inference (TGI), or vLLM for LLMs. We hope this series equips you with the knowledge and tools needed for a successful deployment.

This blog describes the vLLM for LLMs option.

Overview

In another blog in the series, we show how to deploy the Llama 3.1 405b model using NVIDIA NIM in a single and multimode deployment, also known as distributed inference.

By following the process described in the NVIDIA NIM blog, this blog demonstrates how easily you can deploy Llama 3.1 405b in a Dell PowerEdge XE9680 server using vLLM with Docker or with Kubernetes.

Deployment with vLLM

The vLLM library is designed for high-throughput and memory-efficient inference and serving of large language models (LLMs). The vLLM community is a vibrant and active membership that is centered around the development and use of the vLLM library.

In the following sections, we show two simple ways to deploy vLLM with the Llama 3.1 405b model.

Docker deployment

The easiest deployment is to run Llama 3.1 405b with vLLM. The basic requirements include a Dell PowerEdge XE9680 server running Linux, Docker, NVIDIA GPU Driver, and the NVIDIA Container Toolkit. This blog does not include installation information; see https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html for more information.

To deploy the container and download the model, run the following command. The following example shows the output when the command is completed and the model is deployed:

fbronzati@node005:~$ docker run --runtime nvidia --gpus all   -v /aipsf710-21/vllm:/root/.cache/huggingface   --env "HUGGING_FACE_HUB_TOKEN=<replace_with_your_HuggingFcae_key"  -p 8000:8000   --ipc=host   vllm/vllm-openai:latest  --model meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tensor-parallel-size 8 --max_model_len 10000
INFO 09-11 13:44:28 api_server.py:459] vLLM API server version 0.6.0
INFO 09-11 13:44:28 api_server.py:460] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=10000, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 09-11 13:44:31 api_server.py:160] Multiprocessing frontend to use ipc:///tmp/4761d363-8d98-4215-a0ae-1d63b684f5c1 for RPC Path.
INFO 09-11 13:44:31 api_server.py:176] Started engine process with PID 78
INFO 09-11 13:44:36 config.py:890] Defaulting to use mp for distributed inference
INFO 09-11 13:44:36 llm_engine.py:213] Initializing an LLM engine (v0.6.0) with config: model='meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=10000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fbgemm_fp8, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=meta-llama/Meta-Llama-3.1-405B-Instruct-FP8, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=False, use_async_output_proc=True)
WARNING 09-11 13:44:37 multiproc_gpu_executor.py:56] Reducing Torch parallelism from 112 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 09-11 13:44:37 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=209) INFO 09-11 13:44:38 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=215) INFO 09-11 13:44:38 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=211) INFO 09-11 13:44:38 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=213) INFO 09-11 13:44:38 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=214) INFO 09-11 13:44:38 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=210) INFO 09-11 13:44:38 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=212) INFO 09-11 13:44:38 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=210) INFO 09-11 13:44:40 utils.py:977] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=210) INFO 09-11 13:44:40 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 09-11 13:44:40 utils.py:977] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=212) INFO 09-11 13:44:40 utils.py:977] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=209) INFO 09-11 13:44:40 utils.py:977] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=212) INFO 09-11 13:44:40 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 09-11 13:44:40 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=214) INFO 09-11 13:44:40 utils.py:977] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=215) INFO 09-11 13:44:40 utils.py:977] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=209) INFO 09-11 13:44:40 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=215) INFO 09-11 13:44:40 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=214) INFO 09-11 13:44:40 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=213) INFO 09-11 13:44:40 utils.py:977] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=213) INFO 09-11 13:44:40 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=211) INFO 09-11 13:44:40 utils.py:977] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=211) INFO 09-11 13:44:40 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 09-11 13:44:48 custom_all_reduce_utils.py:204] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
INFO 09-11 13:45:08 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(VllmWorkerProcess pid=214) INFO 09-11 13:45:08 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(VllmWorkerProcess pid=212) INFO 09-11 13:45:08 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(VllmWorkerProcess pid=211) INFO 09-11 13:45:08 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(VllmWorkerProcess pid=210) INFO 09-11 13:45:08 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(VllmWorkerProcess pid=213) INFO 09-11 13:45:08 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(VllmWorkerProcess pid=209) INFO 09-11 13:45:08 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(VllmWorkerProcess pid=215) INFO 09-11 13:45:08 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
INFO 09-11 13:45:08 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fff44678250>, local_subscribe_port=45245, remote_subscribe_port=None)
INFO 09-11 13:45:08 model_runner.py:915] Starting to load model meta-llama/Meta-Llama-3.1-405B-Instruct-FP8...
(VllmWorkerProcess pid=210) INFO 09-11 13:45:08 model_runner.py:915] Starting to load model meta-llama/Meta-Llama-3.1-405B-Instruct-FP8...
(VllmWorkerProcess pid=212) INFO 09-11 13:45:08 model_runner.py:915] Starting to load model meta-llama/Meta-Llama-3.1-405B-Instruct-FP8...
.
.
.
.
(VllmWorkerProcess pid=212) INFO 09-11 13:45:11 weight_utils.py:236] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=210) INFO 09-11 13:45:11 weight_utils.py:236] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/109 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   1% Completed | 1/109 [00:02<04:05,   2.28s/it]
Loading safetensors checkpoint shards:   2% Completed | 2/109 [00:05<05:20,   3.00s/it]
Loading safetensors checkpoint shards:   3% Completed | 3/109 [00:09<05:47,   3.28s/it]
Loading safetensors checkpoint shards:   4% Completed | 4/109 [00:12<05:49,   3.32s/it]
Loading safetensors checkpoint shards:   5% Completed | 5/109 [00:17<06:33,   3.78s/it]
.
.
.
Loading safetensors checkpoint shards:  97% Completed | 106/109 [06:28<00:11,   3.67s/it]
Loading safetensors checkpoint shards:  98% Completed | 107/109 [06:32<00:07,   3.59s/it]
Loading safetensors checkpoint shards:  99% Completed | 108/109 [06:36<00:03,   3.86s/it]
Loading safetensors checkpoint shards: 100% Completed | 109/109 [06:39<00:00,  3.69s/it]
Loading safetensors checkpoint shards: 100% Completed | 109/109 [06:39<00:00,  3.67s/it]
 
INFO 09-11 13:51:55 model_runner.py:926] Loading model weights took 56.7677 GB
(VllmWorkerProcess pid=213) INFO 09-11 13:51:55 model_runner.py:926] Loading model weights took 56.7677 GB
(VllmWorkerProcess pid=212) INFO 09-11 13:51:55 model_runner.py:926] Loading model weights took 56.7677 GB
(VllmWorkerProcess pid=214) INFO 09-11 13:51:55 model_runner.py:926] Loading model weights took 56.7677 GB
(VllmWorkerProcess pid=210) INFO 09-11 13:51:55 model_runner.py:926] Loading model weights took 56.7677 GB
(VllmWorkerProcess pid=215) INFO 09-11 13:51:55 model_runner.py:926] Loading model weights took 56.7677 GB
(VllmWorkerProcess pid=211) INFO 09-11 13:51:55 model_runner.py:926] Loading model weights took 56.7677 GB
(VllmWorkerProcess pid=209) INFO 09-11 13:51:55 model_runner.py:926] Loading model weights took 56.7677 GB
INFO 09-11 13:52:04 distributed_gpu_executor.py:57] # GPU blocks: 3152, # CPU blocks: 4161
(VllmWorkerProcess pid=213) INFO 09-11 13:52:08 model_runner.py:1217] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=213) INFO 09-11 13:52:08 model_runner.py:1221] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
.
.
.
.
INFO 09-11 13:52:09 model_runner.py:1217] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 09-11 13:52:09 model_runner.py:1221] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=214) INFO 09-11 13:52:09 model_runner.py:1217] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=214) INFO 09-11 13:52:09 model_runner.py:1221] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=215) INFO 09-11 13:52:25 custom_all_reduce.py:223] Registering 8855 cuda graph addresses
.
.
.
.
..
(VllmWorkerProcess pid=209) INFO 09-11 13:52:26 model_runner.py:1335] Graph capturing finished in 17 secs.
(VllmWorkerProcess pid=214) INFO 09-11 13:52:26 model_runner.py:1335] Graph capturing finished in 17 secs.
INFO 09-11 13:52:26 model_runner.py:1335] Graph capturing finished in 17 secs.
INFO 09-11 13:52:27 api_server.py:224] vLLM to use /tmp/tmpje3c_sb0 as PROMETHEUS_MULTIPROC_DIR
WARNING 09-11 13:52:27 serving_embedding.py:190] embedding_mode is False. Embedding API will not work.
INFO 09-11 13:52:27 launcher.py:20] Available routes are:
INFO 09-11 13:52:27 launcher.py:28] Route: /openapi.json, Methods: HEAD, GET
INFO 09-11 13:52:27 launcher.py:28] Route: /docs, Methods: HEAD, GET
INFO 09-11 13:52:27 launcher.py:28] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 09-11 13:52:27 launcher.py:28] Route: /redoc, Methods: HEAD, GET
INFO 09-11 13:52:27 launcher.py:28] Route: /health, Methods: GET
INFO 09-11 13:52:27 launcher.py:28] Route: /tokenize, Methods: POST
INFO 09-11 13:52:27 launcher.py:28] Route: /detokenize, Methods: POST
INFO 09-11 13:52:27 launcher.py:28] Route: /v1/models, Methods: GET
INFO 09-11 13:52:27 launcher.py:28] Route: /version, Methods: GET
INFO 09-11 13:52:27 launcher.py:28] Route: /v1/chat/completions, Methods: POST
INFO 09-11 13:52:27 launcher.py:28] Route: /v1/completions, Methods: POST
INFO 09-11 13:52:27 launcher.py:28] Route: /v1/embeddings, Methods: POST
INFO 09-11 13:52:27 launcher.py:33] Launching Uvicorn with --limit_concurrency 32765. To avoid this limit at the expense of performance run with --disable-frontend-multiprocessing
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO 09-11 13:52:37 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-11 13:52:47 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

Confirming the model

To confirm that the model is working, send a curl command. Ensure that you use the localhost if you submit the request from the node or the IP address of the host if you are testing from another system.

fbronzati@node005:~$ curl -X 'POST'     'http://localhost:8000/v1/completions'     -H 'accept: application/json'     -H 'Content-Type: application/json'     -d '{
"model": "meta-llama/Meta-Llama-3.1-405B-Instruct-FP8",
"prompt": "Once upon a time",
"max_tokens": 64
}'

The following example is a response from the model:

{"id":"cmpl-9f94c78172db488db84eac5f1fb5165e","object":"text_completion","created":1726232453,"model":"meta-llama/Meta-Llama-3.1-405B-Instruct-FP8","choices":[{"index":0,"text":" there was a man who devoted his entire life to mastering the art of dancing. He trained tirelessly and meticulously in every style he could find, from waltzing to hip-hop. He quickly became a master of each craft, impressing everyone around him with his incredible raw talent and dedication. People everywhere sought after this Renaissance","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":69,"completion_tokens":64}}

Because the container that is being used deploys the OpenAI API, you can use Python or other languages to interact with the model.

Kubernetes deployment

Similarly to the NVIDIA NIM deployment, this blog does not describe the installation of the Kubernetes cluster and additional software such as NVIDIA GPU operator. These components are essential. There are many resources on the Internet that describe how to deploy the components. Also, Dell Technologies can provide you with a working environment with the Dell AI Factory solution, which will facilitate your vLLM deployment.

Creating the deployment file

The following example shows how to create a deployment YAML file, which can be expanded with liveness and readiness probes. This blog does not describe these advanced configurations.

Use your preferred text editor to create a deploy-vllm-llama3.1-405B-8xH100-9680.yaml file.
```
fbronzati@login01:/mnt/f710/vllm$ vi deploy-vllm-llama3.1-405B-8xH100-9680.yaml
```
We used vi, however, vim and GNU nano are also good options.
Paste and edit the following content as required for your environment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-server
  template:
    metadata:
      labels:
        app: vllm-server
    spec:
      containers:
        - name: vllm-container
          image: "vllm/vllm-openai:v0.6.0"
          imagePullPolicy: IfNotPresent
          resources:
            limits:
              nvidia.com/gpu: 8
          args:
            - "--model=meta-llama/Meta-Llama-3.1-405B-Instruct-FP8"
            - "--tensor-parallel-size=8"
            - "--max_model_len=10000"
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              value: "replace_with_your_HuggingFace_key"
          volumeMounts:
            - mountPath: /dev/shm
              name: dshm
            - mountPath: /root/.cache/huggingface
              name: model-cache
      imagePullSecrets:
      - name: regcred
      volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 32Gi
        - name: model-cache
          nfs:
            server: f710.f710
            path: /ifs/data/Projects/vllm
 
      nodeSelector:
        nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM3
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  type: LoadBalancer
  ports:
  - protocol: TCP
    port: 8000
    targetPort: 8000
  selector:
    app: vllm-server
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: vllm-ingress
  annotations:
     nginx.ingress.kubernetes.io/rewrite-target: /
spec:
  ingressClassName: nginx-ingress
  rules:
  - http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: vllm-service
            port:
              number: 8000

Some important considerations include:

The Llama 3.1 405b model is approximately 700 GB. It is impractical to download the model every time that you switch to a different host. Therefore, we recommend having an external NFS such as the PowerScale F710. This configuration is shown on the volumes section of the file.
You must accept the Hugging Face and Meta contract and use your HF key to be able to download the model. Otherwise, an error message is displayed when deploying the pod.
We recommend that you create a new namespace for deploying the pod.
Because the vLLM image is hosted in the Docker registry, you might need to create a secret. Otherwise, the download might be limited.

Creating the Kubernetes namespace and secrets

After creating the deployment file, create a namespace. For our example, we used vllm for simple identification on the pods.

fbronzati@login01:/mnt/f710/vllm$ kubectl create namespace vllm
namespace/vllm created

Create the Docker secret to avoid limiting the number of downloads on the Docker repository:

fbronzati@login01:/mnt/f710/vllm$ kubectl create secret docker-registry regcred --docker-username=<replace_with_your_docker_user> --docker-password=<replace _with_your_docker_key> -n vllm

Deploying the vllm pod

To deploy the pod and the services that are required to access the model:

Run the following command:

fbronzati@login01:/mnt/f710/vllm$ kubectl apply -f deploy-vllm-llama3.1-405B-8xH100-9680.yaml -n vllmdeployment.apps/vllm-deployment created
service/vllm-service created
ingress.networking.k8s.io/vllm-ingress created

For a first-time deployment, the process of downloading the image and the model takes some time because the model is approximately 683 GB.

fbronzati@login01:/mnt/f710/vllm$ du -sh hub/*
683G     hub/models--meta-llama--Meta-Llama-3.1-405B-Instruct-FP8

To monitor the deployment of the pod and services, run the following commands

fbronzati@login01:/mnt/f710/vllm$ kubectl get pods -n vllm -o wideNAME                              READY   STATUS     RESTARTS   AGE   IP              NODE       NOMINATED NODE   READINESS GATES
vllm-deployment-cd6c8564c-kh6tb   1/1      Running   0          58s    10.194.214.55   helios25   <none>           <none>
 
fbronzati@login01:/mnt/f710/vllm$ kubectl get services -n vllm -o wide
NAME           TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)           AGE   SELECTOR
vllm-service   LoadBalancer   *.*.*.*    <pending>      8000:30757/TCP   83s   app=vllm-server

To verify any errors or if the container image is being downloaded, run the kubectl describe command:

fbronzati@login01:/mnt/f710/vllm$ kubectl describe pod vllm-deployment-cd6c8564c-kh6tb -n vllm
Name:             vllm-deployment-cd6c8564c-kh6tb
Namespace:        vllm
Priority:         0
Service Account:  default
Node:             helios25/*.*.*.*
Start Time:       Thu, 12 Sep 2024 09:05:22 -0500
Labels:            app=vllm-server
                   pod-template-hash=cd6c8564c
Annotations:       cni.projectcalico.org/containerID: 56bbadf0bf9193c47e481263e2a52770595c5c16f6bbee3e63177953a755c52c
                   cni.projectcalico.org/podIP: 10.194.214.55/32
                   cni.projectcalico.org/podIPs: 10.194.214.55/32
                   k8s.v1.cni.cncf.io/network-status:
                    [{
                         "name": "k8s-pod-network",
                         "ips": [
                             "10.194.214.55"
                        ],
                         "default": true,
                         "dns": {}
                    }]
                   k8s.v1.cni.cncf.io/networks-status:
                    [{
                         "name": "k8s-pod-network",
                         "ips": [
                             "10.194.214.55"
                        ],
                         "default": true,
                         "dns": {}
                    }]
Status:           Running
IP:               *.*.*.*
IPs:
  IP:           *.*.*.*
Controlled By:   ReplicaSet/vllm-deployment-cd6c8564c
Containers:
  vllm-container:
    Container ID:   containerd://92d98154a4faebfb5fe67ffc5aaa0404f1a6e3c37698a8eb94173543fd2b182c
    Image:         vllm/vllm-openai:v0.6.0
    Image ID:       docker.io/vllm/vllm-openai@sha256:072427aa6f95c74782a9bc3fe1d1fcd1e1aa3fe47b317584ea2181c549ad2de8
    Port:          <none>
    Host Port:     <none>
    Args:
       --model=meta-llama/Meta-Llama-3.1-405B-Instruct-FP8
      --tensor-parallel-size=8
      --max_model_len=10000
    State:          Running
      Started:      Thu, 12 Sep 2024 09:05:24 -0500
    Ready:          True
    Restart Count:  0
    Limits:
      nvidia.com/gpu:  8
    Requests:
      nvidia.com/gpu:  8
    Environment:
      HUGGING_FACE_HUB_TOKEN: █████████████████████████████ 
    Mounts:
      /dev/shm from dshm (rw)
      /root/.cache/huggingface from model-cache (rw)
       /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kd9zq (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  dshm:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  32Gi
  model-cache:
    Type:      NFS (an NFS mount that lasts the lifetime of a pod)
    Server:    f710.f710
    Path:      /ifs/data/Projects/vllm
    ReadOnly:  false
  kube-api-access-kd9zq:
    Type:                    Projected (a volume that contains injected data from multiple sources)
     TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                    BestEffort
Node-Selectors:               nvidia.com/gpu.product=NVIDIA-H100-80GB-HBM3
Tolerations:                  node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                              node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                       <none>

Model Initialization

After downloading the container image and creating the pod/service, the model is downloaded and loaded to the GPUs. This process might take a long time. We recommend that you monitor the logs of the pod to follow the process. The following example shows sample output that enables you to verify if the behavior is the same in your environment:

fbronzati@login01:/mnt/f710/vllm$ kubectl logs vllm-deployment-cd6c8564c-kh6tb -n vllm -f
INFO 09-12 07:05:28 api_server.py:459] vLLM API server version 0.6.0
INFO 09-12 07:05:28 api_server.py:460] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=10000, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 09-12 07:05:33 api_server.py:160] Multiprocessing frontend to use ipc:///tmp/29c1a11c-a17a-4fa1-8768-a001a3385714 for RPC Path.
INFO 09-12 07:05:33 api_server.py:176] Started engine process with PID 78
INFO 09-12 07:05:37 config.py:890] Defaulting to use mp for distributed inference
INFO 09-12 07:05:37 llm_engine.py:213] Initializing an LLM engine (v0.6.0) with config: model='meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=10000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fbgemm_fp8, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=meta-llama/Meta-Llama-3.1-405B-Instruct-FP8, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=False, use_async_output_proc=True)
WARNING 09-12 07:05:37 multiproc_gpu_executor.py:56] Reducing Torch parallelism from 112 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 09-12 07:05:37 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=210) INFO 09-12 07:05:38 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=213) INFO 09-12 07:05:38 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=212) INFO 09-12 07:05:38 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=211) INFO 09-12 07:05:38 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=209) INFO 09-12 07:05:38 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=214) INFO 09-12 07:05:38 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=215) INFO 09-12 07:05:38 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=211) INFO 09-12 07:05:48 utils.py:977] Found nccl from library libnccl.so.2
INFO 09-12 07:05:48 utils.py:977] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=209) INFO 09-12 07:05:48 utils.py:977] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=215) INFO 09-12 07:05:48 utils.py:977] Found nccl from library libnccl.so.2
INFO 09-12 07:05:48 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=211) INFO 09-12 07:05:48 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=209) INFO 09-12 07:05:48 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=210) INFO 09-12 07:05:48 utils.py:977] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=215) INFO 09-12 07:05:48 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=213) INFO 09-12 07:05:48 utils.py:977] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=214) INFO 09-12 07:05:48 utils.py:977] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=210) INFO 09-12 07:05:48 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=212) INFO 09-12 07:05:48 utils.py:977] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=213) INFO 09-12 07:05:48 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=214) INFO 09-12 07:05:48 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=212) INFO 09-12 07:05:48 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 09-12 07:05:55 custom_all_reduce_utils.py:204] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
INFO 09-12 07:06:38 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(VllmWorkerProcess pid=213) INFO 09-12 07:06:38 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(VllmWorkerProcess pid=214) INFO 09-12 07:06:38 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
.
.
.
.
(VllmWorkerProcess pid=209) INFO 09-12 07:06:38 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
INFO 09-12 07:06:38 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fff438fd6f0>, local_subscribe_port=33971, remote_subscribe_port=None)
INFO 09-12 07:06:38 model_runner.py:915] Starting to load model meta-llama/Meta-Llama-3.1-405B-Instruct-FP8...
(VllmWorkerProcess pid=210) INFO 09-12 07:06:38 model_runner.py:915] Starting to load model meta-llama/Meta-Llama-3.1-405B-Instruct-FP8...
(VllmWorkerProcess pid=212) INFO 09-12 07:06:38 model_runner.py:915] Starting to load model meta-llama/Meta-Llama-3.1-405B-Instruct-FP8...
.
.
.
.
(VllmWorkerProcess pid=209) INFO 09-12 07:06:40 weight_utils.py:236] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=210) INFO 09-12 07:06:40 weight_utils.py:236] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/109 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   1% Completed | 1/109 [00:03<07:09,   3.97s/it]
Loading safetensors checkpoint shards:   2% Completed | 2/109 [00:07<06:20,   3.56s/it]
Loading safetensors checkpoint shards:   3% Completed | 3/109 [00:10<06:03,   3.42s/it]
.
.
.
.
.
Loading safetensors checkpoint shards:  96% Completed | 105/109 [06:23<00:12,   3.22s/it]
Loading safetensors checkpoint shards:  97% Completed | 106/109 [06:27<00:10,   3.51s/it]
Loading safetensors checkpoint shards:  98% Completed | 107/109 [06:31<00:06,   3.44s/it]
Loading safetensors checkpoint shards:  99% Completed | 108/109 [06:35<00:03,   3.75s/it]
Loading safetensors checkpoint shards: 100% Completed | 109/109 [06:38<00:00,  3.59s/it]
Loading safetensors checkpoint shards: 100% Completed | 109/109 [06:38<00:00,  3.66s/it]
 
INFO 09-12 07:13:19 model_runner.py:926] Loading model weights took 56.7677 GB
(VllmWorkerProcess pid=209) INFO 09-12 07:13:19 model_runner.py:926] Loading model weights took 56.7677 GB
(VllmWorkerProcess pid=211) INFO 09-12 07:13:19 model_runner.py:926] Loading model weights took 56.7677 GB
(VllmWorkerProcess pid=212) INFO 09-12 07:13:19 model_runner.py:926] Loading model weights took 56.7677 GB
(VllmWorkerProcess pid=215) INFO 09-12 07:13:19 model_runner.py:926] Loading model weights took 56.7677 GB
(VllmWorkerProcess pid=214) INFO 09-12 07:13:19 model_runner.py:926] Loading model weights took 56.7677 GB
(VllmWorkerProcess pid=210) INFO 09-12 07:13:19 model_runner.py:926] Loading model weights took 56.7677 GB
(VllmWorkerProcess pid=213) INFO 09-12 07:13:19 model_runner.py:926] Loading model weights took 56.7677 GB
INFO 09-12 07:13:26 distributed_gpu_executor.py:57] # GPU blocks: 2937, # CPU blocks: 4161
(VllmWorkerProcess pid=209) INFO 09-12 07:13:30 model_runner.py:1217] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=209) INFO 09-12 07:13:30 model_runner.py:1221] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
.
.
.
.
.
INFO 09-12 07:13:30 model_runner.py:1217] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 09-12 07:13:30 model_runner.py:1221] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=212) INFO 09-12 07:13:30 model_runner.py:1217] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=212) INFO 09-12 07:13:30 model_runner.py:1221] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=215) INFO 09-12 07:13:46 custom_all_reduce.py:223] Registering 8855 cuda graph addresses
(VllmWorkerProcess pid=211) INFO 09-12 07:13:46 custom_all_reduce.py:223] Registering 8855 cuda graph addresses
(VllmWorkerProcess pid=213) INFO 09-12 07:13:46 custom_all_reduce.py:223] Registering 8855 cuda graph addresses
.
.
.
.
.
(VllmWorkerProcess pid=211) INFO 09-12 07:13:46 model_runner.py:1335] Graph capturing finished in 16 secs.
(VllmWorkerProcess pid=209) INFO 09-12 07:13:46 model_runner.py:1335] Graph capturing finished in 17 secs.
INFO 09-12 07:13:46 model_runner.py:1335] Graph capturing finished in 16 secs.
INFO 09-12 07:13:47 api_server.py:224] vLLM to use /tmp/tmpzt8ci2e8 as PROMETHEUS_MULTIPROC_DIR
WARNING 09-12 07:13:47 serving_embedding.py:190] embedding_mode is False. Embedding API will not work.
INFO 09-12 07:13:47 launcher.py:20] Available routes are:
INFO 09-12 07:13:47 launcher.py:28] Route: /openapi.json, Methods: GET, HEAD
INFO 09-12 07:13:47 launcher.py:28] Route: /docs, Methods: GET, HEAD
INFO 09-12 07:13:47 launcher.py:28] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 09-12 07:13:47 launcher.py:28] Route: /redoc, Methods: GET, HEAD
INFO 09-12 07:13:47 launcher.py:28] Route: /health, Methods: GET
INFO 09-12 07:13:47 launcher.py:28] Route: /tokenize, Methods: POST
INFO 09-12 07:13:47 launcher.py:28] Route: /detokenize, Methods: POST
INFO 09-12 07:13:47 launcher.py:28] Route: /v1/models, Methods: GET
INFO 09-12 07:13:47 launcher.py:28] Route: /version, Methods: GET
INFO 09-12 07:13:47 launcher.py:28] Route: /v1/chat/completions, Methods: POST
INFO 09-12 07:13:47 launcher.py:28] Route: /v1/completions, Methods: POST
INFO 09-12 07:13:47 launcher.py:28] Route: /v1/embeddings, Methods: POST
INFO 09-12 07:13:47 launcher.py:33] Launching Uvicorn with --limit_concurrency 32765. To avoid this limit at the expense of performance run with --disable-frontend-multiprocessing
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO 09-12 07:13:57 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-12 07:14:07 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

Verifying the model’s use of GPU memory

With access to the host or the container, you can verify GPU use by using the nvidia-smi utility. The following example shows how the GPUs are displayed on the container after the model is loaded:

fbronzati@login01:/mnt/f710/vllm$ kubectl exec -it vllm-deployment-cd6c8564c-kh6tb -n vllm -- bash
 
root@vllm-deployment-cd6c8564c-kh6tb:/vllm-workspace# nvidia-smi
Thu Sep 12 11:31:04 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07               Driver Version: 550.90.07       CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf           Pwr:Usage/Cap |            Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On   |   00000000:19:00.0 Off |                    0 |
| N/A   40C    P0            117W /  700W |    72156MiB /  81559MiB |      0%       Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On   |   00000000:3B:00.0 Off |                    0 |
| N/A   36C    P0            114W /  700W |    69976MiB /  81559MiB |      0%       Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On   |   00000000:4C:00.0 Off |                    0 |
| N/A   34C    P0            115W /  700W |    69976MiB /  81559MiB |      0%       Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On   |   00000000:5D:00.0 Off |                    0 |
| N/A   38C    P0            112W /  700W |    69976MiB /  81559MiB |      0%       Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On   |   00000000:9B:00.0 Off |                    0 |
| N/A   41C    P0            119W /  700W |    69976MiB /  81559MiB |      0%       Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On   |   00000000:BB:00.0 Off |                    0 |
| N/A   36C    P0            111W /  700W |    69976MiB /  81559MiB |      0%       Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          On   |   00000000:CB:00.0 Off |                    0 |
| N/A   38C    P0            117W /  700W |    69976MiB /  81559MiB |      0%       Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          On   |   00000000:DB:00.0 Off |                    0 |
| N/A   34C    P0            117W /  700W |    69496MiB /  81559MiB |      0%       Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
 
+-----------------------------------------------------------------------------------------+
| Processes:                                                                               |
|  GPU   GI    CI        PID   Type    Process name                               GPU Memory |
|        ID   ID                                                                Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

Confirming the model

To confirm that the model is working, send a curl command:

fbronzati@login01:/mnt/f710/vllm$ curl -X 'POST'     'http://*.*.*.*:8000/v1/completions'      -H 'accept: application/json'      -H 'Content-Type: application/json'      -d '{
"model": "meta-llama/Meta-Llama-3.1-405B-Instruct-FP8",
"prompt": "Once upon a time",
"max_tokens": 64
}'

The following example is a response from the model:

{"id":"cmpl-cc26bb0694e34a26940becc791b2eacf","object":"text_completion","created":1726166028,"model":"meta-llama/Meta-Llama-3.1-405B-Instruct-FP8","choices":[{"index":0,"text":", there was a little girl name Sophie. She was eight years old and lived in a big, noisy city with her parents. Sophie's room was filled with belonging to a grown-up until just a little while ago her mother items. The room was cramped and cluttered, but Sophie loved it because it was hers.\n","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":69,"completion_tokens":64}}

Because the container being ussed deploys the OpenAI API, you can use Python or other languages to interact with the model.

Conclusion

The Llama 3.1 405b model is a highly capable open-source model that offers multiple deployment options.

By using vLLM, we increase flexibility by being able to use Docker with a single command line. Installation of the requirements is easier because a Kubernetes cluster or a license is not required for deployment.

While this flexibility is beneficial, it can also introduce complexity when determining the most suitable deployment approach for specific requirements. Enterprises usually require high availability and support that can be more difficult to achieve with open-source tools and a simple Docker deployment.

To address this complexity, Dell Technologies has developed the Dell AI Factory—a comprehensive framework that provides detailed studies, code snippets, and execution outputs to facilitate the deployment process. This resource enables organizations to replicate deployments, gain insights, and evaluate different strategies, helping them select the optimal deployment method based on their unique infrastructure and business needs.

Your Browser is Out of Date

Deploying the Llama 3.1 405b Model Using vLLM

Overview

Deployment with vLLM

Docker deployment

Confirming the model

Kubernetes deployment

Creating the deployment file

Creating the Kubernetes namespace and secrets

Deploying the vllm pod

Model Initialization

Verifying the model’s use of GPU memory

Confirming the model

Conclusion