Deploying the Llama 3.1 405b Model Using vLLM
Thu, 10 Oct 2024 19:42:18 -0000
|Read Time: 0 minutes
This blog is one in a series of three that shows how Dell Technologies and our partner AI ecosystem can help you to provision the most powerful open-source model available easily. In this series of blogs, we share information about the ease of deploying the Llama 3.1 405b model in the Dell PowerEdge XE9680 server by using NVIDIA NIM, Dell Enterprise Hub using Text Generation Inference (TGI), or vLLM for LLMs. We hope this series equips you with the knowledge and tools needed for a successful deployment.
This blog describes the vLLM for LLMs option.
Overview
In another blog in the series, we show how to deploy the Llama 3.1 405b model using NVIDIA NIM in a single and multimode deployment, also known as distributed inference.
By following the process described in the NVIDIA NIM blog, this blog demonstrates how easily you can deploy Llama 3.1 405b in a Dell PowerEdge XE9680 server using vLLM with Docker or with Kubernetes.
Deployment with vLLM
The vLLM library is designed for high-throughput and memory-efficient inference and serving of large language models (LLMs). The vLLM community is a vibrant and active membership that is centered around the development and use of the vLLM library.
In the following sections, we show two simple ways to deploy vLLM with the Llama 3.1 405b model.
Docker deployment
The easiest deployment is to run Llama 3.1 405b with vLLM. The basic requirements include a Dell PowerEdge XE9680 server running Linux, Docker, NVIDIA GPU Driver, and the NVIDIA Container Toolkit. This blog does not include installation information; see https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html for more information.
To deploy the container and download the model, run the following command. The following example shows the output when the command is completed and the model is deployed:
fbronzati@node005:~$ docker run --runtime nvidia --gpus all -v /aipsf710-21/vllm:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=<replace_with_your_HuggingFcae_key" -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tensor-parallel-size 8 --max_model_len 10000 INFO 09-11 13:44:28 api_server.py:459] vLLM API server version 0.6.0 INFO 09-11 13:44:28 api_server.py:460] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=10000, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None) INFO 09-11 13:44:31 api_server.py:160] Multiprocessing frontend to use ipc:///tmp/4761d363-8d98-4215-a0ae-1d63b684f5c1 for RPC Path. INFO 09-11 13:44:31 api_server.py:176] Started engine process with PID 78 INFO 09-11 13:44:36 config.py:890] Defaulting to use mp for distributed inference INFO 09-11 13:44:36 llm_engine.py:213] Initializing an LLM engine (v0.6.0) with config: model='meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=10000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fbgemm_fp8, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=meta-llama/Meta-Llama-3.1-405B-Instruct-FP8, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=False, use_async_output_proc=True) WARNING 09-11 13:44:37 multiproc_gpu_executor.py:56] Reducing Torch parallelism from 112 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. INFO 09-11 13:44:37 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager (VllmWorkerProcess pid=209) INFO 09-11 13:44:38 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=215) INFO 09-11 13:44:38 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=211) INFO 09-11 13:44:38 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=213) INFO 09-11 13:44:38 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=214) INFO 09-11 13:44:38 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=210) INFO 09-11 13:44:38 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=212) INFO 09-11 13:44:38 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=210) INFO 09-11 13:44:40 utils.py:977] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=210) INFO 09-11 13:44:40 pynccl.py:63] vLLM is using nccl==2.20.5 INFO 09-11 13:44:40 utils.py:977] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=212) INFO 09-11 13:44:40 utils.py:977] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=209) INFO 09-11 13:44:40 utils.py:977] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=212) INFO 09-11 13:44:40 pynccl.py:63] vLLM is using nccl==2.20.5 INFO 09-11 13:44:40 pynccl.py:63] vLLM is using nccl==2.20.5 (VllmWorkerProcess pid=214) INFO 09-11 13:44:40 utils.py:977] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=215) INFO 09-11 13:44:40 utils.py:977] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=209) INFO 09-11 13:44:40 pynccl.py:63] vLLM is using nccl==2.20.5 (VllmWorkerProcess pid=215) INFO 09-11 13:44:40 pynccl.py:63] vLLM is using nccl==2.20.5 (VllmWorkerProcess pid=214) INFO 09-11 13:44:40 pynccl.py:63] vLLM is using nccl==2.20.5 (VllmWorkerProcess pid=213) INFO 09-11 13:44:40 utils.py:977] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=213) INFO 09-11 13:44:40 pynccl.py:63] vLLM is using nccl==2.20.5 (VllmWorkerProcess pid=211) INFO 09-11 13:44:40 utils.py:977] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=211) INFO 09-11 13:44:40 pynccl.py:63] vLLM is using nccl==2.20.5 INFO 09-11 13:44:48 custom_all_reduce_utils.py:204] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json INFO 09-11 13:45:08 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json (VllmWorkerProcess pid=214) INFO 09-11 13:45:08 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json (VllmWorkerProcess pid=212) INFO 09-11 13:45:08 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json (VllmWorkerProcess pid=211) INFO 09-11 13:45:08 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json (VllmWorkerProcess pid=210) INFO 09-11 13:45:08 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json (VllmWorkerProcess pid=213) INFO 09-11 13:45:08 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json (VllmWorkerProcess pid=209) INFO 09-11 13:45:08 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json (VllmWorkerProcess pid=215) INFO 09-11 13:45:08 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json INFO 09-11 13:45:08 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fff44678250>, local_subscribe_port=45245, remote_subscribe_port=None) INFO 09-11 13:45:08 model_runner.py:915] Starting to load model meta-llama/Meta-Llama-3.1-405B-Instruct-FP8... (VllmWorkerProcess pid=210) INFO 09-11 13:45:08 model_runner.py:915] Starting to load model meta-llama/Meta-Llama-3.1-405B-Instruct-FP8... (VllmWorkerProcess pid=212) INFO 09-11 13:45:08 model_runner.py:915] Starting to load model meta-llama/Meta-Llama-3.1-405B-Instruct-FP8... . . . . (VllmWorkerProcess pid=212) INFO 09-11 13:45:11 weight_utils.py:236] Using model weights format ['*.safetensors'] (VllmWorkerProcess pid=210) INFO 09-11 13:45:11 weight_utils.py:236] Using model weights format ['*.safetensors'] Loading safetensors checkpoint shards: 0% Completed | 0/109 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 1% Completed | 1/109 [00:02<04:05, 2.28s/it] Loading safetensors checkpoint shards: 2% Completed | 2/109 [00:05<05:20, 3.00s/it] Loading safetensors checkpoint shards: 3% Completed | 3/109 [00:09<05:47, 3.28s/it] Loading safetensors checkpoint shards: 4% Completed | 4/109 [00:12<05:49, 3.32s/it] Loading safetensors checkpoint shards: 5% Completed | 5/109 [00:17<06:33, 3.78s/it] . . . Loading safetensors checkpoint shards: 97% Completed | 106/109 [06:28<00:11, 3.67s/it] Loading safetensors checkpoint shards: 98% Completed | 107/109 [06:32<00:07, 3.59s/it] Loading safetensors checkpoint shards: 99% Completed | 108/109 [06:36<00:03, 3.86s/it] Loading safetensors checkpoint shards: 100% Completed | 109/109 [06:39<00:00, 3.69s/it] Loading safetensors checkpoint shards: 100% Completed | 109/109 [06:39<00:00, 3.67s/it] INFO 09-11 13:51:55 model_runner.py:926] Loading model weights took 56.7677 GB (VllmWorkerProcess pid=213) INFO 09-11 13:51:55 model_runner.py:926] Loading model weights took 56.7677 GB (VllmWorkerProcess pid=212) INFO 09-11 13:51:55 model_runner.py:926] Loading model weights took 56.7677 GB (VllmWorkerProcess pid=214) INFO 09-11 13:51:55 model_runner.py:926] Loading model weights took 56.7677 GB (VllmWorkerProcess pid=210) INFO 09-11 13:51:55 model_runner.py:926] Loading model weights took 56.7677 GB (VllmWorkerProcess pid=215) INFO 09-11 13:51:55 model_runner.py:926] Loading model weights took 56.7677 GB (VllmWorkerProcess pid=211) INFO 09-11 13:51:55 model_runner.py:926] Loading model weights took 56.7677 GB (VllmWorkerProcess pid=209) INFO 09-11 13:51:55 model_runner.py:926] Loading model weights took 56.7677 GB INFO 09-11 13:52:04 distributed_gpu_executor.py:57] # GPU blocks: 3152, # CPU blocks: 4161 (VllmWorkerProcess pid=213) INFO 09-11 13:52:08 model_runner.py:1217] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. (VllmWorkerProcess pid=213) INFO 09-11 13:52:08 model_runner.py:1221] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage. . . . . INFO 09-11 13:52:09 model_runner.py:1217] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. INFO 09-11 13:52:09 model_runner.py:1221] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage. (VllmWorkerProcess pid=214) INFO 09-11 13:52:09 model_runner.py:1217] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. (VllmWorkerProcess pid=214) INFO 09-11 13:52:09 model_runner.py:1221] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage. (VllmWorkerProcess pid=215) INFO 09-11 13:52:25 custom_all_reduce.py:223] Registering 8855 cuda graph addresses . . . . .. (VllmWorkerProcess pid=209) INFO 09-11 13:52:26 model_runner.py:1335] Graph capturing finished in 17 secs. (VllmWorkerProcess pid=214) INFO 09-11 13:52:26 model_runner.py:1335] Graph capturing finished in 17 secs. INFO 09-11 13:52:26 model_runner.py:1335] Graph capturing finished in 17 secs. INFO 09-11 13:52:27 api_server.py:224] vLLM to use /tmp/tmpje3c_sb0 as PROMETHEUS_MULTIPROC_DIR WARNING 09-11 13:52:27 serving_embedding.py:190] embedding_mode is False. Embedding API will not work. INFO 09-11 13:52:27 launcher.py:20] Available routes are: INFO 09-11 13:52:27 launcher.py:28] Route: /openapi.json, Methods: HEAD, GET INFO 09-11 13:52:27 launcher.py:28] Route: /docs, Methods: HEAD, GET INFO 09-11 13:52:27 launcher.py:28] Route: /docs/oauth2-redirect, Methods: HEAD, GET INFO 09-11 13:52:27 launcher.py:28] Route: /redoc, Methods: HEAD, GET INFO 09-11 13:52:27 launcher.py:28] Route: /health, Methods: GET INFO 09-11 13:52:27 launcher.py:28] Route: /tokenize, Methods: POST INFO 09-11 13:52:27 launcher.py:28] Route: /detokenize, Methods: POST INFO 09-11 13:52:27 launcher.py:28] Route: /v1/models, Methods: GET INFO 09-11 13:52:27 launcher.py:28] Route: /version, Methods: GET INFO 09-11 13:52:27 launcher.py:28] Route: /v1/chat/completions, Methods: POST INFO 09-11 13:52:27 launcher.py:28] Route: /v1/completions, Methods: POST INFO 09-11 13:52:27 launcher.py:28] Route: /v1/embeddings, Methods: POST INFO 09-11 13:52:27 launcher.py:33] Launching Uvicorn with --limit_concurrency 32765. To avoid this limit at the expense of performance run with --disable-frontend-multiprocessing INFO: Started server process [1] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) INFO 09-11 13:52:37 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 09-11 13:52:47 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
Confirming the model
To confirm that the model is working, send a curl command. Ensure that you use the localhost if you submit the request from the node or the IP address of the host if you are testing from another system.
fbronzati@node005:~$ curl -X 'POST' 'http://localhost:8000/v1/completions' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{ "model": "meta-llama/Meta-Llama-3.1-405B-Instruct-FP8", "prompt": "Once upon a time", "max_tokens": 64 }'
The following example is a response from the model:
{"id":"cmpl-9f94c78172db488db84eac5f1fb5165e","object":"text_completion","created":1726232453,"model":"meta-llama/Meta-Llama-3.1-405B-Instruct-FP8","choices":[{"index":0,"text":" there was a man who devoted his entire life to mastering the art of dancing. He trained tirelessly and meticulously in every style he could find, from waltzing to hip-hop. He quickly became a master of each craft, impressing everyone around him with his incredible raw talent and dedication. People everywhere sought after this Renaissance","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":69,"completion_tokens":64}}
Because the container that is being used deploys the OpenAI API, you can use Python or other languages to interact with the model.
Kubernetes deployment
Similarly to the NVIDIA NIM deployment, this blog does not describe the installation of the Kubernetes cluster and additional software such as NVIDIA GPU operator. These components are essential. There are many resources on the Internet that describe how to deploy the components. Also, Dell Technologies can provide you with a working environment with the Dell AI Factory solution, which will facilitate your vLLM deployment.
Creating the deployment file
The following example shows how to create a deployment YAML file, which can be expanded with liveness and readiness probes. This blog does not describe these advanced configurations.
- Use your preferred text editor to create a deploy-vllm-llama3.1-405B-8xH100-9680.yaml file.
fbronzati@login01:/mnt/f710/vllm$ vi deploy-vllm-llama3.1-405B-8xH100-9680.yaml
We used vi, however, vim and GNU nano are also good options. - Paste and edit the following content as required for your environment:
apiVersion: apps/v1 kind: Deployment metadata: name: vllm-deployment spec: replicas: 1 selector: matchLabels: app: vllm-server template: metadata: labels: app: vllm-server spec: containers: - name: vllm-container image: "vllm/vllm-openai:v0.6.0" imagePullPolicy: IfNotPresent resources: limits: nvidia.com/gpu: 8 args: - "--model=meta-llama/Meta-Llama-3.1-405B-Instruct-FP8" - "--tensor-parallel-size=8" - "--max_model_len=10000" env: - name: HUGGING_FACE_HUB_TOKEN value: "replace_with_your_HuggingFace_key" volumeMounts: - mountPath: /dev/shm name: dshm - mountPath: /root/.cache/huggingface name: model-cache imagePullSecrets: - name: regcred volumes: - name: dshm emptyDir: medium: Memory sizeLimit: 32Gi - name: model-cache nfs: server: f710.f710 path: /ifs/data/Projects/vllm nodeSelector: nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM3 --- apiVersion: v1 kind: Service metadata: name: vllm-service spec: type: LoadBalancer ports: - protocol: TCP port: 8000 targetPort: 8000 selector: app: vllm-server --- apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: vllm-ingress annotations: nginx.ingress.kubernetes.io/rewrite-target: / spec: ingressClassName: nginx-ingress rules: - http: paths: - path: / pathType: Prefix backend: service: name: vllm-service port: number: 8000
Some important considerations include:
- The Llama 3.1 405b model is approximately 700 GB. It is impractical to download the model every time that you switch to a different host. Therefore, we recommend having an external NFS such as the PowerScale F710. This configuration is shown on the volumes section of the file.
- You must accept the Hugging Face and Meta contract and use your HF key to be able to download the model. Otherwise, an error message is displayed when deploying the pod.
- We recommend that you create a new namespace for deploying the pod.
- Because the vLLM image is hosted in the Docker registry, you might need to create a secret. Otherwise, the download might be limited.
Creating the Kubernetes namespace and secrets
After creating the deployment file, create a namespace. For our example, we used vllm for simple identification on the pods.
fbronzati@login01:/mnt/f710/vllm$ kubectl create namespace vllm namespace/vllm created
Create the Docker secret to avoid limiting the number of downloads on the Docker repository:
fbronzati@login01:/mnt/f710/vllm$ kubectl create secret docker-registry regcred --docker-username=<replace_with_your_docker_user> --docker-password=<replace _with_your_docker_key> -n vllm
Deploying the vllm pod
To deploy the pod and the services that are required to access the model:
- Run the following command:
fbronzati@login01:/mnt/f710/vllm$ kubectl apply -f deploy-vllm-llama3.1-405B-8xH100-9680.yaml -n vllmdeployment.apps/vllm-deployment created service/vllm-service created ingress.networking.k8s.io/vllm-ingress created
For a first-time deployment, the process of downloading the image and the model takes some time because the model is approximately 683 GB.
fbronzati@login01:/mnt/f710/vllm$ du -sh hub/* 683G hub/models--meta-llama--Meta-Llama-3.1-405B-Instruct-FP8
- To monitor the deployment of the pod and services, run the following commands
fbronzati@login01:/mnt/f710/vllm$ kubectl get pods -n vllm -o wideNAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES vllm-deployment-cd6c8564c-kh6tb 1/1 Running 0 58s 10.194.214.55 helios25 <none> <none> fbronzati@login01:/mnt/f710/vllm$ kubectl get services -n vllm -o wide NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR vllm-service LoadBalancer *.*.*.* <pending> 8000:30757/TCP 83s app=vllm-server
- To verify any errors or if the container image is being downloaded, run the kubectl describe command:
fbronzati@login01:/mnt/f710/vllm$ kubectl describe pod vllm-deployment-cd6c8564c-kh6tb -n vllm Name: vllm-deployment-cd6c8564c-kh6tb Namespace: vllm Priority: 0 Service Account: default Node: helios25/*.*.*.* Start Time: Thu, 12 Sep 2024 09:05:22 -0500 Labels: app=vllm-server pod-template-hash=cd6c8564c Annotations: cni.projectcalico.org/containerID: 56bbadf0bf9193c47e481263e2a52770595c5c16f6bbee3e63177953a755c52c cni.projectcalico.org/podIP: 10.194.214.55/32 cni.projectcalico.org/podIPs: 10.194.214.55/32 k8s.v1.cni.cncf.io/network-status: [{ "name": "k8s-pod-network", "ips": [ "10.194.214.55" ], "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: [{ "name": "k8s-pod-network", "ips": [ "10.194.214.55" ], "default": true, "dns": {} }] Status: Running IP: *.*.*.* IPs: IP: *.*.*.* Controlled By: ReplicaSet/vllm-deployment-cd6c8564c Containers: vllm-container: Container ID: containerd://92d98154a4faebfb5fe67ffc5aaa0404f1a6e3c37698a8eb94173543fd2b182c Image: vllm/vllm-openai:v0.6.0 Image ID: docker.io/vllm/vllm-openai@sha256:072427aa6f95c74782a9bc3fe1d1fcd1e1aa3fe47b317584ea2181c549ad2de8 Port: <none> Host Port: <none> Args: --model=meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tensor-parallel-size=8 --max_model_len=10000 State: Running Started: Thu, 12 Sep 2024 09:05:24 -0500 Ready: True Restart Count: 0 Limits: nvidia.com/gpu: 8 Requests: nvidia.com/gpu: 8 Environment: HUGGING_FACE_HUB_TOKEN: █████████████████████████████ Mounts: /dev/shm from dshm (rw) /root/.cache/huggingface from model-cache (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kd9zq (ro) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: dshm: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: Memory SizeLimit: 32Gi model-cache: Type: NFS (an NFS mount that lasts the lifetime of a pod) Server: f710.f710 Path: /ifs/data/Projects/vllm ReadOnly: false kube-api-access-kd9zq: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true QoS Class: BestEffort Node-Selectors: nvidia.com/gpu.product=NVIDIA-H100-80GB-HBM3 Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: <none>
Model Initialization
After downloading the container image and creating the pod/service, the model is downloaded and loaded to the GPUs. This process might take a long time. We recommend that you monitor the logs of the pod to follow the process. The following example shows sample output that enables you to verify if the behavior is the same in your environment:
fbronzati@login01:/mnt/f710/vllm$ kubectl logs vllm-deployment-cd6c8564c-kh6tb -n vllm -f INFO 09-12 07:05:28 api_server.py:459] vLLM API server version 0.6.0 INFO 09-12 07:05:28 api_server.py:460] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=10000, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None) INFO 09-12 07:05:33 api_server.py:160] Multiprocessing frontend to use ipc:///tmp/29c1a11c-a17a-4fa1-8768-a001a3385714 for RPC Path. INFO 09-12 07:05:33 api_server.py:176] Started engine process with PID 78 INFO 09-12 07:05:37 config.py:890] Defaulting to use mp for distributed inference INFO 09-12 07:05:37 llm_engine.py:213] Initializing an LLM engine (v0.6.0) with config: model='meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=10000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fbgemm_fp8, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=meta-llama/Meta-Llama-3.1-405B-Instruct-FP8, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=False, use_async_output_proc=True) WARNING 09-12 07:05:37 multiproc_gpu_executor.py:56] Reducing Torch parallelism from 112 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. INFO 09-12 07:05:37 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager (VllmWorkerProcess pid=210) INFO 09-12 07:05:38 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=213) INFO 09-12 07:05:38 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=212) INFO 09-12 07:05:38 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=211) INFO 09-12 07:05:38 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=209) INFO 09-12 07:05:38 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=214) INFO 09-12 07:05:38 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=215) INFO 09-12 07:05:38 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=211) INFO 09-12 07:05:48 utils.py:977] Found nccl from library libnccl.so.2 INFO 09-12 07:05:48 utils.py:977] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=209) INFO 09-12 07:05:48 utils.py:977] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=215) INFO 09-12 07:05:48 utils.py:977] Found nccl from library libnccl.so.2 INFO 09-12 07:05:48 pynccl.py:63] vLLM is using nccl==2.20.5 (VllmWorkerProcess pid=211) INFO 09-12 07:05:48 pynccl.py:63] vLLM is using nccl==2.20.5 (VllmWorkerProcess pid=209) INFO 09-12 07:05:48 pynccl.py:63] vLLM is using nccl==2.20.5 (VllmWorkerProcess pid=210) INFO 09-12 07:05:48 utils.py:977] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=215) INFO 09-12 07:05:48 pynccl.py:63] vLLM is using nccl==2.20.5 (VllmWorkerProcess pid=213) INFO 09-12 07:05:48 utils.py:977] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=214) INFO 09-12 07:05:48 utils.py:977] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=210) INFO 09-12 07:05:48 pynccl.py:63] vLLM is using nccl==2.20.5 (VllmWorkerProcess pid=212) INFO 09-12 07:05:48 utils.py:977] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=213) INFO 09-12 07:05:48 pynccl.py:63] vLLM is using nccl==2.20.5 (VllmWorkerProcess pid=214) INFO 09-12 07:05:48 pynccl.py:63] vLLM is using nccl==2.20.5 (VllmWorkerProcess pid=212) INFO 09-12 07:05:48 pynccl.py:63] vLLM is using nccl==2.20.5 INFO 09-12 07:05:55 custom_all_reduce_utils.py:204] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json INFO 09-12 07:06:38 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json (VllmWorkerProcess pid=213) INFO 09-12 07:06:38 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json (VllmWorkerProcess pid=214) INFO 09-12 07:06:38 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json . . . . (VllmWorkerProcess pid=209) INFO 09-12 07:06:38 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json INFO 09-12 07:06:38 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fff438fd6f0>, local_subscribe_port=33971, remote_subscribe_port=None) INFO 09-12 07:06:38 model_runner.py:915] Starting to load model meta-llama/Meta-Llama-3.1-405B-Instruct-FP8... (VllmWorkerProcess pid=210) INFO 09-12 07:06:38 model_runner.py:915] Starting to load model meta-llama/Meta-Llama-3.1-405B-Instruct-FP8... (VllmWorkerProcess pid=212) INFO 09-12 07:06:38 model_runner.py:915] Starting to load model meta-llama/Meta-Llama-3.1-405B-Instruct-FP8... . . . . (VllmWorkerProcess pid=209) INFO 09-12 07:06:40 weight_utils.py:236] Using model weights format ['*.safetensors'] (VllmWorkerProcess pid=210) INFO 09-12 07:06:40 weight_utils.py:236] Using model weights format ['*.safetensors'] Loading safetensors checkpoint shards: 0% Completed | 0/109 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 1% Completed | 1/109 [00:03<07:09, 3.97s/it] Loading safetensors checkpoint shards: 2% Completed | 2/109 [00:07<06:20, 3.56s/it] Loading safetensors checkpoint shards: 3% Completed | 3/109 [00:10<06:03, 3.42s/it] . . . . . Loading safetensors checkpoint shards: 96% Completed | 105/109 [06:23<00:12, 3.22s/it] Loading safetensors checkpoint shards: 97% Completed | 106/109 [06:27<00:10, 3.51s/it] Loading safetensors checkpoint shards: 98% Completed | 107/109 [06:31<00:06, 3.44s/it] Loading safetensors checkpoint shards: 99% Completed | 108/109 [06:35<00:03, 3.75s/it] Loading safetensors checkpoint shards: 100% Completed | 109/109 [06:38<00:00, 3.59s/it] Loading safetensors checkpoint shards: 100% Completed | 109/109 [06:38<00:00, 3.66s/it] INFO 09-12 07:13:19 model_runner.py:926] Loading model weights took 56.7677 GB (VllmWorkerProcess pid=209) INFO 09-12 07:13:19 model_runner.py:926] Loading model weights took 56.7677 GB (VllmWorkerProcess pid=211) INFO 09-12 07:13:19 model_runner.py:926] Loading model weights took 56.7677 GB (VllmWorkerProcess pid=212) INFO 09-12 07:13:19 model_runner.py:926] Loading model weights took 56.7677 GB (VllmWorkerProcess pid=215) INFO 09-12 07:13:19 model_runner.py:926] Loading model weights took 56.7677 GB (VllmWorkerProcess pid=214) INFO 09-12 07:13:19 model_runner.py:926] Loading model weights took 56.7677 GB (VllmWorkerProcess pid=210) INFO 09-12 07:13:19 model_runner.py:926] Loading model weights took 56.7677 GB (VllmWorkerProcess pid=213) INFO 09-12 07:13:19 model_runner.py:926] Loading model weights took 56.7677 GB INFO 09-12 07:13:26 distributed_gpu_executor.py:57] # GPU blocks: 2937, # CPU blocks: 4161 (VllmWorkerProcess pid=209) INFO 09-12 07:13:30 model_runner.py:1217] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. (VllmWorkerProcess pid=209) INFO 09-12 07:13:30 model_runner.py:1221] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage. . . . . . INFO 09-12 07:13:30 model_runner.py:1217] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. INFO 09-12 07:13:30 model_runner.py:1221] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage. (VllmWorkerProcess pid=212) INFO 09-12 07:13:30 model_runner.py:1217] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. (VllmWorkerProcess pid=212) INFO 09-12 07:13:30 model_runner.py:1221] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage. (VllmWorkerProcess pid=215) INFO 09-12 07:13:46 custom_all_reduce.py:223] Registering 8855 cuda graph addresses (VllmWorkerProcess pid=211) INFO 09-12 07:13:46 custom_all_reduce.py:223] Registering 8855 cuda graph addresses (VllmWorkerProcess pid=213) INFO 09-12 07:13:46 custom_all_reduce.py:223] Registering 8855 cuda graph addresses . . . . . (VllmWorkerProcess pid=211) INFO 09-12 07:13:46 model_runner.py:1335] Graph capturing finished in 16 secs. (VllmWorkerProcess pid=209) INFO 09-12 07:13:46 model_runner.py:1335] Graph capturing finished in 17 secs. INFO 09-12 07:13:46 model_runner.py:1335] Graph capturing finished in 16 secs. INFO 09-12 07:13:47 api_server.py:224] vLLM to use /tmp/tmpzt8ci2e8 as PROMETHEUS_MULTIPROC_DIR WARNING 09-12 07:13:47 serving_embedding.py:190] embedding_mode is False. Embedding API will not work. INFO 09-12 07:13:47 launcher.py:20] Available routes are: INFO 09-12 07:13:47 launcher.py:28] Route: /openapi.json, Methods: GET, HEAD INFO 09-12 07:13:47 launcher.py:28] Route: /docs, Methods: GET, HEAD INFO 09-12 07:13:47 launcher.py:28] Route: /docs/oauth2-redirect, Methods: GET, HEAD INFO 09-12 07:13:47 launcher.py:28] Route: /redoc, Methods: GET, HEAD INFO 09-12 07:13:47 launcher.py:28] Route: /health, Methods: GET INFO 09-12 07:13:47 launcher.py:28] Route: /tokenize, Methods: POST INFO 09-12 07:13:47 launcher.py:28] Route: /detokenize, Methods: POST INFO 09-12 07:13:47 launcher.py:28] Route: /v1/models, Methods: GET INFO 09-12 07:13:47 launcher.py:28] Route: /version, Methods: GET INFO 09-12 07:13:47 launcher.py:28] Route: /v1/chat/completions, Methods: POST INFO 09-12 07:13:47 launcher.py:28] Route: /v1/completions, Methods: POST INFO 09-12 07:13:47 launcher.py:28] Route: /v1/embeddings, Methods: POST INFO 09-12 07:13:47 launcher.py:33] Launching Uvicorn with --limit_concurrency 32765. To avoid this limit at the expense of performance run with --disable-frontend-multiprocessing INFO: Started server process [1] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) INFO 09-12 07:13:57 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 09-12 07:14:07 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
Verifying the model’s use of GPU memory
With access to the host or the container, you can verify GPU use by using the nvidia-smi utility. The following example shows how the GPUs are displayed on the container after the model is loaded:
fbronzati@login01:/mnt/f710/vllm$ kubectl exec -it vllm-deployment-cd6c8564c-kh6tb -n vllm -- bash root@vllm-deployment-cd6c8564c-kh6tb:/vllm-workspace# nvidia-smi Thu Sep 12 11:31:04 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA H100 80GB HBM3 On | 00000000:19:00.0 Off | 0 | | N/A 40C P0 117W / 700W | 72156MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA H100 80GB HBM3 On | 00000000:3B:00.0 Off | 0 | | N/A 36C P0 114W / 700W | 69976MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA H100 80GB HBM3 On | 00000000:4C:00.0 Off | 0 | | N/A 34C P0 115W / 700W | 69976MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA H100 80GB HBM3 On | 00000000:5D:00.0 Off | 0 | | N/A 38C P0 112W / 700W | 69976MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 4 NVIDIA H100 80GB HBM3 On | 00000000:9B:00.0 Off | 0 | | N/A 41C P0 119W / 700W | 69976MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 5 NVIDIA H100 80GB HBM3 On | 00000000:BB:00.0 Off | 0 | | N/A 36C P0 111W / 700W | 69976MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 6 NVIDIA H100 80GB HBM3 On | 00000000:CB:00.0 Off | 0 | | N/A 38C P0 117W / 700W | 69976MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 7 NVIDIA H100 80GB HBM3 On | 00000000:DB:00.0 Off | 0 | | N/A 34C P0 117W / 700W | 69496MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| +-----------------------------------------------------------------------------------------+
Confirming the model
To confirm that the model is working, send a curl command:
fbronzati@login01:/mnt/f710/vllm$ curl -X 'POST' 'http://*.*.*.*:8000/v1/completions' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{ "model": "meta-llama/Meta-Llama-3.1-405B-Instruct-FP8", "prompt": "Once upon a time", "max_tokens": 64 }'
The following example is a response from the model:
{"id":"cmpl-cc26bb0694e34a26940becc791b2eacf","object":"text_completion","created":1726166028,"model":"meta-llama/Meta-Llama-3.1-405B-Instruct-FP8","choices":[{"index":0,"text":", there was a little girl name Sophie. She was eight years old and lived in a big, noisy city with her parents. Sophie's room was filled with belonging to a grown-up until just a little while ago her mother items. The room was cramped and cluttered, but Sophie loved it because it was hers.\n","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":69,"completion_tokens":64}}
Because the container being ussed deploys the OpenAI API, you can use Python or other languages to interact with the model.
Conclusion
The Llama 3.1 405b model is a highly capable open-source model that offers multiple deployment options.
By using vLLM, we increase flexibility by being able to use Docker with a single command line. Installation of the requirements is easier because a Kubernetes cluster or a license is not required for deployment.
While this flexibility is beneficial, it can also introduce complexity when determining the most suitable deployment approach for specific requirements. Enterprises usually require high availability and support that can be more difficult to achieve with open-source tools and a simple Docker deployment.
To address this complexity, Dell Technologies has developed the Dell AI Factory—a comprehensive framework that provides detailed studies, code snippets, and execution outputs to facilitate the deployment process. This resource enables organizations to replicate deployments, gain insights, and evaluate different strategies, helping them select the optimal deployment method based on their unique infrastructure and business needs.