Your Browser is Out of Date uses technology that works best in other browsers.
For a full experience use one of the browsers below Contact Us
United States/English
Home > AI Solutions > Gen AI > Blogs


Short articles about Gen AI and related technology trends

Blogs (24)

  • GenAI
  • Inference
  • Generative AI
  • NIM
  • microservices

Introduction to NVIDIA Inference Microservices, aka NIM

Bertrand Sirodot Fabricio Bronzati Bertrand Sirodot Fabricio Bronzati

Fri, 14 Jun 2024 19:16:30 -0000


Read Time: 0 minutes

At NVIDIA GTC 2024, the major release of the NVIDIA Inference Microservices, aka NIM was announced. NIM is part of the portfolio making up the Nvidia AI Enterprise stack. Why the focus on inferencing? Because when we look at use cases for Generative AI, the vast majority of them are inferencing use cases. Therefore it was critical to simplify the deployment applications by leveraging inferencing.

What is NIM?

NIM is a set of microservices designed to automate the deployment of Generative AI Inferencing applications. NIM was built with flexibility in mind. It supports a wide range of GenAI models, but also enabled frictionless scalability of GenAI inferencing. Below is a high-level view of the NIM components:


Diving a layer deeper, NIM is made of the following services:

  • An API layer
  • A server layer
  • A runtime layer
  • A model “engine” 

Each microservice is based on a docker container, simplifying the deployment on a wide variety of platforms and operating systems. Those containers can be downloaded from the Nvidia Docker Registry on NGC (

Because of their modular nature, NIMs can be adapted for vastly different use cases. For instance, at time of launch, NVIDIA is releasing a number of NIMs, including but not limited to a NIM for LLM, a NIM for Automated Speech Recognition and a NIM for Biology to predict the 3D structure of a protein. The remainder of this blog will be focused on NIM for LLMs.

While models can be deployed as part of a NIM, NIMs are flexible enough to allow for the use of NVIDIA models, but they are also able to leverage models from NVIDIA, on both cases, NVIDIA pre-generate the model engines and include industry standard APIs with the Triton Inference server. The figure below shows what happens at the start of a NIM:

When the docker run command is issued to start the NIM, once the containers have been downloaded, the containers will look to see if a model is already present on the filesystem. If it is, it will then use said model, but if it isn’t, it will download a default model from NGC. While having the model automatically downloaded seems like a small step, it fits with the overall philosophy of NIM around ease of use and faster time to deployment.

Setting up NIM

Thanks to their packaging as docker containers, NIMs have few prerequisites. You basically need a GPU or set of homogenous GPUs with sufficient aggregate memory to run your model and that/those GPU(s) will need to have tensor parallelism enabled. NIMs can run on pretty much any Linux distribution and only require docker, the Nvidia Container Toolkit and CUDA drivers to be installed.

Once all the prerequisites have been installed, it is possible to verify your installation using the following docker command: docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi which should display an output similar to the one below:

$ docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
 Fri May 24 14:13:48 2024
 | NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
 | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
 | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
 |                                         |                      |               MIG M. |
 |   0  NVIDIA H100 80GB HBM3          On  | 00000000:19:00.0 Off |                    0 |
 | N/A   48C    P0             128W / 700W |  76989MiB / 81559MiB |      0%      Default |
 |                                         |                      |             Disabled |
 |   1  NVIDIA H100 80GB HBM3          On  | 00000000:3B:00.0 Off |                    0 |
 | N/A   40C    P0             123W / 700W |  77037MiB / 81559MiB |      0%      Default |
 |                                         |                      |             Disabled |
 |   2  NVIDIA H100 80GB HBM3          On  | 00000000:4C:00.0 Off |                    0 |
 | N/A   40C    P0             128W / 700W |  77037MiB / 81559MiB |      0%      Default |
 |                                         |                      |             Disabled |
 |   3  NVIDIA H100 80GB HBM3          On  | 00000000:5D:00.0 Off |                    0 |
 | N/A   47C    P0             129W / 700W |  77037MiB / 81559MiB |      0%      Default |
 |                                         |                      |             Disabled |
 |   4  NVIDIA H100 80GB HBM3          On  | 00000000:9B:00.0 Off |                    0 |
 | N/A   49C    P0             142W / 700W |  77037MiB / 81559MiB |      0%      Default |
 |                                         |                      |             Disabled |
 |   5  NVIDIA H100 80GB HBM3          On  | 00000000:BB:00.0 Off |                    0 |
 | N/A   41C    P0             131W / 700W |  77037MiB / 81559MiB |      0%      Default |
 |                                         |                      |             Disabled |
 |   6  NVIDIA H100 80GB HBM3          On  | 00000000:CB:00.0 Off |                    0 |
 | N/A   48C    P0             144W / 700W |  77037MiB / 81559MiB |      0%      Default |
 |                                         |                      |             Disabled |
 |   7  NVIDIA H100 80GB HBM3          On  | 00000000:DB:00.0 Off |                    0 |
 | N/A   40C    P0             129W / 700W |  76797MiB / 81559MiB |      0%      Default |
 |                                         |                      |             Disabled |

 | Processes:                                                                            |
 |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
 |        ID   ID                                                             Usage      |

The next thing you will need is an NCG Authentication key. An NGC API key is required to access NGC resources and a key can be generated here:

When creating an NGC API key, ensure that at least “NGC Catalog” is selected from the “Services Included” dropdown. More Services can be included if this key is to be reused for other purposes.

This key will need to be passed to docker run in the next section as the NGC_API_KEY environment variable to download the appropriate models and resources when starting the NIM.

If you’re not familiar with how to do this, the simplest way is to export it in your terminal:

export NGC_API_KEY=<value>

Run one of the following to make it available at startup:

# If using bash

echo "export NGC_API_KEY=<value>" >> ~/.bashrc

# If using zsh

echo "export NGC_API_KEY=<value>" >> ~/.zshrc

To pull the NIM container image from NGC, first authenticate with the NVIDIA Container Registry with the following command:

echo "$NGC_API_KEY" | docker login --username '$oauthtoken' --password-stdin

Use $oauthtoken as the username and NGC_API_KEY as the password. The $oauthtoken username is a special name that indicates that you will authenticate with an API key and not a username and password. This is what the output of the command looks like:

$ echo "$NGC_API_KEY" | docker login --username '$oauthtoken' --password-stdin
 WARNING! Your password will be stored unencrypted in /home/fbronzati/.docker/config.json.
 Configure a credential helper to remove this warning. See

Login Succeeded

A few times throughout this documentation, the ngc CLI tool will be used. Before continuing, please refer to the NGC CLI documentation for information on how to download and configure the tool.

Note: The ngc tool used to use the environment variable NGC_API_KEY but has deprecated that in favor of NGC_CLI_API_KEY. In the previous section, you set NGC_API_KEY and it will be used in future commands. If you run ngc with this variable set, you will get a warning saying it is deprecated in favor of NGC_CLI_API_KEY. This can be safely ignored for now. You can set NGC_CLI_API_KEY, but so long as NGC_API_KEY is set, you will still get the warning.


Launching NIM

The below command will launch a Docker container for the meta-llama3-8b-instruct model.

# Choose a container name for bookkeeping

export CONTAINER_NAME=meta-llama3-8b-instruct

# Choose a LLM NIM Image from NGC

export IMG_NAME=""

# Choose a path on your system to cache the downloaded models

export LOCAL_NIM_CACHE=~/.cache/nim

mkdir -p "$LOCAL_NIM_CACHE"

# Start the LLM NIM

docker run -it --rm --name=$CONTAINER_NAME --runtime=nvidia  --gpus all -e NGC_API_KEY -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" -u $(id -u) -p 8000:8000 $IMG_NAME 

The NVIDIA NIM for LLM will automatically select the most compatible profile based on your system specification, using either backend:

  • TensorRT-LLM for optimized inference engines, when a compatible model is found
  • vLLM for generic non-optimized model

The selection will be logged at startup. For example:

Detected 6 compatible profile(s).

Valid profile: dcd85d5e877e954f26c4a7248cd3b98c489fbde5f1cf68b4af11d665fa55778e (tensorrt_llm-h100-fp8-tp2-latency) on GPUs [0, 1]

Valid profile: 30b562864b5b1e3b236f7b6d6a0998efbed491e4917323d04590f715aa9897dc (tensorrt_llm-h100-fp8-tp1-throughput) on GPUs [0]

Valid profile: 879b05541189ce8f6323656b25b7dff1930faca2abe552431848e62b7e767080 (tensorrt_llm-h100-fp16-tp2-latency) on GPUs [0, 1]

Valid profile: cb52cbc73a6a71392094380f920a3548f27c5fcc9dab02a98dc1bcb3be9cf8d1 (tensorrt_llm-h100-fp16-tp1-throughput) on GPUs [0]

Valid profile: 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2) on GPUs [0, 1, 2, 3, 4, 5, 6, 7]

Valid profile: 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1) on GPUs [0, 1, 2, 3, 4, 5, 6, 7]

Selected profile: dcd85d5e877e954f26c4a7248cd3b98c489fbde5f1cf68b4af11d665fa55778e (tensorrt_llm-h100-fp8-tp2-latency)

Profile metadata: precision: fp8

Profile metadata: tp: 2

Profile metadata: llm_engine: tensorrt_llm

Profile metadata: feat_lora: false

Profile metadata: gpu: H100

Profile metadata: pp: 1

Profile metadata: gpu_device: 2330:10de

Profile metadata: profile: latency 

It is possible to override this behavior, set a specific profile ID with -e NIM_MODEL_PROFILE=<value>. The following list-model-profiles command lists the available profiles for the IMG_NAME LLM NIM:

docker run --rm --runtime=nvidia --gpus=all $IMG_NAME list-model-profiles

Updating NIM to use PowerScale as model cache

It is possible to modify the NIM deployment process to leverage PowerScale as the cache to store the model.

Why would one do that? Because storing the model on a cache on PowerScale allows the re-use of the model on any server or even multiple clusters. This is not as critical when an application leverages a foundation model, but if you have spent money and resources to customize a particular model, this method allows that particular model to be re-used by multiple applications.

It also means that the application can scale horizontally as the model is now available to multiple servers, thus potentially improving its performance.

To achieve this, a few things need to happen.

First let’s export the container name and the image name:

$ export CONTAINER_NAME=meta-llama3-8b-instruct
 $ export IMG_NAME=""

Then let’s create the directory that will be used as cache on PowerScale and export that directory:

$ export LOCAL_NIM_CACHE=/aipsf810/project-helix/NIM/nim
 $ mkdir -p "$LOCAL_NIM_CACHE"

Then let’s run the container with these environment variables:

$ docker run -it --rm --name=$CONTAINER_NAME \
 --runtime=nvidia \
 --gpus all \
 --shm-size=16GB \
 -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
 -u $(id -u) \
 -p 8000:8000 \
 Unable to find image '' locally
 1.0.0: Pulling from nim/meta/llama3-8b-instruct
 5e8117c0bd28: Already exists
 d67fcc6ef577: Already exists
 47ee674c5713: Already exists
 63daa0e64b30: Already exists
 d9d9aecefab5: Already exists
 d71f46a15657: Already exists
 054e2ffff644: Already exists
 7d3cd81654d5: Already exists
 dca613dca886: Already exists
 0fdcdcda3b2e: Already exists
 af7b4f7dc15a: Already exists
 6d101782f66c: Already exists
 e8427cb13897: Already exists
 de05b029a5a2: Already exists
 3d72a2698104: Already exists
 aeff973c2191: Already exists
 85d7d3ff0cca: Already exists
 5996430251dd: Already exists
 314dc83fdfc2: Already exists
 5cef8f59ae9a: Already exists
 927db4ce3e96: Already exists
 cbe4a04f4491: Already exists
 60f1a03c0955: Pull complete
 67c1bb2b1aac: Pull complete
 f16f7b821143: Pull complete
 9be4fff0cd1a: Pull complete
 Digest: sha256:7fe6071923b547edd9fba87c891a362ea0b4a88794b8a422d63127e54caa6ef7
 Status: Downloaded newer image for

 == NVIDIA Inference Microservice LLM NIM ==

NVIDIA Inference Microservice LLM NIM Version 1.0.0
 Model: nim/meta/llama3-8b-instruct

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This NIM container is governed by the NVIDIA AI Product Agreement here:
 A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the AI Foundation Models Community License

ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3.
 A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.

2024-06-05 14:52:46,069 [INFO] PyTorch version 2.2.2 available.
 2024-06-05 14:52:46,732 [WARNING] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
 2024-06-05 14:52:46,733 [INFO] [TRT-LLM] [I] Starting TensorRT-LLM init.
 2024-06-05 14:52:48,096 [INFO] [TRT-LLM] [I] TensorRT-LLM inited.
 [TensorRT-LLM] TensorRT-LLM version: 0.10.1.dev2024053000
 INFO 06-05 14:52:49.6] NIM LLM API version 1.0.0
 INFO 06-05 14:52:49.12] Running NIM without LoRA. Only looking for compatible profiles that do not support LoRA.
 INFO 06-05 14:52:49.12] Detected 6 compatible profile(s).
 INFO 06-05 14:52:49.12] Valid profile: dcd85d5e877e954f26c4a7248cd3b98c489fbde5f1cf68b4af11d665fa55778e (tensorrt_llm-h100-fp8-tp2-latency) on GPUs [0, 1]
 INFO 06-05 14:52:49.12] Valid profile: 30b562864b5b1e3b236f7b6d6a0998efbed491e4917323d04590f715aa9897dc (tensorrt_llm-h100-fp8-tp1-throughput) on GPUs [0]
 INFO 06-05 14:52:49.12] Valid profile: 879b05541189ce8f6323656b25b7dff1930faca2abe552431848e62b7e767080 (tensorrt_llm-h100-fp16-tp2-latency) on GPUs [0, 1]
 INFO 06-05 14:52:49.12] Valid profile: cb52cbc73a6a71392094380f920a3548f27c5fcc9dab02a98dc1bcb3be9cf8d1 (tensorrt_llm-h100-fp16-tp1-throughput) on GPUs [0]
 INFO 06-05 14:52:49.12] Valid profile: 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2) on GPUs [0, 1, 2, 3, 4, 5, 6, 7]
 INFO 06-05 14:52:49.12] Valid profile: 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1) on GPUs [0, 1, 2, 3, 4, 5, 6, 7]
 INFO 06-05 14:52:49.12] Selected profile: dcd85d5e877e954f26c4a7248cd3b98c489fbde5f1cf68b4af11d665fa55778e (tensorrt_llm-h100-fp8-tp2-latency)
 INFO 06-05 14:52:50.112] Profile metadata: precision: fp8
 INFO 06-05 14:52:50.112] Profile metadata: tp: 2
 INFO 06-05 14:52:50.112] Profile metadata: llm_engine: tensorrt_llm
 INFO 06-05 14:52:50.112] Profile metadata: feat_lora: false
 INFO 06-05 14:52:50.112] Profile metadata: gpu: H100
 INFO 06-05 14:52:50.112] Profile metadata: pp: 1
 INFO 06-05 14:52:50.112] Profile metadata: gpu_device: 2330:10de
 INFO 06-05 14:52:50.112] Profile metadata: profile: latency
 INFO 06-05 14:52:50.112] Preparing model workspace. This step might download additional files to run the model.
 tokenizer_config.json [00:00:00] [██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████] 49.79 KiB/49.79 KiB 843.55 KiB/s (0s)rank1.engine [00:01:43] [██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████] 4.30 GiB/4.30 GiB 42.53 MiB/s (0s)trt_llm_config.yaml [00:00:00] [███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████] 1012 B/1012 B 17.60 KiB/s (0s)config.json [00:00:00] [█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████] 654 B/654 B 27.50 KiB/s (0s)special_tokens_map.json [00:00:00] [████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████] 73 B/73 B 3.12 KiB/s (0s)rank0.engine [00:01:43] [██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████] 4.30 GiB/4.30 GiB 42.63 MiB/s (0s)tokenizer.json [00:00:00] [████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████] 8.66 MiB/8.66 MiB 31.35 MiB/s (0s)checksums.blake3 [00:00:00] [████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████] 402 B/402 B 11.67 KiB/s (0s)generation_config.json [00:00:00] [███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████] 187 B/187 B 8.23 KiB/s (0s)model.safetensors.index.json [00:00:00] [███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████] 23.39 KiB/23.39 KiB 931.10 KiB/s (0s)config.json [00:00:00] [███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████] 5.21 KiB/5.21 KiB 98.42 KiB/s (0s)metadata.json [00:00:00] [████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████] 232 B/232 B 6.36 KiB/s (0s)INFO 06-05 14:56:25.185] Model workspace is now ready. It took 215.073 seconds
 INFO 06-05 14:56:25.207] Initializing an LLM engine (v1.0.0) with config: model='/tmp/meta--llama3-8b-instruct-ic179k86', speculative_config=None, tokenizer='/tmp/meta--llama3-8b-instruct-ic179k86', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
 WARNING 06-05 14:56:25.562] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
 INFO 06-05 14:56:25.583] Using provided selected GPUs list [0, 1]
 INFO 06-05 14:56:25.583] Using 0 bytes of gpu memory for PEFT cache
 INFO 06-05 14:56:25.593] Engine size in bytes 4613382012
 INFO 06-05 14:56:25.604] available device memory 85170061312
 INFO 06-05 14:56:25.604] Setting free_gpu_memory_fraction to 0.9
 WARNING 06-05 14:57:25.189] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
 INFO 06-05 14:57:25.198] Using default chat template:
 {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>

'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>

' }}{% endif %}
 WARNING 06-05 14:57:25.454] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
 INFO 06-05 14:57:25.462] Serving endpoints:
 INFO 06-05 14:57:25.462] An example cURL request:
 curl -X 'POST' \
'' \
   -H 'accept: application/json' \
   -H 'Content-Type: application/json' \
   -d '{
     "model": "meta/llama3-8b-instruct",
     "messages": [
         "content":"Hello! How are you?"
         "content":"Hi! I am quite well, how can I help you today?"
         "content":"Can you write me a song?"
     "top_p": 1,
     "n": 1,
     "max_tokens": 15,
     "stream": true,
     "frequency_penalty": 1.0,
     "stop": ["hello"]

INFO 06-05 14:57:25.508] Started server process [32]
 INFO 06-05 14:57:25.509] Waiting for application startup.
 INFO 06-05 14:57:25.518] Application startup complete.
INFO 06-05 14:57:25.519] Uvicorn running on (Press CTRL+C to quit)

Your output might differ slightly from the above, but if you reach the last line, then you have now successfully deployed a NIM for LLM with Llama 3 8B cached on PowerScale.

But what if I want to run a different model, such as Llama 3 70B instead of 8B. Easy, kill the previous container, change the 2 following environment variables:

$ export CONTAINER_NAME=meta-llama3-70b-instruct

$ export IMG_NAME=""

And run the same command as previously:

$ docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \

At the end of which, you will now have a NIM for LLM running Llama 3 70B and yes, it is that simple to now deploy all the required components to run inference.


Selecting a specific GPU

In all the commands above, I have instructed the container to use all the available GPUs, by passing --gpus all parameter. This is acceptable in homogeneous environments with 1 or more of the same GPU.

In heterogeneous environments with a combination of GPUs (for example: A6000 + a GeForce display GPU), workloads should only run on compute-capable GPUs. Expose specific GPUs inside the container using either:

  • the --gpus flag (ex: --gpus='"device=1"')
  • the environment variable NVIDIA_VISIBLE_DEVICES (ex: -e NVIDIA_VISIBLE_DEVICES=1)

The device ID(s) to use as input(s) are listed in the output of nvidia-smi -L:

fbronzati@node041:/aipsf810/project-helix/NIM$ nvidia-smi -L
 GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-a4c60bd7-b5fc-f461-2902-65138251f2cf)
 GPU 1: NVIDIA H100 80GB HBM3 (UUID: GPU-e5cd81b5-2df5-6404-8568-8b8e82783770)
 GPU 2: NVIDIA H100 80GB HBM3 (UUID: GPU-da78242a-c12a-3d3c-af30-5c6d5a9627df)
 GPU 3: NVIDIA H100 80GB HBM3 (UUID: GPU-65804526-f8a9-5f9e-7f84-398b099c7b3e)
 GPU 4: NVIDIA H100 80GB HBM3 (UUID: GPU-14a79c8f-0439-f199-c0cc-e46ee9dc05c1)
 GPU 5: NVIDIA H100 80GB HBM3 (UUID: GPU-06938695-4bfd-9c39-e64a-15edddfd5ac2)
 GPU 6: NVIDIA H100 80GB HBM3 (UUID: GPU-23666331-24c5-586b-8a04-2c6551570860)
 GPU 7: NVIDIA H100 80GB HBM3 (UUID: GPU-14f2736c-f0e9-0cb5-b12b-d0047f84531c)

For further information on this, please refer to the NVIDIA Container Toolkit documentation for more instructions.


Selecting a specific model profile

When I ran the NIM container above, I let it pick the default model profile for my system, it is also possible to specify which model profile I want to use. To do that, I need to ID of the profile. Getting the ID of a profile is as easy a running the following command for the specific image you are looking at. For instance, to get all the profile available for meta-llama3-70b-instruct:

$ docker run --rm --runtime=nvidia --gpus=all $IMG_NAME list-model-profiles

 == NVIDIA Inference Microservice LLM NIM ==

NVIDIA Inference Microservice LLM NIM Version 1.0.0
 Model: nim/meta/llama3-70b-instruct

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This NIM container is governed by the NVIDIA AI Product Agreement here:
 A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the AI Foundation Models Community License

ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3.
 A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.

 - Free GPUs:
   -  [2330:10de] (0) NVIDIA H100 80GB HBM3 (H100 80GB) [current utilization: 0%]
   -  [2330:10de] (1) NVIDIA H100 80GB HBM3 (H100 80GB) [current utilization: 0%]
   -  [2330:10de] (2) NVIDIA H100 80GB HBM3 (H100 80GB) [current utilization: 0%]
   -  [2330:10de] (3) NVIDIA H100 80GB HBM3 (H100 80GB) [current utilization: 0%]
   -  [2330:10de] (4) NVIDIA H100 80GB HBM3 (H100 80GB) [current utilization: 0%]
   -  [2330:10de] (5) NVIDIA H100 80GB HBM3 (H100 80GB) [current utilization: 0%]
   -  [2330:10de] (6) NVIDIA H100 80GB HBM3 (H100 80GB) [current utilization: 0%]
   -  [2330:10de] (7) NVIDIA H100 80GB HBM3 (H100 80GB) [current utilization: 0%]
 - Compatible with system and runnable:
   - 93782c337ddaf3f6b442ef7baebd0a732e433b7b93ec06bfd3db27945de560b6 (tensorrt_llm-h100-fp8-tp8-latency)
   - 8bb3b94e326789381bbf287e54057005a9250e3abbad0b1702a70222529fcd17 (tensorrt_llm-h100-fp8-tp4-throughput)
   - a90b2c0217492b1020bead4e7453199c9f965ea53e9506c007f4ff7b58ee01ff (tensorrt_llm-h100-fp16-tp8-latency)
   - abcff5042bfc3fa9f4d1e715b2e016c11c6735855edfe2093e9c24e83733788e (tensorrt_llm-h100-fp16-tp4-throughput)
   - 7e8f6cc0d0fde672073a20d5423977a0a02a9b0693f0f3d4ffc2ec8ac35474d4 (vllm-fp16-tp8)
   - df45ca2c979e5c64798908815381c59159c1d08066407d402f00c6d4abd5b108 (vllm-fp16-tp4)
   - With LoRA support:
     - 36fc1fa4fc35c1d54da115a39323080b08d7937dceb8ba47be44f4da0ec720ff (tensorrt_llm-h100-fp16-tp4-throughput-lora)
     - 0f3de1afe11d355e01657424a267fbaad19bfea3143a9879307c49aed8299db0 (vllm-fp16-tp8-lora)
     - a30aae0ed459082efed26a7f3bc21aa97cccad35509498b58449a52c27698544 (vllm-fp16-tp4-lora)
 - Incompatible with system:
   - 2e9b29c44b3d82821e7f4facd1a652ec5e0c4e7e473ee33451f6d046f616c3e5 (tensorrt_llm-l40s-fp8-tp8-latency)
   - 8b8e03de8630626b904b37910e3d82a26cebb99634a921a0e5c59cb84125efe8 (tensorrt_llm-l40s-fp8-tp4-throughput)
   - 96b70da1414c7beb5cf671b3e7cf835078740764064c453cd86e84cf0491aac0 (tensorrt_llm-l40s-fp16-tp8-throughput)
   - b811296367317f5097ed9f71b8f08d2688b2411c852978ae49e8a0d5c3a30739 (tensorrt_llm-a100-fp16-tp4-throughput)
   - 7f8bb4a2b97cf07faf6fb930ba67f33671492b7653f2a06fe522c7de65b544ca (tensorrt_llm-a100-bf16-tp8-latency)
   - 03fdb4d11f01be10c31b00e7c0540e2835e89a0079b483ad2dd3c25c8cc29b61 (tensorrt_llm-l40s-fp16-tp8-throughput-lora)
   - 7ba9fbd93c41a28358215f3e94e79a2545ab44e39df016eb4c7d7cadc384bde7 (tensorrt_llm-a100-fp16-tp4-throughput-lora)


For example below instead of using the default configuration, we selected the profile df45ca2c979e5c64798908815381c59159c1d08066407d402f00c6d4abd5b108 (vllm-fp16-tp4) and deployed using the flag -e NIM_MODEL_PROFILE=, following the output of the Llama 3 70B with vLLM deployment:

$ docker run -it --rm --name=$CONTAINER_NAME \
 --runtime=nvidia \
 --gpus all \
 --shm-size=16GB \
 -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
 -u $(id -u) \
 -e NIM_MODEL_PROFILE=df45ca2c979e5c64798908815381c59159c1d08066407d402f00c6d4abd5b108 \
 -p 8000:8000 \

Running Inference Requests

A NIM typically exposes 2 OpenAI compatible API endpoints: the completions endpoint and the chat completions endpoint. In the next section, I will show how to interact with those endpoints.

OpenAI Completion Request

The Completions endpoint is generally used for base models. With the Completions endpoint, prompts are sent as plain strings, and the model produces the most likely text completions subject to the other parameters chosen. To stream the result, set "stream": true.

curl -X 'POST' \
'' \
     -H 'accept: application/json' \
     -H 'Content-Type: application/json' \
     -d '{
 "model": "meta-llama3-8b-instruct",
 "prompt": "Once upon a time",
 "max_tokens": 64

Using the Llama 3 8B model, the request outputs the following:

$ curl -X 'POST' \
     '' \
     -H 'accept: application/json' \
     -H 'Content-Type: application/json' \
     -d '{
 "model": "meta-llama3-8b-instruct",
 "prompt": "Once upon a time",
 "max_tokens": 64
 {"id":"cmpl-799d4f8aa64943c4a5a737b5defdfdeb","object":"text_completion","created":1716483431,"model":"meta-llama3-8b-instruct","choices":[{"index":0,"text":", there was a young man named Jack who lived in a small village at the foot of a vast and ancient forest. Jack was a curious and adventurous soul, always eager to explore the world beyond his village. One day, he decided to venture into the forest, hoping to discover its secrets.\nAs he wandered deeper into","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":5,"total_tokens":69,"completion_tokens":64}}

Using Llama 3 70B and the request outputs the following:

$ curl -X 'POST' \
'' \
     -H 'accept: application/json' \
     -H 'Content-Type: application/json' \
     -d '{
 "model": "meta-llama3-70b-instruct",
 "prompt": "Once upon a time",
 "max_tokens": 64
 {"id":"cmpl-1b7394c809b244d489efacd13c2e13ac","object":"text_completion","created":1716568789,"model":"meta-llama3-70b-instruct","choices":[{"index":0,"text":", there was a young girl named Lily. She lived in a small village surrounded by lush green forests and rolling hills. Lily was a gentle soul, with a heart full of love for all living things.\nOne day, while wandering through the forest, Lily stumbled upon a hidden clearing. In the center of the clearing stood","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":5,"total_tokens":69,"completion_tokens":64}}

OpenAI Chat Completion Request

The Chat Completions endpoint is typically used with chat or instruct tuned models that are designed to be used through a conversational approach. With the Chat Completions endpoint, prompts are sent in the form of messages with roles and contents, giving a natural way to keep track of a multi-turn conversation. To stream the result, set "stream": true.

Running this request against Llama 3 8B produces the output below:

$ curl -X 'POST' \
'' \
     -H 'accept: application/json' \
     -H 'Content-Type: application/json' \
     -d '{
 "model": "meta-llama3-8b-instruct",
 "messages": [
 "content":"Hello! How are you?"
 "content":"Hi! I am quite well, how can I help you today?"
 "content":"Can you write me a song?"
 "max_tokens": 32
 {"id":"cmpl-cce6dabf331e465f8e9107b05eb92f6c","object":"chat.completion","created":1716483457,"model":"meta-llama3-8b-instruct","choices":[{"index":0,"message":{"role":"assistant","content":"I'd be happy to try. Can you give me a few details to get started?\n\n* Is there a specific topic or theme you'd like the song to"},"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":47,"total_tokens":79,"completion_tokens":32}}

And against Llama 3 70B:

$ curl -X 'POST' '' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{
 "model": "meta-llama3-70b-instruct",
 "messages": [
 "content":"Hello! How are you?"
 "content":"Hi! I am quite well, how can I help you today?"
 "content":"Can you write me a song?"
 "max_tokens": 32
 {"id":"cmpl-c6086a36cbf84fa387a18d6da4de6ffb","object":"chat.completion","created":1716569009,"model":"meta-llama3-70b-instruct","choices":[{"index":0,"message":{"role":"assistant","content":"I'd be happy to try. Can you give me a bit more information on what you're looking for? Here are a few questions to get started:\n\n*"},"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":47,"total_tokens":79,"completion_tokens":32}}


As shown in this blog, NIMs significantly changes the way infrastructure to run inference is deployed. NIMs packages all the critical components together in a single container that can be run using the standard docker run command.


Read Full Blog
  • AI
  • Artificial Intelligence
  • inferencing
  • XE9680
  • GenAI
  • LLM
  • Meta
  • Llama

vLLM Meets Kubernetes-Deploying Llama-3 Models using KServe on Dell PowerEdge XE9680 with AMD MI300X

Ajay Kadoula Savitha Pareek Subhankar Adak Ajay Kadoula Savitha Pareek Subhankar Adak

Fri, 17 May 2024 19:18:34 -0000


Read Time: 0 minutes


Dell's PowerEdge XE9680 server infrastructure, coupled with the capabilities of Kubernetes and KServe, offers a comprehensive platform for seamlessly deploying and managing sophisticated large language models such as Meta AI's Llama-3, addressing the evolving needs of AI-driven businesses.

Our previous blog post explored leveraging advanced Llama-3 models (meta-llama/Meta-Llama-3-8B-Instruct and meta-llama/Meta-Llama-3-70B-Instruct) for inference tasks, where Dell Technologies highlighted the container-based deployment, endpoint API methods, and OpenAI-style inferencing. This subsequent blog delves deeper into the inference process but with a focus on Kubernetes (k8s) and KServe integration.

This method seamlessly integrates with the Hugging Face ecosystem and the vLLM framework, all operational on the robust Dell™ PowerEdge™ XE9680 server, empowered by the high-performance AMD Instinct™ MI300X accelerators.

System configurations and prerequisites

  • Operating System: Ubuntu 22.04.3 LTS
  • Kernel: Linux 5.15.0-105-generic
  • Architecture: x86-64
  • ROCm™ version: 6.1
  • Server: Dell™ PowerEdge™ XE9680
  • GPU: 8x AMD Instinct™ MI300X Accelerators
  • vLLM version:
  • Llama-3: meta-llama/Meta-Llama-3-8B-Instruct, meta-llama/Meta-Llama-3-70B-Instruct
  • Kubernetes: 1.26.12
  • KServe: V0.11.2

To install vLLM, see our previous blog for instructions on setting up the cluster environment needed for inference.

Deploying Kubernetes and KServe on XE9680


To set up Kubernetes on a bare metal XE9680 cluster, Dell Technologies used Kubespray, an open-source tool that streamlines Kubernetes deployments. Dell Technologies followed the quick start section of its documentation, which provides clear step-by-step instructions for installation and configuration.

Next, Dell Technologies installed KServe, a highly scalable and standards-based model inference platform on Kubernetes.

KServe provides the following features:

  • It acts as a universal inference protocol for a range of machine learning frameworks, ensuring compatibility across different platforms.
  • It supports serverless inference workloads with autoscaling capabilities, including GPU scaling down to zero when not in use.
  • It uses ModelMesh to achieve high scalability, optimized resource utilization, and intelligent request routing.
  • It provides a simple yet modular solution for production-level ML serving, encompassing prediction, preprocessing and postprocessing, monitoring, and explainability features.
  • It facilitates advanced deployment strategies such as canary rollouts, experimental testing, model ensembles, and transformers for more complex use cases.


Setting up KServe: Quickstart and Serverless installation

To test inference with KServe, start with the KServe Quickstart guide for a simple setup. If you need a production-ready environment, refer to the Administration Guide. Dell Technologies opted for Serverless installation to meet our scalability and resource flexibility requirements.

Serverless installation:

As part of KServe (v0.11.2) installation, Dell Technologies had to install the following dependencies first:

  1. Istio (V1.17.0)
  2. Certificate manager (v1.13.0)
  3. Knative Serving (v1.11.0)
  4. DNS Configuration

Each dependency is described below.

Istio (V1.17.0)

Purpose: Manages traffic and security.

Why needed: Ensures efficient routing, secure service communication, and observability for microservices.

See the Istio guide for details.

Certificate manager (v1.13.0)

Purpose: Automates TLS certificate management.

Why needed: Provides encrypted and secure communication between services, crucial for protecting data.

Knative Serving (v1.11.0)

Purpose: Enables serverless deployment and scaling.

Why needed: Automatically scales KServe model serving pods based on demand, ensuring efficient resource use.

DNS Configuration

Purpose: Facilitates service domain.

Why needed: Ensures that services can communicate using human-readable names, which is crucial for reliable request routing.

kubectl patch configmap/config-domain \
      --namespace knative-serving \
      --type merge \
      --patch '{"data":{"":""}}'

For more details, please see the Knative Serving install guide.

Installation steps for KServe

The deployment requires ClusterStorageContainer. Provide the http_proxy and https_proxy values if the cluster is behind a proxy, and nodes do not have direct Internet access.

KServe is also included with the Kubeflow deployment. If you must deploy Kubeflow, refer to the Kubeflow git page for installation steps. Dell Technologies used a single command installation approach to install Kubeflow with Kustomize.

Llama 3 model execution with Kubernetes

Initiate the execution of the Llama 3 model within Kubernetes by deploying it using the established configurations as given above and then follow the instructions below.

Create the manifest file

This YAML snippet employs the API version, ensuring compatibility and adherence to the standards within the KServe environment. Within this setup, a KServe Inference Service named llama-3-70b is established using a container housing the vLLM model.

The configuration includes precise allocations for MI300X GPUs and environmental settings. It intricately orchestrates the deployment process by specifying resource limits for CPUs, memory, and GPUs, along with settings for proxies and authentication tokens.

In the YAML example file below, arguments are passed to the container's command. This container expects:

  • --port: The port on which the service will listen (8080)
  • --model: The model to be loaded, specified as meta-llama/Meta-Llama-3-70B-Instruct / meta-llama/Meta-Llama-3-8B

Alternatively, separate YAML files can be created to run both models independently.

For the end-point inferencing, choose any of the three methods (with a container image (offline inferencing, endpoint API, and the OpenAI approach) mentioned in the previous blog.

Apply the manifest file

The next step performs the kubectl apply command to deploy the vLLM configuration defined in the YAML file onto the Kubernetes cluster. This command triggers Kubernetes to interpret the YAML specifications and initiate the creation of the Inference Service, named llama-3-70b. This process ensures that the vLLM model container is set up with the designated resources and environment configurations.

The initial READY status will be either unknown or null. After the model is ready, it changes to True. For an instant overview of Inference Services across all namespaces, check the kubectl get isvc -A command. It provides essential details like readiness status, URLs, and revision information, enabling quick insights into deployment status and history.

For each deployed inference service, one pod gets created. Here we can see that two pods are running; each hosting a distinct Llama-3 model (8b and 70b) on two different GPUs on an XE9680 server.

To get the detailed information about the specified pod's configuration, status, and events use the kubectl describe pod command, which aids in troubleshooting and monitoring within Kubernetes.

After the pod is up and running, users can perform inference through the designated endpoints. Perform the curl request to verify whether the model is successfully served at the specific local host.

Users can also follow the ingress IP and ports method for inferencing.


Integrating Llama 3 with the Dell PowerEdge XE9680 and leveraging the powerful AMD MI300X highlights the adaptability and efficiency of Kubernetes infrastructure. vLLM enhances both deployment speed and prediction accuracy, demonstrating KServe's advanced AI technology and expertise in optimizing Kubernetes for AI workloads.

Read Full Blog
  • AI
  • Dell
  • Hugging Face
  • Dell Portal
  • AI Made Easy

AI Agents Made Easy by Dell Enterprise Hub

Khushboo Rathi Bala Rajendran Khushboo Rathi Bala Rajendran

Wed, 22 May 2024 15:06:05 -0000


Read Time: 0 minutes

From Models to AI Agents: The Future of Intelligent Applications

Our previous blogs explored model selection criteria and model merging for complex applications. Now, let's delve into the next level: AI Agents.

AI Agents: Intelligent Collaboration

AI agents are software programs that learn and act autonomously. They can work alongside other models and tools, forming a powerful team to tackle your application's requirements.

With 2022 being the year of Generative AI and LLMs, and 2023 being the year of RAG, 2024 is poised to be the year of AI Agents. Some might even say they're already here. Unlike LLMs and RAG applications, AI Agents are better built for real-world interaction. They can not only process text but also execute tasks and make decisions, making them ideal for practical applications.

Seamless Flight Information with AI Agents: Imagine a simple yet powerful feature in your airline app:

  1. Speak Your Request: Tap the microphone icon and ask, "When is my flight tomorrow?".
  2. AI Agent #1: Understanding Your Voice: This first AI agent specializes in speech recognition, converting your spoken question into text.
  3. AI Agent #2: Finding Your Flight: The processed text is sent to another AI agent that specializes in querying the airline database, identifying you and retrieving your flight information.
  4. AI Agent #3: Real-Time Flight Status: The third AI agent, specializing in real-time flight data, checks the departure, boarding, and arrival times for your specific flight.
  5. AI Agent #1: Speaking the Answer: All the information is gathered and sent back to the first AI agent which converts the text into an audio response personalized for you: "Dear Khushboo, your Delta flight to Las Vegas is on time and departs at 3:00 PM. Would you like me to check you in?"

AI Agents offer a highly versatile architecture where each agent can be independently scaled to meet specific requirements, ensuring optimal performance with minimized costs. Dell Technologies’ diverse platform portfolio provides the ideal environment for running these agents efficiently.

Furthermore, this AI agent architecture allows for seamless A/B testing, guaranteeing reliable service and the implementation of best practices in model deployment for a superior user experience.


In this blog, we will share our guide for creating an AI Agent system, connecting multiple models to solve a specific need of an application. We will use the LangChain framework to create a research article search agent utilizing Dell Enterprise Hub. For this guide, we chose the meta-llama/Meta-Llama-3-8b-Instruct model, a smaller yet powerful model refined through Reinforced Learning from Human Feedback (RLHF).   

 Figure 1: Architecture, the user queries the frontend application with the question, an agent which have two different tools, Arxiv Explore and calculator can be used it agent enable feature is on else it will directly answer the question from llama3-8B-instruct.

Figure 1. Architecture of the AI Agent

Model deployment

  1. From Dell Enterprise Hub, select a model from the model dashboard to deploy. In this blog, meta-llama/Meta-Llama-3-8b-Instruct will be used to work with the agents. Select the deployment options that match your environment. 
  2. Now the Docker command to run TGI will be available to be copied. Paste the command in your Linux system. It will look something like the following: 
docker run \ 
    -it \ 
    --gpus 1 \ 
    --shm-size 1g \ 
    -p 8080:80 \ 
    -e NUM_SHARD=1 \ 
    -e MAX_INPUT_TOKENS=8000 \ 
    -e MAX_TOTAL_TOKENS=8192 \

This command will launch a container with the TGI server running the llama3-8b-instruct model on port 8080. 

UI interface

The UI server script is the combination of Gradio, LangChain agents, and associated tools. For this example, we have built an agent with arxiv, a tool that helps with searching and summarizing technical papers from

In the following code base, the inference endpoint url values must be changed to your specific server settings. The endpoint_url must point to the TGI server container from Dell Enterprise Hub shown in the model deployment section. You may change gr.Image to an image of your choosing. 

The following are the prerequisites to be installed before running the final UI agent code:

pip install gradio,langchain,langchain-community
pip install --upgrade --quiet  arxiv

Then, run it as:

import gradio as gr
from langchain_community.llms import HuggingFaceEndpoint
from langchain.chains import RetrievalQA
from langchain.agents import load_tools
from langchain.agents import initialize_agent
# Create endpoint for llm
llm = HuggingFaceEndpoint(
# Generating response by the agent
def agent_gen_resp(mes):
    tools = load_tools(["arxiv", "llm-math"], llm=llm)
    agent = initialize_agent(tools,
    respond =
    return respond
# Inferencing using llm or llm+agent
def gen_response(message, history, agent_flag):    
    if agent_flag == False:
        resp = llm(message)
        resp = agent_gen_resp(message)
    history.append((message, resp))
    return "", history
# Flag for agent use
def flag_change(agent_flag):
    if agent_flag == False:
        return gr.Textbox(visible=False)
        return gr.Textbox(visible=True)
# Creating gradio blocks
with gr.Blocks(theme=gr.themes.Soft(), css="footer{display:none !important}") as demo:
    with gr.Row():
        with gr.Column(scale=0):
            gr.Image("dell_tech.png", scale=0, show_download_button=False, show_label=False, container=False)
        with gr.Column(scale=4):
    gr.Markdown("# AI Agents made easy by Dell Enterprise Hub")
    gr.Markdown("## Using Meta-Llama-3-8B-Instruct")
    with gr.Row():
        chatbot = gr.Chatbot(scale=3)
    prompt = gr.Textbox(container=False)
    with gr.Row():
        agent_flag = gr.Checkbox(label="Enable ArXiv Agent",scale=4)
        clear = gr.ClearButton([prompt, chatbot])
    prompt.submit(gen_response, [prompt, chatbot, agent_flag], [prompt, chatbot])
    agent_flag.change(flag_change, agent_flag)
# Launching the application
demo.launch(server_name="", server_port=7860)



We ran an example where we asked the llama3-8b-instruct LLM model to summarize the paper 2401.00304 from The response from the model is shown in Figure 2. The base model fails to retrieve the correct article.

However, when the model is provided with the arxiv tool from Langchain, the model is able to retrieve the correct article. It then recognizes further instructions to summarize the abstract and produces the results shown in Figure 3. Figure 4 shows the thought processes of an agent, its corresponding actions, and the tools it used to get the results. 

Query the Llama3-8b-istruct to get the summary of the abstract of paper in

Figure 2. Query the Llama3-8b-istruct to get the summary of the abstract of paper in

Figure 3. Enabling the Agent and asking it to summarize the paper from

Figure 3. Enabling the Agent and asking it to summarize the paper from

Figure 4. Background process followed by the agent to come to the final answer

Figure 4. Background process followed by the agent to come to the final answer


With few clicks, you have a live, working AI agent implementation in which the model is seamlessly chained with tools and working to solve a specific application requirement. 

Eager for more? Check out the other blogs in this series to get inspired and discover what else you can do with the Dell Enterprise Hub and Hugging Face partnership.


Author: Khushboo Rathi, Engineering Technologist

Bala Rajendran, AI Technologist

To see more from these authors, check out Bala Rajendran and Khushboo Rathi on Info Hub.

Read Full Blog
  • AI
  • Dell
  • Hugging Face
  • Dell Portal
  • AI Made Easy

Code Assistant Made Easy by Dell Enterprise Hub

Paul Montgomery Bala Rajendran Paul Montgomery Bala Rajendran

Mon, 20 May 2024 15:06:37 -0000


Read Time: 0 minutes


With the rise of large language models (LLMs) as powerful code assistants, LLMs are proving to be invaluable tools for developers. Code, after all, is a language – a precise expression of intent understood by compilers and programs. Since LLMs excel at understanding and manipulating language, it's no wonder they excel as code assistants, especially when trained on vast amounts of code in various languages like C, Python, and Ruby.

The recently-released Llama 3 8B model surpasses even CodeLlama, a previous generation model specifically designed for code generation. This blog delves into a method for implementing a powerful code assistant leveraging Llama 3 models.



Visual Studio Code—or VS Code for short—is a lightweight but powerful code editor that runs on windows, Mac, and Linux. It supports many languages, extensions, debugging and Git integrations, and more. As such, it made sense for us to build this Llama3-powered code assistant to integrate into VS Code. The continue extension of VS Code will be the communication path between VS Code and Llama 3.

Figure 1: Architectural diagram of VS Code, Continue Extension and Llama 3 deployed on-premise by Dell Enterprise Hub to create a powerful code assistant.

Figure 1. Architectural diagram of VS Code, Continue Extension, and Llama 3 deployed on-premise by Dell Enterprise Hub to create a powerful code assistant

The continue extension takes user requests and converts them into RESTful API calls to the Hugging Face TGI (Text Generation Inference) inference endpoint, which is running Llama 3 from Dell Enterprise Hub.

Model deployment

  1. From Dell Enterprise Hub, select a model to deploy. In this example, meta-llama/Meta-Llama-3-8b-Instruct will be used.
  2. Select the deployment options that match your environment, and the Docker command will be updated.
  3. Copy and paste the following code into your Linux command line to start a TGI server. The command will look something like the following:
docker run \
    -it \
    --gpus 1 \
     --shm-size 1g \
    -p 8080:80 \
    -e NUM_SHARD=1 \
    -e MAX_INPUT_TOKENS=8000 \
    -e MAX_TOTAL_TOKENS=8192 \

This will start the TGI server and serve LLama 3 8B over a RESTful interface. The model is bundled with the Docker container to make this process easy. The Docker container is highly optimized and configured to run efficiently on your choice of Dell PowerEdge platforms.

VS Code integration

When the TGI server is hosting the model, VS Code can be configured to use the model for code assistance:

1. Once VS Code is running, click on the icon for Extensions shown to the right.Continue Icon on VC Code Extensions tab

2. Search for continue and install the “Continue - Llama 3, GPT-4, and more” extension.

3. Click the Continue icon (which looks like “>CD_”).

4. At the bottom of the Continue extension, there is a Configure Continue icon as shown in the following image. Click that, and a JSON configuration will be shown.

Continue extension on VC Code

5. In the models section, a new entry will need to be made. The following JSON is an example:

       "title": "TGI Llama 3 Instruct",
       "model": "llama3",
       "apiBase": "http://192.x.x.x:8080/v1",
       "completionOptions": {},
       "apiKey": "EMPTY",
       "provider": "openai"

6. Customize the JSON configuration by filling in your own apiBase IP address which matches the IP address of the server to your Inference endpoint.

7. Save the configuration file modifications.

8. At the bottom of the continue extension, there is a dropdown box with all the model configurations available. Set the configuration to TGI Llama 3 Instruct, as shown here:

TGI Llama 3 Instruct integration with VS Code

9. You can also map to multiple models in this configuration. For example, you could map to a fine-tuned model on your organization’s code and bring your company’s coding guidelines and best practices into this code assistant framework.

The code assistant is now operational. In the Continue prompt text box, enter in something similar to “Calculate Pi in Python”, and the code assistant will return several algorithm options from which to choose.

Capability examples

Following are a few examples of what the code assistant we have created can produce:

1. Create a Game: Here, we are creating brick breaker game with the prompt “Make a Python brick breaker (breakout) game.”

Figure 2: VS Code with prompt window on the left and generated code on the right. The prompt used here “Make a python back breaker (breakout) game.”

Figure 2. VS Code with prompt window on the left and generated code on the right. The prompt used here “Make a python back breaker (breakout) game.”  

Running the code will create a playable game that looks like the following and you can play this game.

Figure 3:The game created by running the code generated by the code assistant.

Figure 3. The game created by running the code generated by the code assistant

2. Code Optimization: In this example, we provide context with @ and point to a Fibonacci C++ code with the prompt, “Optimize my code below @ {code}”. You can also ask to follow up prompts in this context, such as “What is the time complexity between above mentioned code?”.

Figure 4: VS Code with the prompt “optimize my code below” and the context code with “@”.

Figure 4. VS Code with the prompt “optimize my code below” and the context code with “@”

Figure 5: VS Code with generated output with optimized code and detailed explanation.

Figure 5. VS Code with generated output with optimized code and detailed explanation

 Figure 6: VS Code output to the prompt, “What is the time complexity between the above-mentioned code”. The generated output with longer context windows compares both the code and shares detailed insight into time complexity of the code

Figure 6. VS Code output to the prompt, “What is the time complexity between the above-mentioned code”

The generated output with longer context windows compares both the code and shares detailed insight into time complexity of the code

3. Code Conversion: Convert this code to python @. The code assistant does a great job of converting and describing the new python code.

Figure 7: VS Code with prompt, “convert the code to python” and “@” with context and the code previously optimized Fibonacci series generator in C.

Figure 7. VS Code with prompt, “convert the code to python” and “@” with context and the code previously optimized Fibonacci series generator in C++

Figure 8: VC Code with generated output with python code and “brief” explanation of the conversion.

Figure 8. VC Code with generated output with python code and “brief” explanation of the conversion


Like CodeLlama70B, Llama 3 can analyze your existing code, suggest completions, explain code that is written by someone else as shown throughout this blog, and even generate entirely new content or sections based on the context. It can also perform intelligent problem-solving like recommending algorithms, data structures, and libraries to fit your needs. In this example, the code assistant even optimized existing code for better performance. Llama 3 also supports multiple languages, unlike CodeLlama, which allows the code assistant to port code from one language to another. These are exciting times for code assistants. For more innovations, check out these interesting takes on automating unit tests with LLMs: Automated Unit Test Improvement using Large Language Models and An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation.

Eager for more? Check out the other blogs in this series to get inspired and discover what else you can do with the Dell Enterprise Hub and Hugging Face partnership.


Author: Paul Montgomery, Balachandran Rajendran

Read Full Blog
  • AI
  • Dell
  • Hugging Face
  • Dell Portal
  • AI Made Easy

Open Source RAG Made Easy by Dell Enterprise Hub

Paul Montgomery Bala Rajendran Paul Montgomery Bala Rajendran

Wed, 22 May 2024 15:07:41 -0000


Read Time: 0 minutes

Beyond Pre-Trained LLMs: The Power of Retrieval-Augmented Generation (RAG)

While pre-trained large language models (LLMs) excel at factual tasks after fine-tuning, their ability to access and update knowledge remains limited. This can hinder performance on tasks requiring deep understanding. Additionally, tracing the source of their responses is challenging.

Enter Retrieval-Augmented Generation (RAG). This popular method overcomes these limitations by combining LLMs with a retrieval component. RAG offers several advantages:

  • Up-to-date Information: Access to external knowledge sources ensures responses reflect current information.
  • Context-Aware Responses: RAG considers context for more relevant and informative answers.
  • Source Attribution: RAG identifies the source of retrieved information, enhancing transparency.
  • Cost-Effectiveness: Implementation is often more efficient compared to complex task-specific architectures.

The field of RAG techniques is rapidly evolving with advancements like RePlug, REALM, FiD, TRIME, Self-RAG, and In-Context RALM.

This blog focuses on a simplified RAG implementation made effortless with Dell Enterprise Hub. With minimal clicks and code copy-pasting, you can have a basic RAG solution up and running for your documents.



Figure 1. Architectural of a simple RAG with UI powered by Gradio, LangChain framework, vector database and Llama 3 powered by TGI from Dell Enterprise Hub

Figure 1. Architectural of a simple RAG with UI powered by Gradio, LangChain framework, vector database and Llama 3 powered by TGI from Dell Enterprise Hub 

The first component of a simple RAG implementation is the Data Ingest during which the database is populated with data such as PDFs, Word documents, PowerPoint presentations, and other similar data sources. The next component is the UI. In this case, we chose Gradio for simplicity. The third component is the LLM itself. In this case, we chose the Llama 3 8B model from Dell Enterprise Hub. Let’s delve deeper into each of these components. 

Model deployment

  1. Go to the Dell Enterprise Hub, and select a model to deploy. In this example, meta-llama/Meta-Llama-3-8b-Instruct will be used. Select the deployment options that match your environment. 
  2. Now the Docker command to run TGI will be available for copy. Paste the command in your Linux system. It will look something like the following:
docker run \
    -it \
    --gpus 1 \
     --shm-size 1g \
    -p 8080:80 \
    -e NUM_SHARD=1 \
    -e MAX_INPUT_TOKENS=8000 \
    -e MAX_TOTAL_TOKENS=8192 \

This command will:

  • Download the model and install all required packages. If the model is already downloaded, then it would use the latest version of the model. 
  • Start the TGI server.
    • The TGI server will be served on port 8080.
    • Llama 3 8B will be served via a RESTful inference endpoint.


The UI server is a combination of LangChain and Gradio. Deploy using the following steps:

  1. Install Gradio:
  2. Install Langchain:
  3. Copy and paste the following code into a file. This example uses
  4. Customize the code example.
    • The huggingface_api_token and endpoint_url values must be changed to your specific settings. 
    • The endpoint_url must point to the Hugging Face TGI server URL. 
    • You may change the gr.Image to an image of your choosing.
  5. To run the program, run python3
  6. When this program is run, it will serve on port 7860, the port to web browse
from langchain_community.llms import HuggingFaceEndpoint
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
import gradio as gr
embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')
vectordb = Chroma(embedding_function=embeddings, persist_directory='./chroma_db')
llm = HuggingFaceEndpoint(
qa_chain = RetrievalQA.from_chain_type(llm, retriever=vectordb.as_retriever())
def gen_response(message, history, rag_flag):
     vect_data = ""
    if rag_flag == False:
        resp = llm(message)
        resp = qa_chain({"query": message})['result'].strip()
        docs = vectordb.similarity_search(message)
        for doc in docs:
             vect_data += str(doc) + "\n\n"
     history.append((message, resp))
    return "", history, vect_data
def flag_change(rag_flag):
    if rag_flag == False:
         return gr.Textbox(visible=False)
         return gr.Textbox(visible=True)
with gr.Blocks(theme=gr.themes.Soft(), css="footer{display:none !important}") as demo:
    with gr.Row():
        with gr.Column(scale=0):
             gr.Image("dell_tech.png", scale=0, show_download_button=False, show_label=False, container=False)
        with gr.Column(scale=4):
     gr.Markdown("# Dell RAG Demo")
    with gr.Row():
         chatbot = gr.Chatbot(scale=3)
        data = gr.Textbox(lines=17, max_lines=17, show_label=False, scale=1)
    prompt = gr.Textbox(container=False)
    with gr.Row():
         rag_flag = gr.Checkbox(label="Enable RAG")
         clear = gr.ClearButton([prompt, chatbot])
     prompt.submit(gen_response, [prompt, chatbot, rag_flag], [prompt, chatbot, data])
     rag_flag.change(flag_change, rag_flag, data)
demo.launch(server_name="", server_port=7860)

Database ingest code

This code enables the ingestion of a directory of documents into the vector database, which powers the RAG implementation. Retain the default values unless you fully understand what effect the changes will have. For example, changing the embedding model will necessitate changes in the UI Server code to match. To execute:

  1. Install Langchain:
  2. Copy and paste the code into a file. In this example, is used.
  3. Create a directory with PDF files. In this example, the directory is ./data.
  4. The following program may be run with no parameters to get the help screen: python3
  5. To ingest documents, run python3 ./data
  6. All the documents should now be loaded into the vector database.
import argparse
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
def add_pdf_to_db(dir_name, dbname, chunk, overlap, model_name):
    loader = PyPDFDirectoryLoader(dir_name)
    docs = loader.load()
     print("Documents loaded")
     text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk, chunk_overlap=overlap)
     all_splits = text_splitter.split_documents(docs)
     print("Documents split")
     embeddings = HuggingFaceEmbeddings(model_name=model_name)
     print("Embeddings created")
    vectordb = Chroma.from_documents(all_splits, embedding=embeddings, persist_directory=dbname)
     print("Documents added into the vector database")
if __name__ == "__main__":
    parser = argparse.ArgumentParser(
         prog='Add documents to Vector DB',
         description='Add a directory of documents into the Chroma vector database')
     parser.add_argument('dir', type=str, help='Dir of PDF files to ingest')
     parser.add_argument('--dbname', type=str, help='Database directory', default='./chroma_db')
     parser.add_argument('--chunk', type=int, help='Document chunk size', default=500)
     parser.add_argument('--overlap', type=int, help='Document overlap size', default=20)
     parser.add_argument('--embedmodel', type=str, help='Embedding model name', default='sentence-transformers/all-MiniLM-L6-v2')
    args = parser.parse_args()
     add_pdf_to_db(args.dir, args.dbname, args.chunk, args.overlap, args.embedmodel)

Before issuing queries to the RAG, check to make sure Hugging Face TGI (Text Generation Inference) is running. If the Enable RAG flag is active, the database will be queried for extra context. Otherwise, the query will be sent directly to the LLM without extra context. This enables easy comparisons for RAG vs non-RAG answers.

Example Usage

This blog was ingested using the database ingestion code and has been made ready for RAG. The prompt used is “How do I deploy a model using Dell Enterprise Hub?”. First, we will ask the question without RAG:

Figure 2 : UI powered by Gradio with prompt, “How do I deploy a model using Dell Enterprise Hub?” and it’s generated result.

Figure 2. UI powered by Gradio with prompt, “How do I deploy a model using Dell Enterprise Hub?” and it’s generated result

The answer looks reasonable, but it is a hallucination. The instructions are incorrect. Now, we enable RAG and ask the same question:

Figure 3: UI powered by Gradio with prompt, “How do I deploy a model using Dell Enterprise Hub?” and it’s generated result with RAG enabled. And it generates the expected/right answer.

Figure 3. UI powered by Gradio with prompt, “How do I deploy a model using Dell Enterprise Hub?” and it’s generated result with RAG enabled

This provides the correct response and shows the vector database contents used to generate the response on the right.


Delivering higher accuracy and better performance at scale, RAG offers a powerful approach for generating more informative, trustworthy, and adaptable responses from AI systems. Once you have a basic RAG implementation, you can mix and match different LLMs with different context window requirements for your applications, incorporate AI agents in the mix, create multiple vector databases for different kinds of documents, and incorporate different embedding models into your pipeline and chucking techniques. This technology is evolving as we speak, and what works for some applications might not work for other applications as accurately as you would expect. That said, with a combination of various techniques, you might just find the sweet spot for your application. 

Eager for more? Check out the other blogs in this series to get inspired and discover what else you can do with the Dell Enterprise Hub and Hugging Face partnership.


Authors: Paul Montgomery, Balachandran Rajendran

Read Full Blog
  • AI
  • Dell
  • Hugging Face
  • Dell Portal
  • AI Made Easy

Model Merging Made Easy by Dell Enterprise Hub

Khushboo Rathi Bala Rajendran Khushboo Rathi Bala Rajendran

Wed, 22 May 2024 15:08:58 -0000


Read Time: 0 minutes

Beyond Open-Source LLMs: Tailoring Models for Your Needs

The open-source LLM landscape is booming! But with so many options, choosing the right model can be overwhelming. What if you need a model with both domain-specific knowledge and diverse generation capabilities? Enter model merging, a powerful technique to unlock the full potential of LLMs.

Model merging: Unlocking model versatility

Model merging allows you to combine the strengths of different pre-trained models without additional training. This creates a "multitask" model, excelling in both specific domains and diverse generation tasks and addressing key challenges in AI like:

  • Catastrophic Forgetting: This occurs when a model learning new tasks forgets those previously learned. Merging preserves the original model’s capabilities.
  • Multitask Learning: Effectively training a model for multiple tasks can be difficult. Merging offers a way to combine pre-trained models with different strengths.

This blog explores the use of the MergeKit Python library to merge pre-trained LLMs like Mistral-7B-v0.1 and Zephyr-7B-alpha. We'll demonstrate how to create a new model that leverages the strengths of both.


There are a variety of methods that can be used during model merging, such as Linear, Spherical Linear Interpolation (SLERP), TIES, DARE, Passthrough, and Task Arithmetic. For the purposes of this blog, we will be using the task arithmetic method, which computes the task vector of each model by subtracting it from the base model weights. This method works best with models that were fine-tuned from common ancestors and have a similar model framework. Hence, in this walk-through, we will merge the fine-tuned version of zephyr-7B with its base model—Mistral-7B—to form our new merged model. Alternatively, you could merge your special domain-specific, highly fine-tuned model of Mistral-7B with the base model of Mistral-7B.

 Figure 1. Architecture of model merging deployment, with UI powered by Gradio and Zypher 7B, Mistal 7B and the merged model all powered by TGI from Dell Enterprise Hub 

Figure 1. Architecture of model merging deployment, with UI powered by Gradio and Zypher 7B, Mistal 7B and the merged model all powered by TGI from Dell Enterprise Hub


The following describes the process for merging two models using mergekit and deploying the merged model to production:

1.  Login with your user access token from Hugging Face.

2. From Dell Enterprise Hub, select the models you would like to merge. For the purposes of this blog, we chose Zephyr-7b-beta and mistralai/Mistral-7B-v0.1

docker run \
    -it \
    --gpus 1 \
    --shm-size 1g \
    -p 80:80 \
    -v /path/on/local_workspace:/Model_zephyr-7b-beta_weights 
    -e NUM_SHARD=1 \
    -e MAX_INPUT_TOKENS=8000 \
    -e MAX_TOTAL_TOKENS=8192 \
docker run \
    --gpus 2 \
    --shm-size 1g \
    -v /path/on/local_workspace:/Model_mistralai-mistral-7b-v0.1  \
    -v /home/$USER/autotrain:/app/autotrain \ \
    --model /app/model \
    --project-name fine-tune \
    --data-path /app/data \
    --text-column text \
    --trainer sft \
    --epochs 3 \
    --mixed_precision bf16
    --batch-size 2 \
    --peft \
    --quantization int4

3. Once we have the Dell optimized containers, the weights of the models must be stored locally to mount them on our training container. The weights can be found on the /model directory inside the container, as shown here:

#container ID of the image running the model 
kradmin@jpnode4:~$ docker ps
CONTAINER ID   IMAGE                                                                                      COMMAND                  CREATED          STATUS          PORTS                                                       NAMES
19c2e634c2ba   "/ …"   25 seconds ago   Up 25 seconds>80/tcp, :::8888->80/tcp                       compassionate_varahamihira
#Capture the container ID to execute the docker 
kradmin@jpnode4:~$  docker exec -it 19c2e634c2ba bash
#copying the weights outside from the container 
root@19c2e634c2ba:/usr/src# cd /model
root@19c2e634c2ba:/model# cp -r /model /Model_zephyr-7b-beta_weights

Now, the weights are stored locally in the folder Model_zephyr-7b-beta_weights outside the container. Follow the same process for the mistral-7b-v0.1 model weights.

4. Retrieve the training container from Dell Enterprise Hub, and mount both of these weights: 

docker run \
    -it \
    --gpus 1 \
    --shm-size 1g \
    -p 80:80 \
    -v /path/to/model_weights/:/Model_zephyr-7b-beta_weights\
    -v /path/to/mistral_model_weights/:/Model_mistralai-mistral-7b-v0.1
    -e NUM_SHARD=1 \
    -e MAX_INPUT_TOKENS=8000 \
    -e MAX_TOTAL_TOKENS=8192 \

 5. Inside the training container, we git clone the mergekit toolkit locally and install the required packages:

git clone
cd mergekit 
pip install -e .  

6. Create a config YAML file and configure your merge method and percentage weights for each model. Following is the config file we used for the task arithmetic method. Feel free to experiment with various weights associated with the model to achieve optimal performance for your application:

  - model: /path/to/your/huggingface_model/zephyr-7b-beta
      weight: 0.35
  - model: /path/to/your/huggingface_model/Mistral-7B-v0.1
      weight: 0.65
merge_method: task_arithmetic
dtype: bfloat16
  1. The script mergekit-yaml is the main entry point for mergekit, taking your YAML configuration file and an output path to store the merged model:
mergekit-yaml path/to/your/config.yml ./output-model-directory  --allow-crimes  --copy-tokenizer  --out-shard-size 1B  --lazy-unpickle  --write-model-card 


We have run three container servers—Mistral-7B-v0.1, zephyr-7b-beta, and our new merged model. We have built a simple Gradio UI to compare the results from these three models. Check out our blog on model plug and play for a more in-depth implementation of the Gradio UI.

Figure 2: UI when the model Mistal 7B is selected and the inferencing results generated by Mistal 7B model for the prompt, “What is the python code to generate pi?”

Figure 2. UI when the model Mistal 7B is selected and the inferencing results generated by Mistal 7B model for the prompt, “What is the python code to generate pi?”

Figure 3: UI when the model Zerphy-7b-beta is selected and the inferencing results generated by Zephyr-7b-beta model for the prompt, “What is the python code to generate pi?” 

   Figure 3. UI when the model Zerphy-7b-beta is selected and the inferencing results generated by Zephyr-7b-beta model for the prompt, “What is the python code to generate pi?”

 Figure 4. UI when the merged model is selected, and the inferencing results generated by the merged model


In this small-scale example, both the Mistral-7B-v0.1 and zephyr-7b-beta models failed to generate the correct text for the prompt “What is the python code to generate pi?”, however the blended model generated the text successfully and accurately with no fine-tuning or prompt engineering needed. The core idea of model merging is that the whole is greater than the sum of its parts. Dell Enterprise Hub makes it easy to deploy these blended models at scale. 

Eager for more? Check out the other blogs in this series to get inspired and discover what else you can do with the Dell Enterprise Hub and Hugging Face partnership.


Authors: Khushboo Rathi, Engineering Technologist,

Bala Rajendran, AI Technologist

To see more from these authors, check out Bala Rajendran and Khushboo Rathi on Info Hub.

Read Full Blog
  • AI
  • Dell
  • Hugging Face
  • Dell Portal
  • AI Made Easy

Model Plug and Play Made Easy by Dell Enterprise Hub

Paul Montgomery Bala Rajendran Paul Montgomery Bala Rajendran

Wed, 22 May 2024 15:15:14 -0000


Read Time: 0 minutes

Seamless Model Deployment and A/B Testing with Dell Enterprise Hub

Dell Enterprise Hub simplifies model deployment—as detailed in our blog, Model Deployments Made Easy by Dell Enterprise Hub—and choosing the right model requires careful consideration—as discussed in our blog, Model Selection Made Easy by Dell Enterprise Hub.

In this blog, we will walk through creating a user-friendly "plug-and-play" code assistant powered by Dell Enterprise Hub. The solution features a common UI that empowers chat application end-users to conduct A/B testing between base models and their fine-tuned counterparts.

Key functionalities:

  • Model Selection: Choose which model receives prompts with multiple models running concurrently in the background.
  • Simultaneous Prompting: Send the same prompt to all production models simultaneously.

This guide streamlines model deployment and empowers users to experiment with different models, ultimately enhancing the chat application's effectiveness.


The user interface for the chatbot is powered by Gradio. The models deployed from Dell Enterprise Hub contain all relevant software dependencies, including Hugging Face’s Text Generation Inference and optimized configurations to run the model on Dell PowerEdge hardware. The Gradio UI server communicates with multiple Inference Endpoints powered by TGI to detect health and lets the user know when the chatbot is ready to chat.

Architectural diagram of a Gradio UI server communicating to multiple models.

Figure 1. Architectural diagram of a Gradio UI server communicating to multiple models

This architecture supports users in interacting with different models with a click of a button on the UI, allowing for easy comparison of different models from a single UI. The Gradio server itself does not require a GPU, however the TGI servers will need at least one GPU depending on the model deployed in the backend from Dell Enterprise Hub.

Model deployment

  1. From Dell Enterprise Hub, select a model that you would like to deploy and open the model details. For the purposes of this walk-through, we chose meta-llama/Meta-Llama-3-8b-Instruct. Click on the Deploy link and select the deployment options that fit your environment.
  2. Copy-paste the docker command from Dell Enterprise Hub to be executed on your Linux system. Following is an example of a command:
docker run \
    -it \
    --gpus 1 \
    --shm-size 1g \
    -p 8080:80 \
    -e NUM_SHARD=1 \
    -e MAX_INPUT_TOKENS=8000 \
    -e MAX_TOTAL_TOKENS=8192 \

This command will launch TGI and serve the Llama 3 model over a RESTful interface on port 8080. The Llama 3 model is bundled with the Docker container to make this process as simple as possible. 

Repeat steps 1 and 2 for deploying additional models or to deploy fine-tuned models from Dell Enterprise Hub.


User Interface

To implement the Gradio UI—a relatively simple and clean chat interface which communicates with the TGI servers—use the following steps:

  1. Install Gradio:
  2. Copy and paste the following code into a file. In this example, was used.
  3. Customize the code example.
    1. Make sure to change the RB_TEXT, RB_CONFIG, and CURR_URL variables to match your own environment. 
    2. There must be an entry in RB_CONFIG for every entry in RB_TEXT, and CURR_URL should point to the URL of the first entry of RB_CONFIG. 
    3. The gr.Image may be substituted for your own logo.
  4. The following command executes the UI server: python3
import gradio as gr
import json
import requests
RB_TEXT = ["meta-llama-3-70b-Dell", "mistral-7b-v0.1", "model-blend"]
     "meta-llama-3-70b-Dell": "http://192.x.x.x:8080",
    "mistral-7b-v0.1": "http://192.x.x.x:8082",
     "model-blend": "http://192.x.x.x:8081"
CURR_URL = "http://192.x.x.x:8080"
def rb_change(model_rb):
    global CURR_URL
    CURR_URL = RB_CONFIG[model_rb]
def check_conn(model_rb):
        resp = requests.get(CURR_URL + '/health')
         return "Disconnected"
    return "Ready"
def gen_response(message, history, model_rb, send_all):
    payload = {'inputs': message, 'parameters': {'max_new_tokens': 250, 'temperature': 0.1, 'repetition_penalty': 1.03}}
    header = {'Content-type': 'application/json'}
    if send_all:
        for model in RB_TEXT:
                 resp =[model] + '/generate', json=payload, headers=header)
                 json_resp = json.loads(resp.text)
                 history.append((f"Model: {model} --- {message}", json_resp['generated_text']))
                 history.append((f"Model: {model} --- {message}", "[Disconnected]"))
             resp =[model_rb] + '/generate', json=payload, headers=header)
             json_resp = json.loads(resp.text)
             history.append((message, json_resp['generated_text']))
             history.append((message, "[Disconnected]"))
    return "", history
with gr.Blocks(theme=gr.themes.Soft(), css="footer{display:none !important}") as demo:
    with gr.Row():
        with gr.Column(scale=0):
             gr.Image("dell_tech.png", scale=0, show_download_button=False, show_label=False, container=False)
        with gr.Column(scale=4):
    chatbot = gr.Chatbot()
    model_rb = gr.Radio(RB_TEXT, label="Select Model", value="meta-llama-3-70b-Dell")
    with gr.Row():
        with gr.Column(scale=0):
            status = gr.Button("Checking...")
        with gr.Column(scale=2):
             prompt = gr.Textbox(container=False)
    with gr.Row():
         send_all = gr.Checkbox(label="Send to all models simultaneously")
         clear = gr.ClearButton([prompt, chatbot])
     prompt.submit(gen_response, [prompt, chatbot, model_rb, send_all], [prompt, chatbot])
     model_rb.input(rb_change, model_rb, None)
     demo.load(check_conn, model_rb, status, every=3)
demo.launch(server_name="", server_port=7860)

This launches a web server on port 7860:

Gradio UI for Model plug and play

Figure 2. UI powered by Gradio with option to select the model

The Ready/Disconnected indicator next to the prompt text box indicates the health of the TGI server with which the model is associated. Behind the scenes, the Gradio application is reaching out over HTTP to see if the TGI server is running and sets the status appropriately.

Checking the Send prompts to all models simultaneously checkbox will cause any prompt to be sent to all models, and all responses will be shown in the chat window. We found it very useful to test different model responses and compare them with each other quickly using this custom UI.

 A query and response from model 1Figure 3. A prompt and generated response from model 1

A query and response from model 2Figure 4. A prompt and generated response from model 2


Sending a query to all models simultaneouslyFigure 5. Sending the same prompt to all models simultaneously

The 3rd model is non-responsive so a Disconnected message was sent Figure 6. The third model was brought down for maintenance and UI reflects the status


Simple but very powerful, this architecture is highly useful when testing and validating your models and deploying a chatbot at scale. With this guide, you can set up a custom chatbot in a few minutes with just a few clicks. That is model plug and play made easy by Dell Technologies. 

Eager for more? Check out the other blogs in this series to get inspired and discover what else you can do with the Dell Enterprise Hub and Hugging Face partnership.

 Author: Paul Montgomery, Rajendran Balachandran

Read Full Blog
  • AI
  • Dell
  • Hugging Face
  • Dell Portal
  • AI Made Easy

Model Selection Made Easy by Dell Enterprise Hub

Bala Rajendran Bala Rajendran

Tue, 21 May 2024 20:06:41 -0000


Read Time: 0 minutes

With so many models available in Dell Enterprise Hub, which one should you use?

 A screenshot of all the models available on Dell Enterprise Hub at

Figure 1. Dell Enterprise Hub

In reality, there is no one model to rule them all. Even if there was one, it would be inefficient and ineffective to use the same model for all your applications. The model best used as a point-of-sale chatbot is very different from the model for a domain-specific knowledge bot. Dell Enterprise Hub has a diverse array of popular high-performance model architectures to enable a wide range of customers and applications. We will continue to add more models to meet the needs of our customers, as model architectures, capabilities and application needs evolve.

Let’s look at some of the most important criteria for selecting the right model for an application.

Size and capabilities

The number of parameters used while training—often referred to as the size of the model—varies per model. Larger models with larger parameters tend to demonstrate superior functionalities, however they tend to be slower in performance and have higher computation costs. Sometimes, larger models support special techniques while a smaller model of the same architecture might not.

For example, Llama 3 70B uses Grouped Query Attention (GQA) for improved inference scalability to overcome computational complexity but not Sliding Window Attention (SWA) for handling sequences of arbitrary length with a reduced inference cost. In comparison, Mistral’s models support GQA, SWA, and the Byte-fallback BPE tokenizer which ensures that characters are never mapped to out-of-vocabulary tokens. As a unique feature, Dell Enterprise Hub maps a model and task to a Dell Platform, and thus model selection may also be limited by hardware requirements.

Training data

Different models are trained on different datasets. The quality and quantity of the training data vary from model to model. Llama 3 was trained on 15T tokens, all collected from publicly available sources. Compared to Llama 2, Llama3 is seven times larger with four times more code. Five percent of Llama 3 consists of high-quality non-English datasets that cover 30 languages. Gemma models are trained on 6T tokens of web documents, code, and mathematics that help the model learn logical reasoning, symbolic representation, and mathematical queries. Compared to the Gemma and Llama 3 family of models, Mistral is fluent in English, French, Italian, German, and Spanish.

The quality and diversity of data sources is crucial when training powerful language models that handle a variety of tasks and text formats, whether the models are trained on data passed through heuristic filters, Not Safe for Work (NSFW) filters, Child Sexual Abuse Material (CSAM) filters, Sensitive Data Filtering, semantic deduplication approaches, or text classifiers to improve data quality.

Model evaluation benchmarks

Benchmarks provide good insight into a model’s application performance; however they should be taken with a grain of salt. The datasets used in these benchmarks are public and can contain the data used to train these models, thus causing inflated performance in the benchmark scores.

The assumption that the test prompts within a benchmark represent a random sample is incorrect. The correlation in model performance across test prompts is non-random, and accounting for correlation across tests reveals variability in model rankings on major benchmarks. This raises serious concerns about the validity of existing benchmarking studies and the future of evaluation using benchmarks.

Massive Multitask Language Understanding (MMLU), the most popular benchmark which uses multiple choice questions and answers to evaluate models, has been shown to be highly sensitive to minute details in the questions asked. Simple tweaks like changing the order of choices or the method of answer selection results in ranking shifts up to 8 positions. To learn more about this phenomenon, check out these Arxiv papers: Examining the robustness of LLM evaluation and When Benchmarks are Targets: Revealing the Sensitivity of Large.

Model architectures

Most new LLMs are based on transformer architecture, yet there are many differences between them architecturally. Traditional large language models (LLMs) often use an encoder-decoder architecture. The encoder processes the entire input, and the decoder generates the output. A decode-only model skips the encoder and directly generates the output based on the input it receives, one piece at a time.  Llama 3 is a decoder-only mode, which makes it well-suited for tasks that involve generating texts for chatbots, dialogue systems, machine translation, text summarization, and creative text generation, but not well-suited for tasks that require a deeper understanding of context.

BERT and T5 are the most common and well-known encoder-decoder architectures. Gemma is decoder-only LLM. Implementation of techniques like GQA and SWA within the model delivers a better inference performance for Mistral compared to its peers. Mixture of Experts (MOE) models like Mistral 8X 22B are related to sparse MOE models which reduce inference costs by only keeping 44B active during inference with 8 experts despite having 176B parameters. On the other hand, fine-tuning and training is lot more complex for MOE models compared to non-MOE architecture models. New techniques are constantly evolving.

Context windows

 LLMs are stateless and do not understand the difference between one question and another. The short-term memory is built into the application where previous inputs and outputs are fed back to the LLM to provide context and the illusion of continuous conversation. A larger context window allows the model to consider a broader context and could potentially lead to a more accurate response. Llama 2 7B has a context window of 4096 tokens—meaning the model can consider up to 4096 tokens of text while generating a response—whereas Gemma 7B has context window of 8192 tokens. A RAG-based AI solution tends to need much greater context window to facilitate high-quality retrieval and high-quality generation from the LLM as a result. Mistral 8x 22B has a 64K context window. Both Llama 3 8B and 70B have 8K context windows, but there are plans to increase that in future releases.

Vocab size and head size

Vocab size, referring to the number of distinct words or tokens that the model can recognize and work with, is essentially the LLM’s vocabulary breadth, one of the most important criteria and yet often overlooked. A larger vocabulary size translates to a more nuanced understanding of language by the LLM, however higher vocab sizes come with higher training costs.  

Another interesting criterion is head size, which is specifically associated with the self-attention layer. The self-attention layer allows the model to identify relationships between parts of the input sequence. Head size determines the dimensionality of the output vectors produced by this layer. Imagine these vectors as representations of the input, where each dimension captures a different aspect. Head size influences the model’s capacity to capture different aspects of the relationship within the input sequence. More heads generally allow for richer understanding and increased computational complexity.


Open-source model licenses define how the users interact with the models and use them in their applications. The licenses grant them specific rights and responsibilities, ensuring transparency and collaboration with the open-source community. The models made available on Dell Enterprise Hub have the following license categories:

  • Apache 2.0: Permissive 
    • Allows the users to use, modify, and distribute the code for any purpose, including commercially. 
    • Requires users to maintain copyright and license notices in the code.
    • Offers a potential patent grant with certain patent rights associated with the licensed software.
  • Llama 2 and Llama 3: CopyLeft
    • Restrictive than Apache 2.0, emphasizing sharing and collaboration.
    • Enforces sharing of source code for any modifications or derivative works from Llama3-licensed code.
    • Must also be released under the Llama3 license. 
  • Gemma: Gemma license is similar to Apache 2.0 in terms of permissiveness, allowing the free use and mediation with certain exceptions, potentially limiting certain commercial use.

We encourage developers to review the respective licenses in detail on the following pages: 


A multitude of pieces go into building an AI solution in addition to the model. In an AI solution powered by LLMs, it might not be just one model, but a combination of different models working together to deliver an elegant solution to a business challenge. Dell Technologies considered all of the criteria mentioned in this blog when creating this curated set of models for Dell Enterprise Hub in partnership with Hugging Face. Each month, newer and more powerful open-source LLMs are expected to be released with greater model support added for optimization by Dell Technologies and Hugging Face to Dell Enterprise Hub

Eager for more? Check out the other blogs in this series to get inspired and discover what else you can do with the Dell Enterprise Hub and Hugging Face partnership.


Author: Bala Rajendran, AI Technologist

To see more from this author, check out Bala Rajendran on Info Hub.

Read Full Blog

Hugging Face Model Deployments Made Easy by Dell Enterprise Hub

Ian Roche Bala Rajendran Ian Roche Bala Rajendran

Mon, 20 May 2024 14:46:02 -0000


Read Time: 0 minutes


The Dell Enterprise Hub ( is a game changer for obtaining and using optimized models that run on some of the latest and greatest Dell hardware. The Dell Enterprise Hub has a curated set of models that have been containerized with all the software dependencies optimized to run and validated on Dell Hardware.

This blog shows how a user can go from the Dell Enterprise Hub portal to a running model in minutes. We will step through the setup from the beginning until one or more containers are running.


The Dell Optimized containers are built on top of the TGI framework ( This allows a user to rely on all the existing benefits of TGI yet it is optimized for Dell. In addition, once a Dell container is downloaded it comes preconfigured with all the required model weights so no additional searching is needed to have a running system. This is a trade-off to have larger containers in order to provide simplicity and minimize accidently running incorrect model weights.

In this blog we look at the simpler case of deploying a model for inference. There are also containers that can be used for model training and fine-tuning and these will be covered in a future blog.

Server setup

During our testing we worked on different Dell Servers and GPUs. In this example we will focus on the 760xa servers for inference.



2 x Intel(R) Xeon(R) Gold 6438M (32 cores each)


512GB (16 x 32GB)


2TB local storage + PowerScale F600 mounted share



This server has the capacity to run multiple inference sessions in parallel. It contains the maximum number of GPUs supported by this Dell server. If more GPUs are required for your model, then an XE9680 can be used that hosts up to 8 GPUs.


The software stack along with the versions we used are below:

  • Ubuntu 22.04
  • Docker 24.0.6
  • NVIDIA Container Toolkit 1.14.2

It's likely that other versions also work but this was what was running in our lab.

Top tip: We missed the install of the NVIDIA toolkit on one server so to avoid this you can run a test container to check if it is working by using the command: 

docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

If the toolkit is missing, follow the instructions at

This image shows the optimized containers from Dell Enterprise HubOptimized containers from Dell Enterprise Hub

The optimized containers from Dell Enterprise Hub have three basic requirements - Docker, NVIDIA Container Toolkit, and Linux OS on Dell PowerEdge Platforms.

Select a model

The Dell Enterprise Hub contains an expanding set of models that are optimized to run on Dell hardware. To select a model, go to, log in using your Hugging Face username and select your model of choice. Check out the Model Selection Made Easy by Dell Enterprise Hub blog. It is possible to also use your own fine-tuned model but for this test we will use a prebuilt Llama 3 8B model. For more details on how to use the portal see AI Made Easy Unleash the Potential of Dell Enterprise Hub on Hugging Face.

See below for a sample portal screen for deployment of the Llama 3 8B model on a Dell 760xa with L40S GPUs:  

This image shows sample portal screen for deployment of the Llama 3 8B model on a Dell 760xa with L40S GPUsSample portal screen for Llama 3 deployment on a Dell 760xa with L40s GPUs

See the Dell Enterprise Hub Deploy page for Meta Llama 3 8B instructions.

The models on the Dell Enterprise Hub are under three broad categories of Licenses – Apache 3.0, Llama 3, and Gemma. Even though all of these models are permissive for enterprise usage, you will have to accept terms and conditions before accessing the models.

Container deployment

From the portal above the following Docker run command was generated:

docker run \
    -it \
    --gpus 2 \
     --shm-size 1g \
    -p 80:80 \
    -e NUM_SHARD=2 \
    -e MAX_INPUT_TOKENS=8000 \
    -e MAX_TOTAL_TOKENS=8192 \

This command can be performed as-is on your Dell server and the model will be pulled locally and run.

Note: The containers come prebuilt with all the weights so some models can be more than100GB. This Llama3 8B model is ~27GB.

When running a standard Docker command (as above) there is no link between your Hugging Face account and the model running, this allows a secure access to the models.

To solve for this it is possible to specify your hugging face hub token in 2 ways:

  1. Set your token as an environment variable “HUGGING_FACE_HUB_TOKEN”
  2. Add it to each Docker container run command “-e HUGGING_FACE_HUB_TOKEN=$token”

It is important to secure your token and not to post it in any public repo. For full details on how to use tokens see How to use HF tokens and for token generation see Token Generation.

Testing the deployment

The TGI containers expose http endpoints that can be used to perform queries in various formats. The full swagger definition of the API is available at

For a simple test we can use the “generate” endpoint to POST a simple query to the model that we ran in the previous step:

 curl    \
  -X POST   \
  -d '{"inputs":"What is Dell Technologies World?", "parameters":{"max_new_tokens":50}}'   \
  -H 'Content-Type: application/json'

This produces the following output:

{"generated_text":" Dell Technologies World is an annual conference held by Dell Technologies, a multinational technology company. The conference is designed to bring together customers, partners, and industry experts to share knowledge, showcase new products and services, and network with others in the technology industry.\n"}

As can be seen the response is generated and keeps within the limit of max 50 tokens that was specified in the query.


The Dell Enterprise Hub simplifies the deployment and execution of the latest AI models. The prebuilt containers run seamlessly on Dell Hardware.

In this example we showed how quick and easy it is to run the latest Llama 3 model on a 760xa with L40S GPUs. The Dell Enterprise Hub also supports training and fine-tuning models on Dell Hardware.  

Eager for more? Check out the other blogs in this series to get inspired and discover what else you can do with the Dell Enterprise Hub and Hugging Face partnership.

Read Full Blog
  • PowerEdge
  • GenAI
  • LLM
  • Generative AI
  • LLM training

Model Training – Dell Validated Design

Bertrand Sirodot Damian Erangey Bertrand Sirodot Damian Erangey

Fri, 03 May 2024 16:09:06 -0000


Read Time: 0 minutes


When it comes to large language models (LLMs), there may be fundamental question that everyone looking to leverage foundational models need to answer: should I train my model, or should I customize an existing model?

There can be strong arguments for either. In a previous post, Nomuka Luehr covered some popular customization approaches. In this blog, I will look at the other side of the question: training, and answer the following questions: Why would I train an LLM? What factors should I consider? I’ll also cover the recently announced Generative AI in the Enterprise – Model Training Dell Validated Design - A Scalable and Modular Production Infrastructure for AI Large Language Model Training. This is a collaborative effort between Dell Technologies and NVIDIA, aimed at facilitating high-performance, scalable, and modular solutions for training large language models (LLMs) in enterprise settings (more about that later).

Training pipeline

The data pipelines for training and customization are similar because both processes involve feeding specific datasets through the LLM.

In the case of training, the dataset is typically much larger than for customization, because customization is targeted at a specific domain. Remember, for training a model, the goal is to embed as much knowledge into the model as possible, so the dataset must be large. 

This raises the question of the dataset and its accuracy and relevance. Curating and preparing the data are essential processes to avoid biases and misinformation. This step is vital for ensuring the quality and relevance of the data fed into the LLM. It involves meticulously selecting and refining data to minimize biases and misinformation, which if overlooked, could compromise the model's output accuracy and reliability. Data curation is not just about gathering information; it's about ensuring that the model's knowledge base is comprehensive, balanced, and reflects a wide array of perspectives.

When the dataset is curated and prepped, the actual process of training involves a series of steps where the data is fed through the LLM. The model generates outputs based on the input provided, which are then compared against expected results. Discrepancies between the actual and expected outputs lead to adjustments in the model's weights, gradually improving its accuracy through iterative refinement (using supervised learning, unsupervised learning, and so on).

While the overarching principle of this process might seem straightforward, it's anything but simple. Each step involves complex decisions, from selecting the right data and preprocessing it effectively to customizing the model's parameters for optimal performance. Moreover, the training landscape is evolving, with advancements, such as supervised and unsupervised learning, which offer different pathways to model development. Supervised learning, with its reliance on labeled datasets, remains a cornerstone for most LLM training regimes, by providing a structured way to embed knowledge. However, unsupervised learning, which explores data patterns without predefined labels, is gaining traction for its ability to unearth novel insights.

These intricacies highlight the importance of leveraging advanced tools and technologies. Companies like NVIDIA are at the forefront, offering sophisticated software stacks that automate many aspects of the process, and reducing the barriers to entry for those looking to develop or refine LLMs.

Network and storage performance

In the previous section, I touched on the dataset required to train or customize models. While having the right dataset is a critical piece of this process, being able to deliver that dataset fast enough to the GPUs running the model is another critical and yet often overlooked piece. To achieve that, you must consider two components:

  • Storage performance
  • Network performance

For anyone looking to train a model, having a node-to-node (also known as East-West) network infrastructure based on 100Gbps, or better yet, 400Gbps, is critical, because it ensures sufficient bandwidth and throughput to keep saturated the type of GPUs, such as the NVIDIA H100, required for training.

Because customization datasets are typically smaller than full training datasets, a 100Gbps network can be sufficient, but as with everything in technology, your mileage may vary and proper testing is critical to ensure adequate performance for your needs.

Datasets used to train models are typically very large: in the 100s of GB. For instance, the dataset used to train GPT-4 is said to be over 550GB. With the advance of RDMA over Converged Ethernet (RoCE), GPUs can pull the data directly from storage. And because 100Gbps networks are able to support that load, the bottleneck has moved to the storage.

Because of the nature of large language models, the dataset used to train them is made of unstructured data, such as Sharepoint sites and document repositories, and are therefore most often hosted on network attached storage, such as Dell PowerScale. I am not going to get into further details on the storage part because I’ll be publishing another blog on how to use PowerScale to support model training. But you must make careful considerations when designing the storage to ensure that the storage is able to keep up with the GPUs and the network.

A note about checkpointing

As we previously mentioned, the process of training is iterative. Based on the input provided, the model generates outputs, which are then compared against expected results. Discrepancies between the actual and expected outputs lead to adjustments in the model's weights, gradually improving its accuracy through iterative refinement. This process is repeated across many iterations over the entire training dataset.

A training run (that is, running an entire dataset through a model and updating its weight), is extremely time consuming and resource intensive. According to this blog post, a single training run of ChatGPT-4 costs about $4.6M. Imagine running a few of them in a row, only to have an issue and having to start again. Because of the cost associated with training runs, it is often a good idea to save the weights of the model at an intermediate stage during the run. Should something fail later on, you can load the saved weights and restart from that point. Snapshotting the weights of a model in this way is called checkpointing. The challenge with checkpointing is performance.

A checkpoint is typically stored on an external storage system, so again, storage performance and network performance are critical considerations to offer the proper bandwidth and throughput for checkpointing. For instance, the Llama-2 70B consumes about 129GB of storage. Because each of its checkpoints is the exact same predictable size, they can be saved quickly (to disk) to ensure the proper performance of the training process.

NVIDIA software stack

The choice of which framework to use depends on whether you typically lean more towards doing it yourself or buying specific outcomes. The benefit of doing it yourself is ultimate flexibility, sometimes at the expense of time to market, whereas buying an outcome can offer better time to market at the expense of having to choose within a pre-determined set of components. In my case, I have always tended to favor buying outcomes, which is why I want to cover the NVIDIA AI Enterprise (NVAIE) software stack at a high level.

The following figure is a simple layered cake that showcases the various components of the NVAIE, in light green.

The image represents the layers of the stack required to train models. At the bottom is the hardware layer with, going from left to right, Dell PowerEdge servers for management, Dell PowerSwitch for the network infrastructure, Dell PowerScale storage, Dell PowerEdge R760xa with L40S GPU, Dell XE9680 and Dell XE8640.

The white paper Generative AI in the Enterprise – Model Training Dell Validated Design provides an in-depth exploration of a reference design developed by Dell Technologies in collaboration with NVIDIA. It offers enterprises a robust and scalable framework to train large language models effectively. Whether you're a CTO, AI engineer, or IT executive, this guide addresses the crucial aspects of model training infrastructure, including hardware specifications, software design, and practical validation findings.

Training the Dell Validated Design architecture

The validated architecture aims to give the reader a broad output of model training results. We used two separate configuration types across the compute, network and GPU stack.

There are two 8x PowerEdge XE9680 configurations both with 8x NVIDIA H100 SXM GPUs. The difference between the configurations (again) is the network. The first configuration is equipped with 8x ConnectX-7; the second configuration is equipped with four ConnectX-7 adapters. Both are configured for NDR.

The diagram represents the architecture designed to train the model. It includes a PowerSwitch N3248 at the bottom for the out of band management. 2 Nvidia QM9790 infiniband switches connected to PowerEdge XE9680 through ConnectX adapters. 2 PowerSwitch S54xx for connectivity to the datacenter network, to the PowerScale F710 All NVMe NFS storage and to the PowerEdge R660 running Nvidia Base Command Manager Essentials.

On the storage side, the evolution of PowerScale continues to thrive in the AI domain with the launch of its latest line, including the notable PowerScale F710. This addition embraces Dell PowerEdge 16G servers, heralding a new era in performance capabilities for PowerScale's all-flash nodes. On the software side, the F710 benefits from the enhanced performance features found in the PowerScale OneFS 9.7 update.

Key takeaways

The guide provides training times for the Llama 2 7B and Llama 2 70B models over 500 steps, with variations based on the number of nodes and configurations used.

Why only 500 steps? The decision to train models for a set number of steps (500), rather than to train models for convergence, is practical for validation purposes. It allows for a consistent benchmarking metric across different scenarios and models, to produce a clearer comparison of infrastructure efficiency and model performance in the early stages.

Efficiency of Model Sizing: The choice of 7B and 70B Llama 2 model architectures indicates a strategic approach to balance computational efficiency with potential model performance. Smaller models like the 7B are quicker to train and require fewer resources, making them suitable for preliminary tests and smaller-scale applications. On the other hand, the 70B model, while significantly more resource-intensive, was chosen for its potential to capture more complex patterns and provide more accurate outputs.

Configuration and Resource Optimization: Comparing two hardware configurations provides valuable insights into optimizing resource allocation. While higher-end configurations (Configuration 1 with 8 adapters) offer slightly better performance, you must weigh the marginal gains against the increased costs. This highlights the importance of tailoring the hardware setup to the specific needs and scale of the project, where sometimes, a less maximalist approach (Configuration 2 with 4 adapters) can provide a more balanced cost-to-benefit ratio, especially in smaller setups. Certainly something to think about!

Parallelism Settings: The specific settings for tensor and pipeline parallelism (as covered in the guide), along with batch sizes and sequence lengths, are crucial for training efficiency. These settings impact the training speed and model performance, indicating the need for careful tuning to balance resource use with training effectiveness. The choice of these parameters reflects a strategic approach to managing computational loads and training times.

To close

With the scalable and modular infrastructure designed by Dell Technologies and NVIDIA, you are well-equipped to embark on or enhance your AI initiatives. Leverage this blueprint to drive innovation, refine your operational capabilities, and maintain a competitive edge in harnessing the power of large language models.

Authors: Bertrand Sirodot and Damian Erangey

Read Full Blog
  • AI
  • Artificial Intelligence
  • inferencing
  • XE9680
  • GenAI
  • LLM
  • Meta
  • Llama

Running Meta Llama 3 Models on Dell PowerEdge XE9680

Tao Zhang Khushboo Rathi Onur Celebioglu Tao Zhang Khushboo Rathi Onur Celebioglu

Thu, 02 May 2024 17:54:12 -0000


Read Time: 0 minutes


Recently, Meta has open-sourced its Meta Llama 3 text-to-text models with 8B and 70B sizes, which are the highest scoring LLMs that have been open-sourced so far in their size ranges, in terms of  quality of responses[1]. In this blog, we will run those models on the Dell PowerEdge XE9680 server to show their performance and improvement by comparing them to the Llama 2 models.

Open-sourcing the Large Language Models (LLMs) enables easy access to this state-of-the-art technology and has accelerated innovations in the field and adoption for different applications. As shown in Table 1, this round of Llama 3 release includes the following five models in total, including two pre-trained models with sizes of 8B and 70B and their instruction-tuned versions, plus a safeguard version for the 8B model[2].

Table 1:  Released Llama 3 Models

Model size (Parameters) 

Model names 


Training tokens 

Vocabulary length


  • Meta-Llama-3-8B 
  • Meta-Llama-3-8B-Instruct
  • Meta-Llama-Guard-2-8B







  • Meta-Llama-3-70B 
  • Meta-Llama-3-70B-Instruct

Llama 3 is trained on 15T tokens which is 7.5X the number of tokens on which Llama 2 was trained. Training with large, high-quality datasets and refined post-training processes improved Llama 3 model’s capabilities, such as reasoning, code generation, and instruction following. Evaluated across main accuracy benchmarks, the Llama 3 model not only exceeds its precedent, but also leads over other main open-source models by significant margins. The Llama 3 70B instruct model shows close or even better results compared to the commercial closed-source models such as Gemini Pro[1]. 

The model architecture of Llama 3 8B is similar to that of Llama 2 7B with one significant difference. Besides a larger parameter size, the Llama 3 8B model uses the group query attention (GQA) mechanism instead of the multi-head attention (MHA) mechanism used in the Llama 2 7B model. Unlike MHA which has the same number of Q (query), K (key), and V (value) matrixes, GQA reduces the number of K and V matrixes required by sharing the same KV matrixes across grouped Q matrixes. This reduces the memory  required and improves computing efficiency during the inferencing process[3]. In addition, the Llama 3 models improved the max context window length to 8192 compared to 4096 for the Llama 2 models. Llama 3 uses a new tokenizer called tik token that expands the vocabulary size to 128K when compared to 32K used in Llama 2. The new tokenization scheme offers improved token efficiency, yielding up to 15% fewer tokens compared to Llama 2 based on Meta’s benchmark[1].

This blog focuses on the inference performance tests of the Llama 3 models running on the Dell PowerEdge XE9680 server, especially, the comparison with Llama 2 models, to show the improvements in the new generation of models.

Test setup

The server we used to benchmark the performance is the PowerEdge XE9680 with 8x H100 GPUs[4]. The detailed server configurations are shown in Table 2.

Table 2:  XE9680 server configuration

System Name

PowerEdge XE9680



System Type

Data Center

Number of Nodes


Host Processor Model

4th Generation Intel® Xeon® Scalable Processors

Host Process Name

Intel® Xeon® Platinum 8470

Host Processors per Node


Host Processor Core Count


Host Processor Frequency

2.0 GHz, 3.8 GHz Turbo Boost

Host Memory Capacity and Type

2TB, 32x 64 GB DIMM, 4800 MT/s DDR5

Host Storage Capacity

1.8 TB, NVME

GPU Number and Name

8x H100

GPU Memory Capacity and Type

80GB, HBM3

GPU High-speed Interface

PCIe Gen5 / NVLink Gen4

The XE9680 is the ideal server, optimized for AI workloads with its 8x NVSwitch interconnected H100 GPUs. The high-speed NVLink interconnect allows deployment of large models like Llama 3 70B that need to span multiple GPUs for best performance and memory capacity requirements. With its 10 PCIe slots, the XE9680 also provides a flexible PCIe architecture that enables a variety of AI fabric options.

For these tests, we deployed the Llama 3 models Meta-Llama-3-8B and Meta-Llama-3-70B, and the Llama 2 models Llama-2-7b-hf and Llama-2-70b-hf. These models are available for download from Hugging Face after permission approved by Meta. For a fair comparison between Llama 2 and Llama 3 models, we ran the models with native precision (float16 for Llama 2 models and bfloat16 for Llama 3 models) instead of any quantized precision. 

Given that it has the same basic model architecture as Llama 2, Llama 3 can easily be integrated into any available software eco-system that currently supports the Llama 2 model. For the experiments in this blog, we chose NVIDIA TensorRT-LLM latest release (version 0.9.0) as the inference framework. NVIDIA CUDA version was 12.4; the driver version was 550.54.15. The operating system for the experiments was Rocky Linux 9.1.

Knowing that the Llama 3 improved accuracy significantly, we first concentrated on the inferencing speed tests. More specifically, we tested the Time-to-First-Token (TTFT) and throughput over different batch sizes for both Llama 2 and Llama 3 models, as shown in the Results section. To make the comparison between two generations of models easy, and to mimic a summarization task, we kept the input token length and output token length at 2048 and 128 respectively for most of the experiments. We also measured throughput of the Llama 3 with the long input token length (8192), as one of the most significant improvements. Because H100 GPUs support the fp8 data format for the models with negligible accuracy loss, we measured the throughput under long input token length for the Llama 3 model at fp8 precision.


This figure shows the inference speed comparison between Llama-3-70b and Llama-2-70b models in terms of Time-to-First-Token.

Figure 1.  Inference speed comparison: Llama-3-70b vs Llama-2-70b: Time-to-First-Token

This figure shows the inference speed comparison between Llama-3-70b and Llama-2-70b models in terms of throughput.

Figure 2: Inference speed comparison: Llama-3-70b vs Llama-2-70b: Throughput

Figures 1 and 2 show the inference speed comparison with the 70b Llama 2 (Llama-2-70b) and Llama 3 (Llama-3-70b) models running across eight H100 GPUs in a tensor parallel (TP=8) fashion on an XE9680 server. From the test results, we can see that for both TTFT (Figure 1) and throughput (Figure 2), the Llama 3 70B model has a similar inference speed to the Llama 2 70b model. This is expected given the same size and architecture of the two models. So, by deploying Llama 3 instead of Llama 2 on an XE9680, organizations can immediately see a big boost in accuracy and quality of responses, using the same software infrastructure, without any impact to latency or throughput of the responses.

This figure shows the inference speed comparison between Llama-3-8b and Llama-2-7b models in terms of Time-to-First-Token.

Figure 3.   Inference speed comparison: Llama-3-8b vs Llama-2-7b: Time-to-First-Token

This figure shows the inference speed comparison between Llama-3-8b and Llama-2-7b models in terms of throughput.

Figure 4:  Inference speed comparison: Llama-3-8b vs Llama-2-7b: Throughput  

Figures 3 and 4 show the inference speed comparison with the 7b Llama 2 (Llama-2-7b) and 8b Llama 3 (Llama-3-8b) models running on a single H100 GPU on an XE9680 server. From the results, we can see the benefits of using the group query attention (GQA) in the Llama 3 8B architecture, in terms of reducing the GPU memory footprint in the prefill stage and speeding up the calculation in the decoding stage of the LLM inferencing. Figure 3 shows that Llama 3 8B has a similar response time in generating the first token even though it is a 15% larger model compared to Llama-2-7b. Figure 4 shows that Llama-3-8b has higher throughput than Llama-2-7b when the batch size is 4 or larger. The benefits of GQA grow as the batch size increases. We can see from the experiments that:

  • the Llama-2-7b cannot run at a batch size of 64 or larger with the 16-bit precision and given input/output token length, because of the OOM (out of memory) error
  • the Llama-3-8b can run at a batch size of 128, which gives more than 2x throughput with the same hardware configuration

This figure shows the throughput of the Llama-3-70b model at the max input token length on the XE9680 server.

Figure 5:  Llama-3-70b throughput under 8192 input token length

Another improvement of the Llama 3 model: it supports a max input token length of 8192, which is 2x of that of a Llama 2 model. We tested it with the Llama-3-70b model running on 8 H100 GPUs of the XE9680 server. The results are shown in Figure 5. The throughput is linearly proportional to the batch size tested and can achieve 271 tokens/s at a batch size of 16, indicating that more aggressive quantization techniques can further improve the throughput.


In this blog, we investigated the Llama 3 models that were released recently, by comparing their inferencing speed with that of the Llama 2 models by running on a Dell PowerEdge XE9680 server. With the numbers collected through experiments, we showed that not only is the Llama 3 model series a big leap in terms of the quality of responses, it also has great inferencing advantages in terms of high throughput with a large achievable batch size, and long input token length. This makes Llama 3 models great candidates for those long context and offline processing applications.

Authors: Tao Zhang, Khushboo Rathi, and Onur Celebioglu

[1] Meta AI, “Introducing Meta Llama 3: The most capable openly available LLM to date”,

[3] J. Ainslie et. al, “GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints”,

Read Full Blog
  • deep learning
  • Intel
  • PowerEdge
  • VMware
  • GPU
  • PowerScale

Scaling Deep Learning Workloads in a GPU Accelerated Virtualized Environment

Srinivas Varadharajan Bala Chandrasekaran Srinivas Varadharajan Bala Chandrasekaran

Mon, 29 Apr 2024 14:30:31 -0000


Read Time: 0 minutes


Demand for compute, parallel, and distributed training is ever increasing in the field of deep learning (DL). The introduction of large-scale language models such as Megatron-Turing NLG (530 billion parameters; see References 1 below) highlights the need for newer techniques to handle parallelism in large-scale model training. Impressive results from transformer models in natural language have paved a way for researchers to apply transformer-based models in computer vision. The ViT-Huge (632 million parameters; see References 2 below) model, which uses a pure transformer applied to image patches, achieves amazing results in image classification tasks compared to state-of-the-art convolutional neural networks.

Larger DL models require more training time to achieve convergence. Even smaller models such as EfficientNet (43 million parameters; see References 3 below) and EfficientNetV2 (24 million parameters; see References 3 below) can take several days to train depending on the size of data and the compute used. These results clearly show the need to train models across multiple compute nodes with GPUs to reduce the training time. Data scientists and machine learning engineers can benefit by distributing the training of a DL model across multiple nodes. The Dell Validated Design for AI shows how software-defined infrastructure with virtualized GPUs is highly performant and suitable for AI (Artificial Intelligence) workloads. Different AI workloads require different resources sizing, isolation of resources, use of GPUs, and a better way to scale across multiple nodes to handle the compute-intensive DL workloads.

This blog  demonstrates the use and performance across various settings such as multinode and multi-GPU workloads on Dell PowerEdge servers with NVIDIA GPUs and VMware vSphere.

System Details

The following table provides the system details:

Table 1: System details




Dell PowerEdge R750xa (NVIDIA-Certified System)


2 x Intel Xeon Gold 6338 CPU @ 2.00 GHz


4 x NVIDIA A100 PCIe

Network Adapter

Mellanox ConnectX-6 Dual port 100 GbE


Dell PowerScale

ESXi Version


BIOS Version


GPU Driver Version


CUDA Version


Setup for multinode experiments

To achieve the best performance for distributed training, we need to perform the following high-level steps  when the ESXi server and virtual machines (VMs) are created:

  1. Enable Address Translation Services (ATS) on VMware ESXi and VMs to enable peer to peer (P2P) transfers with high performance.
  2. Enable ATS on the ConnectX-6 NIC.
  3. Use the ibdev2netdev utility to display the installed Mellanox ConnectX-6 card and mapping between logical and physical ports, and enable the required ports.
  4. Create a Docker container with Mellanox OFED drivers, Open MPI Library, and NVIDIA optimized TensorFlow (the DL framework that is used in the following performance tests).
  5. Set up a keyless SSH login between VMs.
  6. When configuring multiple GPUs in the VM, connect the GPUs with NVLINK.

Performance evaluation

For the evaluation, we used VMs with 32 CPUs, 64 GB of memory, and GPUs (depending on the experiment). The evaluation of the training performance (throughput) is based on the following scenarios:

  • Scenario 1—Single node with multiple VMs and multi-GPU model training
  • Scenario 2—Multinode model training (distributed training)

Scenario 1

Imagine the case in which there are multiple data scientists working on building and training different models. It is vital to strictly isolate resources that are shared between the data scientists to run their respective experiments. How effectively can the data scientists use the resources available?

The following figure shows several experiments on a single node with four GPUs and the performance results. For each of these experiments, we run tf_cnn_benchmarks with the ResNet50 model with a batch size of 1024 using synthetic data.

Note: The NVIDIA A100 GPU supports a NVLink bridge connection with a single adjacent NVIDIA A100 GPU. Therefore, the maximum number of GPUs added to a single VM for multi-GPU experiments on a Dell PowerEdge R750xa server is two.

Figure 1: Performance comparison of multi-VMs and multi-GPUs on a single node

Figure 1 shows the throughput (the average on three runs) of three different experiment setups:

  • Setup 1 consists of a single VM with two GPUs. This setup might be beneficial to run a machine learning workload, which needs more GPUs for faster training (5500 images/second) but still allows the remaining resources in the available node for other data scientists to use.
  • Setup 2 consists of two VMs with one GPU each. We get approximately 2700 images/second on each VM, which can be useful to run multiple hyper-parameter search experiments to fine-tune the model.
  • Setup 3 consists of two VMs with two GPUs each. We use all the GPUs available in the node and show the maximum cumulative throughput of approximately 11000 images/second between two VMs.

Scenario 2

Training large DL models requires a large amount of compute. We also need to ensure that the training is completed in an acceptable amount of time. Efficient parallelization of deep neural networks across multiple servers is important to achieve this requirement. There are two main algorithms when we address distributed training, data parallelism, and model parallelism. Data parallelism allows the same model to be replicated in all nodes, and we feed different batches of input data to each node. In model parallelism, we divide the model weights to each node and the same minibatch data is trained across the nodes.

In this scenario, we look at the performance of data parallelism while training the model using multiple nodes. Each node receives different minibatch data. In our experiments, we scale to four nodes with one VM and one GPU each.

To help with scaling models to multiple nodes, we use Horovod (see References 6 below ), which is a distributed DL training framework. Horovod uses the Message Passing Interface (MPI) to effectively communicate between the processes.

MPI concepts include:

  • Size indicates the total number of processes. In our case, the size is four processes.
  • Rank is the unique ID for each process.
  • Local rank indicates the unique process ID in a node. In our case, there is only one GPU in each node.
  • The Allreduce operation aggregates data among multiple processes and redistributes them back to the process.
  • The Allgather operation is used to gather data from all the processes.
  • The Broadcast operation is used to broadcast data from one process identified by root to other processes.

The following table provides the scaling experiment results:

Table 2: Scaling experiments results

Number of nodes

VM Throughput (images/second)




For the scaling experiment results in the table, we run tf_cnn_benchmarks with the ResNet50 model with a batch size of 1024 using synthetic data. This experiment is a weak scaling-based experiment; therefore, the same local batch size of 1024 is used as we scale across nodes.

The following figure shows the plot of speedup analysis of scaling experiment:

Figure 2: Speedup analysis of scaling experiment

The speedup analysis in Figure 2 shows the speedup (times X) when scaling up to four nodes. We can clearly see that it is almost linear scaling as we increase the number of nodes.

The following figure shows how multinode distributed training on VMs compares to running the same experiments on bare metal (BM) servers:

Figure 3: Performance comparison between VMs and BM servers

The four-nodes experiment (one GPU per node) achieves a throughput of 10675 images/second in the VM environment while the similarly configured BM run achieves a throughput of 10818 images/second. One-, two-, and four-node experiments show a percentage difference of less than two percent between BM experiments and VM experiments.


In this blog, we described how to set up the ESXi server and VMs to be able to run multinode experiments. We examined various scenarios in which data scientists can benefit from multi-GPU experiments and their corresponding performance. The multinode scaling experiments showed how the speedup is closer to linear scaling. We also examined how VM-based distributed training compares to BM-server based distributed training. In upcoming blogs, we will look at best practices for multinode vGPU training, and the use and performance of NVIDIA Multi-Instance GPU (MIG) for various deep learning workloads.


  1. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World’s Largest and Most Powerful Generative Language Model
  2. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
  3. EfficientNetV2: Smaller Models and Faster Training
  4. Virtualizing GPUs for AI with VMware and NVIDIA Based on Dell Infrastructure Design Guide


Contributors to this blog: Prem Pradeep Motgi and Sarvani Vemulapalli

Read Full Blog
  • generative AI
  • Llama 2
  • LLM
  • Meta
  • Fine-tuning

Llama 2: Efficient Fine-tuning Using Low-Rank Adaptation (LoRA) on Single GPU

Khushboo Rathi Bhavesh Patel Khushboo Rathi Bhavesh Patel

Wed, 24 Apr 2024 14:23:28 -0000


Read Time: 0 minutes


With the growth in the parameter size and performance of large-language models (LLM), many users are increasingly interested in adapting them to their own use case with their own private dataset. These users can either search the market for an Enterprise-level application which is trained on large corpus of public datasets and might not be applicable to their internal use case or look into using the open-source pre-trained models and then fine-tuning them on their own proprietary data. Ensuring efficient resource utilization and cost-effectiveness are crucial when choosing a strategy for fine-tuning a large-language model, and the latter approach offers a more cost-effective and scalable solution given that it’s trained with known data and able to control the outcome of the model. 

This blog investigates how Low-Rank Adaptation (LoRA) – a parameter effective fine-tuning technique – can be used to fine-tune Llama 2 7B model on single GPU. We were able to successfully fine-tune the Llama 2 7B model on a single Nvidia’s A100 40GB GPU and will provide a deep dive on how to configure the software environment to run the fine-tuning flow on Dell PowerEdge R760xa featuring NVIDIA A100 GPUs. 

This work is in continuation to our previous work, where we performed an inferencing experiment on Llama2 7B and shared results on GPU performance during the process.

Memory bottleneck

When finetuning any LLM, it is important to understand the infrastructure needed to load and fine-tune the model.  When we consider standard fine-tuning, where all the parameters are considered, it requires significant computational power to manage optimizer states and gradient checkpointing. The optimizer states and gradients usually result in a memory footprint which is approximately five times larger than the model itself. If we consider loading the model in fp16 (2 bytes per parameter), we will need around 84 GB of GPU memory, as shown in figure 1, which is not possible on a single A100-40 GB card. Hence, to overcome this memory capacity limitation on a single A100 GPU, we can use a parameter-efficient fine-tuning (PEFT) technique. We will be using one such technique known as Low-Rank Adaptation (LoRA) for this experiment.



Figure 1. Schematic showing memory footprint of standard fine-tuning with Llama 27B model.


Fine-tuning method

LoRA is an efficient fine-tuning method where instead of finetuning all the weights that constitute the weight matrix of the pre-trained LLM, it optimizes rank decomposition matrices of the dense layers to change during adaptation. These matrices constitute the LoRA adapter. This fine-tuned adapter is then merged with the pre-trained model and used for inferencing. The number of parameters is determined by the rank and shape of the original weights. In practice, trainable parameters vary as low as 0.1% to 1% of all the parameters. As the number of parameters needing fine-tuning decreases, the size of gradients and optimizer states attached to them decrease accordingly. Thus, the overall size of the loaded model reduces. For example, the Llama 2 7B model parameters could be loaded in int8 (1 byte), with 1 GB trainable parameters loaded in fp16 (2 bytes). Hence, the size of the gradient (fp16), optimizer states (fp32), and activations (fp32) aggregates to approximately 7-9 GB. This brings the total size of the loaded model to be fine-tuned to 15-17 GB, as illustrated in figure 2.


Figure 2. Schematic showing an example of memory footprint of LoRA fine tuning with Llama 2 7B model.


Experimental setup 

A model characterization gives readers valuable insight into GPU memory utilization, training loss, and computational efficiency measured during fine-tuning by varying the batch size and observing out-of-memory (OOM) occurrence for a given dataset. In table 1, we show resource profiling when fine-tuning Llama 2 7B-chat model using LoRA technique on PowerEdge R760xa with 1*A100-40 GB on Open- source SAMsum dataset. To measure tera floating-point operations (TFLOPs) on the GPU, the DeepSpeed Flops Profiler was used. Table 1 gives the detail on the system used for this experiment.

Table 1. Actual memory footprint of Llama 27B model using LoRA technique in our experiment.

Trainable params (LoRA) 

0.0042 B (0.06% of 7B model)

7B model params(int8)

7 GB

Lora adapter (fp16)

0.0084 GB

Gradients (fp32)

0.0168 GB

Optimizer States(fp32)

0.0168 GB


2.96 GB

Total memory for batch size 1

10 GB = 9.31 GiB


System configuration

In this section, we list the hardware and software system configuration of the R760xa PowerEdge server used in this experiment for the fine-tuning work of Llama-2 7B model.

Figure 3. R760XA Specs

Table 2. Hardware and software configuration of the system 




Compute server for inferencing

PowerEdge R760xa


Nvidia A100-40GB PCIe CEM GPU

Host Processor Model Name 

Intel(R) Xeon(R) Gold 6454S (Sapphire Rapids)  

Host Processors per Node 


Host Processor Core Count 


Host Processor Frequency 

2.2 GHz

Host Memory Capacity 

512 GB, 16 x 32GB 4800 MT/s DIMMs

Host Storage Type 


Host Storage Capacity 

900 GB


Operating system

Ubuntu 22.04.1 


DeepSpeed- FLOPs Profiler 



Package Management 



The SAMsum dataset – size 2.94 MB – consists of approximately 16,000 rows (Train, Test, and Validation) of English dialogues and their summary. This data was used to fine-tune the Llama 2 7B model. We preprocess this data in the format of a prompt to be fed to the model for fine-tuning. In the JSON format, prompts and responses were used to train the model. During this process, PyTorch batches the data (about 10 to 11 rows per batch) and concatenates them. Thus, a total of 1,555 batches are created by preprocessing the training split of the dataset. These batches are then passed to the model in chunks for fine-tuning.


Fine-tuning steps

  1. Download the Llama 2 model
    • The model is available either from Meta’s git repository or Hugging Face, however to access the model, you will need to submit the required registration form for Meta AI license agreement 
    • The details can be found in our previous work here
  2. Convert the model from the Meta’s git repo to a Hugging face model type in order to use the PEFT libraries used in the LoRA technique
    • Use the following commands to convert the model 
      ## Install HuggingFace Transformers from source
      pip freeze | grep transformers ## verify it is version 4.31.0 or higher
      git clone
      cd transformers
      pip install protobuf
      python src/transformers/models/llama/ \
         --input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir /output/path
  3. Build a conda environment and then git clone the example fine-tuning recipes from Meta’s git repository to get started
    • We have modified the code base to include Deepspeed flops profiler and nvitop to profile the GPU
  4. Load the dataset using the dataloader library of hugging face and, if need be, perform preprocessing
  5. Input the config file entries with respect to PEFT methods, model name, output directory, save model location, etc 
    • The following is the example code snippet
    model_name: str="path_of_base_hugging_face_llama_model"    
    run_validation: bool=True
    batch_size_training: int=7
    num_epochs: int=1  
    val_batch_size: int=1
    dataset = "dataset_name" 
    peft_method: str = "lora"
    output_dir: str = "path_to_save_fine_tuning_model"
    save_model: bool = True

 6. Run the following command to perform fine tuning on a single GPU

python3  --use_peft --peft_method lora --quantization --model_name location_of_hugging_face_model

Figure 4 shows fine tuning with LoRA technique on 1*A100 (40GiB) with Batch size = 7 on SAMsum dataset, which took 83 mins to complete.

Figure 4. Example screenshot of fine-tuning with LoRA on SAMsum dataset


Experiment results

The fine-tuning experiments were run at batch sizes 4 and 7. For these two scenarios, we calculated training losses, GPU utilization, and GPU throughput. We found that at batch size 8, we encountered an out-of-memory (OOM) error for the given dataset on 1*A100 with 40 GB. 

When a dialogue was sent to the base 7B model, the summarization results are not proper as shown in figure 5. After fine-tuning the base model on the SAMsum dataset, the same dialogue prompts a proper summarized result as shown in figure 6. The difference in results shows that fine-tuning succeeded. 

Figure 5. Summarization results from the base model.

Figure 6. Summarization results from the fine-tuned model.

Figures 7 and 8 show the training losses at batch size 4 and 7 respectively. We found that even after increasing the batch size by approximately 2x times, the model training performance did not degrade. 

Figure 7. Training loss with batch size = 4, a total of 388 training steps in 1 epoch.

Figure 8. Training loss with batch size = 7, a total of 222 training steps in 1 epoch.

The GPU memory utilization was captured with LoRA technique in Table 3. At batch size 1, used memory was 9.31 GB. At batch size of 4, used memory was 26.21 GB. At batch size 7, memory used was 39.31 GB. Going further, we see an OOM at batch size 8 for 40 GB GPU card. The memory usage remains constant throughout fine-tuning, as shown in Figure 9, and is dependent on the batch size. We calculated the reserved memory per batch to be 4.302 GB on 1*A100. 

Table 3. The GPU memory utilization is captured by varying the max. batch size parameter.

Max. Batch Size

Steps in 1 Epoch

Total Memory (GiB)

Used Memory (GiB)

Free Memory (GiB)

















Out of Memory Error

Figure 9. GPU memory utilization for batch size 4 (which remains constant for fine-tuning)

The GPU TFLOP was determined using DeepSpeed Profiler, and we found that FLOPs vary linearly with the number of batches sent in each step, indicating that FLOPs per token is the constant.

Figure 10. Training GPU TFlops for batch sizes 4 and 7 while fine-tuning the model.

The GPU multiple-accumulate operations (MACs), which are common operations performed in deep learning models, are also determined. We found that MACs also follow a linear dependency on the batch size and hence constant per token. 

Figure 11. GPU MACs for batch sizes 4 and 7 while the fine-tuning the model. 

The time taken for fine-tuning, which is also known as epoch time, is given in table 4. It shows that the training time does not vary much, which strengthens our argument that the FLOPs per token is constant. Hence, the total training time is independent of the batch size.

Table 4. Data showing the time taken by the fine-tuning process.

Max. Batch Size

Steps in 1 Epoch

Epoch time (secs)








Conclusion and Recommendation

  1. We show that using a PEFT technique like LoRA can help reduce the memory requirement for fine-tuning a large-language model on a proprietary dataset. In our case, we use a Dell PowerEdge R760xa featuring the NVIDIA A100-40GB GPU to fine-tune a Llama 2 7B model.
  2. We recommend using a lower batch size to minimize automatic memory allocation, which could be utilized in the case of a larger dataset. We have shown that a lower batch size affects neither the training time nor training performance.  
  3. The memory capacity required to fine-tune the Llama 2 7B model was reduced from 84GB to a level that easily fits on the 1*A100 40 GB card by using the LoRA technique. 




Author: Khushboo Rathi |

Co-author: Bhavesh Patel |


Read Full Blog
  • PowerEdge
  • GPU
  • MLPerf
  • Broadcom
  • GenAI
  • Large Language Model

Dell PowerEdge Servers Unleash Another Round of Excellent Results with MLPerf™ v4.0 Inference

Rakshith Vasudev Frank Han Manpreet Sokhi Rakshith Vasudev Frank Han Manpreet Sokhi

Wed, 27 Mar 2024 15:12:53 -0000


Read Time: 0 minutes

Today marks the unveiling of MLPerf v4.0 Inference results, which have emerged as an industry benchmark for AI systems. These benchmarks are responsible for assessing the system-level performance consisting of state-of-the-art hardware and software stacks. The benchmarking suite contains image classification, object detection, natural language processing, speech recognition, recommenders, medical image segmentation, LLM 6B and LLM 70B question answering, and text to image benchmarks that aim to replicate different deployment scenarios such as the data center and edge.

Dell Technologies is a founding member of MLCommons™ and has been actively making submissions since the inception of the Inference and Training benchmarks. See our MLPerf™ Inference v2.1 with NVIDIA GPU-Based Benchmarks on Dell PowerEdge Servers   white paper that introduces the MLCommons Inference benchmark.

Our performance results are outstanding, serving as a clear indicator of our resolve to deliver outstanding system performance. These improvements enable higher system performance when it is most needed, for example, for demanding generative AI (GenAI) workloads.

What is new with Inference 4.0? 

Inference 4.0 and Dell’s submission include the following:

  • Newly introduced Llama 2 question answering and text to image stable diffusion benchmarks, and submission across different Dell PowerEdge XE platforms. 
  • Improved GPT-J (225 percent improvement) and DLRM-DCNv2 (100 percent improvement) performance. Improved throughput performance of the GPTJ and DLRM-DCNv2 workload means faster natural language processing tasks like summarization and faster relevant recommendations that allow a boost to revenue respectively.
  • First-time submission of server results with the recently released PowerEdge R7615 and PowerEdge XR8620t servers with NVIDIA accelerators.
  • Besides accelerator-based results, Intel-based CPU-only results. 
  • Results for PowerEdge servers with Qualcomm accelerators.
  • Power results showing high performance/watt scores for the submissions. 
  • Virtualized results on Dell servers with Broadcom.  

Overview of results 

Dell Technologies delivered 187 data center, 28 data center power, 42 edge, and 24 edge power results. Some of the more impressive results were generated by our:

  • Dell PowerEdge XE9680, XE9640, XE8640, and servers with NVIDIA H100 Tensor Core GPUs
  • Dell PowerEdge R7515, R750xa, and R760xa servers with NVIDIA L40S and A100 Tensor Core GPUs
  • Dell PowerEdge XR7620 and XR8620t servers with NVIDIA L4 Tensor Core GPUs
  • Dell PowerEdge R760 server with Intel Emerald Rapids CPUs
  • Dell PowerEdge R760 with Qualcomm QAIC100 Ultra accelerators

NVIDIA-based results include the following GPUs:

  • Eight-way NVIDIA H100 GPU (SXM)
  • Four-way NVIDIA H100 GPU (SXM)
  • Four-way NVIDIA A100 GPU (PCIe)
  • Four-way NVIDIA L40S GPU (PCIe)

These accelerators were benchmarked on different servers such as PowerEdge XE9680, XE8640, XE9640, R760xa, XR7620, and XR8620t servers across data center and edge suites.

Dell contributed to about 1/4th of the closed data center and edge submissions. The large number of result choices offers end users an opportunity to make data-driven purchase decisions and set performance and data center design expectations.

Interesting Dell data points 

The most interesting data points include:

  • Performance results across different benchmarks are excellent and show that Dell servers meet the increasing need to serve different workload types. 
  • Among 20 submitters, Dell Technologies was one of the few companies that covered all benchmarks in the closed division for data center suites. 
  • The PowerEdge XE8640 and PowerEdge XE9640 servers compared to other four-way systems procured winning titles across all the benchmarks including the newly launched stable diffusion and Llama 2 benchmark. 
  • The PowerEdge XE9680 server compared to other eight-way systems procured several winning titles for benchmarks such as ResNet Server, 3D-Unet, BERT-99, and BERT-99.9 Server.
  • The PowerEdge XE9680 server delivers the highest performance/watt compared to other submitters with 8-way NVIDIA H100 GPUs for ResNet Server, GPTJ Server, and Llama 2 Offline 
  • The Dell XR8620t server for edge benchmarks with NVIDIA L4 GPUs outperformed other submissions.
  • The PowerEdge R750xa server with NVIDIA A100 PCIe GPUs outperformed other submissions on the ResNet, RetinaNet, 3D-Unet, RNN-T, BERT 99.9, and BERT 99 benchmarks.
  • The PowerEdge R760xa server with NVIDIA L40S GPUs outperformed other submissions on the ResNet Server, RetinaNet Server, RetinaNet Offline, 3D-UNet 99, RNN-T, BERT-99, BERT-99.9, DLRM-v2-99, DLRM-v2-99.9, GPTJ-99, GPTJ-99.9, Stable Diffusion XL Server, and Stable Diffusion XL Offline benchmarks. 


The following figure shows the different Offline and Server performance scenarios in the data center suite. These results provide an overview; follow-up blogs will provide more details about the results.

The following figure shows that these servers delivered excellent performance for all models in the benchmark such as ResNet, RetinaNet, 3D-UNet, RNN-T, BERT, DLRM-v2, GPT-J, Stable Diffusion XL, and Llama 2. Note that different benchmarks operate on varied scales. They have all been showcased in an exponentially scaled y-axis in the following figure:

Figure 1:  System throughput for submitted systems for the data center suite.

The following figure shows single-stream and multistream scenario results for the edge for ResNet, RetinaNet, 3D-Unet, RNN-T, BERT 99, GPTJ, and Stable Diffusion XL benchmarks. The lower the latency, the better the results and for Offline scenario, higher the better.

Figure 2:  Edge results with PowerEdge XR7620 and XR8620t servers overview


The preceding results were officially submitted to MLCommons. They are MLPerf-compliant results for the Inference v4.0 benchmark across various benchmarks and suites for all the tasks in the benchmark such as image classification, object detection, natural language processing, speech recognition, recommenders, medical image segmentation, LLM 6B and LLM 70B question answering, and text to image. These results prove that Dell PowerEdge XE9680, XE8640, XE9640, and R760xa servers are capable of delivering high performance for inference workloads. Dell Technologies secured several #1 titles that make Dell PowerEdge servers an excellent choice for data center and edge inference deployments. End users can benefit from the plethora of submissions that help make server performance and sizing decisions, which ultimately deliver enterprises’ AI transformation and shows Dell’s commitment to deliver higher performance.

MLCommons Results


The preceding graphs are MLCommons results for MLPerf IDs from 4.0-0025 to 4.0-0035 on the closed datacenter, 4.0-0036 to 4.0-0038 on the closed edge, 4.0-0033 in the closed datacenter power, and 4.0-0037 in closed edge power. 

Read Full Blog
  • AI

Dell Technologies and NVIDIA ─Unleashing the Power of Generative AI Through Collaboration

Gautam Bhagra Gautam Bhagra

Fri, 15 Mar 2024 12:13:22 -0000


Read Time: 0 minutes

A new report from McKinsey reveals the staggering potential of generative AI, estimating its annual impact to be between $2.6 trillion and $4.4 trillion across 63 analyzed use cases. This impact not only underscores the immense economic significance of generative AI but also heralds a new era of possibility and prosperity for businesses and societies worldwide.

In the realm of artificial intelligence (AI), discussions often center on advanced hardware and software capabilities. However, the true journey towards AI adoption in enterprises and telecom companies begins with a strategic shift towards thinking about business transformation and gaining a profound understanding of the intrinsic value residing in their data assets. This focus helps enterprises move towards extracting maximum value from existing data repositories and platforms.

A winning strategy based on collaboration

Dell Technologies has collaborated with NVIDIA to pair NVIDIA’s expertise with full-stack accelerated computing for generative AI with Dell infrastructure and services, specifically for AI adoption in enterprises and telecom companies. The combined technologies provide “a one-stop shop” for organizations looking to harness the power of generative AI by providing the tools, expertise, services, and industry-specific solutions to empower businesses on their generative AI journeys.

Dell Technologies is also working with NVIDIA to develop targeted solutions for specific enterprise verticals, including healthcare, manufacturing, and telecommunications to maximize the impact of generative AI in these areas.

The comprehensive AI-validated solutions encompass not only the essential hardware and software required for AI implementation but also a wide array of professional services. These services range from advisory AI and consulting to implementation services and managed services for generative AI. We accompany our customers along their transformation journeys, ensuring end-to-end support – from strategy formulation to deployment and scaling.

A joint initiative between Dell Technologies and NVIDIA, Project Helix offers industry-leading, best-in-class solutions for various generative AI use cases, helping accelerate adoption across industries. The comprehensive, AI-validated solution is based on AI-optimized Dell PowerEdge servers with the latest NVIDIA GPUs to support all phases of generative AI, from training to fine-tuning to inferencing. Scalable Dell Unstructured Data Solutions (UDS) storage like PowerScale and ECS enables rapid data storage for numerous object types and prepares the data for AI model processing. High-performance networking enables low-latency data gravity operations across the organization.

Using Dell’s proven ProConsult Advisory methodology, we facilitate structured interviews and workshops to assist customers in aligning on their generative AI strategic vision, establishing guiding principles, assessing their current (AS-IS) state, and developing future (TO-BE) road maps for achieving wanted outcomes. 

Dell and NVIDIA end-to-end validated stack for AI

On top of this accelerated computing platform, NVIDIA offers NVIDIA AI Enterprise software. This end-to-end, cloud-native software platform delivers production-grade AI, which includes the NVIDIA NeMo framework and models for large language model development and deployment.

This solution empowers a wide range of multivendor use cases, including, but not limited to, natural language generation, chatbots, and digital assistants. The use case list for generative AI applications extends to personalized marketing, improved network management, predictive maintenance, enhanced security, resource optimization, new service development, and more.

AI─a paradigm shift for business excellence

Demystifying AI adoption involves moving the narrative away from hardware-centric paradigms towards a holistic focus on business transformation and data value. By recognizing the centrality of data and embracing a strategic approach to AI adoption, enterprises and telecom companies can unlock new avenues of growth and competitiveness. Dell’s collaboration with NVIDIA delivers a unique package of AI solutions to help customers move from the strategy phase to implementing and scaling their AI operations, accelerating the time to innovation.

It is not merely about embracing AI; it is about embracing a mindset of continuous evolution and innovation in the pursuit of organizational excellence.

To harness the full benefits of generative AI, businesses must first understand their data and goals. Dell’s Validated Designs, integrated with NVIDIA technology, help telecommunications customers more easily move from proof of concept to full-scale deployment of their generative AI-powered data center solutions.” – Chris Penrose, Global Head of Business Development for Telco, NVIDIA

"Our collaboration with NVIDIA is a game-changer for Dell's Telco customers who are at the forefront of AI innovation. By leveraging NVIDIA's cutting-edge GPUs, software, and frameworks, we are empowering our customers to accelerate their Generative AI, Machine Learning, and Deep Learning initiatives across OSS, BSS, Core, and RAN. This partnership enables us to deliver the most advanced AI infrastructure and solutions, helping our Telco customers stay ahead in this rapidly evolving landscape." – Manish Singh, Vice President, Engineering Technology, Dell Technologies

Read Full Blog
  • healthcare
  • LLM
  • Generative AI

Dell and Northwestern Medicine Collaborate on Next Generation Healthcare Multimodal LLMs

Northwestern Medicine Bhavesh Patel Bala Chandrasekaran Frank Han Steven Barrow Northwestern Medicine Bhavesh Patel Bala Chandrasekaran Frank Han Steven Barrow

Thu, 15 Feb 2024 15:56:13 -0000


Read Time: 0 minutes

Generative multimodal large language models, or mLLMs, have shown remarkable new capabilities across a variety of domains, ranging from still images to video to waveforms to language and more. However, the impact of healthcare-targeted mLLM applications remains untried, due in large part to the increased risks and heightened regulation encountered in the patient care setting. A collaboration between Dell Technologies and Northwestern Medicine aims to pave the way for the development and integration of next-generation healthcare-oriented mLLMs into hospital workflows via a strategic partnership anchored in technical and practical expertise at the intersection of healthcare and technology.

One practical application of mLLMs is highlighted in a recent open-source publication from the Research and Development team at Northwestern Medicine which describes the development and evaluation of an mLLM for the interpretation of chest x-rays. These interpretations were judged by emergency physicians to be as accurate and relevant in the emergency setting as interpretations by on-site radiologists, even surpassing teleradiologist interpretations. Clinical implementation of such a model could broaden access to care while aiding physician decision-making. This peer-reviewed study – the first to clinically evaluate a generative mLLM for chest x-ray interpretation – is just one example of the numerous opportunities for meaningful impact by healthcare-tailored mLLMs.

This shows the model architecture of a chest x-ray interpretation. The current and prior x-rays are combined into a single input image, which flows through an AI model including an image encoder and a text decoder, which then outputs the observation and impression.Figure 1. Model architecture for chest x-ray interpretation 

As illustrated in Figure 1, the model is a vision encoder-decoder model, using pretrained ViT-base and RoBERTa-base as the image encoder and text decoder respectively. In total, over 1 million images and radiology reports were used to train this model using one node with 8 GPUs over three weeks. Expanding the scope of such models, such as to other image modalities like computed tomography and magnetic resonance imaging, requires much greater hardware capabilities to efficiently train at scale.

Notably, this model was trained using only 8 graphics processing units (GPUs) in three weeks. As the broader body of LLM research has shown, there is great promise in scaling up such methods, incorporating more data and larger models to create more powerful solutions. Hospital systems generate vast amounts of data spanning numerous modalities, such as numeric lab values, clinical images and videos, waveforms, and free text clinical notes. A key goal of the collaboration between Dell Technologies and Northwestern Medicine is to expand on this work and scale the capabilities of healthcare systems to use their own data to solve clinical problems and blend cutting edge data-centric platforms with clinical expertise, all targeted toward improving the patient and practitioner experience.

HIPAA Compliant HPC

To bring this vision to fruition, it is necessary to build out capable healthcare-tailored high-performance computing (HPC) clusters in which multiple nodes with varying levels of compute, memory, and storage resources are made available to users to run tasks in parallel and at scale. This enables centralized management of resources with the flexibility to provision resources to jobs ranging from single-node experiments to massively distributed model training. The typical HPC cluster structure is illustrated in Figure 2. Users can connect to a login node via virtual private network (VPN) or secure shell (SSH). These nodes provide access to requested compute resources within the internal HPC cluster network as well as job scheduling software, such as slurm, to coordinate job submission and distribution. Computing nodes are interconnected with varying levels of provisioned access available, ranging from one GPU on a multi-GPU node to dozens of multi-GPU nodes. A shared parallel filesystem is used to access the data storage.

This image depicts a typical HPC cluster setup. The users can log in from multiple locations through an SSH/VPN, which gives them access to shared compute resources and a shared file system.Figure 2. Typical HPC cluster setup

However, a special consideration within ecosystems handling hospital data is protected health information, or PHI. The Health Insurance Portability and Accountability Act, or HIPAA, mandates a basic level of security to ensure that PHI is adequately protected, ensuring patient privacy around sensitive health data. Thus, HIPAA-compliant healthcare HPC must account for heightened security and segregation of PHI. But what exactly does it mean to be HIPAA compliant? The following will describe some key components necessary to ensure HIPAA compliance and protection of sensitive patient data throughout all aspects of hospital-based collaborations. Though HIPAA compliance may seem challenging, we break down these requirements into two key facets: the data silo and data stewardship, as shown in Figure 3.

These can be generally grouped into a data silo, which implements physical and digital protections to ensure data security, and sound data stewardship, which defines best practices and ensures continual collaborative growth among strategic partners.Figure 3. The key facets needed to ensure HIPAA compliance in datacenters housing healthcare data 

Firstly, the data silo must ensure that access is provisioned in a secure and controllable fashion. Data must be encrypted in accordance with the Advanced Encryption Standard (AES), such as by AES-256 which utilizes a 256-bit key. Adequate firewalls, a private IP address, and access via remote VPN are further required to ensure that PHI remains accessible only to authorized parties and in a secure fashion. Finally, physical access controls ensure credentialed access and surveillance within the datacenter itself.

Secondly, data stewardship practices must be in place to ensure that practices remain up to date and aligned with institutional goals. A business associate agreement (BAA) describes the responsibilities of each party with regards to protection of PHI in a legally binding fashion and is necessary if business associate operations require PHI access. Security protocols, along with a disaster recovery plan, should be outlined to ensure protection of PHI in all scenarios. Finally, regular security and risk analyses should be performed to maintain compliance with applicable standards and identify areas of improvement.

While many datacenters have implemented measures to ensure compliance with regulations like HIPAA, the greatest challenge remains providing on-demand separation between general workloads and HIPAA-compliant workloads within the same infrastructure. To address this issue, Dell Technologies is working in collaboration with Northwestern Medicine on a new approach that utilizes flexible, controlled provisioning to enable on-demand HIPAA compliance within existing HPC clusters, as shown in Figure 4. This HPC setup, once deployed, would automatically provide network separation and the reconfiguration of compute and data storage resources, ensuring they are isolated from the general allocation.

This newly HIPAA-compliant portion of the cluster can be accessed only by credentialed users via VPN using dedicated login nodes which provide separate job scheduling and filesystem access, enabling access to AI-ready compute resources without disrupting general workloads. When no longer needed, automatic cluster reconfiguration occurs, returning resources to the general allocation until new HIPAA-compliant workloads are needed.

As needed to support healthcare workloads, the Dell model for HIPAA-compliant HPC enables separating the HPC cluster into HIPAA-compliant and general partitions at the flip of a switch, providing network separation and access to dedicated HIPAA-compliant compute and data storage while maintaining the ability to serve general workloads in parallel.Figure 4. Next-generation HIPAA-compliant HPC cluster, which builds on the typical setup presented in Figure 1 

Our expertise in compute infrastructure for artificial intelligence (AI) initiatives extends to developing datacenters and datacenter infrastructure with the proper security and controls in place that a health system can leverage as part of their efforts in achieving HIPAA compliance.

This integrated model of HIPAA-compliant compute for healthcare is aimed at democratizing the benefits of the artificial intelligence revolution, enabling healthcare institutions to employ these new technologies and provide better, more efficient care for all.


Generative Artificial Intelligence for Chest Radiograph Interpretation in the Emergency Department:



Jonathan Huang, MD/PhD Candidate, Research & Development, Northwestern Medicine

Matthew Wittbrodt, Solutions Architect, Research & Development, Northwestern Medicine

Alex Heller, Director, Research & Development, Northwestern Medicine

Mozziyar Etemadi, Clinical Director, Advanced Technologies, Northwestern Medicine

Bhavesh Patel, Sr. Distinguished Engineer, Dell Technologies

Bala Chandrasekaran, Technical Staff, Dell Technologies

Frank Han, Senior Principal Engineer, Dell Technologies

Steven Barrow, Enterprise Account Executive


Read Full Blog
  • AI
  • Intel
  • PowerEdge
  • generative AI
  • Large Language Model

Unlocking the Power of Large Language Models and Generative AI: A Dell and Run:ai Joint Solution

Justin King Ekin Karabulut Justin King Ekin Karabulut

Tue, 30 Jan 2024 19:47:13 -0000


Read Time: 0 minutes

In the fast-paced landscape of AI, the last year has undeniably been marked as the era of Large Language Models (LLMs), especially in the Generative AI (GenAI) field. Models like GPT-4 and Falcon have captured our imagination, showcasing the remarkable potential of these LLMs. However, beneath their transformative capabilities lie a substantial challenge: the insatiable hunger for computational resources.

The demand for compute: fueling innovation with computational power

GenAI applications span from media industry to software development, driving innovation across industries. OpenAI's release of GPT-3 was a turning point, demonstrating the capabilities of language models and their potential to revolutionize every sector. On one hand, startups and tech giants have introduced closed-source models, offering APIs for their usage, exemplified by OpenAI and GPT-4. On the other hand, an active open-source community has emerged, releasing powerful models such as Falcon and Llama 2. These models, both closed- and open-source, have spurred a wave of interest, with companies racing to use their potential.

While the promise of LLMs is enormous, they come with a significant challenge—access to high-performance GPUs. Enterprises aiming to deploy these models in their private data centers or cloud environments must contend with the need for substantial GPU power. Security concerns further drive the preference for in-house deployments, making GPU accessibility critical.

The infrastructure required to support LLMs often includes high-end GPUs connected through fast interconnects and storage solutions. These resources are not just expensive and scarce but are also in high demand, leading to bottlenecks in machine learning (ML) development and deployment. Orchestrating these resources efficiently and providing data science and ML teams with easy and scalable access becomes a Herculean task.

Challenges with GPU allocation

In this landscape, GPUs are the backbone of the computational power that fuels these massive language models. Due to the limited availability of on-premises and cloud resources, the open-source community has taken steps to address this challenge. Libraries like bits and bytes (by Tim Dettmers) and ggml (by Georgi Gerganov) have emerged, using various optimization techniques such as quantization to fine-tune and deploy these models on local devices.

However, the challenges are not limited to model development and deployment. These LLMs demand substantial GPU capacity to maintain low latency during inference and high throughput during fine-tuning. In the real world, the need for capacity means having an infrastructure that dynamically allocates GPU resources to handle LLM fine-tuning and inference operations, all while ensuring efficiency and minimal wasted capacity.

As an example, consider loading LLama-7B using half precision (float16). Such a model requires approximately 12GB of GPU memory─a figure that can be even lower with the use of lower precision. In instances where high-end GPUs, like the NVIDIA A100 GPU with 40 GB (or 80 GB) of memory, are dedicated solely to a single model, severe resource waste results, especially when done at scale. The wasted resource does not only translate to financial inefficiencies but also reduced productivity in data science teams, and an increased carbon footprint due to the excessive underutilization of running resources over extended periods.

Some LLMs are so large that they must be distributed across multiple GPUs or multiple GPU servers. Consider Falcon-180B using full precision. Such a model requires approximately 720 GB and the use of more than 16 NVIDIA A100 GPUs with 40 GB each. Fine tuning such models and running them in production requires tremendous computing power and significant scheduling and orchestration challenges. Such workloads require not only a high-end compute infrastructure but also a high-end performant software stack that can distribute these workloads efficiently without bottlenecks. 

Apart from training jobs, serving these models also requires efficient autoscaling on hardware. When there is high demand, these applications must be able to scale up to hundreds of replicas rapidly, while in low demand situations, they can be scaled down to zero to save costs.

Optimizing the management of LLMs for all these specific needs necessitates a granular view of GPU use and performance as well as high-level scheduling view of compute-intensive workloads. For instance, it is a waste if a single model like LLama-7B (12 GB) is run on an NVIDIA A100 (40GB) with almost 60 percent spare capacity instead of using this remaining capacity for an inference workload.

Concurrency and scalability are essential, both when dealing with many relatively small, on-premises models, each fine-tuned and tailored to specific use cases as well as when dealing with huge performant models needing careful orchestration. These unique challenges require a resource orchestration tool like Run:ai to work seamlessly on top of Dell hardware. Such a solution empowers organizations to make the most of their GPU infrastructure, ensuring that every ounce of computational power is used efficiently. By addressing these challenges and optimizing GPU resources, organizations can harness the full potential of LLMs and GenAI, propelling innovation across various industries.

Dell Technologies and Run:ai: joint solution

The figure shows a block that represents AI/ML tools over a block the represents AI workload orchestration. That block is above the Dell Technologies, which is above a block that represents resources.

To address these bottlenecks, which hinder the rapid adoption of GenAI in organizations, Run:ai, a compute orchestration solution, teams up with Dell Technology.

The Dell Generative AI Solutions portfolio, a comprehensive suite of Dell products and services (Dell PowerEdge XE9680, PowerEdge 760XA, and PowerEdge XE8640 servers) in collaboration with NVIDIA, enables customers to build GenAI models on-premises quickly and securely, accelerate improved outcomes, and drive new levels of intelligence. Dell Validated Designs for Generative AI now support both model tuning and inferencing, allowing users to deploy GenAI models quickly with pretested and proven Dell infrastructure, software, and services to power transformative business outcomes with GenAI. The Validated designs integrate end-to-end AI solutions including all the critical components (server, networking, storage, and software) for AI systems, while Run:ai introduces two key technological components that unlock the true potential of these AI models: GPU optimization and a sophisticated scheduling system for training and inference workloads. Extending the Dell GenAI approaches with Run:ai orchestration enables customers to optimize GenAI and AI operations to build and train AI models and run inferencing with greater speed and efficiency.

AI-optimized compute: maximizing GPU utilization

Dell Technologies offers a range of acceleration-optimized PowerEdge servers, purpose-built for high-performance workloads like AI and demanding use-cases in generative AI, as part of the extensive server portfolio that supports various NVIDIA GPUs. Dell PowerEdge servers advance accelerated compute to drive enhanced AI workload outcomes with greater insights, inferencing, training, and visualization. However, one of the primary challenges in training and deploying LLMs is GPU use. Together with Dell PowerEdge servers, Run:ai's GPU optimization layer enables features like fractionalizing GPUs and GPU oversubscription. These features ensure that multiple workloads (training and inference), even small models, can efficiently run on the same GPU. By making better use of existing GPU resources, costs are reduced, and bottlenecks are mitigated.

Advanced scheduling: efficient workload management

Run:ai's advanced scheduling system integrates seamlessly into Kubernetes environments on top of PowerEdge servers. It is designed to tackle the complexities that arise when multiple teams and users share a GPU cluster and when running large multi-GPU or multi-node workloads. The scheduler optimizes resource allocation, ensuring efficient utilization of GPUs among various workloads, including training, fine-tuning, and inference.

Autoscaling and GPU optimization for inference workloads

Run:ai's autoscaling functionality enables dynamic adjustments to the number of replicas, allowing for efficient scaling based on demand. In times of increased workload, Run:ai optimally uses the available GPU, scaling up the replicas to meet performance requirements. Conversely, during periods of low demand, the number of replicas can be scaled down to zero, minimizing resource use and leading to cost savings. While there might be a brief cold start delay with the first request, this approach provides a flexible and effective solution to adapt to changing inference demands while optimizing costs.

Beyond autoscaling, deploying models for inference using Run:ai is a straightforward process. Internal users can effortlessly deploy their models and access them through managed URLs or user-friendly web interfaces like Gradio and Streamlit. This streamlined deployment process facilitates sharing and presentation of deployed LLMs, fostering collaboration and delivering a seamless experience for stakeholders.

AI networking

To achieve high throughput in multi-node training and low latency when hosting a model on multiple machines, most GenAI models require robust and highly performant networking capabilities on hardware, which is where Dell's networking capabilities and offerings come into play. The network interconnects the compute nodes among each other to facilitate communications during distributed training and inferencing. The Dell PowerSwitch Z-series are high-performance, open, and scalable data center switches ideal for generative AI, as well as NVIDIA Quantum InfiniBand switches for faster connectivity.

Fast access to your data

Data  is a crucial component for each part of the development and deployment steps. Dell PowerScale storage supports the most demanding AI workloads with all-flash NVMe file storage solutions that deliver massive performance and efficiency in a compact form factor. PowerScale is an industry-leading storage platform purpose-built to handle massive amounts of unstructured data, ideal for supporting datatypes required for generative AI.

Streamlined LLM tools

To simplify the experience for researchers and ML engineers, Run:ai offers a suite of tools and frameworks. They remove the complexities of GPU infrastructure with interfaces like command-line interfaces, user interfaces, and APIs on top of Dell hardware. With these tools, training, fine-tuning, and deploying models become straightforward processes, enhancing productivity, and reducing time-to-market. As a data scientist, you can take pretrained models from the Huggingface model hub and start working on them with your favorite IDE and experiment with management tools in minutes, a testament to the efficiency and ease of the Dell and Run:ai solution.

Benefits of the Dell and Run:ai solution for customers

Now that we have explored the challenges posed by LLMs and the joint solution of Dell Technologies and Run:ai to these bottlenecks, let's dive into the benefits that this partnership between Dell Technologies and Run:ai and offers to customers:

1. Accelerated time-to-market

The combination of Run:ai's GPU optimization and scheduling solutions, along with Dell's robust infrastructure, significantly accelerates the time-to-market for AI initiatives. By streamlining the deployment and management of LLMs, organizations can quickly capitalize on their AI investments.

2. Enhanced productivity

Data science and ML engineering teams, often unfamiliar with the complexities of AI infrastructure, can now focus on what they do best: building and fine-tuning models. Run:ai's tools simplify the process, reducing the learning curve and improving productivity.

3. Cost efficiency

Optimizing GPU use not only provides performance but also provides cost-effectiveness. By running multiple workloads on the same GPU, organizations can achieve better cost efficiency, get the most out of their infrastructure, thus making AI initiatives more financially viable.

4. Increased scalability and GPU availability

Run:ai's advanced scheduling system ensures that workloads are efficiently managed, even during peak demand. This scalability is crucial for organizations that need to serve language models in real time to a growing user base. In addition, the scheduling component ensures fair and optimized allocation of GPU resources between multiple users, teams, and tasks, preventing resource bottlenecks and contention and increasing availability of GPUs to allow more users, teams, and AI services to get access and use available GPU resources effectively.

5. Innovation unleashed

The solution empowers enterprise teams to innovate and experiment with LLMs and GenAI without being hindered by infrastructure complexities. Researchers and ML engineers can easily fine-tune and deploy models using abstraction tools, fostering innovation and exploration in AI projects.


The joint solution offered by Dell Technologies and Run:ai addresses the critical challenges faced by organizations ramping up with GenAI for their business needs and working with LLMs. By enhancing GPU accessibility, optimizing scheduling, streamlining workflows, and saving costs, this solution empowers businesses to fully harness the potential of LLMs in GenAI applications while simplifying the challenges. With AI initiatives becoming increasingly vital in today's world, this partnership offers businesses new ways to automate and simplify their GenAI strategy and drive more business innovation.

For information about how to get started with Dell Technologies and Run:ai on your GenAI journey, see these resources:

Authors: Justin King, Ekin Karabulut

Contributor: James Yung


Read Full Blog
  • HS5610
  • LLM
  • Llama2
  • Intel Xeon CPU
  • Generative AI
  • quantization

Investigating the Memory Access Bottlenecks of Running LLMs

Tao Zhang Bhavesh Patel Tao Zhang Bhavesh Patel

Thu, 18 Jan 2024 20:20:03 -0000


Read Time: 0 minutes


Memory access and computing are the two main functions in any computer system. In past decades, the computing capability of a processor has greatly benefited from Moore’s Law which brings smaller and faster transistors into the silicon die almost every year. On the other hand, system memory is regressing. The trend of shrinking fabrication technology for a system is making memory access much slower. This imbalance causes the computer system performance to be bottle-necked by the memory access; this is referred to as the “memory wall” issue. The issue gets worse for large language model (LLM) applications, because they require more memory and computing. Therefore, more memory access is required to be able to execute those larger models. 

In this blog, we will investigate the impacts of memory access bottlenecks to the LLM inference results. For the experiments, we chose the Llama2 chat models running on a Dell PowerEdge HS5610 server with the 4th Generation Intel® Xeon® Scalable Processors. For quantitative analysis, we will be using the Intel profile tool – Intel® VTune™ Profiler to capture the memory access information while running the workload. After identifying the location of the memory access bottlenecks, we propose the possible techniques and configurations to mitigate the issues in the conclusion session.


The Natural Language Processing (NLP) has greatly benefited from the transformer architecture since it was introduced in 2017 [1]. The trajectory of the NLP models has been moved to transformer-based architectures given its parallelization and scalability features over the traditional Recurrent Neural Networks (RNN) architectures. Research shows a scaling law of the transformer-based language models, in which the accuracy is strongly related to the model size, dataset size and the amount of compute [2]. This inspired the interest in using Large Language Models (LLMs) for high accuracy and complicated tasks. Figure 1 shows the evolution of the LLMs since the transformer architecture was invented. We can see the parameters of the LLMs have increased dramatically in the last 5 years. This trend is continuing. As shown in the figure, most of the LLMs today come with more than 7 billion parameters. Some models like GPT4 and PaLM2 have trillion-level parameters to support multi-mode features.


Title: LLM evolution - Description: Main LLMs since the transformers was introduced in 2017.

Figure 1: LLM evolution

What comes with the large models are the challenges on the hardware systems for training and inferencing those models. On the one hand, the computation required is tremendous as it is proportional to the model size. On the other hand, memory access is expensive. This mainly comes from the off-chip communication and complicated cache architectures required to support the large model parameters and computation.

Test Setup

The hardware platform we used for this study is HS5610 which is the latest 16G cloud-optimized server from Dell product portfolio. Figure 2 gives an overview of HS5610. It has been designed with CSP features that allow the same benefits with full PowerEdge features & management like mainstream Dell servers, as well as open management (OpenBMC), cold aisle service, channel firmware, and services. The server has two sockets with an Intel 4th generation 32-core Intel® Xeon® CPU on each socket. The TDP power for each CPU is 250W. Table 1 and Table 2 show the details of the server configurations and CPU specifications.

PowerEdge HS5610 An overview of PowerEdge HS5610 Server layout.

Figure 2: PowerEdge HS5610 [3]

Product Collection 

4th Generation Intel® Xeon® Scalable Processors

Processor Name 

Platinum 8480+ 



# of CPU Cores 


# of Threads 


Base Frequency 

2.0 GHz 

Max Turbo Speed 

3.8 GHz 

Cache L3 

64 MB 

Memory Type  

DDR5 4800 MT/s

ECC Memory Supported 


Table 1: HS5610 Server Configurations

System Name 

PowerEdge HS5610 



System Type

Data Center

Number of Nodes 

Host Processor Model 

4th Generation Intel® Xeon® Scalable Processors

Host Processors per Node 

Host Processor Core Count 


Host Processor Frequency 

2.0 GHz, 3.8 GHz Turbo Boost

Host Memory Capacity 

1TB, 16 x 64GB DIMM 4800 MHz 

Host Storage Capacity 

4.8 TB, NVME 

Table 2: 4th Generation 32-core Intel® Xeon® Scalable Processor Technical Specifications

Software Stack and System Configuration

The software stack and system configuration used for this submission is summarized in Table 3. Optimizations have been done for the PyTorch framework and Transformers library to unleash the Xeon CPU AI instruction capabilities. Also, a low-level tool - Intel® Neural Compressor has been used for high-accuracy quantization.


CentOS Stream 8 (GNU/Linux x86_64) 

Intel® Optimized Inference SW  

OneDNN™ Deep Learning, ONNX, Intel® Extension for PyTorch (IPEX), Intel® Extension for Transformers (ITREX), Intel® Neural Compressor

ECC memory mode 


Host memory configuration 


Turbo mode 


CPU frequency governor 


Table 3: Software stack and system configuration

The model under tests is Llama2-chat-hf models with 13 billion parameters (Llama2-13b-chat-hf). The model is based on the pre-trained 13 billion Llama2 model and fine-tuned with human feedback for chatbot applications. The Llama2 model has light (7b), medium (13b) and heavy (70b) size versions.

The profile tool used in the experiments is Intel® VTune™. It is a powerful low-level performance analysis tool for x86 CPUs that supports algorithms, micro-architecture, parallelism, and IO related analysis etc. For the experiments, we use the memory access analysis under micro-architecture category. Note Intel® VTune™ consumes significant hardware resources which impacts the performance results if we run the tool along with the workload. So, we use it as a profile/debug tool to investigate the bottleneck. The performance numbers we demonstrate here are running without Intel® VTune™ on.

The experiments are targeted to cover the following:

  • Single-socket performance vs dual-socket performance to demonstrate the NUMA memory access impact.
  • Performance under different CPU-core numbers within a single socket to demonstrate the local memory access impact.
  • Performance with different quantization to demonstrate the quantization impact.
  • Intel® VTune™ memory access results.

Because Intel® VTune™ has minimum capture durations and max capture size requirements, we focus on capturing the results for the medium-size model (Llama2-13b-chat-hf). This prevents short/long inference time therefore avoiding an underload or overload issue. All the experiments are based on the batch size equals to 1. Performance is characterized by latency or throughput. To reduce the measurement errors, the inference is executed 10 times to get the averaged value. A warm-up process by loading the parameter and running a sample test is executed before running the defined inference. 


For this section, we showcase the performance results in terms of throughput for single-socket and dual socket scenarios under different quantization types followed by the Intel® VTune™ capturing results.

Single-socket Results Under Different Quantization Types:


Title: Single-socket throughput - Description: Single-socket throughput in HS5610 server running Llama2 models under different quantization types.

Figure 3: Single-socket throughput in HS5610 server running Llama2 models under different quantization types

Figure 3 shows the throughputs of running different Llama2 chat models with different quantization types on a single socket. The “numactl” command is used to confine the workload within one single 32-core CPU. From the results, we can see that quantization greatly helps to improve the performance across different models.

Title: Single-socket fp32 Vtune results - (a) - Description: Bandwidith and utilization diagram for single-socket fp32.Title: Single-socket fp32 Vtune results - (b) - Description: Elapsed time analysis for single-socket fp32.



Figure 4:Intel® VTune™ memory analysis results for single-socket fp32 results:

(a). bandwidth and utilization diagram (b). elapsed time analysis


Title: Single-socket bf16 Vtune results - (a) - Description: Bandwidth and utilization diagram for single-socket bf16.Title: Single-socket bf16 Vtune results - (b) - Description: Elapsed time analysis for single-socket bf16.



Figure 5: Intel® VTune™ memory analysis results for single-socket bf16 results:

(a). bandwidth and utilization diagram (b). elapsed time analysis

To better understand what would happen at the lower level, we will take the Llama2 13 billion model as an example. We will use Intel® VTune™ to capture the bandwidth and utilization diagram and the elapsed time analysis for the fp32 data type (shown in Figure 4) and use bf16 data type (shown in Figure 5). We can see that by reducing the representing bits, the bandwidth required for the CPU and DRAM communication is reduced. In this scenario, the DRAM utilization drops from 63.4% for fp32 (shown in Figure 4 (a)) to 28.7% (shown in Figure 4 (b)). The also indicates that the weight data can arrive quicker to the CPU chip. Now we can benefit from the quicker memory communication. The CPU utilization also increases from 10% for fp32 (shown in Figure 4 (a)) to 15.6% for bf16 (shown in Figure 4 (b)). Both faster memory access and better CPU utilization translate to better performance with a more than 50% (from 2.47 tokens/s for fp32 to 3.74 tokens/s) throughput boost as shown in Figure 3. Diving deeper with the elapsed time analysis shown in Figure 4 (b), and Figure 5 (b), L1 cache is one of the performance bottleneck locations on the chip. Quantization reduces the possibility that the task gets stalled.

Dual-socket Results Under Different Quantization Types:


Title: Dual-socket througput - Description: Dual-socket throughput in HS5610 server running Llama2 models under different quantization types.                                                                         


Figure 6: Dual-socket throughput in HS5610 server running Llama2 models under different quantization types


Title: Dual-socket fp32 Vtune results - (a) - Description: Bandwidth and utilization diagram for dual-socket fp32.Title: Dual-socket fp32 Vtune results - (b) - Description: Elapsed time analysis for dual-socket fp32.



Figure 7: Intel® VTune™ memory analysis results for dual-socket fp32 results:

(a). bandwidth and utilization diagram (b). elapsed time analysis


Title: Dual-socket bf16 Vtune results - (a) - Description: Bandwidth and utilization diagram for dual-socket bf16.Title: Dual-socket bf16 Vtune results - (b) - Description: Elapsed time analysis for dual-socket bf16.



Figure 8: Intel® VTune™ memory analysis results for dual-socket bf16 results:

(a). bandwidth and utilization diagram (b). elapsed time analysis

Now moving to the dual-socket scenarios shown in Figure 6-8, we have similar observations regarding the impacts of the quantization: Quantization increases CPU utilization and reduces the L1 cache bottleneck, therefore boosting the throughputs across different Llama2 models.

Comparing the performance between the single-socket (shown in Figure 3) and dual-socket (shown in Figure 6) scenarios indicates negligible performance improvement. As seen in Figure 7 and 8, even though we get better CPU utilizations, the communication between two sockets (the UPI or the NUMA memory access), becomes the main bottleneck that offsets the benefits of having more computing cores.


Based on the experiment results for different Llama2 models under various configurations, we have the conclusions as the following:

  • Quantization improves the performance across the models with different weights by reducing the L1 cache bottleneck and increasing the CPU utilization. It also indicates that we can optimize the TCO by reducing the memory requirements (in terms of the capacity and speed) if we were able to quantize the model properly.
  • Crossing-socket communication from either UPI or NUMA memory access is a significant bottleneck that may affect performance. Optimizations include the reducing of the inter-socket communication. For example better partitioning of the model is critical. Alternatively, this also indicates that executing one workload on a single dedicated CPU with enough cores is desirable for cost and performance considerations.


[1]. A. Vaswani et. al, “Attention Is All You Need”,

[2]. J. Kaplan et. al, “Scaling Laws for Neural Language Models”,


Read Full Blog
  • LLM
  • Llama2
  • Generative AI
  • quantization

Deploying Llama 7B Model with Advanced Quantization Techniques on Dell Server

Tao Zhang Bhavesh Patel Tao Zhang Bhavesh Patel

Tue, 16 Jan 2024 20:05:01 -0000


Read Time: 0 minutes


Large-language Models (LLMs) have gained great industrial and academic interest in recent years. Different LLMs have been adopted in various applications, such as: content generation, text summarization, sentiment analysis, and healthcare. The LLM evolution diagram in Figure 1 shows the popular pre-trained models since 2017 when the transformer architecture was first introduced [1]. It is not hard to find the trend of larger and more open-source models following the timeline. Open-source models boosted the popularity of LLMs by eliminating the huge training cost associated with the large scale of the infrastructure and long training time required. Another portion of the cost of LLM applications comes from the deployment where an efficient inference platform is required.

This blog focuses on how to deploy LLMs efficiently on Dell platform with different quantization techniques. We first benchmarked the model accuracy under different quantization techniques. Then we demonstrated their performance and memory requirements of running LLMs under different quantization techniques through experiments. Specifically, we chose the open-source model Llama-2-7b-chat-hf for its popularity [2]. The server is chosen to be Dell main-stream server R760xa with NVIDIA L40 GPUs [3] [4]. The deployment framework in the experiments is TensorRT-LLM, which enables different quantization techniques including advanced 4bit quantization as demonstrated in the blog [5].


Title: LLM evolution - Description: Main LLMs since the transformers was introduced in 2017.

Figure 1 :LLM evolution



LLM inferencing processes tend to be slow and power hungry, because of the characteristics of LLMs being large in weight size and having auto-regression. How to make the inferencing process more efficient under limited hardware resources is among the most critical problems for LLM deployment. Quantization is an important technique widely used to push for more efficient LLM deployment. It can relieve the large hardware resource requirements by reducing the memory footprint and computation energy, as well as improve the performance with faster memory access time compared to the deployment with the original un-quantized model. For example, in [6], the performance in terms of throughput by tokens per second (tokens/s) for Llama-2-7b model is improved by more than 2x by quantizing from floating point 16-bit format to integer 8-bit. Recent research made more aggressive quantization techniques like 4-bit possible and available in some deployment frameworks like TensorRT-LLM. However, quantization is not free, and it normally comes with accuracy loss. Besides the cost, reliable performance with acceptable accuracy for specific applications is what users would care about. Two key topics covered in this blog are accuracy and performance. We first benchmark the accuracy of the original model and quantized models over different tasks. Then we deployed those models into Dell server and measured their performance. We further measured the GPU memory usage for each scenario. 

Test Setup

The model under investigation is Llama-2-7b-chat-hf [2]. This is a finetuned LLMs with human-feedback and optimized for dialogue use cases based on the 7-billion parameter Llama-2 pre-trained model. We load the fp16 model as the baseline from the huggingface by setting torch_dtype to float16.

We investigated two advanced 4-bit quantization techniques to compare with the baseline fp16 model. One is activation-aware weight quantization (AWQ) and the other is GPTQ [7] [8]. TensorRT-LLM integrates the toolkit that allows quantization and deployment for these advanced 4-bit quantized models.

For accuracy evaluation across models with different quantization techniques, we choose the Massive Multitask Language Understanding (MMLU) datasets. The benchmark covers 57 different subjects and ranges across different difficulty levels for both world knowledge and problem-solving ability tests [9]. The granularity and breadth of the subjects in MMLU dataset allow us to evaluate the model accuracy across different applications. To summarize the results more easily, the 57 subjects in the MMLU dataset can be further grouped into 21 categories or even 4 main categories as STEM, humanities, social sciences, and others (business, health, misc.) [10].

Performance is evaluated in terms of tokens/s across different batch sizes on Dell R760xa server with one L40 plugged in the PCIe slots. The R760xa server configuration and high-level specification of L40 are shown in Table 1 and 2 [3] [4]. To make the comparison easier, we fix the input sequence length and output sequence length to be 512 and 200 respectively.

System Name

PowerEdge R760xa



System Type

Data Center

Number of Nodes


Host Processor Model

4th Generation Intel® Xeon® Scalable Processors

Host Process Name

Intel® Xeon® Gold 6430

Host Processors per Node


Host Processor Core Count


Host Processor Frequency

2.0 GHz, 3.8 GHz Turbo Boost

Host Memory Capacity and Type

512GB, 16 x 32GB DIMM, 4800 MT/s DDR5

Host Storage Capacity

1.8 TB, NVME

Table 1:  R760xa server configuration


GPU Architecture        

L40 NVIDIA Ada Lovelace Architecture

GPU Memory Bandwidth         

48 GB GDDR6 with ECC

Max Power Consumption


Form Factor

4.4" (H) x 10.5" (L) Dual Slot



Table 2:  L40 High-level specification

The inference framework that includes different quantization tools is NVIDIA TensorRT-LLM initial release version 0.5. The operating system for the experiments is Ubuntu 22.04 LTS.


We first show the model accuracy results based on the MMLU dataset tests in Figure 2 and Figure 3, and throughput performance results when running those models on PowerEdge R760xa in Figure 4. Lastly, we show the actual peak memory usage for different scenarios. Brief discussions are given for each result. The conclusions are summarized in the next section.



Title: MMLU 4-category accuracy test result - Description: The comparison of MMLU 4-category accuracy for AWQ, GPTQ and original models.

Figure 2:MMLU 4-category accuracy test result

Figure 2 shows the accuracy test results of 4 main MMLU categories for the Llama-2-7b-chat-hf model. Compared to the baseline fp16 model, we can see that the model with 4-bit AWQ has a significant accuracy drop. On the other hand, the model with 4-bit GPTQ has a much smaller accuracy drop, especially for the STEM category, the accuracy drop is smaller than 5%. 


Title: MMLU 21-category accuracy test result - Description: The comparison of MMLU 21-category accuracy for AWQ, GPTQ and original models.

Figure 3:MMLU 21-category accuracy test result

 Figure 3 further shows the accuracy test results of 21 MMLU sub-categories for the Llama-2-7b-chat-hf model. Similar conclusions can be made that the 4-bit GPTQ quantization gives much better accuracy, except for the law category, the two quantization techniques achieve a close accuracy. 



Title: Throughput test result - Description: Throughput comparison for  AWQ, GPTQ and original models.

Figure 4: Throughput test result

Figure 4 shows the throughput numbers when running Llama-2-7b-chat-hf with different batch size and quantization methods on R760xa server. We observe significant throughput boost with the 4-bit quantization, especially when the batch size is small. For example, a 3x tokens/s is achieved when the batch size is 1 when comparing the scenarios with 4-bit AWQ or GPTQ quantization to the 16-bit baseline scenario. Both AWQ and GPTQ quantization give similar performance across different batch sizes. 

GPU Memory Usage


Title: Peak GPU memory usage - Description: Peak GPU memory usage for AWQ, GPTQ and original models.

Figure 5: Peak GPU memory usage

Figure 5 shows the peak GPU memory usage when running Llama-2-7b-chat-hf with different batch size and quantization methods on R760xa server. From the results, 4-bit quantization techniques greatly reduced the memory required for running the model. Compared to the memory size required for the baseline fp16 model, the quantized models with AWQ or GPTQ only requires half or even less of the memory, depending on the batch size. A slightly larger peak memory usage is also observed for GPTQ quantized model compared to the AWQ quantized model.


  • We have shown the impacts for accuracy, performance, and GPU memory usage by applying advanced 4-bit quantization techniques on Dell PowerEdge server when running Llama 7B model.
  • We have demonstrated the great benefits of these 4-bit quantization techniques in terms of improving throughput and saving GPU memory. 
  • We have quantitively compared the quantized models with the baseline model in terms of accuracy among various subjects based on the MMLU dataset.
  • Tests showed that with an acceptable accuracy loss, 4-bit GPTQ is an attractive quantization method for the LLM deployment where the hardware resource is limited. On the other hand, large accuracy drops across many MMLU subjects have been observed for the 4-bit AWQ. This indicates the model should be limited to the applications tied to some specific subjects. Otherwise, other techniques like re-training or fine-turning techniques may be required to improve accuracy. 


[1]. A. Vaswani et. al, “Attention Is All You Need”,






[7]. J. Lin et. al, “AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration”,

[8]. E. Frantar et. al, “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers”,

[9]. D. Hendrycks et. all, “Measuring Massive Multitask Language Understanding”,


Read Full Blog
  • XE9680
  • generative AI
  • LLM
  • AI in Healthcare

Scaling Hardware and Computation for Practical Deep Learning Applications in Healthcare

Northwestern Medicine Bhavesh Patel Bala Chandrasekaran Frank Han Northwestern Medicine Bhavesh Patel Bala Chandrasekaran Frank Han

Fri, 01 Dec 2023 15:26:29 -0000


Read Time: 0 minutes

Medical practice requires analysis of large volumes of data spanning multiple modalities. While these can be as simple as numeric lab results, at the other extreme are high-complexity data such as magnetic resonance imaging or decades-worth of text-based clinical documentation which may be present in medical records. Oftentimes, small details buried within piles of clinical information are critical to obtaining a complete clinical picture. Many deep learning methods developed in recent years have focused on very short “sequence lengths” – the term used to describe the number of words or pixels that a model can ingest – of images and text compared to those encountered in clinical practice. How do we scale such tools to model this breadth of clinical data appropriately and efficiently?

In the following blog, we discuss ways to tackle the compute requirements of developing transformer-based deep learning tools for healthcare data from the hardware, data processing, and modeling perspectives. To do so, we present a practical application of Flash Attention using a series of experiments performing an analysis of the publicly available Kaggle RSNA Screening Mammography Breast Cancer Detection challenge, which contains 54,706 images of 11,913 patients. Breast cancer affects 1 in 8 women and is the second leading cause of cancer death. As such, screening mammography is one of the most performed imaging-based medical screening procedures, which offers a clinically relevant and data-centric case study to consider.


Data Primer

To detect breast cancer early when treatments are most effective, high-resolution x-ray images are taken of breast tissue to identify areas of abnormality which require further examination by biopsy or more detailed imaging. Typically, two views are acquired:

  • Craniocaudal (CC) – taken from the head-to-toe perspective
  • Mediolateral oblique (MLO) – taken at an angle

The dataset contains DICOM-formatted images which must be pre-processed in a standard fashion prior to model training. We detail the data preparation pipeline in figure 1. The CC and MLO views of each study are identified, flipped horizontally if necessary, cropped, and combined to form the model input image. We wrap the standard PyTorch Dataset class to load images and preprocess them for training.Figure 1. Data pre-processing pipeline for DICOM-formatted image

A more in-depth look at the system for data pre-processing is as follows:

  1. For each breast with a corresponding cancer label, the CC and MLO views are extracted, and the image data are normalized. Right-sided images are horizontally flipped so that the tissue is to the left side of the image, as shown.
  2. Images are cropped to the region of interest (ROI), excluding areas of black or non-tissue artifacts.
  3. Images are resized, maintaining aspect ratio, and tiled to a square of the output size of interest, with the CC view occupying the left half of the output and the MLO view occupying the right.

An important consideration is whether to perform this processing within the dataloader while training or to save a pre-processed version of the dataset. The former approach allows for iteration on different processing strategies without modifying the dataset itself, providing greater ease of experimentation. However, this level of processing during training may limit the rate at which data can be fed to the graphics processing unit (GPU) for training, resulting in time and monetary inefficiencies. In contrast, the latter approach requires that multiple versions of the dataset be saved locally, which is potentially prohibitive when working with large dataset sizes and storage space and/or network limitations. For the purpose of this blog post, to benchmark GPU hardware and training optimizations, we use the second method, saving data on local solid state drives connected via NVMe to ensure GPU saturation despite processor differences. In general, before implementing the training optimizations described below, it is important to first ensure that dataloading does not bottleneck the overall training process.

Scaling Up

Naturally, increasing the capability and amount of compute available for model training yields direct benefits. To demonstrate the influence of hardware on run time, we present a simple 20-epoch training experiment using the same dataset on three different servers, shown in figure 2:

  1. Dell XE8545 with 4x NVIDIA A100-SXM4 40GB GPUs and an AMD EPYC 7763 with 64 cores
  2. Dell R750xa with 4x NVIDIA A100 80GB GPUs and an Intel Xeon Gold 5320 processor with 26 cores
  3. Dell XE9680 server with 8 NVIDIA HGX A100 80GB SXM4 GPUs and an Intel Xeon Platinum 8470 processor with 52 cores

Input data into the model shown in figure 2 were 512x512 with a patch size of 16. Batch size was 24 per GPU on the 40GB and 64 on the 80GB servers.

Parameters remain the same for each run, except that batch size has been increased to maximally utilize GPU memory on the R750xa and XE9680 compared with the XE8545. Gradient accumulation is performed to maintain a constant global batch size per model weight update for each run. We see a clear improvement in runtime as the hardware is scaled up, demonstrating how increased compute capability directly yields time savings which enables researchers to efficiently iterate on experiments and train effective models.Figure 2. ViT-base training time across 20 epochs with 4xA100 40GB, 4xA100 80GB, and 8xA100 80GB servers

In conjunction with hardware, sequence lengths of data should be carefully considered given the application of interest. The selected tokenization scheme directly impacts sequence length of input data, such as the patch size selected as input to a vision transformer. For example, a patch size of 16 on a 1024x1024 image will result in a sequence length of 4,096 (Height*Width/Patch Size2) while a patch size of 8 will result in a sequence length of 16,384. While GPUs increasingly feature more memory, they present an upper bound on the sequence length that can practicably be considered. Smaller patch sizes – and thus, longer sequences – will result in slower throughput via smaller batch sizes and a greater number of computations, as shown in figure 3. However, larger images sizes coupled with smaller patch sizes are particularly relevant in analysis of mammography and other applications in which fine-resolution features are of interest.

      Figure 3. Average training samples per second (per GPU) for mammograms through a vision transformer by patch size

The data illustrated in figure 3 are taken from a run of twenty epochs using an image size of 512x512 and tested on an 8xA100 (80 GB) server.


Flash Attention – Experiments

Recently, Dao et al. have published on Flash Attention (, a technique aimed at more efficiently accomplishing the computations involved within transformers via minimizing GPU high-bandwidth memory and the on-chip SRAM. Their reported findings are impressive, yielding 2-3x speedups during an attention forward and backwards pass while also having 3-20x smaller memory requirements.

Using a Dell XE9680 server with 8 NVIDIA HGX A100 80GB SXM4 GPUs and an Intel Xeon Platinum 8470 processor with 52 cores, we provide a practical demonstration of potential applications for Flash Attention and vision transformers in healthcare. Specifically, we performed experiments to demonstrate how sequence length (determined by patch size and image size) and Flash Attention impact training time. To limit confounding variables, all images were pre-sized on disk and directly loaded into the vision transformer without any permutations. For the vision transformer, the ViT-Base from Huggingface was used. For Flash Attention, the Encoder from the x_transformers library was used, shown being implemented in the following code.

All tests were carried out with the Huggingface trainer using an effective batch size of 128 per GPU, “brain" floating-point 16 data, and across twenty epochs at patch sizes of 8, 16, and 32 with image sizes of 384, 512, 1024, and 2048.

from x_transformers import ViTransformerWrapper, Encoder
class FlashViT(nn.Module):
    def __init__(self,
                 encoder = ViTransformerWrapper(
                     image_size = args.img_size,
                     patch_size = args.patch_size,
                     num_classes = 2,
                     attn_layers = Encoder(
                         dim = 768,
                         depth = 12,
                         heads = 12,
    self.encoder = encoder
    def forward(self,
                 pixel_values: [batch,channel,ht,wt] of pixel values
                 labels: labels for each image
             logits = self.encoder(pixel_values)
             return {'loss':F.cross_entropy(logits,labels),'logits':logits}
model = FlashViT()

Figure 4 demonstrates the pronounced benefit of using Flash Attention within a vision transformer with respect to model throughput. With the exception of the two smallest image sizes and largest patch size (and thus shortest sequence length), Flash Attention resulted in a marked speed increase across all other perturbations. The speed-up range across patch sizes was:

  • Patch size of 8: 3.0 - 4.2x
  • Patch size of 16: 2.8 – 4.0x
  • Patch size of 32: 0 - 2.3x

Figure 4. Samples per second throughput for a ViT-base vision transformer with and without Flash Attention across varying image sizes and patch sizes 

Another benefit demonstrated in these experiments is the additional image and patch size combinations achievable only with Flash Attention due to the reduced GPU memory requirement. Non-Flash Attention models could only be used on image sizes of 2,048 if a patch size of 32 was used (sequence length of 4,096), whereas Flash Attention was capable of running on patch sizes of 8 and 16. Even at shorter sequence lengths (576 - 384x384 image, patch size of 16), there was 2.3x less memory used for Flash Attention. Use of Flash Attention will also be critical when considering larger transformer models, with ViT-Huge having more than 7x the parameters than ViT-Base. In conjunction with hardware-enabling distributed training at scale such as the Dell XE9680, these optimizations will enable new findings at unprecedented scales.



We have described methods by which the benefits of transformer-based models can be scaled to the longer sequences which medical data often require. Notably, we demonstrate the benefits of implementing Flash Attention to a vision encoder. Flash Attention presents marked benefit from a modeling perspective, from shorter runtimes (and thus lower cost) to better image encoding (longer sequence lengths). Moreover, we show that these benefits scale substantially along with sequence length, making them indispensable for practitioners aiming to model the full complexity of hospital data. As machine learning continues to grow in healthcare, tight collaborations between hospitals and technology manufactures are thus essential to allow for greater compute resources to input higher-quality data into machine learning models.



Jonathan Huang, MD/PhD Candidate, Research & Development, Northwestern Medicine

Matthew Wittbrodt, Solutions Architect, Research & Development, Northwestern Medicine

Alex Heller, Director, Research & Development, Northwestern Medicine

Mozziyar Etemadi, Clinical Director, Advanced Technologies, Northwestern Medicine

Bhavesh Patel, Sr. Distinguished Engineer, Dell Technologies

Bala Chandrasekaran, Technical Staff, Dell Technologies

Frank Han, Senior Principal Engineer, Dell Technologies

Read Full Blog
  • generative AI
  • LLM
  • Dell Validated Design for Generative AI with Nvidia
  • RAG
  • Retrieval Augmented Generation
  • Knowledge Base
  • Chatbot

Using Retrieval Augmented Generation (RAG) on a Custom PDF Dataset with Dell Technologies

David O'Dell David O'Dell

Fri, 20 Oct 2023 17:18:40 -0000


Read Time: 0 minutes

The Generative AI transformation

Artificial Intelligence is transforming the entire landscape of IT and our digital lives.  We’ve witnessed several major disruptions that have changed the course of technology over the past few decades. The birth of the internet, virtual reality, 3D printing, containerization, and more have contributed to major shifts in efficiency as well as the democratization of the tools required to create in those spaces.

Generative AI (GenAI) is now a major disruptor, forcing us all to rethink what efficiency leaps can, should, and should not be made with this new technology. 

On its current trajectory, the larger AI industry has the potential to change entire economies.  AI isn’t tremendously new.  Within the past decade, the bottlenecks that once held back progress have been removed by massive gains in GPU technology, abundant data availability, and vast oceans of distributed storage. 

Nowadays, we must differentiate between traditional AI – used to perform specific tasks and make predictions based on patterns – and GenAI – used to create new data that resembles human-like content.  

With GenAI large-language models (LLM) leading the pack of the latest AI innovations, let’s pause for just a moment and ask ourselves, “Why is it suddenly so popular?”, “How does it work?”, and, more importantly, “How can I make it work better?”  There’s no better way to answer these questions than by diving into the code that makes GenAI such a hot item.  


Our mission today with GenAI

If I were to ask you a random question, chances are you could answer it in a sophisticated, accurate, and grammatically correct way – a “human-like” way.  If I asked you about cars, chances are the topic of tires might come up since cars and tires have a strong relationship.   Your response probably wouldn’t contain anything about zebras since zebras and cars are not strongly related.   What if I asked you about a skyscraper?   The word, “building” is strongly related to skyscrapers, but why not the words, “moon” or “bird” - they’re in the sky too, right?  


To achieve a response that pleases us, we want an accurate answer presented to us in a human-like manner.  Those two concepts – “accuracy” and “human-like” – are the common threads woven throughout the code for all generative AI development.

A traditional AI response of “yes” or “no” can maintain high accuracy, but is that what I want to give to my users or my customers?  A dry, robotic, binary response?  Absolutely not.  I want a human-like response that provides context and additional help.  I want the response to solve my problem either directly through automated actions or indirectly by enabling me to help myself.   If we can’t get all of this, then why bother building any of it?  

Having a human-like response is of tremendous value and something the market desires. So how do we humanize a computer created response?  It takes brains.

Human brains are massive pattern-matching machines that rely on millions of physical neural connections.  AI essentially mirrors those physical connections in the form of numerical relationship strings called vectors.  Accuracy comes from thousands of interlocking relationships of general knowledge about individual things.  Each “thing” you feed an AI model, whether it’s a pixel or a word, is digitized and labeled as a vector that has unique value and location, either on an image or in a sentence.  

Once we digitize our content into a form the computer can digest, we can start to analyze it for patterns of relationships and eventually build a model that is good at providing accurate, human-like responses based on the relationships it was given.  


Defining the problem

There are dozens of major LLMs to choose from and thousands of homebrewed variants.  Each model supports unique features and use cases, so choosing the right one is vital.   Let’s first define our problem and then determine model selection.

Our example company would like to improve their overall customer experience when chatting with support.  Besides improving the response time and providing a better self-help pathway, they would like to integrate their legacy and future knowledge base articles into a help desk chatbot that can respond to questions with new information obtained from their pdf dataset.  In this example, we’ll use a collection of white papers and infographics from Dell Infohub.  


Training a model from scratch vs. fine-tuning vs. RAG

If your industry is highly specialized and has a considerably unique vocabulary - such as legal, medical, or scientific – or your business requires a high level of privacy where intermingling publicly and privately sourced data is forbidden, training a model from scratch might be the route to take.

In most cases, using an existing open-source model and then fine-tuning it to enable a new task is preferred since it requires much less compute and saves a lot of time.  With the right balance of compute, storage, and software, fine-tuning an existing model can be extremely effective.  

If your response model is good and performs the task you want but could use some specific learning based on content from a custom document dataset – a knowledge base for example - then Retrieval Augmented Generation (RAG) would be a great candidate for this type of workload.  


Why use Retrieval Augmented Generation?

Retrieval Augmented Generation (RAG) is used in LLM applications to retrieve relevant knowledge base-style content, augment the user prompt with this domain-specific content, then feed both the prompt and content into the LLM to generate a more complete, useful response.  

Figure 1. Understanding how RAG works

So how does RAG work? Imagine yourself at a restaurant, asking the waiter a question about wine pairings.  In this analogy, you are the user who is inputting a question prompt, and the waiter is our LLM model.  The waiter certainly has basic pairing suggestions – “Red wine pairs well with beef” – but this response speaks nothing to centuries of pairing history, recent wine trends, or this restaurant’s decades of culture.  This very dry response also doesn’t account for the restaurant’s current inventory.  We wouldn’t want the waiter to suggest a wine that isn’t currently stocked.

With a RAG process, the waiter takes the question, retrieves relevant historical and up-to-date information specific to the restaurant (including inventory), and gathers it all together for the customer.  That being said, it’s not enough to just retrieve information.  Our customer needs the finer touch of a well-informed suggestion rather than being inundated with a bunch of articles or snippets about wine pairings.  That’s where the LLM shines.  

LLMs are excellent at taking large, disparate chunks of content, organizing them, and providing a human-like response.  The original question, along with all those wine and food snippets about pairing, are fed into the LLM whereby a more complete, useful response is given, all without having to relearn the basics. That is to say, our waiter didn’t have to become a sommelier to give this response. No retraining was required. The RAG process doesn’t require time-consuming training runs.   The LLM is trained prior to engaging in the process.  We simply made new domain-specific knowledge easier for the LLM to digest.


Peeking under the hood

This is all done by taking domain-specific knowledge bases (in this case pdf files), splitting them up intelligently, then encoding the textual content of these chunks into long numerical vectors.  

Vectors representing the original text go into a vector database that can be queried extremely quickly.  Vector databases come in a variety of types and use cases.  In this example, we’re using ChromaDB, a “pure” vector database that is designed to store and retrieve vectors from unstructured data, such as text, images, and files.  This is perfect for our use case where we are taking random text from unknown documents in a variety of formats and converting them to vectors that are used to build relationships between the prompt and chunks of content.  

In some ways, we can consider the vector database to be a form of long-term memory.  As we continue to add new content to it and keep it maintained, our LLM can refer to the contents as the database expands with new information.

Using the original question, the vector database is queried to find which chunks are most related to the original question.  The results are ranked for similarity whereby only the most relevant content is eligible to be fed into the LLM for response generation. 


Choosing our LLM

We’ll be using the Meta Llama2 model since it can be quantized and run locally on prem, comes in a variety of sizes, performs as well or better than ChatGPT, and is also available for free for commercial use.   Llama2 also has a moderately large context window that allows users to introduce text in the form of sentences or entire documents and then generate responses from the new information.  

Using Llama2 or other open-source LLMs on prem allows full control over both your domain-specific content and any content that goes into the model, such as prompts or other proprietary information.  On prem models are also not executable objects on their own since they lack the ability to send your private data back to the original authors.  


Compute and Storage Environment

Our physical environment is on VMware vSphere as the hypervisor in a Dell APEX Private Cloud cluster with VxRail PowerEdge nodes, each with 3 Nvidia T4 GPUs.  Storage is on the local vSAN.  Our notebook server is running inside a Pytorch virtual environment on an Ubuntu 22.04 virtual machine with Miniconda.  This can also be run on bare metal PowerEdge with any OS you choose as long as you can run Jupyter notebooks and have access to GPUs.  


GenAI coding first steps

When running any sort of training or fine-tuning, you’ll need access to models, datasets, and monitoring so that you can compare the performance of the chosen task.   Luckily, there are free open-source versions of everything you need.  Simply set up an account on the following sites and create an API access token to pull the leading models and datasets into your notebooks.  

  • Hugging Face  –   incredibly valuable and widely used open-source python libraries, datasets, models, notebook examples, and community support  
  • Weights and Biases  –  free SaaS-based monitoring dashboard for model performance analysis 
  • Github  –  open-source libraries, tools, and notebook examples 

Along with those sites, at some point you will inevitably experiment with or use incarnations of the Meta (Facebook) Llama model.   You’ll need to fill out this simple permission form.


Setting up RAG on the Llama2 model with a custom PDF dataset

First, let’s log in to Huggingface so that we can access libraries, models, and datasets.

## code to auto login to hugging face, avoid the login prompt
!pip install -U huggingface-hub
# get your account token from
token = ‘<insert your token here>’
from huggingface_hub import login
login(token=token, add_to_git_credential=True)

With each notebook you run, you’ll end up installing, upgrading, and downgrading all sorts of libraries.  The versions shown may very well change over time with added or deprecated features.  If you run into version compatibility issues, try upgrading or downgrading the affected library.

!pip install torch
!pip install transformers
!pip install langchain
!pip install chromadb
!pip install pypdf
!pip install xformers
!pip install sentence_transformers
!pip install InstructorEmbedding
!pip install pdf2image
!pip install pycryptodome
!pip install auto-gptq

From our newly installed packages, let’s import some of the libraries we’ll need.  Most of these will be related to LangChain, an amazing tool for chaining LLM components together as previously seen in figure 1.  As you can see from the library titles, LangChain can connect our pdf loader and vector database and facilitate embeddings.

import torch
from auto_gptq import AutoGPTQForCausalLM
from langchain import HuggingFacePipeline, PromptTemplate
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from pdf2image import convert_from_path
from transformers import AutoTokenizer, TextStreamer, pipeline
DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"

Let’s check our Nvidia GPU environment for availability, processes, and CUDA version.  Here, we see 3 x T4 GPUs.  The driver supports CUDA version up to 12.2 with around 16Gb per device, and no other running processes.  Everything looks good to start our run.

| NVIDIA-SMI 535.113.01             Driver Version: 535.113.01    CUDA Version: 12.2     |
| GPU  Name                 Persistence-M | Bus-Id         Disp.A | Volatile Uncorr. ECC |
| Fan  Temp    Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:0B:00.0 Off |                   Off |
| N/A   46C    P8              10W /  70W |      5MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
|   1  Tesla T4                       Off | 00000000:14:00.0 Off |                   Off |
| N/A   30C    P8              10W /  70W |      5MiB / 16384MiB |      0%       Default |
|                                         |                      |                  N/A |
|   2  Tesla T4                       Off | 00000000:1D:00.0 Off |                   Off |
| N/A   32C    P8               9W /  70W |      5MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
| Processes:                                                                             |
|  GPU    GI   CI        PID    Type   Process name                            GPU Memory |
|        ID    ID                                                              Usage      |
|  No running processes found                                                            |

Let’s check to make sure we can reach some of the pdfs in our repo by placing a pdf file page thumbnail image into an array and calling it for a preview.

pdf_images = convert_from_path("pdfs-dell-infohub/apex-navigator-for-multicloud-storage-solution-overview.pdf", dpi=100)


Let’s call the pdf directory loader from LangChain to get an idea of how many pages we are dealing with.

loader = PyPDFDirectoryLoader("pdfs-dell-infohub")
docs = loader.load()

Next, let’s split the pages into chunks of useful data.  The downloaded hkunlp/instructor-large model helps us split this intelligently rather than via a brute force algorithm.  We use the embeddings from this model to recognize our new content.  Here we see that we’ve split this into over 1700 chunks.

embeddings = HuggingFaceInstructEmbeddings(
    model_name="hkunlp/instructor-large", model_kwargs={"device": DEVICE}
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=64)
texts = text_splitter.split_documents(docs)

Next, we prepare our LLM to receive both the prompt and the relevant chunks from our LangChain retrieval process.  We’ll be using a variant of the Llama2 13 billion parameter model that provides memory optimized (quantized) revisions for us to download.  

model_name_or_path = "TheBloke/Llama-2-13B-chat-GPTQ"
model_basename = "model"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
model = AutoGPTQForCausalLM.from_quantized(

Since this is a chatbot, we need to interact with our model directly.   We do this by installing Huggingface’s pipeline module that facilitates access to your model and creates a raw chat interactive session directly from your notebook code cells.  

text_pipeline = pipeline(

Our prompt is vital in this entire effort. We need to tell the LLM how to behave with responses.  This system prompt, based on the default prompt for the Llama2 model, will provide enough instruction to get us human-like results once our content and question are fed into it.

SYSTEM_PROMPT = "Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer."
template = generate_prompt(
Question: {question}

Finally, our chain can be built.  LangChain links the retriever and our LLM together, then stuffs the document chunks into the prompt and passes it along as a normal query to the LLM while also asking for the source document chunks.

qa_chain = RetrievalQA.from_chain_type(
    retriever=vectordb.as_retriever(search_kwargs={"k": 2}),
    chain_type_kwargs={"prompt": prompt},

Let’s ask it a question that could only be found in the new documents.  In this example, we’ve chosen a very specific question the Llama2 model was never trained on: “Does APEX block storage support multi availability zones?”  The response given is “yes”.  Even though the answer is positive, the model goes into a lot more detail on what it did find, giving us a very useful, human-like response.

We can also prove the source of truth by using the return_source_documents feature of LangChain and returning that in the next cell.  This way, there is no question of whether the response was part of a hallucination.

result = qa_chain("Does apex block storage support multi availability zones?")
 Based on the information provided in the text, the answer is yes. APEX Block Storage supports multi-availability zones (AZs). The text states that "data is distributed across three or more availability zones" using fault sets, providing resilience without replication or unnecessary copies of the data.
of data or needing to use replication across AZs.  Leveraging the native fault sets feature, data is distributed across three or more availability zones to offer additional protection against AZ failure.  
Well Suited for Challenging Workloads  
APEX Block Storage provides proven enterprise data services, such as thin provisioning, snapshots, replication , volume migration,  and backup/restore to S3, which are needed to run mission -critical  workloads confidently on the public cloud . With it s extre me performance and scalability, APEX Block Storage  is well suited to support very large databases, analytics workloads, and multiple container  (Kubernetes) development and production deployments.   And with the enterprise -class services and resilien ce provided in the cloud , APEX Block Storage  for AWS and Microsoft Azure  is the ideal solution to run your most challenging workloads in the public cloud wi th confidence that you can meet your SLAs.    

More Information


Let’s ask an even more specific question: “Provide a curl example of code to authenticate my PowerFlex.”

The LLM delivers a very pleasant response that is well-formatted, polite, and useful.  

result = qa_chain("provide a curl example of code to authenticate my powerflex")
 Based on the provided context, here is an example of how to authenticate with PowerFlex using curl:
curl --location --request POST 'https://<PFXM>/rest/auth/login' \
 --header 'Content-Type: application/json' \
 --data-raw '{"username": "<username>", "password": "<password>"}'
This example uses the `POST` method to send a login request to the PowerFlex Manager, specifying the username and password in the request body. The `--header` option sets the `Content-Type` header to `application/json`, and the `--data-raw` option sends the login credentials in JSON format.
Note that you will need to replace `<PFXM>` with the actual hostname or IP address of your PowerFlex Manager, and `<username>` and `<password>` with your PowerFlex username and password, respectively.


Bringing AI to your data

We’ve shown how the mission of GenAI is to provide accurate, human-like responses and that RAG is a low-impact method to augment those responses with your own custom content.  We went through some of the tools that help facilitate this process, such as LangChain, vector databases, and the LLM model itself.

To really make these models shine, you need to apply your own data, which means having data sovereignty as well as secure access to your data.  You simply can’t afford to have private data leak, potentially being captured and uncontrollably exposed globally.  

Your unique data has immense value.  Dell is here to help bring AI to your data and achieve the best possible results with preconfigured or custom solutions catered to your business needs regardless of trajectory, whether it’s using RAG, fine-tuning, or training from scratch.

With Dell Validated Design for Generative AI with Nvidia, customers can optimize the deployment speed of a modular, secure, and scalable AI platform.  Dell PowerEdge servers deliver high performance and extreme reliability and can be purchased in a variety of ways, including bare metal, preconfigured with popular cloud stacks like our APEX Cloud Platforms, and as a subscription through Dell APEX.  Simplify your structured or unstructured data expansion for GenAI with PowerFlex, PowerScale or ObjectScale, deployed on prem or as a subscription in the major cloud providers.  Dell doesn’t just stop at the data center. With Dell Precision AI workstations in the workplace, data scientists can speed innovation on the most intensive workloads.  

If you have any questions or need expert assistance, Dell Professional Services can help craft an enterprise GenAI strategy for high value uses cases and the roadmap to achieve them.  

Dell enables you to maintain data sovereignty and control while simplifying GenAI processes, providing the outcomes you demand with the flexible financing options you deserve.  



Author: David O’Dell, Technical Marketing Engineer, AI and Solutions

Read Full Blog
  • AI
  • deep learning
  • HPC

Training an AI Radiologist with Distributed Deep Learning

Luke Wilson PhD Luke Wilson PhD

Wed, 07 Dec 2022 13:25:02 -0000


Read Time: 0 minutes

Originally published on Aug 16, 2018 11:14:00 AM 

The potential of neural networks to transform healthcare is evident. From image classification to dictation and translation, neural networks are achieving or exceeding human capabilities. And they are only getting better at these tasks as the quantity of data increases.

But there’s another way in which neural networks can potentially transform the healthcare industry: Knowledge can be replicated at virtually no cost. Take radiology as an example: To train 100 radiologists, you must teach each individual person the skills necessary to identify diseases in x-ray images of patients’ bodies. To make 100 AI-enabled radiologist assistants, you take the neural network model you trained to read x-ray images and load it into 100 different devices.

The hurdle is training the model. It takes a large amount of cleaned, curated, labeled data to train an image classification model. Once you’ve prepared the training data, it can take days, weeks, or even months to train a neural network. Even once you’ve trained a neural network model, it might not be smart enough to perform the desired task. So, you try again. And again. Eventually, you will train a model that passes the test and can be used out in the world.

Workflow for developing neural network modelsIn this post, I’m going to talk about how to reduce the time spent in the Train/Test/Tune cycle by speeding up the training portion with distributed deep learning, using a test case we developed in Dell EMC’s HPC & AI Innovation Lab to classify pathologies in chest x-ray images. Through a combination of distributed deep learning, optimizer selection, and neural network topology selection, we were able to not only speed the process of training models from days to minutes, we were also able to improve the classification accuracy significantly. 

Starting Point: Stanford University’s CheXNet

We began by surveying the landscape of AI projects in healthcare, and Andrew Ng’s group at Stanford University provided our starting point. CheXNet was a project to demonstrate a neural network’s ability to accurately classify cases of pneumonia in chest x-ray images.

The dataset that Stanford used was ChestXray14, which was developed and made available by the United States’ National Institutes of Health (NIH). The dataset contains over 120,000 images of frontal chest x-rays, each potentially labeled with one or more of fourteen different thoracic pathologies. The data set is very unbalanced, with more than half of the data set images having no listed pathologies.

Stanford decided to use DenseNet, a neural network topology which had just been announced as the Best Paper at the 2017 Conference on Computer Vision and Pattern Recognition (CVPR), to solve the problem. The DenseNet topology is a deep network of repeating blocks over convolutions linked with residual connections. Blocks end with a batch normalization, followed by some additional convolution and pooling to link the blocks. At the end of the network, a fully connected layer is used to perform the classification.

An illustration of the DenseNet topology (source: Kaggle)

Stanford’s team used a DenseNet topology with the layer weights pretrained on ImageNet and replaced the original ImageNet classification layer with a new fully connected layer of 14 neurons, one for each pathology in the ChestXray14 dataset. 

Building CheXNet in Keras

It’s sounds like it would be difficult to setup. Thankfully, Keras (provided with TensorFlow) provides a simple, straightforward way of taking standard neural network topologies and bolting-on new classification layers.

from tensorflow import keras
from keras.applications import DenseNet121

orig_net = DenseNet121(include_top=False, weights='imagenet', input_shape=(256,256,3)) 

In this code snippet, we are importing the original DenseNet neural network (DenseNet121) and removing the classification layer with the include_top=False argument. We also automatically import the pretrained ImageNet weights and set the image size to 256x256, with 3 channels (red, green, blue).

With the original network imported, we can begin to construct the classification layer. If you look at the illustration of DenseNet above, you will notice that the classification layer is preceded by a pooling layer. We can add this pooling layer back to the new network with a single Keras function call, and we can call the resulting topology the neural network's filters, or the part of the neural network which extracts all the key features used for classification. 

from keras.layers import GlobalAveragePooling2D

filters = GlobalAveragePooling2D()(orig_net.output) 

The next task is to define the classification layer. The ChestXray14 dataset has 14 labeled pathologies, so we have one neuron for each label. We also activate each neuron with the sigmoid activation function, and use the output of the feature filter portion of our network as the input to the classifiers. 

from keras.layers import Dense

classifiers = Dense(14, activation='sigmoid', bias_initializer='ones')(filters)  

The choice of sigmoid as an activation function is due to the multi-label nature of the data set. For problems where only one label ever applies to a given image (e.g., dog, cat, sandwich), a softmax activation would be preferable. In the case of ChestXray14, images can show signs of multiple pathologies, and the model should rightfully identify high probabilities for multiple classifications when appropriate.

Finally, we can put the feature filters and the classifiers together to create a single, trainable model.

from keras.models import Model  
chexnet = Model(inputs=orig_net.inputs, outputs=classifiers)  

With the final model configuration in place, the model can then be compiled and trained.

Accelerating the Train/Test/Tune Cycle with Distributed Deep Learning

To produce better models sooner, we need to accelerate the Train/Test/Tune cycle. Because testing and tuning are mostly sequential, training is the best place to look for potential optimization.

How exactly do we speed up the training process? In Accelerating Insights with Distributed Deep Learning, Michael Bennett and I discuss the three ways in which deep learning can be accelerated by distributing work and parallelizing the process:

  • Parameter server models such as in Caffe or distributed TensorFlow,
  • Ring-AllReduce approaches such as Uber’s Horovod, and
  • Hybrid approaches for Hadoop/Spark environments such as Intel BigDL.

Which approach you pick depends on your deep learning framework of choice and the compute environment that you will be using. For the tests described here we performed the training in house on the Zenith supercomputer in the Dell EMC HPC & AI Innovation Lab. The ring-allreduce approach enabled by Uber’s Horovod framework made the most sense for taking advantage of a system tuned for HPC workloads, and which takes advantage of Intel Omni-Path (OPA) networking for fast inter-node communication. The ring-allreduce approach would also be appropriate for solutions such as the Dell EMC Ready Solutions for AI, Deep Learning with NVIDIA.

The MPI-RingAllreduce approach to distributed deep learning

Horovod is an MPI-based framework for performing reduction operations between identical copies of the otherwise sequential training script. Because it is MPI-based, you will need to be sure that an MPI compiler (mpicc) is available in the working environment before installing horovod.

Adding Horovod to a Keras-defined Model

Adding Horovod to any Keras-defined neural network model only requires a few code modifications:

  1. Initializing the MPI environment,
  2. Broadcasting initial random weights or checkpoint weights to all workers,
  3. Wrapping the optimizer function to enable multi-node gradient summation,
  4. Average metrics among workers, and
  5. Limiting checkpoint writing to a single worker.

Horovod also provides helper functions and callbacks for optional capabilities that are useful when performing distributed deep learning, such as learning-rate warmup/decay and metric averaging.

Initializing the MPI Environment

Initializing the MPI environment in Horovod only requires calling the init method:

import horovod.keras as hvd  

This will ensure that the MPI_Init function is called, setting up the communications structure and assigning ranks to all workers.

Broadcasting Weights

Broadcasting the neuron weights is done using a callback to the Keras method. In fact, many of Horovod’s features are implemented as callbacks to, so it’s worthwhile to define a callback list object for holding all the callbacks.

callbacks = [ hvd.callbacks.BroadcastGlobalVariablesCallback(0) ] 

You’ll notice that the BroadcastGlobalVariablesCallback takes a single argument that’s been set to 0. This is the root worker, which will be responsible for reading checkpoint files or generating new initial weights, broadcasting weights at the beginning of the training run, and writing checkpoint files periodically so that work is not lost if a training job fails or terminates.

Wrapping the Optimizer Function

The optimizer function must be wrapped so that it can aggregate error information from all workers before executing. Horovod’s DistributedOptimizer function can wrap any optimizer which inherits Keras’ base Optimizer class, including SGD, Adam, Adadelta, Adagrad, and others.

import keras.optimizers  
opt = hvd.DistributedOptimizer(keras.optimizers.Adadelta(lr=1.0)) 

The distributed optimizer will now use the MPI_Allgather collective to aggregate error information from training batches onto all workers, rather than collecting them only to the root worker. This allows the workers to independently update their models rather than waiting for the root to re-broadcast updated weights before beginning the next training batch.

Averaging Metrics

Between steps error metrics need to be averaged to calculate global loss. Horovod provides another callback function to do this called MetricAverageCallback.

callbacks = [ hvd.callbacks.BroadcastGlobalVariablesCallback(0),  

This will ensure that optimizations are performed on the global metrics, not the metrics local to each worker.

Writing Checkpoints from a Single Worker

When using distributed deep learning, it’s important that only one worker write checkpoint files to ensure that multiple workers writing to the same file does not produce a race condition, which could lead to checkpoint corruption.

Checkpoint writing in Keras is enabled by another callback to However, we only want to call this callback from one worker instead of all workers. By convention, we use worker 0 for this task, but technically we could use any worker for this task. The one good thing about worker 0 is that even if you decide to run your distributed deep learning job with only 1 worker, that worker will be worker 0.

callbacks = [ ... ]  
if hvd.rank() == 0:  

Result: A Smarter Model, Faster!

Once a neural network can be trained in a distributed fashion across multiple workers, the Train/Test/Tune cycle can be sped up dramatically.

The figure below shows exactly how dramatically. The three tests shown are the training speed of the Keras DenseNet model on a single Zenith node without distributed deep learning (far left), the Keras DenseNet model with distributed deep learning on 32 Zenith nodes (64 MPI processes, 2 MPI processes per node, center), and a Keras VGG16 version using distributed deep learning on 64 Zenith nodes (128 MPI processes, 2 MPI processes per node, far right). By using 32 nodes instead of a single node, distributed deep learning was able to provide a 47x improvement in training speed, taking the training time for 10 epochs on the ChestXray14 data set from 2 days (50 hours) to less than 2 hours!

Performance comparisons of Keras models with distributed deep learning using Horovod

The VGG variant, trained on 128 Zenith nodes, was able to complete the same number of epochs as was required for the single-node DenseNet version to train in less than an hour, although it required more epochs to train. It also, however, was able to converge to a higher-quality solution. This VGG-based model outperformed the baseline, single-node model in 4 of 14 conditions, and was able to achieve nearly 90% accuracy in classifying emphysema.

Accuracy comparison of baseline single-node DenseNet model vs VGG variant with Horovod


In this post we’ve shown you how to accelerate the Train/Test/Tune cycle when developing neural network-based models by speeding up the training phase with distributed deep learning. We walked through the process of transforming a Keras-based model to take advantage of multiple nodes using the Horovod framework, and how these few simple code changes, coupled with some additional compute infrastructure, can reduce the time needed to train a model from days to minutes, allowing more time for the testing and tuning pieces of the cycle. More time for tuning means higher-quality models, which means better outcomes for patients, customers, or whomever will benefit from the deployment of your model.

Lucas A. Wilson, Ph.D. is the Chief Data Scientist in Dell EMC's HPC & AI Innovation Lab. (Twitter: @lucasawilson)

Read Full Blog

Natural Language Processing

Benita Mordi Amir Bahmanyari Benita Mordi Amir Bahmanyari

Tue, 02 Mar 2021 16:10:04 -0000


Read Time: 0 minutes

 “Hey Google, do I look good today?”

“You’re more stunning than a new router fresh out of the box.”

“Aww, thank you!”

“You’re welcome.” 

Oh, the joys of natural language processing, and one of many short conversations some of us have with our smart home or personal assistance devices.

The AI subfield of Natural Language Processing (NLP) trains computers to understand human language so that computers can communicate using the same language. The interdisciplinary studies of theoretical computer science, principles of linguistics, and artificial intelligence (AI) that are focused on natural human language and human-machine interactions, brought about what we know today as NLP. Linguistics provides the formula for language such as semantics, syntax, vocabulary, grammar and phrases, while computer science and machine/deep learning transform these linguistic formulas into the NLP algorithm itself.

Common examples of NLP in use today include:

  • Email spam detection or document classification
  • Website chatbots
  • Automated voice response systems (IVR/AVR) on support calls
  • Support and marketing use cases analyze written text on the Internet, in support tickets, on social media platforms, and more to determine if the content contains positive or negative sentiment about a product or service.
  • Real-time translation of a language to another such as in Google Translate.
  • Search made simple such as with Google Search
  • On-demand spell checking such as in Microsoft Word
  • On-demand next word prediction found in messaging applications such as on mobile phones.
  • In drug trials where text is scanned to determine overlap in intellectual property during drug development.
  • Personal assistance agents such as Siri, Alexa, Cortana, and Google Assistant 

In the case of personal assistants as an example, NLP in action looks like the following:

  1. You ask Siri: ‘What’s the weather today?”
  2. Siri collects your question in audio format and converts it to text, which is processed for understanding.
  3. Based on that understanding, a response is created, converted to audio, and then delivered to you.  

Algorithmically, NLP starts with understanding the syntax of the text to extract the grammatical sense from the arrangement of words; a much easier task as most language has clearly defined grammatical rules that can be used to train the algorithms. When the syntax is understood, the algorithm works to infer meaning, nuance, and semantics, which is a harder task because language is not a precise science. The same thing can be said in multiple ways and still have the same meaning in and across multiple languages.

Tools and frameworks

Tools and frameworks that support the implementation of NLP applications, like those mentioned earlier, must be able to derive high-quality information from analyzed text through Text Mining. The components of text mining enable NLP to carry out the following operations:

  • Noise removal—Extraction of useful data
  • TokenizationIdentification and key segmentation of the useful data
  • NormalizationTranslation of text into equivalent numerical values appropriate for a computer to understand
  • Pattern classification—Discovery of relevancy in segmented data pieces and classify them

Common NLP frameworks with the capabilities that are described above are listed below. The intricacies of these frameworks are outside the scope of this blog; go to the following sites to learn more.


We know where NLP came from and some of its applications today, but where is it going and is it ready for wider adoption? What we understand about most existing AI algorithms is that they are suitable for narrow implementations where they carry out a very specific task. Such algorithms are considered to be Artificial Narrow Intelligence, and not Artificial General Intelligence; where the latter implies that they are expert at many things. Most AI is still yet to fully have a grasp on context and what covers time, space, and causality the way humans do. NLP is no exception.

For example, an Internet search returns irrelevant results that do not answer our questions because NLP is  excellent at parsing large amounts of data for similarities in content. Then, there is the nuance of spoken language mentioned before and the variance in language rules across languages and even domains. These factors make training for complete accuracy difficult. Some ways to address this might be larger data sets, more infrastructure to train, and perhaps model-based training versus the use of neural networks. However, these come with their own challenges.

At Dell, we have successfully deployed NLP in our tech support center applications, where agents write quick descriptions of a customer’s issues and the application returns predictions for the next best troubleshooting step. 3,000 agents use the tool to service over 10 K customers per day.

We use NLP techniques on input text to generate a format that the AI model can use and have employed K-nearest neighbor (KNN) clustering and logistic regressions for predictions. Microservice APIs are in place to pass information to agents as well. To address the concerns around text as input, we worked with our subject matter experts from the tech support space to identify Dell-specific lingo, which we used to develop a library of synonyms where different entries could mean the same thing. This helped greatly with cleaning up data, providing data to train, and helped us group similar words for context.  

For a high turnover role (support agents), we were able to train new agents to be successful sooner by making their onboarding process easier. The support application’s ability to provide the right information quickly lessened the time spent on browsing large irrelevant amounts of information, which can lead to disgruntled customers and frustrated agents. We saw a 10% reduction in the time it took for customers to be serviced.  The solution made it possible to feed newly discovered issues to our engineering teams when agents reported or searched for new technical issues with which we were not  already familiar. This worked conversely to support agents from engineering as well.

Our research teams at Dell are actively feeding our findings on neural machine translations into the open-source community: one of our current projects is work on AI Voice Synthesis, where NLP works so well you can’t tell that a computer is speaking!

For more information about natural language processing (BERT) MLPerf benchmark ratings for Dell PowerEdge platforms, visit the linked blog posts, then reach out to Dell’s Emerging Tech Team for help with NLP  projects in your organization.

Read Full Blog
  • PowerEdge
  • GPU
  • MLPerf

Dell EMC Servers Offer Excellent Deep Learning Performance with the MLPerf™ Training v1.1 Benchmark

Frank Han Rakshith Vasudev Dharmesh Patel Frank Han Rakshith Vasudev Dharmesh Patel

Wed, 01 Dec 2021 21:31:51 -0000


Read Time: 0 minutes


Dell Technologies has submitted results to the MLPerf Training benchmarking suite for the fifth round. This blog provides an overview of our submissions for the latest version, v1.1. Submission results indicate that different Dell EMC servers (Dell EMC DSS8440, PowerEdge R750xa, and PowerEdge XE8545 servers) offer promising performance for deep learning workloads. These workloads are across different problem types such as image classification, medical image segmentation, lightweight object detection, heavyweight object detection, speech recognition, natural language processing, recommendation, and reinforcement learning.

The previous blog about MLPerf v1.0 contains an introduction to MLCommons™ and the benchmarks in the MLPerf training benchmarking suite. We recommend that you read this blog for an overview of the benchmarks. All the benchmarks and rules remain the same as for v1.0.

The following graph with an exponentially scaled y axis indicates time to converge for the servers and benchmarks in question:


Fig 1: All Dell Technologies submission results for MLPerf Training v1.1

Figure 1 shows that this round of Dell Technologies submissions includes many results. We provided 51 results. These results encompass different Dell Technologies servers including Dell EMC DSS8440, PowerEdge R750xa, and PowerEdge XE8545 servers with various NVIDIA A100 accelerator configurations with different form factors: PCIe, SXM4, and different VRAM variants including 40 GB and 80 GB versions. These variants also include 300 W, 400 W, and 500 W TDP variants.

Note: For the hardware and software specifications of the systems in the graph, see

Different benchmarks were submitted that span areas of image classification, medical image segmentation, lightweight object detection, heavy weight object detection, speech recognition, natural language processing, recommendation, and reinforcement learning. In all these areas, the Dell EMC DSS8440, PowerEdge R750xa, and PowerEdge XE8545 server performance is outstanding. 


Full coverage

Dell Technologies not only submitted the most results but also comprehensive results from a single system. PowerEdge XE8545 x 4 A100-SXM-80GB server results include submissions across the full spectrum of benchmarked models in the MLPerf training v1.1 suite such as BERT, DLRM, MaskR-CNN, Minigo, ResNet, SSD, RNNT, and 3D U-Net.

Multinode results

The performance scaling of the multinode results is nearly linear or linear and results scale well. This scaling makes the performance of Dell EMC servers in a multinode environment more conducive to faster time to value. Furthermore, among other submitters that include NVIDIA accelerator-based submissions, we are one of three submitters that encompass multinode results.

Improvements from v1.0 to v1.1

Updates for the Dell Technologies v1.1 submission include:

  • The v1.1 submission includes results from the PowerEdge R750xa server. The PowerEdge R750xa server offers compelling performance, well suited for artificial intelligence, machine learning, and deep learning training and inferencing workloads.
  • Our results include numbers for 10 GPUs with 80 GB A100 variants on the Dell EMC DSS8440 server. The results for 10 GPUs are useful because more GPUs in a server help to train the model faster, if constrained in a single node environment for training.

Fig 2: Performance comparison of BERT between v1.0 and v1.1 across Dell EMC DSS8440 and PowerEdge XE8545 servers

We noticed the performance improvement of v1.1 over v1.0 with the BERT model, especially with the PowerEdge XE8545 server. While many deep learning workloads were similar in performance between v1.0 and v1.1, the many results that we submitted help customers understand the performance difference across versions.


  • Our number of submissions was significant (51 submissions). They help customers observe performance with different Dell EMC servers across various configurations. A higher number of results helps customers understand server performance that enables a faster time to solution across different configuration types, benchmarks, and multinode settings.
  • Among other submissions that include NVIDIA accelerator-based submissions, we are one of three submitters that encompass multinode results. It is imperative to understand scaling performance across multiple servers as deep learning compute needs continue to increase with different kinds of deep learning models and parallelism techniques.
  • PowerEdge XE8545 x 4A100-SXM-80GB server results include all the models in the MLPerf v1.1 benchmark.
  • PowerEdge R750xa server results were published for this round; they offer excellent performance.

Next steps

In future blogs, we plan to compare the performance of NVLINK Bridged systems with non-NVLINK Bridged systems.

Read Full Blog