The following tables list the system configurations and software stack used for generative AI validation:
Table 8. System configuration
Component | Details | ||
Compute server for inferencing | 4 x PowerEdge R760xa | 2 x PowerEdge XE9680 | 2 x PowerEdge XE8640 |
GPUs per server | 4 x NVIDIA H100 PCIe GPUs | 8 x NVIDIA H100 SXM GPUs | 4 x NVIDIA H100 SXM GPUs |
Network adapter | ConnectX6 25 Gb Ethernet | ConnectX6 25 Gb Ethernet | ConnectX6 25 Gb Ethernet |
Network switch | 2 x PowerSwitch S5248F-ON | 2 x PowerSwitch S5248F-ON | 2 x PowerSwitch S5248F-ON |
Table 9. Software components and versions
Component | Details |
Operating system | Ubuntu 22.04.1 LTS |
Cluster management | NVIDIA Base Command Manager Essentials 9.2 |
Kubernetes | Upstream Kubernetes - Version v1.24.9 |
GPU operator | NVIDIA GPU operator v22.9.2 |
Inference server | NVIDIA Triton Inference Server v23.04 |
AI framework | NVIDIA NeMo Container v23.04 |
GPT models are advanced language models known for their impressive text generation and natural language processing capabilities. They use the Transformer architecture, which allows them to understand complex language patterns and relationships. These models are pretrained on vast amounts of text data and can be fine-tuned for specific tasks, enabling them to perform well in various NLP applications.
In this validated design, we validated the generative AI models shown in table 10 below. We validated a significant number of the NeMo GPT and Stable Diffusion models with Triton Inference Server on all three server models. We ran inference on BLOOM 7B, Llama 2 7B, and Llama 213B on standard Python or PyTorch containers available from NVIDIA NGC on all three server models.
The Llama 2 70B model in a standard configuration for inference uses model parallelism, where groups of model layers are spread across multiple GPUs. This model uses a model-parallel (MP) parameter of 8, which requires eight GPUs. Therefore, we deployed this model in its standard configuration on a single PowerEdge XE9680 server that offers eight H100 GPUs connected by NVSwitch.
Table 10. Generative AI models and validation container
Model | Description | Base Container used for validation |
Transformer-based language model with 20 B total trainable parameter count | Triton - nvcr.io/ea-bignlp/bignlp-inference:22.08-py3 NeMo - nvcr.io/nvidia/nemo:23.04 | |
Transformer-based language model with 2 B total trainable parameter count | Triton - nvcr.io/ea-bignlp/bignlp-inference:22.08-py3 NeMo - nvcr.io/nvidia/nemo:23.04 | |
Transformer-based language model with 1.3 B total trainable parameter count | Triton - nvcr.io/ea-bignlp/bignlp-inference:22.08-py3 NeMo - nvcr.io/nvidia/nemo:23.04 | |
Transformer-based language model with 345 M total trainable parameter count | Triton - nvcr.io/ea-bignlp/bignlp-inference:22.08-py3 NeMo - nvcr.io/nvidia/nemo:23.04 | |
Transformer-based language model with 5 B total trainable parameter count | Triton - nvcr.io/ea-bignlp/bignlp-inference:22.08-py3 NeMo - nvcr.io/nvidia/nemo:23.04 | |
Big Science Large Open-science Open-access Multilingual Language Model with 7B parameter count | nvcr.io/nvidia/cuda:12.1.0-devel-ubi8 | |
Optimized transformer model with 7B parameter count | nvcr.io/nvidia/pytorch:23.06-py3 | |
Optimized transformer model with 13B parameter count | nvcr.io/nvidia/pytorch:23.06-py3 | |
Optimized transformer model with 70B parameter count | nvcr.io/nvidia/pytorch:23.06-py3 | |
Text to image generation model | Triton - nvcr.io/ea-bignlp/bignlp-inference:22.08-py3 python:latest |
We deployed and validated the preceding models in the following scenarios:
Not all the models are validated in all the different scenarios due to limitations noted in the following sections.
To optimize the NeMo model's throughput and latency, it can be converted to the FasterTransformer (FT) format, which includes performance modifications to the encoder and decoder layers in the transformer architecture. This conversion is done by launching the NeMo Docker container with the command specified in this blog post. We deployed all the NeMo models listed in Table 6 with Triton Inference Server, except for the NeMo GPT-2B-00 1 model, which we deployed using the inference server described in the model card.
The following figure shows the models in production using Triton Inference Server:
Figure 5. Models in production using Triton Inference Server
We validated the model by asking simple questions, as shown in the following figure:
Figure 6. Prompt and response with a NeMo GPT model
The BLOOM model has been proposed with its various versions through the BigScience Workshop. BigScience is inspired by other open science initiatives in which researchers have pooled their time and resources to collectively achieve a higher impact. The BLOOM architecture is similar to GPT3 (autoregressive model for next token prediction) but has been trained on 46 different languages and 13 programming languages. Several smaller versions of the models have been trained on the same dataset.
We deployed the BLOOM 7B model on a PowerEdge R760xa server using instructions available in the GitHub page. We used the following latest libraries so that the model can be deployed on NVIDIA H100 GPUs:
The following figure shows the model inference:
Figure 7. Model inference using BLOOM
Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. Llama 2 is available from Hugging Face and for accessing the repository it is required to accept Meta license agreement. We deployed the model on a Pytorch:23.06 container from NVIDIA NGC after installing the packages required by Llama and listed in its requirements.txt file.
The following figure is an example of running inference of the Llama model:
Figure 8. Model inference for Llama
Stable Diffusion is a deep learning model that performs text-to-image tasks with a focus on generating detailed images based on text descriptions. Additionally, this versatile model can be extended to handle various tasks, including inpainting, outpainting, and image-to-image translations guided by textual prompts.
We followed the instructions in GitHub to download and run the model. We used the following Docker file.
FROM python:latest RUN rm -rf /usr/local/cuda/lib64/stubs COPY requirements.txt / RUN pip install -r requirements.txt \ --extra-index-url https://download.pytorch.org/whl/cu118 RUN useradd -m huggingface USER huggingface WORKDIR /home/huggingface ENV USE_TORCH=1 RUN mkdir -p /home/huggingface/.cache/huggingface \ && mkdir -p /home/huggingface/input \ && mkdir -p /home/huggingface/output COPY docker-entrypoint.py /usr/local/bin COPY token.txt /home/huggingface ENTRYPOINT [ "docker-entrypoint.py" ] We used the following requirements.txt file: diffusers[torch]==0.17.1 onnxruntime==1.15.1 safetensors==0.3.1 torch==2.0.1+cu118 transformers==4.30.1 xformers==0.0.20
The following figure is an image generated using the prompt Astronaut riding a horse on Mars:
Figure 9. Image generated by Stable Diffusion for prompt “Astronaut riding a horse on Mars”
We used Language Model Evaluate Test Suite from AI21 Labs, specifically the RTE suite, to validate NeMo model inference. The goal is to reproduce the results made available in the NeMo Model Card. RTE includes datasets that prompt for a True or False response.
The following text is an example of a prompt and the response:
Prompt: Tropical Storm Irene on August 11, 2005 at 16:15 UTC. Tropical Storm Irene will increase in strength over the next several days, possibly developing into a hurricane that will hit the east coast of the United States, said the National Hurricane Center of Miami, Florida in a report today. Irene was located approximately 975 kilometers south-southeast of Bermuda at 16:00 UTC today. Forecasters say that the storm is now moving in a west- northwest direction with top sustained winds of 40 miles per hour. A storm called Irene is going to approach the east coast of the US. True or False?
Response: True
The following table shows the scores that we measured. Comparing this score to other models provides a mechanism to compare the accuracy of the models.
Table 11. RTE score
Model | RTE score |
NeMo GPT 20B | 0.527076 |
NeMo GPT-2B-001 | 0.519856 |
We validated the NeMo model for binary text classification using the Stanford Politeness Corpus dataset, which has text phrases labeled as polite or impolite. We evaluate models using a fixed test dataset containing 700 phrases. The dataset requests sentiment analysis from the LLM models.
The following text is an example of a prompt and the response:
Prompt: What do you mean? How am I supposed to vindicate myself of this ridiculous accusation? Question: polite or impolite?
Response: Impolite
The following table summarizes the score we obtained for the NeMo models. An explanation of the performance metrics can be found here.
Table 12. Score summary
Model | Label | Precision | Recall | F1 |
NeMo GPT-345M | Polite | 95.36 | 93.51 | 94.43 |
Impolite | 93.29 | 95.21 | 94.24 | |
NeMo GPT-2B-001 | Polite | 94.67 | 92.21 | 93.42 |
Impolite | 92 | 94.52 | 93.24 | |
NeMo GPT 20B | Polite | 94.08 | 92.86 | 93.46 |
Impolite | 92.57 | 93.84 | 93.2 |
Measuring the accuracy of LLMs is vital for assessing their performance, comparing different models, and tracking progress in the field of NLP. It will guide users in determining which foundation model to use for further customization and fine-tuning. Additionally, accurate evaluation metrics validate research findings, aid in model optimization, and address potential biases or ethical concerns. Ultimately, accurate assessment ensures informed and responsible deployment of LLMs in diverse real-world applications.
Every day, newer and better-performing LLM models are being released. Furthermore, the evaluation metrics for LLMs have evolved over time to tackle the unique challenges posed by these intricate models and to assess their performance more effectively across a wide spectrum of natural language processing tasks. Our validation primarily focused on publicly available NeMo models. In the future, we will update this section to incorporate recently released models using metrics and tools that are widely recognized and accepted in the field. We will also update the section to demonstrate deployment of other open-source available models from HuggingFace using Triton Inference Server.