The following table lists the system configuration and software stack used for generative AI validation:
Table 5. System configuration and software stack
Component | Details |
Hardware | |
Compute server for inferencing | PowerEdge R760xa |
GPUs | 4 x NVIDIA H100 PCIe GPUs |
Network adapter | ConnectX6 25 Gb Ethernet |
Network switch | 2 x PowerSwitch S5248F-ON |
Software | |
Operating system | Ubuntu 22.04.1 LTS |
Cluster management | NVIDIA cluster management software |
Kubernetes | Upstream Kubernetes - Version v1.24.9 |
GPU operator | NVIDIA GPU operator v22.9.2 |
Inference server | NVIDIA Triton Inference Server v23.04 |
AI framework | NVIDIA NeMo Container v23.04 |
GPT models are advanced language models known for their impressive text generation and natural language processing capabilities. They use the Transformer architecture, which allows them to understand complex language patterns and relationships. These models are pretrained on vast amounts of text data and can be fine-tuned for specific tasks, enabling them to perform well in various NLP applications.
In this validated design, we have validated the following Generative models. We have validated most of NeMo GPT models with Triton Inference Server. We ran inference on BLOOM, Llama, and Stable Diffusion models on standard Python or PyTorch containers available from NVIDIA NGC.
Table 6. Generative AI models and validation container
Model | Description | Base Container used for validation |
Transformer-based language model with 20 B total trainable parameter count | Triton - nvcr.io/ea-bignlp/bignlp-inference:22.08-py3 NeMo - nvcr.io/nvidia/nemo:23.04 | |
Transformer-based language model with 2 B total trainable parameter count | Triton - nvcr.io/ea-bignlp/bignlp-inference:22.08-py3 NeMo - nvcr.io/nvidia/nemo:23.04 | |
Transformer-based language model with 1.3 B total trainable parameter count | Triton - nvcr.io/ea-bignlp/bignlp-inference:22.08-py3 NeMo - nvcr.io/nvidia/nemo:23.04 | |
Transformer-based language model with 345 M total trainable parameter count | Triton - nvcr.io/ea-bignlp/bignlp-inference:22.08-py3 NeMo - nvcr.io/nvidia/nemo:23.04 | |
Transformer-based language model with 5 B total trainable parameter count | Triton - nvcr.io/ea-bignlp/bignlp-inference:22.08-py3 NeMo - nvcr.io/nvidia/nemo:23.04 | |
Big Science Large Open-science Open-access Multilingual Language Model | nvcr.io/nvidia/cuda:12.1.0-devel-ubi8 | |
Optimized transformer model with 7B parameter count | nvcr.io/nvidia/pytorch:23.06-py3 | |
Optimized transformer model with 13B parameter count | nvcr.io/nvidia/pytorch:23.06-py3 | |
Text to image generation model | python:latest |
We deployed and validated the preceding models in the following scenarios:
Not all the models are validated in all the different scenarios due to limitations noted in the following sections.
To optimize the NeMo model's throughput and latency, it can be converted to the FasterTransformer (FT) format, which includes performance modifications to the encoder and decoder layers in the transformer architecture. This conversion is done by launching the NeMo Docker container with the command specified in this blog post. We deployed all the NeMo models listed in Table 6 with Triton Inference Server, except for the NeMo GPT-2B-00 1 model, which we deployed using the inference server described in the model card.
The following figure shows the models in production using Triton Inference Server:
Figure 7. Models in production using Triton Inference Server
We validated the model by asking simple questions, as shown in the following figure:
Figure 8. Prompt and response with a NeMo GPT model
The BLOOM model has been proposed with its various versions through the BigScience Workshop. BigScience is inspired by other open science initiatives in which researchers have pooled their time and resources to collectively achieve a higher impact. The BLOOM architecture is similar to GPT3 (autoregressive model for next token prediction) but has been trained on 46 different languages and 13 programming languages. Several smaller versions of the models have been trained on the same dataset.
We deployed the BLOOM model on a PowerEdge R760xa server using instructions available in the GitHub page. We used the following latest libraries so that the model can be deployed on NVIDIA H100 GPUs:
The following figure shows the model inference:
Figure 9. Model inference using BLOOM
Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. Llama 2 is available from Hugging Face and for accessing the repository it is required to accept Meta license agreement. We deployed the model on a Pytorch:23.06 container from NVIDIA NGC after installing the packages required by Llama and listed in its requirements.txt file.
The following figure is an example of running inference of the Llama model:
Figure 10. Model inference for Llama
Stable Diffusion is a deep learning model that performs text-to-image tasks with a focus on generating detailed images based on text descriptions. Additionally, this versatile model can be extended to handle various tasks, including inpainting, outpainting, and image-to-image translations guided by textual prompts.
We followed the instructions in Github to download and run the model. We used the following Docker file.
FROM python:latest
RUN rm -rf /usr/local/cuda/lib64/stubs
COPY requirements.txt /
RUN pip install -r requirements.txt \
--extra-index-url https://download.pytorch.org/whl/cu118
RUN useradd -m huggingface
USER huggingface
WORKDIR /home/huggingface
ENV USE_TORCH=1
RUN mkdir -p /home/huggingface/.cache/huggingface \
&& mkdir -p /home/huggingface/input \
&& mkdir -p /home/huggingface/output
COPY docker-entrypoint.py /usr/local/bin
COPY token.txt /home/huggingface
ENTRYPOINT [ "docker-entrypoint.py" ]
We used the following requirements.txt file:
diffusers[torch]==0.17.1
onnxruntime==1.15.1
safetensors==0.3.1
torch==2.0.1+cu118
transformers==4.30.1
xformers==0.0.20
The following figure is an image generated using the prompt Astronaut riding a horse on Mars:
Figure 11. Stable Diffusion generated image for prompt "Astronaut riding a horse on Mars"
Model validation using AI21 RTE
We used Language Model Evaluate Test Suite from AI21 Labs, specifically the RTE suite, to validate NeMo model inference. The goal is to reproduce the results made available in the NeMo Model Card. RTE includes datasets that prompt for a True or False response.
The following text is an example of a prompt and the response:
Prompt: Tropical Storm Irene on August 11, 2005 at 16:15 UTC. Tropical Storm Irene will increase in strength over the next several days, possibly developing into a hurricane that will hit the east coast of the United States, said the National Hurricane Center of Miami, Florida in a report today. Irene was located approximately 975 kilometers south-southeast of Bermuda at 16:00 UTC today. Forecasters say that the storm is now moving in a west- northwest direction with top sustained winds of 40 miles per hour. A storm called Irene is going to approach the east coast of the US. True or False?
Response: True
The following table shows the score that we measured. Comparing this score to other models provides a mechanism to compare the accuracy of the models.
Table 7. RTE score
Model | RTE score |
NeMo GPT 20B | 0.527076 |
NeMo GPT-2B-001 | 0.519856 |
We validated the NeMo models for binary text classification using the Stanford Politeness Corpus dataset, which has text phrases labeled as polite or impolite. We evaluate models using a fixed test dataset containing 700 phrases. The dataset requests sentiment analysis from the LLM models.
The following text is an example of a prompt and the response:
Prompt: What do you mean? How am I supposed to vindicate myself of this ridiculous accusation? Question: polite or impolite?
Response: Impolite
The following table summarizes the score we obtained for the NeMo models. An explanation of the performance metrics can be found here.
Table 8. Score summary
Model | Label | Precision | Recall | F1 |
NeMo GPT-345M | Polite | 95.36 | 93.51 | 94.43 |
Impolite | 93.29 | 95.21 | 94.24 | |
NeMo GPT-2B-001 | Polite | 94.67 | 92.21 | 93.42 |
Impolite | 92 | 94.52 | 93.24 | |
NeMo GPT 20B | Polite | 94.08 | 92.86 | 93.46 |
NeMo GPT 20B | Impolite | 92.57 | 93.84 | 93.2 |
Measuring the accuracy of LLMs is vital for assessing their performance, comparing different models, and tracking progress in the field of NLP. It will guide users in determining which foundation model to use for further customization and fine-tuning. Additionally, accurate evaluation metrics validate research findings, aid in model optimization, and address potential biases or ethical concerns. Ultimately, accurate assessment ensures informed and responsible deployment of LLMs in diverse real-world applications.
Every day, newer and better-performing LLM models are being released. Furthermore, the evaluation metrics for LLMs have evolved over time to tackle the unique challenges posed by these intricate models and to assess their performance more effectively across a wide spectrum of natural language processing tasks. Our validation primarily focused on publicly available NeMo models. In the future, we will update this section to incorporate recently released models using metrics and tools that are widely recognized and accepted in the field. We will also update the section to demonstrate deployment of other open-source available models from HuggingFace using Triton Inference Server.