Your Browser is Out of Date

Nytro.ai uses technology that works best in other browsers.
For a full experience use one of the browsers below

Blogs

blogs (3)

  • AMD
  • Artificial Intelligence

Part III | How to Run Llama-2 via Hugging Face Models on AMD ROCm™ with Dell PowerEdge™?

Scalers AI Mohan Rokkam Delmar Hernandez Scalers AI Mohan Rokkam Delmar Hernandez

Thu, 09 Nov 2023 23:21:47 -0000

|

Read Time: 0 minutes

PowerEdge R7615

AMD Instinct MI210 Accelerator


In our second blog, we provided a step-by-step guide on how to get models running on AMD ROCm™, set up TensorFlow and PyTorch, and deploying GPT-2. In this guide, we are now exploring how to set up a leading large language model (LLM) Llama-2 using Hugging Face.

Dell™ PowerEdge™ offers a rich portfolio of AMD ROCm™ solutions, including Dell™ PowerEdge™ R7615, R7625, and R760xa servers. 

We implemented the following Dell PowerEdge system configurations 

Operating system: Ubuntu 22.04.3 LTS

Kernel version: 5.15.0-86-generic

Docker Version: Docker version 24.0.6, build ed223bc

ROCm version: 5.7

Server: Dell PowerEdge R7615

CPU: AMD EPYC™ 9354P 32-Core Processor

GPU: AMD Instinct™ MI210


Step-by-Step Guide

1. First, Install AMD ROCm™ driver, libraries, and tools. Follow the detailed installation instructions for your Linux based platform.

To ensure these installations are successful, check the GPU info using `rocm-smi`.

2. Next, we will select code snippets from Hugging Face. Hugging Face offers the most comprehensive set of developer tools for leading AI models. We will select GPT2 code snippets for both TensorFlow and PyTorch. Follow the steps in blog 2 (link) to start the ROCm PyTorch docker container.

Follow the steps in Blog II to start the AMD ROCm™ PyTorch docker container.

Running a chatbot with Llama2-7B-chat model and Gradio ChatInterface:

The Llama-2-7b-chat model from Hugging Face is a large language model developed by Facebook AI and Meta, designed for text generation tasks. It is a part of the Llama2 series, featuring an impressive 6.74 billion parameters, and is primarily used for creating AI chatbots and generating human-like text.

Gradio ChatInterface is Gradio's high-level abstraction for creating chatbot UIs and allows you to create a web-based demo around the Llama2-7B- chat model in a few lines of code.

Install Prerequisites:

Python


pip3 install transformers sentencepiece accelerate gradio protobuf

 Request access token:

Request access to Llama-2 7B Chat Model:  Llama-2-7B-Chat-HF

Log in to Hugging Face CLI and enter your access token when prompted:

Unset 


huggingface-cli login

Perform Python code:

Python


import time

import torch

from transformers import LlamaForCausalLM, LlamaTokenizer

import gradio as gr


model_name = "meta-llama/Llama-2-7b-chat-hf"

torch_dtype = torch.bfloat16

max_new_tokens = 500


# Initialize and load tokenizer, model

tokenizer = LlamaTokenizer.from_pretrained(model_name, device_map="auto")

model = LlamaForCausalLM.from_pretrained(model_name, torch_dtype=torch_dtype, device_map="auto")


def chat(message, history):

    input_text = message


    # Encode the input text using tokenizer

    encoded_input = tokenizer.encode(input_text, return_tensors='pt')

    encoded_input = encoded_input.to('cuda')

    # Inference

    start_time = time.time()

    outputs = model.generate(encoded_input, max_new_tokens=max_new_tokens)

    end_time = time.time()

    generated_text = tokenizer.decode(

                    outputs[0],

                    skip_special_tokens=True

                )

   

 # Calculate number of tokens generated  

    num_tokens = len(outputs[0].detach().cpu().numpy().flatten())

    inference_time = end_time - start_time

    token_per_sec = num_tokens / inference_time

    print(f"Inference latency: {inference_time} sec")

    print(f"Token per sec: {token_per_sec}")

    return(generated_text)

   

  # Launch gradio based chatinterface


demo = gr.ChatInterface(fn=chat, title="Llama2 chatbot")

demo.launch()

Here is the output conversation on the chatbot with prompt and results

 

Here is a view of AMD GPU utilization with rocm-smi

 

 As you can see, using Hugging Face integration with AMD ROCm™, we can now deploy the leading large language models, in this case, Llama-2. Furthermore, the performance of the AMD Instinct™ MI210 meets our target performance threshold for inference of LLMs at <100 millisecond per token.  

“Scalers AI was thrilled to see the robust ecosystem emerging around ROCm that provides us with critical choice and exceeds our target <100 millisecond per user latency target on 7B parameter leading large language models!”

 - Chetan Gadil, CTO, Scalers AI  

In our next blog, we explore the performance of AMD ROCm™ and how we can accelerate AI research progress across industries with AMD ROCm™.  

Authors

Steen Graham, CEO of Scalers AI

Delmar Hernandez, Dell PowerEdge Technical Marketing

Mohan Rokkam, Dell PowerEdge Technical Marketing

 

Read Full Blog
  • PowerEdge
  • AMD
  • Artificial Intelligence

Part II | How to Run Hugging Face Models with AMD ROCm™ on Dell™ PowerEdge™?

Scalers AI Mohan Rokkam Delmar Hernandez Scalers AI Mohan Rokkam Delmar Hernandez

Tue, 14 Nov 2023 16:27:00 -0000

|

Read Time: 0 minutes

In case you’re interested in learning more about how Dell and Hugging Face are working together, check out the November 14 announcement detailing how the two companies are simplifying GenAI with on-premises IT.  

PowerEdge R7615

AMD Instinct MI210 Accelerator


In our first blog, we explored the readiness of the AMD ROCm™ ecosystem to run modern Generative AI workloads. This blog provides a step-by-step guide to running Hugging Face models on AMD ROCm™ and insights on setting up TensorFlow, PyTorch, and GPT-2.

Dell PowerEdge offers a rich portfolio of AMD ROCm™ solutions, including Dell™ R7615, R760xa, R7615, and R7625 PowerEdge™ servers.

For this blog, we selected the Dell PowerEdge R7615. 

System Configuration Details

Operating system: Ubuntu 22.04.3 LTS

Kernel version: 5.15.0-86-generic

Docker Version: Docker version 24.0.6, build ed223bc

ROCm version: 5.7

Server: Dell™ PowerEdge™ R7615

CPU: AMD EPYC™ 9354P 32-Core Processor

GPU: AMD Instinct™ MI210

Step-by-Step Guide

1. First, Install the AMD ROCm™ driver, libraries, and tools. Follow the detailed installation instructions for your Linux based platform. 
 
To ensure these installations are successful, check the GPU info using `rocm-smi.`

2. Next, we will select code snippets from Hugging Face. Hugging Face offers the most comprehensive developer tools for leading AI models. We will choose GPT2 code snippets for both TensorFlow and PyTorch.

Running GPT2 on AMD ROCm™ with TensorFlow

Here, we use the AMD ROCm™ docker image for TensorFlow and launch GPT2 inference on an AMD™ GPU.

3. Use docker images for TensorFlow with AMD ROCm™ backend support to expedite the setup

Unset

sudo docker run -it \

--network=host \

--device=/dev/kfd \

--device=/dev/dri \

--ipc=host \

--shm-size 16G \

--group-add video \

--cap-add=SYS_PTRACE \

--security-opt seccomp=unconfined \

--workdir=/dockerx \

-v $HOME/dockerx:/dockerx rocm/tensorflow:latest /bin/bash

4. Run TensorFlow code from Hugging Face to infer GPT2 successfully inside a Docker container with the AMD™ GPU, using the following snippet

Python

from transformers import TFGPT2LMHeadModel, GPT2Tokenizer

 

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

GPT2 = TFGPT2LMHeadModel.from_pretrained("gpt2")

prompt = "What is Quantum Computing?"

 

input_ids = tokenizer.encode(prompt, return_tensors='tf')

 

output = GPT2.generate(input_ids, max_length = 100)

print(tokenizer.decode(output[0], skip_special_tokens = True))

Running GPT2 on AMD ROCm™ with PyTorch

5. Use docker images for PyTorch with AMD ROCm™ backend support to expedite the setup

Unset

sudo docker run -it \

--network=host \

--device=/dev/kfd \

--device=/dev/dri \

--ipc=host \

--shm-size 16G \

--group-add=video \

--cap-add=SYS_PTRACE \

--security-opt seccomp=unconfined \

--workdir=/dockerx \

-v $HOME/dockerx:/dockerx rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1 /bin/bash

6. Use the snippet below to run a PyTorch from Hugging Face script in a Docker container

Python

from transformers import GPT2Tokenizer, GPT2LMHeadModel

 

tokenizer = GPT2Tokenizer.from_pretrained('gpt2', device_map="auto")

model = GPT2LMHeadModel.from_pretrained('gpt2', device_map="auto")

prompt = "What is Quantum Computing?"

 

encoded_input = tokenizer(prompt, return_tensors='pt')

encoded_input = encoded_input.to('cuda')

 

output = model.generate(**encoded_input, max_length=100)

print(tokenizer.decode(output[0], skip_special_tokens = True))

| As you can see, AMD ROCm™ has a rich ecosystem of support for leading AI frameworks like PyTorch, TensorFlow, and Hugging Face to set up and deploy industry-leading transformer models.

If you are interested in trying different models from Hugging Face, you can refer to the comprehensive set of transformer models supported here: https://huggingface.co/docs/transformers/index

Our next blog shows you how to run Llama-2 in a chat application, arguably the leading large language model available to developers today using Hugging Face.  

Blog III

| References 

 

| Authors:

Steen Graham, CEO of Scalers AI

Delmar Hernandez, Dell PowerEdge Technical Marketing

Mohan Rokkam, Dell PowerEdge Technical Marketing

 

Read Full Blog
  • GPU
  • AMD
  • Artificial Intelligence

Part I: Is AMD ROCm™ Ready to Deploy Leading AI Workloads?

Scalers AI Mohan Rokkam Delmar Hernandez Scalers AI Mohan Rokkam Delmar Hernandez

Thu, 09 Nov 2023 23:21:48 -0000

|

Read Time: 0 minutes

PowerEdge R7615

AMD Instinct MI210 Accelerator


Today, Innovation is GPU constrained, and we are seeing explosive growth in AI workloads, namely transformer based models for Generative AI. This blog explores AMD ROCm™ software and AMD GPUs, and AMD readiness for primetime. 

AMD ROCm™ or Radeon Open eCosystem (ROCm) was launched in 2016 as an open-source software foundation for GPU computing in Linux, providing developers with tools to leverage GPUs compute capacity to advance their workloads across applications including high performance computing and advanced rendering. It provides a comprehensive set of tools and libraries for programming GPUs in a variety of languages, including C++, Python, and R. 

AMD ROCm can be used to accelerate a variety of workloads, such as:

  • Scientific computing and computer-aided design (CAD): AMD ROCm™ can accelerate scientific simulations, such as molecular dynamics and computational fluid dynamics.
  • Artificial Intelligence: AMD ROCm™ can be used to train and deploy AI models faster and more efficiently.
  • Data science: AMD ROCm™ can accelerate data processing and analytics tasks.
  • Graphics and visualization: AMD ROCm™ can create and render high-performance graphics and visualizations.

With the broad and rising adoption of Generative AI driving the need for parallel computational power of GPUs to train, fine-tune, and deploy deep learning models, AMD ROCm™ has expanded support for the leading AI frameworks in TensorFlow, PyTorch, ONNX runtime, and more recently Hugging Face.  

Hugging Face and AMD announced a collaboration to support AMD ROCm™ and hardware platforms to deliver leadership transformer performance on AMD CPUs and GPUs for training and inference. The initial focus will be on AMD Instinct™ MI2xx and MI3xx series GPUs¹.

AMD and Hugging Face plan to support transformer architectures for natural language processing, computer vision, and speech. Plans also include traditional computer vision models and recommendation models.

| “We will integrate AMD ROCm SDK seamlessly in our open-source libraries, starting with the transformers library.” 

Further, Hugging Face highlighted plans for a new Optimum library dedicated to AMD¹. In addition to the growing ecosystem for AI software support for AMD ROCm™, Dell™ offers a portfolio of leading edge PowerEdge™ hardware supporting AMD ROCm™ and the AMD MI210 across Dell™ PowerEdge™ R760xa and R7615 servers.

The breadth of hardware offerings gives enterprise users of AMD ROCm™ robust hardware choices to pair with fast-advancing software support.

The architecture above showcases the robust availability of AMD ROCm™ software and Hugging Face integration, allowing developers to run leading transformer models optimized on AMD Instinct™ GPUs today. Dell™ offers a robust portfolio of PowerEdge™ servers that support GPUs supported by AMD ROCm™.

This enables customers to easily get the hardware needed to test, develop, and deploy AI solutions with AMD ROCm™.

| So is AMD ROCm Ready for AI Workloads?

Though the AMD ROCm™ adoption and ecosystem maturity are nascent, the support of leading AI frameworks and collaboration with key ecosystem partners such as Hugging Face, paired with AMD advancements in GPU Hardware, make it ready to take on the leading AI workloads today. 

In part II of this blog series, we will put the architecture to the test and develop a LLM-based chatbot on Dell™ PowerEdge™ servers with AMD ROCm™ and AMD GPUs. 

Blog II

| References 

https://huggingface.co/blog/huggingface-and-amd

| Authors

Steen Graham, CEO of Scalers AI

Delmar Hernandez, Dell PowerEdge Technical Marketing

Mohan Rokkam, Dell PowerEdge Technical Marketing

Read Full Blog