Home > AI Solutions > Gen AI > Guides > Generative AI in the Enterprise with AMD Accelerators > Software components
AMD ROCm is an open-source software platform designed to optimize the performance of AMD accelerators for AI. ROCm is powered by Heterogeneous-computing Interface for Portability (HIP). It supports programming models, such as OpenMP and OpenCL, includes all necessary open-source software compilers, debuggers, and libraries, and is fully integrated into machine learning frameworks, such as PyTorch and TensorFlow. ROCm also facilitates scale-out deployments with containerization tools and supports various programming languages for AI and HPC workloads. Also, it integrates with AMD Instinct accelerators to enhance AI and HPC applications, backed by comprehensive management tools for development and system performance analysis.
Dell Omnia is open‑source software for deploying and managing high-performance clusters for HPC, AI, and data analytics workloads. Omnia installs Kubernetes and Slurm for managing jobs and enables installation of many other packages and services for running diverse workloads on the same converged solution. Developers are continually extending Omnia to speed deployment of new infrastructure into resource pools that can be allocated and reallocated to different workloads easily. Omnia can make it faster and easier for IT to provide the right tools for the right job on the right infrastructure at the right time.
Organizations seeking comprehensive model workflow and life cycle management can optionally deploy MLOps platforms and toolsets, such as Kubeflow, cnvrg.io, MLflow, and others.
MLOps integrates machine learning with software engineering for efficient deployment and management of models in real-world applications. In generative AI, MLOps can automate model deployment, ensure continuous integration, monitor performance, optimize resources, handle errors, and ensure compliance. It can also manage model versions, detect data drift, and provide model explainability. These practices ensure generative models operate reliably and efficiently in production, which is critical for interactive tasks like content generation and customer service chatbots.
PyTorch torchtune is a native toolkit designed for the effortless creation, refinement, and exploration of LLMs. This library stands out with its native PyTorch versions of well-known LLMs, crafted from flexible and interchangeable components. It simplifies the fine-tuning process with straightforward, customizable training guides for techniques like LoRA and QLoRA, all within the PyTorch ecosystem—no external trainers or frameworks necessary. Also, torchtune offers YAML configurations to streamline the setup of training, evaluation, quantization, or inference procedures. To facilitate a quick start, it also includes comprehensive support for various prevalent dataset formats and a collection of prompt templates.
Transformer Reinforcement Learning (TRL) from Hugging Face is a comprehensive library designed to fine-tune and align transformer language models using Reinforcement Learning (RL). It facilitates the training process through steps like Supervised Fine-tuning (SFT), Reward Modeling (RM), and Proximal Policy Optimization (PPO), and is integrated with the transformers library. TRL supports various models, including GPT-2, BLOOM, and GPT-Neo, which can be optimized using PPO. The library emphasizes efficiency and scalability, allowing training to be accelerated from a single GPU to a multi-node cluster. It includes features like CLI for code-free interaction, Trainers for applying fine-tuning methods, and AutoModels that add a value head to the model for RL training.
Composer is an open-source deep learning training library by MosaicML, Databricks. Built on top of PyTorch, the Composer library makes it easier to implement distributed training workflows on large-scale clusters. Composer is optimized for scalability and usability, integrating best practices for efficient, multi-node training. By abstracting away low-level complexities such as parallelism techniques, distributed data loading, and memory optimization, you can focus on training modern ML models and running experiments without slowing down.
The following sections describe the inference components.
LLMs promise to fundamentally change how we use AI across all industries. However, serving these models is challenging and can be surprisingly slow even on expensive hardware. vLLM is an open-source library for fast LLM inference and serving. vLLM uses PagedAttention, an attention algorithm that effectively manages attention keys and values. vLLM equipped with PagedAttention redefines the new state of the art in LLM serving. It delivers up to 24 times higher throughput than HuggingFace Transformers, without requiring any model architecture changes.
Kserve (formerly known as KFServing) is a cloud-native, open-source framework designed to serve machine learning models in a Kubernetes environment. It simplifies the process of deploying, monitoring, and managing machine learning models at scale. Kserve provides a serverless inferencing solution that supports multiple machine learning frameworks, such as TensorFlow, PyTorch, XGBoost, and Scikit-Learn, among others. It uses powerful features from Kubernetes for autoscaling, including scaling up and down to zero, thus optimizing resource use. Also, Kserve includes capabilities for canary rollouts, multiframework serving, and integration with model monitoring systems, making it a versatile and efficient choice for machine learning operations (MLOps) workflows.
The following section describes a RAG component.
LlamaIndex is a data framework designed to augment the functionality of LLMs. It offers flexibility in the use of LLMs, which can be used as autocomplete tools, chatbots, or semiautonomous agents. LlamaIndex simplifies this process by providing various tools. Data connectors allow for the ingestion of existing data from various sources and formats, such as APIs, PDFs, and SQL. Data indexes then structure this data into intermediate representations that are easily consumable by LLMs. Engines provide natural language access to your data, with query engines offering powerful question-answering interfaces, and chat engines facilitating multi-interactions. Agents, powered by LLMs and augmented by various tools, function as knowledge workers. Also, LlamaIndex includes observability and evaluation integrations for rigorous experimentation, evaluation, and monitoring of your application.
The AMD Infinity Hub contains a collection of advanced software containers and deployment guides curated by AMD for AI applications. This hub is a comprehensive resource that enables researchers, scientists, and engineers to accelerate their time to productivity by providing ready-to-use tools and detailed instructions for deploying and optimizing their AI and computational workloads.
For more information, see the AMD Infinity Hub.