We use a “base container” with all the relevant components as the foundation for LLM tasks including inferencing, RAG, and fine-tuning. The Dockerfile creates a container image that is based on Ubuntu 22.04 with a focus on building an environment suitable for ROCm and PyTorch with ROCm support. A summary of the main actions includes:
- Base Image and arguments─Starts from the ubuntu:22.04 image and sets up several arguments for versions of BNXT, AMD ROCm, PyTorch ROCm, UCX, and OpenMPI
- Package installation─Updates and installs various packages necessary for building and running the ROCm environment and other dependencies
- Configuration of Broadcom network drivers─Copies, configures, and installs the libbnxt_re RoCE user space library, which is related to Broadcom network drivers
- AMD drivers─Downloads and installs AMD drivers and ROCm components for graphics and compute functionalities
- Python and PyTorch installation─Installs Python development packages and pip, then installs PyTorch, torchvision, and torchaudio with ROCm support
- UCX and OpenMPI installation─Clones and builds Unified Communication X (UCX) and OpenMPI from their respective repositories with specific configurations for ROCm
- RCCL tests─Clones and builds RCCL tests with MPI support
- torchtune installation─Installs torchtune, a tool for tuning PyTorch models, from a Git repository
- Final image creation─Uses a multistage build process to create a final image that only contains the necessary files, squashing previous layers to reduce size