Home > Workload Solutions > Container Platforms > Red Hat OpenShift Container Platform > Archive > Running ML/DL Workloads Using Red Hat OpenShift Container Platform v3.11 > No data available.
NVIDIA drivers are compiled from source. To complete the build process:
yum -y install nvidia-driver nvidia-driver-cuda nvidia-modprobe
modprobe -r nouveau
Installing the NVIDIA driver package blacklists the driver in the kernel command line nouveau.modeset=0 rd.driver.blacklist=nouveau video=vesa:off. This ensures that the nouveau driver is not loaded on subsequent reboots.
nvidia-modprobe && nvidia-modprobe -u
nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 | sed -e 's/ /-/g'
This command outputs the name of the GPU on the server – in this example, Tesla-V100-SXM2-32GB. This name can be used to label the node in OpenShift.
Steps 1 to 6 of this procedure describe installation of the NVIDIA GPU driver from source. At the time of writing, NVIDIA and Red Hat have announced a technical preview of new packages for GPU drivers for select Red Hat Enterprise Linux versions. These packages eliminate the need to have compilers and a full software development toolchain installed on each system that is running NVIDIA GPUs, simplifying the management experience for the user. To get started with the new packages, follow the instructions in this README.
Add the nvidia-container-runtime-hook
The version of Docker that is shipped by Red Hat includes support for OCI runtime hooks, Therefore, we need to install only the nvidia-container-runtime-hook package.
curl -s -L https://nvidia.github.io/nvidia-container-runtime/centos7/nvidia-container-runtime.r
epo | tee /etc/yum.repos.d/nvidia-container-runtime.repo
An OCI prestart hook makes NVIDIA libraries and binary files available in a container by bind-mounting them in from the host. The prestart hook is triggered by the presence of certain environment variables in the container: NVIDIA_DRIVER_CAPABILITES=compute,utility.
yum -y install nvidia-container-runtime-hook
cat<<'EOF' >> /usr/share/containers/oci/hooks.d/oci-nvidia-hook.json
{
"hasbindmounts": true,
"hook": "/usr/bin/nvidia-container-runtime-hook",
"stage": [ "prestart" ]
}
EOF
An SELinux policy tailored for running CUDA GPU workloads is required to run NVIDIA containers that are contained and not privileged.
Install the SELinux policy module on all GPU worker nodes by running:
wget https://raw.githubusercontent.com/zvonkok/origin-ci-gpu/master/selinux/nvidia-container.pp
semodule -i nvidia-container.pp
The new SELinux policy relies on correct labeling of the host. Ensure that the files that are needed have the correct SELinux label by running the following commands:
nvidia-container-cli -k list | restorecon -v -f -
restorecon -Rv /dev
restorecon -Rv /var/lib/kubelet
The system is now set up to run a GPU-enabled container.
To verify correct operation of the driver and container enablement, run a cuda-vector-add container with Docker or Podman:
docker run --user 1000:1000 --security-opt=no-new-privileges --cap-drop=ALL \
--security-opt label=type:nvidia_container_t \
docker.io/mirrorgooglecontainers/cuda-vector-add:v0.1
Trying to pull repository docker.io/mirrorgooglecontainers/cuda-vector-add ...
v0.1: Pulling from docker.io/mirrorgooglecontainers/cuda-vector-add
5d9a20cbabf3: Pull complete
84b2e9f421b6: Pull complete
6f94649104a2: Pull complete
6c16e819a84a: Pull complete
9822cda4c699: Pull complete
1bc138ea32ad: Pull complete
ade909bfe2a5: Pull complete
e70e5ba470d6: Pull complete
ab71e6b7eb90: Pull complete
925740434ebd: Pull complete
2f93605342b5: Pull complete
fe61ad4992f7: Pull complete
Digest: sha256:0705cd690bc0abf54c0f0489d82bb846796586e9d087e9a93b5794576a456aea
Status: Downloaded newer image for docker.io/mirrorgooglecontainers/cuda-vector-add:v0.1
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
If you see the words Test PASSED, the drivers, hooks and container runtime are functioning correctly and you can proceed to configuring OpenShift Container Platform.