Home > Workload Solutions > High Performance Computing > White Papers > Running ML/DL Workloads Using Red Hat OpenShift Container Platform v3.11 > Installing the GPU device plug-in
Install the GPU device plug-in after a successful installation of OpenShift 3.11.
Follow these steps:
oc label node <node-with-gpu> openshift.com/gpu-accelerator=true
The labels are used in the next stage of the installation.
oc label node stor1.r5a.local openshift.com/gpu-accelerator=true
oc label node stor2.r5a.local openshift.com/gpu-accelerator=true
oc label node stor3.r5a.local openshift.com/gpu-accelerator=true
oc label node stor4.r5a.local openshift.com/gpu-accelerator=true
Follow these steps:
git clone https://github.com/redhat-performance/openshift-psap.git
cd openshift-psap/blog/gpu/device-plugin
The sample daemonset device-plugin/nvidia-device-plugin.yml uses the label you created in Schedule the device plug-in on nodes that include GPUs so that the plugin pods run only where GPU hardware is available.
oc create -f nvidia-device-plugin.yml
oc get pods -n kube-system
NAME READY STATUS RESTARTS AGE
nvidia-device-plugin-daemonset-czzbs 1/1 Running 0 32s
nvidia-device-plugin-daemonset-hz5kr 1/1 Running 0 32s
nvidia-device-plugin-daemonset-w9bxj 1/1 Running 0 32s
nvidia-device-plugin-daemonset-xql7z 1/1 Running 0 32s
Four are running because we labeled four storage nodes in the preceding step.
oc logs nvidia-device-plugin-daemonset- czzbs -n kube-system
2019/07/12 2:19:45 Loading NVML
2019/07/12 2:19:45 Fetching devices.
2019/07/12 2:19:45 Starting FS watcher.
2019/07/12 2:19:45 Starting OS watcher.
2019/07/12 2:19:45 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2019/07/12 2:19:45 Registered device plugin with Kubelet
The node advertises the nvidia.com/gpu extended resource that is in its capacity:
oc describe node stor1.r5a.local openshift.com | egrep'Capacity|Allocatable|gpu'
Capacity:
nvidia.com/gpu: 1
Allocatable:
nvidia.com/gpu: 1
Nodes that do not have GPUs installed do not advertise GPU capacity.
Use the device-plugin/cuda-vector-add.yaml as a pod description for running the cuda-vector-add image in OpenShift. The last line of the file requests one Nvidia GPU from OpenShift. The OpenShift scheduler sees this and schedules the pod to a node that has a free GPU. After the pod create request arrives at a node, the Kubelet coordinates with the device plug-in to start the pod with a GPU resource.
To run a GPU-enabled container on the cluster:
oc new-project nvidia
oc create -f cuda-vector-add.yaml
The container finishes and outputs the following:
oc get pods
NAME READY STATUS RESTARTS AGE
cuda-vector-add 0/1 Completed 0 3s
nvidia-device-plugin-daemonset- czzbs 1/1 Running 0 9m
oc logs cuda-vector-add
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
This output is the same as when we ran the container directly using Podman or Docker.
The following table shows the labels of the files that are needed for a working GPU container:
Table 2. File labels for a GPU container
File |
SELinux label |
/dev/nvidia* |
xserver_misc_device_t |
/usr/bin/nvidia-* |
xserver_exec_t |
/var/lib/kubelet/*/* |
container_file_t |