Installing the GPU device plug-in

Thank you for your feedback!

Install the GPU device plug-in after a successful installation of OpenShift 3.11.
Schedule the device plug-in on nodes that include GPUs
Follow these steps:
1. Label the node by running the following command:
oc label node <node-with-gpu> openshift.com/gpu-accelerator=true
The labels are used in the next stage of the installation.
1. To install the device plug-in on the storage nodes, run the following commands:
oc label node stor1.r5a.local openshift.com/gpu-accelerator=true
oc label node stor2.r5a.local openshift.com/gpu-accelerator=true
oc label node stor3.r5a.local openshift.com/gpu-accelerator=true
oc label node stor4.r5a.local openshift.com/gpu-accelerator=true
Deploy the NVIDIA device plug-in daemonset
Follow these steps:
1. Clone the following repository, which contains several yaml files for future use, by running:
git clone https://github.com/redhat-performance/openshift-psap.git
cd openshift-psap/blog/gpu/device-plugin

The sample daemonset device-plugin/nvidia-device-plugin.yml uses the label you created in Schedule the device plug-in on nodes that include GPUs so that the plugin pods run only where GPU hardware is available.
1. Create the NVIDIA device plug-in daemonset by running the following command:
oc create -f nvidia-device-plugin.yml
1. Verify that the device plug-in is working correctly by running the following command:
oc get pods -n kube-system
NAME READY STATUS RESTARTS AGE
nvidia-device-plugin-daemonset-czzbs   1/1       Running   0          32s
nvidia-device-plugin-daemonset-hz5kr   1/1       Running   0          32s
nvidia-device-plugin-daemonset-w9bxj   1/1       Running   0          32s
nvidia-device-plugin-daemonset-xql7z   1/1       Running   0          32s
Four are running because we labeled four storage nodes in the preceding step.
1. Review the logs by running the following command:
oc logs nvidia-device-plugin-daemonset- czzbs -n kube-system
2019/07/12 2:19:45 Loading NVML
2019/07/12 2:19:45 Fetching devices.
2019/07/12 2:19:45 Starting FS watcher.
2019/07/12 2:19:45 Starting OS watcher.
2019/07/12 2:19:45 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2019/07/12 2:19:45 Registered device plugin with Kubelet

The node advertises the nvidia.com/gpu extended resource that is in its capacity:
oc describe node stor1.r5a.local openshift.com | egrep'Capacity|Allocatable|gpu'
Capacity:
nvidia.com/gpu: 1
Allocatable:
nvidia.com/gpu: 1
Nodes that do not have GPUs installed do not advertise GPU capacity.
Deploy a pod that requires a GPU
Use the device-plugin/cuda-vector-add.yaml as a pod description for running the cuda-vector-add image in OpenShift. The last line of the file requests one Nvidia GPU from OpenShift. The OpenShift scheduler sees this and schedules the pod to a node that has a free GPU. After the pod create request arrives at a node, the Kubelet coordinates with the device plug-in to start the pod with a GPU resource.
To run a GPU-enabled container on the cluster:
1. Create a project to group the GPU work by running the following command:
oc new-project nvidia
1. Create and start the pod by running the following command:
oc create -f cuda-vector-add.yaml

The container finishes and outputs the following:
oc get pods
NAME READY STATUS RESTARTS AGE
cuda-vector-add 0/1 Completed 0 3s
nvidia-device-plugin-daemonset- czzbs 1/1 Running 0 9m
1. Review the logs for any errors by running the following command:
oc logs cuda-vector-add
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

This output is the same as when we ran the container directly using Podman or Docker.
1. If you see a “permission denied” error, check that you have the correct SELinux label.
Troubleshooting SELinux
The following table shows the labels of the files that are needed for a working GPU container:
Table 2. File labels for a GPU container

File

SELinux label

/dev/nvidia*

xserver_misc_device_t

/usr/bin/nvidia-*

xserver_exec_t

/var/lib/kubelet/*/*

container_file_t

File	SELinux label
/dev/nvidia*	xserver_misc_device_t
/usr/bin/nvidia-*	xserver_exec_t
/var/lib/kubelet//	container_file_t

Your Browser is Out of Date

Installing the GPU device plug-in

Installing the GPU device plug-in

Schedule the device plug-in on nodes that include GPUs

Deploy the NVIDIA device plug-in daemonset

Deploy a pod that requires a GPU

Troubleshooting SELinux