The NVIDIA GPU Operator deploys several pods that are used to manage and enable GPUs for use in OpenShift Container Platform. Some of these pods require packages that are not available by default in the Universal Base Image (UBI) that OpenShift Container Platform uses. To make packages available to the NVIDIA GPU driver container, you must enable cluster-wide entitled container builds in OpenShift.
At a high level, enabling entitled builds involves three steps:
The following sections elaborate on these steps. For more information about entitled builds in OpenShift, see this Red Hat blog post.
Obtain your subscription certificates under the Subscriptions tab in Red Hat Customer Portal, as shown in the following figure:
$ unzip cert_20210202.zip
cert_20210202.zip consumer_export.zip openshift signature
$ unzip consumer_export.zip
Candlepin export for beedde9c-a893-4a58-8313-4862e78806e5
Create the entitlement MachineConfig
$ sed "s/BASE64_ENCODED_PEM_FILE/$(base64 -w 0 </path/to/certificate_file.pem>)/g" 0003-cluster-wide-machineconfigs.yaml.template > 0003-cluster-wide-machineconfigs.yaml
$ oc create -f 0003-cluster-wide-machineconfigs.yaml
oc create -f 0004-cluster-wide-entitled-pod.yaml
oc logs cluster-entitled-build-pod
Figure 2. Locating the kernel-dev packages
Installing the NFD Operator
To install the NFD Operator, you must log in to the OpenShift cluster through the web console (see in the Dell EMC Ready Stack for Red Hat OpenShift Container Platform 4.6 Deployment Guide). Follow these steps:
$ oc new-project gpu-operator-resources
The Install Operator page opens, as shown in the following figure:
Figure 3. NFD Operator installation page
After the NFD pods have started running, more compute node labels are added.
Installing the NVIDIA GPU Operator
Figure 4. NVIDIA GPU Operator pods status
A new nvidia.com/gpu resource is displayed In the NodeSpec for nodes with GPUs.
oc get node <gpu_node> -o yaml | grep -i nvidia.com/gpu
The following output is displayed:
Figure 5. Resources present
Providing GPU resources to a pod
- image: nvcr.io/nvidia/tensorflow:19.09-py3
args: ["git clone https://github.com/tensorflow/benchmarks.git;cd benchmarks/scripts/tf_cnn_benchmarks;python3 tf_cnn_benchmarks.py --num_gpus=2 --data_format=NHWC --batch_size=32 --model=resnet50 --variable_update=parameter_server"]
As shown in the sample Pod Spec, you can provide GPUs to pods by specifying the GPU resource nvidia.com/gpu and requesting the number of GPUs that you want. This number must not exceed the number of GPUs present on a specific node.
The NVIDIA GPU Operator also deploys gpu-feature-discovery pods on each compute node. The pod labels each node with information about the GPU type, family, count, and so on, as shown in the Pod Spec. These node labels can be used in the Pod Spec to schedule workloads based on criteria such as the GPU product name, as shown under nodeSelector.