NVIDIA GPU drivers are required to provision GPU on OpenShift Container Shift Platform. These drivers enable CUDA and allow workloads to consume GPU. The Dell engineering team provisioned NVIDIA A100 GPU on a compute node to validate the deployment.
This chapter describes the prerequisites and steps for installing the NVIDIA GPU operator.
Prerequisites
The NFD operator is a prerequisite for the NVIDIA GPU operator.
Install the Node Feature Discovery (NFD) operator
Install the NFD Operator using the Red Hat OperatorHub.
- In the OpenShift Container Platform web console, select Operators > OperatorHub.
Choose Node Feature Discovery from the list of available operators, and then click Install.
Figure 51. Installing NFD operator
- On the Install Operator page, select a specific namespace on the cluster, and then click Install.
The recommended namespace, openshift-nfd, is created during operator installation.
- Select Operators > Installed Operators.
- Ensure that Node Feature Discovery is listed in the openshift-nfd project with a status of Succeeded.
- Review the status of the nfd-controller-manager pod, as shown in the following figure:
Figure 52. NFD pod status
- Create an instance of Node Feature Discovery using the NodeFeatureDiscovery tab.
- Click Operators > Installed Operators from the navigation bar.
- Find the Node Feature Discovery entry.
- Click NodeFeatureDiscovery under Provided APIs.
- Click Create NodeFeatureDiscovery.
- Click Create to start labeling the nodes in the cluster that have GPUs.
The NFD operator uses vendor PCI IDs to identify hardware in a node. NVIDIA uses the PCI ID 10de.
- Use the OpenShift Container Platform web console or the CLI to verify that the NFD operator is functioning correctly. Run oc describe node <node name> to verify that the new labels have been added.
Figure 53. Checking the node labels
Installing the NVIDIA GPU operator
As a cluster administrator, install the NVIDIA GPU operator from the OpenShift Container Platform CLI or the web console.
Install the NVIDIA GPU operator from the OpenShift web console
In the OpenShift web console:
- Select Operators > OperatorHub > All Projects.
- Search for and select NVIDIA GPU Operator.
Figure 54. Installing the NVIDIA GPU operator
- Click Install.
The suggested namespace to use is nvidia-gpu-operator.
Review the status of the GPU operator pod:
Figure 55. GPU operator pod status
During the installation, a CRD for a ClusterPolicy is created. The ClusterPolicy configures the GPU stack, including the image, repository, and pod restrictions and credentials.
- From the navigation menu in the web console, select Operators > Installed Operators > NVIDIA GPU Operator.
- Select the ClusterPolicy tab, and then click Create ClusterPolicy.
The platform assigns the default name gpu-cluster-policy.
- Click Create.
Figure 56. Creating ClusterPolicy
The GPU operator installs all the required components to set up the NVIDIA GPUs in the cluster.
- Wait 20 minutes for the installation to finish.
The status of the deployed ClusterPolicy gpu-cluster-policy for the NVIDIA GPU operator changes to State:ready.
Figure 57. ClusterPolicy status
- To validate that GPU resources are available for pods, create a pod to run the nvidia-smi command. See this sample file in GitHub for the pod specification.
The following figure shows the command output:
Figure 58. Output of nvidia-smi command