NVIDIA GPU drivers are required to provision GPU on OpenShift Container Shift Platform. These drivers enable CUDA and allow workloads to consume GPU. The Dell OpenShift engineering team provisioned NVIDIA H100 GPU on a compute node to validate the solution deployment.
Install the NFD operator
The Node Feature Discovery (NFD) operator is required for the NVIDIA GPU operator. To install the NFD operator:
- In the OpenShift Container Platform web console, select Operators > OperatorHub.
- Choose Node Feature Discovery from the list of available operators, and then click Install.
Figure 25. Installing the NFD operator
- On the Install Operator page, select a specific namespace on the cluster, and then click Install. The recommended namespace is openshift-nfd.
- Select Operators > Installed Operators.
- Ensure that Node Feature Discovery is listed in the openshift-nfd project with a status of Succeeded.
- Ensure that the status of the nfd-controller-manager pod is Running, as shown in the following figure:
Figure 26. NFD pod status
- Create an NFD instance using the NodeFeatureDiscovery tab:
- From the navigation bar, click Operators > Installed Operators and find the Node Feature Discovery entry.
- Click NodeFeatureDiscovery under Provided APIs.
- Click Create NodeFeatureDiscovery.
- Click Create to start labeling the nodes in the cluster that have GPUs.
The NFD operator uses vendor PCI IDs to identify hardware in a node. NVIDIA uses the PCI ID 10de.
- Use the OpenShift Container Platform CLI to review the status of the NFD.
- To verify that the new node labels have been added, run the following command: oc describe node <node name>
Figure 27. Verifying NFD node labels
Install the NVIDIA GPU
You can perform this task as the cluster administrator from the OpenShift Container Platform web console:
- Select Operators > OperatorHub > All Projects.
- Search for NVIDIA GPU Operator and select it.
Figure 28. Installing the NVIDIA GPU operator
- Click Install.
The suggested namespace is nvidia-gpu-operator.
- Review the status of the GPU operator pod by running:
oc get pods -n nvidia-gpu-operator
A ClusterPolicy CRD is created. The ClusterPolicy configures the GPU stack, including the image, the repository, and pod restrictions and credentials.
- From the web console, select Operators > Installed Operators > NVIDIA GPU Operator.
- On the ClusterPolicy tab, click Create ClusterPolicy.
The platform assigns the default name gpu-cluster-policy to the policy.
- Click Create.
Figure 29. Creating the ClusterPolicy
The GPU operator installs the components that are required to set up the NVIDIA GPUs in the cluster.
After 20 minutes, the status of the deployed gpu-cluster-policy for the NVIDIA GPU operator changes to State:ready.
Figure 30. NVIDIA GPU operator ClusterPolicy status
- To verify that GPU resources are available for pods, create a pod to run the nvidia-smi command. See this sample file in GitHub for the pod specification.
The following figure shows sample output from the command:
Figure 31. Sample nvidia-smi command output