Dell Technologies tested a pair of NVIDIA A100 GPUs as whole, nonpartitioned devices with driver version 470.82.01. The publicly available GPU operator was installed using the Helm chart after the Symcloud Platform cluster had been provisioned. Doing so enabled discovery of the device by Kubernetes. In order to make the GPUs available within the Symcloud environment, Dell Technologies performed a robin host probe --rediscover <nodename>
command. Doing so made GPUs available for deploying application bundles that require GPU resources.
The Dell Validated Design for Analytics — Modern Data Stack uses a Spark 3.4.1 bundle with an optional worker node that requires one or more GPUs. Enabling this worker when instantiating the application requests the GPU and causes the worker to register with the Spark master, indicating the extra resource.
Running jobs against the deployed Spark application stack spreads the work across all nodes. If the workload requests GPU resources for acceleration, the job is scheduled only on workers that have registered with a GPU. This job scheduling was verified by running tasks that use Spark SQL RAPIDS calls.
/opt/spark/bin/spark-shell --jars /opt/spark/jars/rapids-4-spark_2.12-22.06.0.jar
--conf spark.plugins=com.nvidia.spark.SQLPlugin
--conf spark.executor.resource.gpu.discoveryScript=/opt/sparkRapidsPlugin/ \
getGpusResources.sh
--conf spark.executor.resource.gpu.vendor=nvidia.com
--conf spark.rapids.sql.enabled=true
--conf spark.executor.resource.gpu.amount=1
--conf spark.rapids.shims-provider-override=com.nvidia.spark.rapids.shims.spark330.SparkShimServiceProvider
scala> val df = sc.makeRDD(1 to 10000000, 6).toDF
df: org.apache.spark.sql.DataFrame = [value: int]
scala> val df2 = sc.makeRDD(1 to 10000000, 6).toDF
df2: org.apache.spark.sql.DataFrame = [value: int]
scala> df.select( $"value" as "a").join(df2.select($"value" as "b"), $"a" === $"b").count
res0: Long = 10000000
Checking the physical worker with GPU resources by running the /usr/bin/nvidia-smi tool shows the GPU that is used for processing.
[root@worker6 ~]# nvidia-smi
on Apr 10 14:46:55 2023+-----------------------------------------------------------------------------+| NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 ||-------------------------------+----------------------+----------------------+| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. || | | MIG M. ||===============================+======================+======================|| 0 NVIDIA A100-PCI... Off | 00000000:B1:00.0 Off | 0 || N/A 28C P0 31W / 250W | 0MiB / 40536MiB | 0% Default || | | Disabled |+-------------------------------+----------------------+----------------------+| 1 NVIDIA A100-PCI... Off | 00000000:CA:00.0 Off | 0 || N/A 30C P0 36W / 250W | 39902MiB / 40536MiB | 0% Default || | | Disabled |+-------------------------------+----------------------+----------------------++-----------------------------------------------------------------------------+| Processes: || GPU GI CI PID Type Process name GPU Memory || ID ID Usage ||=============================================================================|| 1 N/A N/A 1310307 C ....0-openjdk-amd64/bin/java 39895MiB |+-----------------------------------------------------------------------------+