Component validation

Thank you for your feedback!

The testing objective was to run sample checks to validate the functionality of different components and services that Dell Technologies installed as parts of the Dell Validated Design for Analytics — Data Lakehouse.

GPU validation

Dell Technologies tested a pair of NVIDIA A100 GPUs as whole, nonpartitioned devices with driver version 470.82.01. The publicly available GPU operator was installed using the Helm chart after the Symcloud Platform cluster had been provisioned. Doing so enabled discovery of the device by Kubernetes. In order to make the GPUs available within the Symcloud environment, Dell Technologies performed a robin host probe --rediscover <nodename> command. Doing so made GPUs available for deploying application bundles that require GPU resources.

The Dell Validated Design for Analytics — Data Lakehouse uses a Spark 3.3.2 bundle with an optional worker node that requires one or more GPUs. Enabling this worker when instantiating the application requests the GPU and causes the worker to register with the Spark master, indicating the extra resource.

Running jobs against the deployed Spark application stack spreads the work across all nodes. If the workload requests GPU resources for acceleration, the job is scheduled only on workers that have registered with a GPU. This job scheduling was verified by running tasks that use Spark SQL RAPIDS calls.

/opt/spark/bin/spark-shell  --jars /opt/spark/jars/rapids-4-spark_2.12-22.06.0.jar 
--conf spark.plugins=com.nvidia.spark.SQLPlugin 
--conf spark.executor.resource.gpu.discoveryScript=/opt/sparkRapidsPlugin/ \
getGpusResources.sh 
--conf spark.executor.resource.gpu.vendor=nvidia.com  
--conf spark.rapids.sql.enabled=true  
--conf spark.executor.resource.gpu.amount=1 
--conf spark.rapids.shims-provider-override=com.nvidia.spark.rapids.shims.spark330.SparkShimServiceProvider

scala> val df = sc.makeRDD(1 to 10000000, 6).toDF
df: org.apache.spark.sql.DataFrame = [value: int]

scala> val df2 = sc.makeRDD(1 to 10000000, 6).toDF
df2: org.apache.spark.sql.DataFrame = [value: int]

scala> df.select( $"value" as "a").join(df2.select($"value" as "b"), $"a" === $"b").count
res0: Long = 10000000

Checking the physical worker with GPU resources by running the /usr/bin/nvidia-smi tool shows the GPU that is used for processing.

[root@worker6 ~]# nvidia-smi

on Apr 10 14:46:55 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4           |
|----------------------------+-------------------------+----------------------+
| GPU Name Persistence-M     | Bus-Id Disp.A           | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage            | GPU-Util Compute M.  |
|                            |                         | MIG M.               |
|============================+=========================+======================|
| 0 NVIDIA A100-PCI... Off   | 00000000:B1:00.0 Off    | 0                    |
| N/A 28C P0 31W / 250W      | 0MiB / 40536MiB         | 0%           Default |
|                            |                         | Disabled             |
+----------------------------+-------------------------+----------------------+
| 1 NVIDIA A100-PCI... Off   | 00000000:CA:00.0 Off    | 0                    |
| N/A 30C P0 36W / 250W      | 39902MiB / 40536MiB     | 0%           Default |
|                            |                         |             Disabled |
+----------------------------+-------------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
| GPU   GI   CI      PID     Type      Process name                GPU Memory |
|       ID   ID                                                         Usage |
|=============================================================================|
| 1     N/A  N/A     1310307 C         ....0-openjdk-amd64/bin/java  39895MiB |
+-----------------------------------------------------------------------------+

Platform validation

Spark validation

PowerScale

/opt/spark/bin/pyspark --packages \ io.delta:delta-core_2.12:2.3.0,org.apache.hadoop:hadoop-aws:3.3.4 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"

+--------+--------+------+------+---------------+---------------+--------+-------------------+--------------------+--------------------+-----------------+------------------+----+------------+-------+--------+|msisdn_1|msisdn_2|cell_1|cell_2| operator_1| operator_2|duration| timestamp|termination_status_1|termination_status_2| value_1| value_2|type|transit_type| tac_1| tac_2|+--------+--------+------+------+---------------+---------------+--------+-------------------+--------------------+--------------------+-----------------+------------------+----+------------+-------+--------+| 1| 1| 259| 259|TelecomMobile 1|TelecomMobile 1| 694|2023-04-10T03:31:58| 1| 1|6.321016642869488|6.7185260289517865|CALL| INT|1203900|35203107|+--------+--------+------+------+---------------+---------------+--------+-------------------+--------------------+--------------------+-----------------+------------------+----+------------+-------+--------+

>>> df2 = df1.write.format("delta").mode("overwrite").save("hdfs://<ip_address>data/hdfs-test-table1") >>> df3 = spark.read.format("delta").load('hdfs://<ip_address>/data/hdfs-test-table1') >>> df3.show(1)

ECS

/opt/spark/bin/spark-shell --packages \ io.delta:delta-core_2.12:2.3.0,org.apache.hadoop:hadoop-aws:3.3.4 --conf \ "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf \ "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"

sc.hadoopConfiguration.set("fs.s3a.access.key", "access key id") sc.hadoopConfiguration.set("fs.s3a.secret.key", "secret key") sc.hadoopConfiguration.set("fs.s3a.endpoint", "http://<s3api address: port>") sc.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", \ "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider") \ spark.range(5).repartition(1).write.format("delta").save("s3a://<bucketname>/")

Kafka validation

/spark/bin/pyspark --packages \ io.delta:delta-core_2.12:2.3.0,org.apache.hadoop:hadoop-aws:3.3.4,\ org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.2 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"

spark container :import sysimport osimport pyspark.sql.functions as Ffrom pyspark.sql import SparkSessionfrom pyspark.context import SparkContextfrom pyspark.sql.functions import *from pyspark.sql.functions import *from delta.tables import *hadoopConf = spark._jsc.hadoopConfiguration()hadoopConf.set("fs.s3a.access.key", "<access_key>")hadoopConf.set("fs.s3a.secret.key", "<secret_key>")hadoopConf.set("fs.s3a.endpoint", "http://<s3_api_address>")hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")hadoopConf.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")df = (spark\ .readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "kakfa7-broker-01:9092") \ .option("subscribe", "kafkaspark") \ .option("startingOffsets", "latest") \ .load()) df.writeStream \ .format("delta") \ .outputMode("append") \ .option("mergeSchema", "true") \ .option("checkpointLocation", "/tmp/kafaktest") \ .start("s3a://sparkdelta/kafkaspark2/")df.selectExpr("CAST(value AS STRING)").show(20)

High availability validation

This test validates deploying Symcloud Platform in high availability (HA) mode. For a highly available Kubernetes cluster, Dell Technologies recommends configuring a Kubernetes cluster with a minimum of three control plane nodes.

Symcloud uses the keepalived and HAProxy services to provide high availability to the Kubernetes API server. The keepalived service is responsible for managing a Virtual IP address (VIP) where all requests to the Kubernetes API server are sent. The HAProxy service is responsible for redirecting API server requests to instances of the API server running on each of the control plane nodes.

Dell Technologies used the following command to deploying Symcloud Platform in HA mode:

gorobin_5.4.3-120 onprem install-ha --hosts hosts.json --config-json config.json \
--gorobintar gorobintar-5.4.3-120.tar  --vip <ip_address> --vrid 5 --ignore-warnings

Dell Technologies has proved the basic Symcloud Platform cluster HA functionality by manually simulating failover of a Kubernetes control plane node. When the primary manager node fails or becomes unhealthy, one of the secondary manager nodes takes over as primary. Data integrity is maintained for key metadata that is related to storage management for the Symcloud Platform cluster and for deployed applications. Also, a mechanism is provided for recovering from hard failures.

Your Browser is Out of Date

Component validation

Component validation

GPU validation

Platform validation

Spark validation

PowerScale

ECS

Kafka validation

High availability validation