The testing objective was to run sample checks to validate the functionality of different components and services that Dell Technologies installed as parts of the Dell Validated Design for Analytics — Data Lakehouse.
Home > Workload Solutions > Data Analytics > Guides > Design Guide—Dell Validated Design for Analytics—Data Lakehouse > Component validation
The testing objective was to run sample checks to validate the functionality of different components and services that Dell Technologies installed as parts of the Dell Validated Design for Analytics — Data Lakehouse.
Dell Technologies tested a pair of NVIDIA A100 GPUs as whole, nonpartitioned devices with driver version 470.82.01. The publicly available GPU operator was installed using the Helm chart after the Symcloud Platform cluster had been provisioned. Doing so enabled discovery of the device by Kubernetes. In order to make the GPUs available within the Symcloud environment, Dell Technologies performed a robin host probe --rediscover <nodename>
command. Doing so made GPUs available for deploying application bundles that require GPU resources.
The Dell Validated Design for Analytics — Data Lakehouse uses a Spark 3.3.2 bundle with an optional worker node that requires one or more GPUs. Enabling this worker when instantiating the application requests the GPU and causes the worker to register with the Spark master, indicating the extra resource.
Running jobs against the deployed Spark application stack spreads the work across all nodes. If the workload requests GPU resources for acceleration, the job is scheduled only on workers that have registered with a GPU. This job scheduling was verified by running tasks that use Spark SQL RAPIDS calls.
/opt/spark/bin/spark-shell --jars /opt/spark/jars/rapids-4-spark_2.12-22.06.0.jar
--conf spark.plugins=com.nvidia.spark.SQLPlugin
--conf spark.executor.resource.gpu.discoveryScript=/opt/sparkRapidsPlugin/ \
getGpusResources.sh
--conf spark.executor.resource.gpu.vendor=nvidia.com
--conf spark.rapids.sql.enabled=true
--conf spark.executor.resource.gpu.amount=1
--conf spark.rapids.shims-provider-override=com.nvidia.spark.rapids.shims.spark330.SparkShimServiceProvider
scala> val df = sc.makeRDD(1 to 10000000, 6).toDF
df: org.apache.spark.sql.DataFrame = [value: int]
scala> val df2 = sc.makeRDD(1 to 10000000, 6).toDF
df2: org.apache.spark.sql.DataFrame = [value: int]
scala> df.select( $"value" as "a").join(df2.select($"value" as "b"), $"a" === $"b").count
res0: Long = 10000000
Checking the physical worker with GPU resources by running the /usr/bin/nvidia-smi tool shows the GPU that is used for processing.
[root@worker6 ~]# nvidia-smi
on Apr 10 14:46:55 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 | |----------------------------+-------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |============================+=========================+======================| | 0 NVIDIA A100-PCI... Off | 00000000:B1:00.0 Off | 0 | | N/A 28C P0 31W / 250W | 0MiB / 40536MiB | 0% Default | | | | Disabled | +----------------------------+-------------------------+----------------------+ | 1 NVIDIA A100-PCI... Off | 00000000:CA:00.0 Off | 0 | | N/A 30C P0 36W / 250W | 39902MiB / 40536MiB | 0% Default | | | | Disabled | +----------------------------+-------------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 1 N/A N/A 1310307 C ....0-openjdk-amd64/bin/java 39895MiB | +-----------------------------------------------------------------------------+
Dell Technologies tested the Delta Lake data lakehouse that was deployed on the Symcloud Platform. Dell Technologies also tested integrating Dell PowerScale and Dell ECS storage arrays with Delta Lake. This platform enables the running of various data analytics applications such as Spark, Kafka, and Splunk on Symcloud Platform with different storage systems.
Dell Technologies used Delta Lake 2.3.0 with Spark 3.3.2 to validate all data lakehouse functionality. All the features of Delta Lake with Spark were validated, including:
These tests validate running a Spark bundle application on the using Delta Lake as its data lakehouse storage. Dell Technologies used the Apache Spark 3.3.2 prebuilt Hadoop version with Delta Lake 2.3.0.
Dell PowerScale and Dell ECS acted as the Delta Lake data lakehouse storage in order to access the data for read and write operations. Dell Technologies also validated Spark with an NVIDIA GPU on this platform.
Data can be written to or read from Spark using different API protocols, such as:
hadoop-aws:3.2.3
library for accessing the data from ECS through the S3 API protocol.Spark with PowerScale testing consisted of the following steps:
/opt/spark/bin/pyspark --packages \
io.delta:delta-core_2.12:2.3.0,org.apache.hadoop:hadoop-aws:3.3.4 --conf
"spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf
"spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
>>> dfl = spark.read.format("delta").load('hdfs://<ip_address>/data/hdfs-test-table')
>>> dfl.show(1)
+--------+--------+------+------+---------------+---------------+--------+-------------------+--------------------+--------------------+-----------------+------------------+----+------------+-------+--------+|msisdn_1|msisdn_2|cell_1|cell_2| operator_1| operator_2|duration| timestamp|termination_status_1|termination_status_2| value_1| value_2|type|transit_type| tac_1| tac_2|+--------+--------+------+------+---------------+---------------+--------+-------------------+--------------------+--------------------+-----------------+------------------+----+------------+-------+--------+| 1| 1| 259| 259|TelecomMobile 1|TelecomMobile 1| 694|2023-04-10T03:31:58| 1| 1|6.321016642869488|6.7185260289517865|CALL| INT|1203900|35203107|+--------+--------+------+------+---------------+---------------+--------+-------------------+--------------------+--------------------+-----------------+------------------+----+------------+-------+--------+
>>> df2 = df1.write.format("delta").mode("overwrite").save("hdfs://<ip_address>data/hdfs-test-table1")
>>> df3 = spark.read.format("delta").load('hdfs://<ip_address>/data/hdfs-test-table1')
>>> df3.show(1)
+--------+--------+------+------+---------------+---------------+--------+-------------------+--------------------+--------------------+-----------------+------------------+----+------------+-------+--------+|msisdn_1|msisdn_2|cell_1|cell_2| operator_1| operator_2|duration| timestamp|termination_status_1|termination_status_2| value_1| value_2|type|transit_type| tac_1| tac_2|+--------+--------+------+------+---------------+---------------+--------+-------------------+--------------------+--------------------+-----------------+------------------+----+------------+-------+--------+| 1| 1| 259| 259|TelecomMobile 1|TelecomMobile 1| 694|2023-04-10T03:31:58| 1| 1|6.321016642869488|6.7185260289517865|CALL| INT|1203900|35203107|+--------+--------+------+------+---------------+---------------+--------+-------------------+--------------------+--------------------+-----------------+------------------+----+------------+-------+--------+
/opt/spark/bin/spark-shell --packages \
io.delta:delta-core_2.12:2.3.0,org.apache.hadoop:hadoop-aws:3.3.4 --conf \
"spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf \
"spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
sc.hadoopConfiguration.set("fs.s3a.access.key", "access key id")
sc.hadoopConfiguration.set("fs.s3a.secret.key", "secret key")
sc.hadoopConfiguration.set("fs.s3a.endpoint", "http://<s3api address: port>")
sc.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", \
"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider") \
spark.range(5).repartition(1).write.format("delta").save("s3a://<bucketname>/")
This test validates deploying a Kafka application bundle on Symcloud Platform with Delta Lake. Dell Technologies has integrated Kafka with Spark to stream the input data from Kafka, perform data transformation on Spark, and store the result on Delta Lake. Kafka testing consisted of the following steps:
kubectl exec -it kakfa7-broker-01 /bin/bash
kubectl exec -it kakfa7-broker-01 -- /bin/bash -c "/usr/bin/Kafka-console-producer --broker-list kakfa7-broker-01:9092 --topic test < /home/appuser/TestDataOne.csv"
/spark/bin/pyspark --packages \
io.delta:delta-core_2.12:2.3.0,org.apache.hadoop:hadoop-aws:3.3.4,\
org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.2 --conf
"spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf
"spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
spark container :import sysimport osimport pyspark.sql.functions as Ffrom pyspark.sql import SparkSessionfrom pyspark.context import SparkContextfrom pyspark.sql.functions import *from pyspark.sql.functions import *from delta.tables import *hadoopConf = spark._jsc.hadoopConfiguration()hadoopConf.set("fs.s3a.access.key", "<access_key>")hadoopConf.set("fs.s3a.secret.key", "<secret_key>")hadoopConf.set("fs.s3a.endpoint", "http://<s3_api_address>")hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")hadoopConf.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")df = (spark\ .readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "kakfa7-broker-01:9092") \ .option("subscribe", "kafkaspark") \ .option("startingOffsets", "latest") \ .load()) df.writeStream \ .format("delta") \ .outputMode("append") \ .option("mergeSchema", "true") \ .option("checkpointLocation", "/tmp/kafaktest") \ .start("s3a://sparkdelta/kafkaspark2/")df.selectExpr("CAST(value AS STRING)").show(20)
This test validates deploying Symcloud Platform in high availability (HA) mode. For a highly available Kubernetes cluster, Dell Technologies recommends configuring a Kubernetes cluster with a minimum of three control plane nodes.
Symcloud uses the keepalived and HAProxy services to provide high availability to the Kubernetes API server. The keepalived service is responsible for managing a Virtual IP address (VIP) where all requests to the Kubernetes API server are sent. The HAProxy service is responsible for redirecting API server requests to instances of the API server running on each of the control plane nodes.
Dell Technologies used the following command to deploying Symcloud Platform in HA mode:
gorobin_5.4.3-120 onprem install-ha --hosts hosts.json --config-json config.json \
--gorobintar gorobintar-5.4.3-120.tar --vip <ip_address> --vrid 5 --ignore-warnings
Dell Technologies has proved the basic Symcloud Platform cluster HA functionality by manually simulating failover of a Kubernetes control plane node. When the primary manager node fails or becomes unhealthy, one of the secondary manager nodes takes over as primary. Data integrity is maintained for key metadata that is related to storage management for the Symcloud Platform cluster and for deployed applications. Also, a mechanism is provided for recovering from hard failures.