Dell Technologies conducted and documented sample workload tests for each component. These tests verified component compatibility and basic functionality with the recommended hardware and software configurations in Cloudera CDP Private Cloud Base.

TeraSuite test

About this task

TeraSuite is a suite of programs that generate, sort, and validate a large dataset to benchmark the performance of a Hadoop cluster. It consists of TeraGen, TeraSort, and TeraValidate, which are part of the Apache Hadoop examples package. Dell Technologies ran TeraSuite programs to validate the HDFS and MapReduce layers of the Hadoop cluster.

Steps

This example teragen command generates data:

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/\
hadoop-mapreduce-examples.jar teragen 10000000000 teragen

This example terasort command sorts the generated data:

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/\
hadoop-mapreduce-examples.jar teragen terasort

This example teravalidate command validates that the generated data has been sorted:

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/\
hadoop-mapreduce-examples.jar terasort teravalidate

Results

At the conclusion of this test, Dell Technologies proved the validity of the cluster HDFS and MapReduce layers.

DSFIO test

About this task

TestDFSIO is a benchmark tool that measures the I/O performance of HDFS by reading and writing multiple files in parallel.

Steps

This example TestDSFIO command performs a TestDFSIO write test:

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/\
hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -write -nrFiles 5000 \
-size 128MB

This example TestDSFIO command performs a TestDSFIO read test:

yarn jar /opt/cloudera/parcels/CDH/jars/hadoop-mapreduce-client \
-jobclient-3.1.1.7.1.7.0-551-tests.jar TestDFSIO -read -nrFiles 5000 \
-size 128MB

Results

At the conclusion of this test, Dell Technologies proved the validity of HDFS reads and writes.

MapReduce test

About this task

Hadoop MapReduce is a programming paradigm that is used to process parallel processing of large-scale data on a Hadoop cluster.

Steps

A custom python word count program was used to functionally validate the MapReduce service:

yarn jar /opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p2000.37147774/jars/\
hadoop-streaming-3.1.1.7.1.7.2000-305.jar -file /root/final_test/mapper.py -mapper \
/root/final_test/mapper.py -file /root/final_test/reducer.py -reducer \
/root/final_test/reducer.py -input /user/root/words.txt -output /user/root/mp_reduce

Results

At the conclusion of this test, Dell Technologies proved the validity of the MapReduce service.

Spark test

About this task

Apache Spark is a distributed processing system for big data analytics workloads. It delivers high performance for both batch and streaming data by leveraging:

A state-of-the-art Directed Acyclic Graph (DAG) scheduler
A query optimizer
A physical execution engine

Steps

This example sparkpi application tests the functional validation of the Spark service:

spark-submit --class org.apache.spark.examples.SparkPi --master yarn \
--deploy-mode cluster /opt/cloudera/parcels/CDH/jars/\
spark-examples_2.11-2.4.7.7.1.7.2000-305.jar

Results

At the conclusion of this test, Dell Technologies proved the validity of the Spark service.

Spark GPU test

About this task

Cloudera Distribution of Spark (CDS) 3.2.3 for GPUs is an add-on service. It enables you to take advantage of the RAPIDS Accelerator for Apache Spark to accelerate Spark 3 performance on existing CDP Private Cloud Base clusters.

Steps

This example Spark shell command tests the functional validation of the Spark GPU service (CDS 3.2.3):

spark3-shell --master yarn --conf spark.task.resource.gpu.amount=1 --conf \
spark.rapids.sql.concurrentGpuTasks=1 --conf spark.sql.files.maxPartitionBytes=256m \
--conf spark.locality.wait=0s --conf spark.sql.adaptive.enabled=true \
--conf spark.rapids.memory.pinnedPool.size=2G --conf "spark.rapids.sql.enabled=true" \
--conf "spark.executor.memoryOverhead=5g" \
--conf spark.sql.adaptive.advisoryPartitionSizeInBytes=1

Results

At the conclusion of this test, Dell Technologies proved the validity of the Spark GPU service (CDS 3.2.3).

Hive test

About this task

Hive is a data warehouse system that enables SQL-like queries on large datasets that are stored on the Hadoop cluster. It leverages Apache Tez or MapReduce as the execution engine.

Simple table creation and select queries were performed to validate the Hive service. These examples test the functional validation of the Hive service.

Steps

Connect to Hive with the client of your choice. This example uses the Beeline thin client, which is the default CLI client in a CDP installation:
```
hive
```
Create a database named TEST:
```
CREATE DATABASE TEST;
```

Create a table:

CREATE TABLE TEST.Sales_Data(StoreLocation VARCHAR(30),Product VARCHAR(30),\
OrderDate DATE,Revenue DECIMAL(10,2))

Insert data into the table:

Insert into Sales_Data Values('Bangalore','Nutella','2023-05-16',7455.67),\
('Bangalore','Peanut Butter','2023-05-16',5316.89),('Bangalore','Milk','2023-05-16',\
2433.76),('Hyderabad','Bananas','2023-05-16',9456.01),('Hyderabad','Nutella',\
'2023-05-16',3644.33),('Hyderabad','Peanut Butter', '2023-05-16', 8988.64),\
('Hyderabad','Milk','2023-05-16', 1621.58)

Results

At the conclusion of this test, Dell Technologies proved the validity of the Hive service.

HBase test

About this task

HBase is a column-oriented, nonrelational database management system. It leverages HDFS as its distributed storage layer and provides a fault-tolerant mechanism for storing sparse datasets.

HBase queries that wrote to, and read, tables were used to test the functional validation of the HBase service.

Steps

Start the HBase interactive shell:

hbase shell

The system displays the HBase interactive shell prompt:

HBase Shell; enter 'help<RETURN>' for list of supported commands.Type "exit<RETURN>" to leave the HBase ShellVersion 0.94.23, rf42302b28aceaab773b15f234aa8718fff7eea3c, Tue May 1600:55:22 UTC 2023hbase(main):001:0>

Create a table named history with two columns; home and away:
```
create 'history', 'home', 'away'
```
The system displays a message similar to the following:
```
0 row(s) in 1.1300 seconds=> Hbase::Table - emp
```

Insert one row into the table:

put 'history','1','home data:name','jim'
put 'history','row1','home:city','Boston'

The system displays a message similar to the following:

1 column=personal data:name, timestamp=1417524185058, value=jim1 column=personal data:city, timestamp=1417524216501, value=Boston

Delete the table:
```
drop 'history'
```
The system displays a message similar to the following:
```
0 row(s) in 0.3060 seconds
```
Verify that the table no longer exists:
```
exists 'history'
```
The system displays a message similar to the following:
```
Table history does not exist0 row(s) in 0.0730 seconds
```

Edit the following lines in the HBase configuration file, /conf/hbase-site.xml, with the values below to accommodate one billion rows:

Maximum Number of HStoreFiles Compaction: 20
HStore Blocking Store Files: 200
HBase Memstore Block Multiplier: 4
HBase Memstore Flush Size: 256

Create a table named staff with six columns; id, name, age, city, department, and salary:
```
create 'staff', 'id', 'name', 'age', 'city', 'department', 'salary'
```
The system displays a message similar to the following:
```
0 row(s) in 1.1400 seconds=> Hbase::Table - emp
```
Generate the required amount of data in a CSV file, using a custom python script.

Load the data from the CSV file using the HBase ImportTsv command:

/bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' \
-Dimporttsv.columns=HBASE_ROW_KEY,<column names> i989<tablename> \
<location of file from HDFS>

Verify that the table contains one billion rows:

count 'staff', INTERVAL => 1000000

The system displays a message similar to the following:

Current count: 1000000, row: 100899997Current count: 2000000, row: 101799997...Current count: 999000000, row: 999099999Current count: 1000000000, row: id1000000000 row(s)

Results

At the conclusion of this test, Dell Technologies proved the validity of the HBase service.

Hue test

About this task

Hue provides an interface to compose and run SQL queries across various database services that are deployed within the cluster. Dell Technologiesperformed validation to ensure that the Hue service is deployed correctly, and that it functions as intended.

Dell Technologies performed the following steps to validate Hue functionality:

Steps

Accessed the Hue web UI.
Uploaded sample data in plain text format to HDFS.
Created a table in Hive using the uploaded data on HDFS.
Ran a SELECT query on the Editor to verify the table creation and accessibility.

Results

At the conclusion of this test, Dell Technologies proved the validity of Hue.

Ranger test

About this task

Apache Ranger is a data security framework that enables centralized access control policy management and data access auditing across various Hadoopservices.

Dell Technologies performed the following steps to validate Ranger functionality:

Steps

Created a read/write access policy for a test user in HDFS.
Verified that the access policy is enforced for the test user in HDFS.
Confirmed that the associated user access is correctly shown in the audit logs.

Results

At the conclusion of this test, Dell Technologies proved the validity of Ranger.

Atlas test

About this task

Apache Atlas offers data and metadata cataloging as well as data governance functionalities integrated within Hadoop.

Dell Technologies performed the following steps to validate Atlas functionality:

Steps

Created a Hive table by submitting the following HiveQL statement in the Hive client:

CREATE TABLE employee (ssn STRING, name STRING, location STRING)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ','

STORED AS TEXTFILE;

Loaded the table with sample data from a text file, using HDFS access:

printf "111-111-111,James,San Jose\\n222-222-222,Christian,Santa Clara
\\n333-333-333,George,Fremont" > employeedata.txt

hdfs dfs -copyFromLocal employeedata.txt /warehouse/tablespace/managed/hive/employee

Verified that Atlas cataloged the table and its schema through the web UI by:
1. Going to the Atlas home page.
2. Searching for the table name.
Created a second table with a subset of columns from the first table using the following HiveQL statement in the Hive client:
```
CREATE TABLE employee_alt AS (SELECT name, location FROM employee);
```
Verified that Atlas correctly captured the lineage relationship of the tables on the Atlas web UI by viewing its lineage graph.

Results

At the conclusion of this test, Dell Technologies proved the validity of Atlas.

Atlas and Ranger integration test

About this task

By integrating Atlas with Ranger, you can leverage Atlas metadata to define and enforce Ranger policies across different data sources in Hadoop.

Dell Technologies implemented tag-based access control policies on sample data to validate functionalities that are provided through integration of Atlas and Ranger.

Dell Technologies performed the following steps to validate Atlas and Ranger integration:

Steps

Tagged specific columns of a sample Hive table with a new classification in Atlas.
Created tag-based policies in Ranger to control the access of a test user to the tagged columns.
Queried the columns with the classification.
Verified that the access was masked or denied based on the policies.
Verified that the Ranger audit logs recorded these access events.

Results

At the conclusion of this test, Dell Technologies proved the validity of the Atlas and Ranger integration.

Ozone test

About this task

Apache Ozone is a distributed object storage system that can be installed as a stand-alone system or alongside HDFS. It supports the Hadoop Compatible File System (HCFS) interface for compatibility with Hadoopapplications, and S3 API accessible by any S3-compatible client.

HDFS has a single NameNode that manages the namespace and the metadata. However, Ozone separates the namespace management and the block space management using Ozone Manager (OM) and Storage Container Manager (SCM). Therefore, Ozone can theoretically handle more files and objects than HDFS.

Dell Technologies performed the following steps to validate Ozone:

Steps

Conducted simple read/write operations on objects.
Ran TeraSuite as part of the validation process.

Results

At the conclusion of this test, Dell Technologies proved the validity of Ozone.

Your Browser is Out of Date

Component validation

Component validation

TeraSuite test

DSFIO test

MapReduce test

Spark test

Spark GPU test

Hive test

HBase test

Hue test

Ranger test

Atlas test

Atlas and Ranger integration test

Ozone test