The testing objective was to run sample workloads to validate the functionality of different components and services that Dell Technologies installed as parts of Cloudera CDP Private Cloud Base.

TeraSuite test

The TeraSuite workload tool combines testing of HDFS and MapReduce layers of a Hadoop cluster. Its goal is to generate, sort, and validate a configurable amount of data as fast as possible. This test is designed to exercise the compute and local storage configurations with concurrent HDFS access.

This example teragen command generates data:

      time hadoop jar hadoop-mapreduce-examples-3.1.1.7.1.7.0-551.jar teragen \
-Ddfs.blocksize=536870912 -Dmapreduce.job.maps=240 -Dmapreduce.job.reduces=120 \
-Dmapreduce.map.speculative=true -Dmapreduce.map.output.compress=true 10000000000 \
hdfs://pvcmaster1.orange.local:8020/user/root/teragen1

This example terasort command sorts the generated data:

      time hadoop jar hadoop-mapreduce-examples-3.1.1.7.1.7.0-551.jar terasort \
-Ddfs.blocksize=536870912 -Dmapreduce.job.maps=240 -Dmapreduce.job.reduces=120 \
-Dmapreduce.map.speculative=true -Dmapreduce.map.output.compress=true \
/user/root/teragen1 /user/root/terasort1

This example teravalidate command validates that the generated data has been sorted:

      time hadoop jar hadoop-mapreduce-examples-3.1.1.7.1.7.0-551.jar teravalidate \
-Ddfs.blocksize=536870912 -Dmapreduce.job.maps=240 -Dmapreduce.job.reduces=120 \
-Dmapreduce.map.speculative=true -Dmapreduce.map.output.compress=true \
/user/root/terasort1 /user/root/teravalidate

At the conclusion of this test, Dell Technologies proved the validity of the cluster HDFS and MapReduce layers.

DSFIO test

The TestDFSIO workload tool is a read and write test for HDFS that is included with CDP. TestDFSIO was run with many files to create multiple execution threads.

This example TestDSFIO command performs a TestDFSIO write test:

      yarn jar /opt/cloudera/parcels/CDH/jars/hadoop-mapreduce-client \
-jobclient-3.1.1.7.1.7.0-551-tests.jar TestDFSIO -write -nrFiles 5000 -size 128MB

This example TestDSFIO command performs a TestDSFIO read test:

      yarn jar /opt/cloudera/parcels/CDH/jars/hadoop-mapreduce-client \
-jobclient-3.1.1.7.1.7.0-551-tests.jar TestDFSIO -read -nrFiles 5000 -size 128MB

At the conclusion of this test, Dell Technologies proved the validity of HDFS reads and writes.

MapReduce test

About this task

Hadoop MapReduce is a programming model that is used to process bulk data. Programs for MapReduce can be run in parallel. They deliver high-performance, large-scale data analyses in the cluster.

Step

This example pythonword count program tests the functional validation of the MapReduce service:

     hadoop jar /opt/cloudera/parcels/CDH/jars/hadoop-streaming-3.1.1.7.1.7.0-551.jar \
-file /root/final_test/mapper.py -mapper /root/final_test/mapper.py -file \
/root/final_test/reducer.py -reducer /root/final_test/reducer.py -input \
/user/root/words.txt -output /user/root/mp_reduce"

Results

At the conclusion of this test, Dell Technologies proved the validity of the MapReduce service.

Spark test

About this test

Apache Spark achieves high performance for both batch and streaming data using:

A state-of-the-art Directed Acyclic Graph (DAG) scheduler
A query optimizer
A physical execution engine

Step

This example sparkpi application tests the functional validation of the Sparkservice:

     spark-submit --class org.apache.spark.examples.SparkPi --master yarn \
/opt/cloudera/parcels/CDH/jars/spark-examples_2.11-2.4.7.7.1.7.0-551.jar 10

Results

At the conclusion of this test, Dell Technologies proved the validity of the Spark service.

Hive test

Hive is a data warehouse software project that is built on top of Hadoop to provide data query and analysis. Hive provides a SQL-like interface to query data that is stored in various databases and file systems that integrate with Hadoop.

These example queries test the functional validation of the Hive service.

Connect to Hive with the client of your choice. This example uses the Beeline thin client:
```
      !connect jdbc:hive//<Namenode>:10000/default     
```
Create a database named TEST:
```
      CREATE DATABASE TEST     
```

Create a table:

      CREATE TABLE TEST.Sales_Data(StoreLocation VARCHAR(30),Product VARCHAR(30),\
OrderDate DATE,Revenue DECIMAL(10,2))

Insert data into the table:

      Insert into Sales_Data Values('Bangalore','Nutella','2021-07-16',7455.67),\
('Bangalore','Peanut Butter','2021-07-16',5316.89),('Bangalore','Milk','2021-07-16',\
2433.76),('Hyderabad','Bananas','2021-07-16',9456.01),('Hyderabad','Nutella',\
'2021-07-16',3644.33),('Hyderabad','Peanut Butter', '2021-07-16', 8988.64),\
('Hyderabad','Milk','2021-07-16', 1621.58)

At the conclusion of this test, Dell Technologies proved the validity of the Hive service.

HBase test

HBase is a column-oriented, nonrelational database management system that runs on top of HDFS. HBase provides a fault-tolerant way to store sparse datasets, which are common in many big data use cases.

These example HBase queries were used to test the functional validation of the HBase service.

Change directories to the HBase home folder:
```
      cd /usr/localhost/
cd Hbase     
```

Start the HBase interactive shell:

      ./bin/hbase shell

The system displays the HBase interactive shell prompt:

      HBase Shell; enter 'help<RETURN>' for list of supported commands. Type "exit<RETURN>" to leave the HBase Shell Version 0.94.23, rf42302b28aceaab773b15f234aa8718fff7eea3c, Mon Jan 31 00:55:22 UTC 2022 hbase(main):001:0>

Create a table named history with two columns; home and away :

      create 'history', 'home', 'away'

The system displays a message similar to the following:

      0 row(s) in 1.1300 seconds => Hbase::Table - emp

Insert one row into the table:

      put 'history','1','home data:name','jim'
put 'history','row1','home:city','Boston'

The system displays a message similar to the following:

      1 column=personal data:name, timestamp=1417524185058, value=jim 1 column=personal data:city, timestamp=1417524216501, value=Boston

Delete the table:

      drop 'history'

The system displays a message similar to the following:

      0 row(s) in 0.3060 seconds

Verify that the table no longer exists:

      exists 'history'

The system displays a message similar to the following:

      Table history does not exist 0 row(s) in 0.0730 seconds

Edit the following lines in the HBase configuration file, /conf/hbase-site.xml , with the values below to accommodate one billion rows:

Maximum Number of HStoreFiles Compaction: 20
HStore Blocking Store Files: 200
HBase Memstore Block Multiplier: 4
HBase Memstore Flush Size: 256

Create a table named staff with six columns; id , name , age , city , department , and salary :

      create 'staff', 'id', 'name', 'age', 'city', 'department', 'salary'

The system displays a message similar to the following:

      0 row(s) in 1.1400 seconds => Hbase::Table - emp

Generate the required amount of data in a CSV file, using a custom python script.

Load the data from the CSV file using the HBase ImportTsv command:

      /bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' \
-Dimporttsv.columns=HBASE_ROW_KEY,<column names> i989<tablename> \
<location of file from HDFS>

Verify that the table contains one billion rows:

      count 'staff', INTERVAL => 1000000

The system displays a message similar to the following:

      Current count: 1000000, row: 100899997 Current count: 2000000, row: 101799997 . . . Current count: 999000000, row: 999099999 Current count: 1000000000, row: id 1000000000 row(s) Took 16498.6309 seconds => 1000000000

At the conclusion of this test, Dell Technologies proved the validity of the HBase service.

Your Browser is Out of Date

Component validation

Component validation

TeraSuite test

DSFIO test

MapReduce test

Spark test

Hive test

HBase test