A suite of three industry-standard workload tools was used to validate the configurations.
The TestDFSIO workload tool is a read and write test for HDFS. It is a utility that comes with CDP. TestDFSIO was run with many files to create multiple execution threads. This benchmark utility is like a “fire hose” test for the environment and shows that an optimal network architecture is in place.
This test is designed to test the aggregate bandwidth between CDP clients and PowerScale nodes.
Dell Technologies ran the read and write commands for each test case. The sum of read/write execution times determined the ranking; lower being the better.
yarn jar /opt/cloudera/parcels/CDH/jars/hadoop-mapreduce-client-jobclient-3.1.1.7.1.6.0-297-tests.jar TestDFSIO -write -nrFiles 1000 -size 1GB
yarn jar /opt/cloudera/parcels/CDH/jars/hadoop-mapreduce-client-jobclient-3.1.1.7.1.6.0-297-tests.jar TestDFSIO -read -nrFiles 1000 -size 1GB
Figure 5. Aggregate network bandwidth between compute nodes and PowerScale
The Terasuite workload tool combines testing of HDFS and MapReduce layers of a Hadoop cluster. The goal is to generate, sort, and validate 1 TB of data (or any other amount of data) as fast as possible.
This test is designed to exercise the compute and local storage configurations with concurrent HDFS access.
Dell Technologies ran the Terasuite commands for each of the test cases. The sum of Teragen, Terasort, Teravalidate execution times determined the ranking, lower being the better.
Example commands for Teragen, Terasort, and Teravalidate include:
time yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar -Dmapred.map.tasks=320 teragen 10000000000 /benchmarks/tera32-32-run3
time yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar terasort /benchmarks/tera32-32-run3 /benchmarks/terasort32-32-run3
time yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teravalidate /benchmarks/terasort32-32-run1 /benchmarks/teraval32-32-r
Figure 6. The Terasort job and the aggregate network bandwidth
Figure 7. Host information during the Terasort job
YSCB was used to characterize the performance of the HBase NoSql database. The YCSB simulated workloads used were:
Figure 8. YCSB testing and the HBASE metrics from Cloudera Manager
One billion records were loaded and then workloads A, B, D were run. The YSCB results were ranked using an average of the overall operations per second. A higher rate of operations is considered better.
Three different scenarios were evaluated, based on the PowerFlex compute configuration and using five virtualized CDP Worker Nodes.
Table 8. Test scenarios
Test Scenarios | Physical cores | Worker VM cores | YARN container cores |
Scenario 1 | 16 | 16 | 16 |
Scenario 2 | 16 | 16 | 32 |
Scenario 3 | 16 | 32 | 32 |
Note: YARN container cores were changed by modifying the YARN configuration parameters yarn.nodemanager.resource.cpu-vcores and yarn.scheduler.maximum-allocation-vcores.
Scenario 1 used an exact mapping of worker VM cores and YARN containers cores to available physical cores. Scenario 2 used a 2:1 mapping between YARN container cores and physical cores, allowing YARN to manage task parallelism. Scenario 3 used a 2:1 mapping for both VM cores and YARN cores to physical cores, allowing VMware to do the hyperthreading.
The DFSIO, Terasuite, and YCSB tests were run for each scenario, and the results were evaluated.