Amazon EMR on PowerScale as default HDFS file system

Thank you for your feedback!

In this solution, PowerScale is the default HDFS file system for EMR big data cluster, and all the data is placed on PowerScale. Only temporary and scratch space is retained on the master and core worker nodes of the EMR cluster.

HDFS, YARN, MapReduce, and Spark services

The following table outlines the procedures for validating HDFS, YARN, MapReduce, and Spark services of Amazon EMR on the PowerScale HDFS storage cluster.

Note: For a list of specific test steps and results, see Appendix A: Test steps and output.

Table 2. Validation of HDFS, YARN, MapReduce, and Spark

Test case name	Step	Description
MapReduce /YARN (word count)	1	Create a hdfs home directory on the Amazon EMR and PowerScale clusters.
	2	Put local file /etc/redhat-release on the PowerScale HDFS as input dataset.
	3	Run MapReduce WordCount job on input and out from PowerScale HDFS.
Spark (line count and word count)	1	Put local file /etc/passwd on the PowerScale HDFS as input dataset.
	2	Run a Spark LineCount job on input and out from PowerScale HDFS.
	3	Run a Spark WordCount job on input and out from PowerScale HDFS.

For the detailed test cases and their results, see Validation results.

Hive and Hive on Tez

This section includes steps for validating the Hive service.

Note: For a list of specific test steps and output, see Appendix A: Test steps and output.

Ensure that the PowerScale HDFS clusters have the necessary permission for user hive RWX access.

Switch to hive, and run the beeline CLI to create a database location on the PowerScale HDFS cluster:

CREATE database remote_DB COMMENT 'Holds all the tables data in PowerScale hdfs cluster' LOCATION 'hdfs://powerscale_fqdn:8020/user/hive/remote_DB'

Time taken: 0.045 seconds

Create an internal nonpartitioned table, and load data using local in the path:

USE remote_DB

Time taken: 0.036 seconds

CREATE TABLE passwd_int_nonpart (user_name STRING, password STRING, user_id STRING, group_id STRING, user_id_info STRING, home_dir STRING, shell STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ':'

Time taken: 0.211 seconds

LOAD data local inpath '/etc/passwd' into TABLE passwd_int_nonpart

Loading data to table remote_db.passwd_int_nonpart

Table remote_db.passwd_int_nonpart stats: [numFiles=1, numRows=0, totalSize=1808, rawDataSize=0]

Time taken: 0.261 seconds

For the detailed test cases and their results, see Validation results.

Distcp between HDFS and PowerScale cluster

This section includes steps for validating Distcp PowerScale HDFS clusters.

Note: For a list of specific test steps and output, see Appendix A: Test steps and output.

Run DistCp to copy a sample file from the remote PowerScale HDFS to the local DAS HDFS.

sudo -u hdfs hadoop distcp -pc /tmp/redhat-release hdfs://powerscale_fqdn:8020/user/hdfs/

sudo -u hdfs hdfs dfs -ls -R hdfs://powerscale_fqdn:8020/user/hdfs/redhat-release

-rw-r--r-- 3 hdfs hdfs 27 2020-03-07 13:26 hdfs://powerscale_fqdn:8020/user/hdfs/redhat-release

For the detailed test cases and their results, see Validation results.

Validation results

This section describes detailed results of the validation procedures as PASS or FAIL.

Category	Test case name	Expected test results	Status
Amazon EMR setup	PowerScale as default HDFS file system	Deployment is completed successfully	PASS
Amazon EMR setup	Usability and functionality test of UI	Service check runs successfully	PASS
MapReduce	Word count	Command runs successfully	PASS
Spark	Line count and word count	Command runs successfully	PASS
Hive and Hive on Tez	DDL operations LOAD data local in path INSERT into table INSERT Overwrite TABLE	Command runs successfully	PASS
	DML operations Query db1 database tables Query db2 database tables	Command runs successfully	PASS
	JOIN tables in db1 database	Command runs successfully	PASS
	JOIN tables in db2 database	Command runs successfully	PASS
	JOIN tables between db1 and db2	Command runs successfully	PASS
	Local temporary table from db1 table	Command runs successfully	PASS
	IMPORT and EXPORT operations	Command runs successfully	PASS
	Table-level and column-level statistics	Command runs successfully	PASS
DistCp	DisctCp and backup script	Command runs successfully	PASS

Your Browser is Out of Date