PowerScale as the Hadoop tiered storage for Amazon EMR cluster

Thank you for your feedback!

In this solution, PowerScale is the parallel HDFS file system for Amazon EMR big data cluster. The data can be placed and accessed from the PowerScale. Amazon EMR local HDFS file system will be default hdfs file system and hosts hot data along with temporary and scratch space for big data cluster.

HDFS, YARN, MapReduce, and Spark services

The following table outlines the procedures for validating HDFS, YARN, MapReduce, and Spark services of Amazon EMR with EMRFS as default file system and PowerScale HDFS as Hadoop tiered storage.

Note: For a list of specific test steps and results, see Appendix A: Test steps and output.

Table 3. Validation of HDFS, YARN, MapReduce, and Spark

Test case name	Step	Description
MapReduce /YARN (word count)	1	Create a hdp-user1 home directory on the Amazon EMR (local HDFS) and PowerScale clusters
	3	Put local file /etc/redhat-release on the Amazon EMR (local HDFS) file system
	4	Put local file /etc/redhat-release on the PowerScale HDFS
	5	Run MapReduce WordCount job on input from Amazon EMR (local HDFS) with output to the PowerScale HDFS
	6	Run MapReduce WordCount job on input from PowerScale HDFS1, with output to Amazon EMR (local HDFS)
Spark (line count and word count)	1	Put local file /etc/passwd on the Amazon EMR (local HDFS) file system
	2	Put local file /etc/passwd on the PowerScale HDFS
	3	Run a Spark LineCount/WordCount job on input from the primary Amazon EMR (local HDFS), with output to the secondary PowerScale HDFS
	4	Run a Spark LineCount/WordCount job on input from the secondary PowerScale HDFS, with output to the primary Amazon EMR (local HDFS)

For the detailed test cases and their results, see Validation results.

Hive and Hive on Tez

This section includes steps for validating the Hive service.

Note: For a list of specific test steps and output, see Appendix A: Test steps and output.

Ensure that the local EMR DAS HDFS and remote PowerScale HDFS clusters are set up and the necessary permission for user hdp-user1 RWX access are provided.

Switch to hdp-user1, and run the beeline CLI to create a remote database location on the remote PowerScale HDFS cluster:

CREATE database remote_DB COMMENT 'Holds all the tables data in remote location Hadoop cluster' LOCATION 'hdfs://powerscale_fqdn:8020/user/hive/remote_DB'

Time taken: 0.045 seconds

Create an internal nonpartitioned table and load data using local in path:

USE remote_DB

Time taken: 0.036 seconds

CREATE TABLE passwd_int_nonpart (user_name STRING, password STRING, user_id STRING, group_id STRING, user_id_info STRING, home_dir STRING, shell STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ':'

Time taken: 0.211 seconds

LOAD data local inpath '/etc/passwd' into TABLE passwd_int_nonpart

Loading data to table remote_db.passwd_int_nonpart

Table remote_db.passwd_int_nonpart stats: [numFiles=1, numRows=0, totalSize=1808, rawDataSize=0]

Time taken: 0.261 seconds

For the detailed test cases and their results, see Validation results.

Distcp between HDFS and PowerScale cluster

This section includes steps for validating Distcp within Amazon EMR HDFS and PowerScale clusters.

Note: For a list of specific test steps and output, see Appendix A: Test steps and output.

Run DistCp to copy a sample file from the remote PowerScale HDFS to the local DAS HDFS.

sudo -u hdfs hadoop distcp -pc hdfs://powerscale_fqdn:8020/tmp/redhat-release /tmp/

sudo -u hdfs hdfs dfs -ls -R /tmp/redhat-release

-rw-r--r-- 3 hdfs hdfs 27 2020-03-07 13:26 /tmp/redhat-release

For the detailed test cases and their results see Validation results.

Validation results

This section describes detailed results of the validation procedures as PASS or FAIL.

Category	Test case name	Expected test results	Status
Amazon EMR setup	PowerScale as default HDFS file system	Deployment is completed successfully	PASS
Amazon EMR setup	Usability and functionality test of UI	Service check runs successfully	PASS
MapReduce	Word count	Command runs successfully	PASS
Spark	Line count and word count	Command runs successfully	PASS
Hive and Hive on Tez	DDL operations LOAD data local in path INSERT into table INSERT Overwrite TABLE	Command runs successfully	PASS
	DML operations Query local database tables Query remote database tables	Command runs successfully	PASS
	JOIN tables in local database	Command runs successfully	PASS
	JOIN tables in remote database	Command runs successfully	PASS
	JOIN tables between local_db and remote_db	Command runs successfully	PASS
	Local temporary table from remote_db table	Command runs successfully	PASS
	IMPORT and EXPORT operations	Command runs successfully	PASS
	Table-level and column-level statistics	Command runs successfully	PASS
DistCp	DistCp and backup script	Command runs successfully	PASS

Your Browser is Out of Date