Home > Storage > PowerScale (Isilon) > Industry Solutions and Verticals > Analytics > Multi-Cloud Data Services for Dell PowerScale in AWS: Amazon EMR for Data Analytics Solutions > PowerScale as the Hadoop tiered storage for Amazon EMR cluster
In this solution, PowerScale is the parallel HDFS file system for Amazon EMR big data cluster. The data can be placed and accessed from the PowerScale. Amazon EMR local HDFS file system will be default hdfs file system and hosts hot data along with temporary and scratch space for big data cluster.
The following table outlines the procedures for validating HDFS, YARN, MapReduce, and Spark services of Amazon EMR with EMRFS as default file system and PowerScale HDFS as Hadoop tiered storage.
Note: For a list of specific test steps and results, see Appendix A: Test steps and output.
Test case name | Step | Description |
MapReduce | 1 | Create a hdp-user1 home directory on the Amazon EMR (local HDFS) and PowerScale clusters |
3 | Put local file /etc/redhat-release on the Amazon EMR (local HDFS) file system | |
4 | Put local file /etc/redhat-release on the PowerScale HDFS | |
5 | Run MapReduce WordCount job on input from Amazon EMR (local HDFS) with output to the PowerScale HDFS | |
6 | Run MapReduce WordCount job on input from PowerScale HDFS1, with output to Amazon EMR (local HDFS) | |
Spark (line count and word count) | 1 | Put local file /etc/passwd on the Amazon EMR (local HDFS) file system |
2 | Put local file /etc/passwd on the PowerScale HDFS | |
3 | Run a Spark LineCount/WordCount job on input from the primary Amazon EMR (local HDFS), with output to the secondary PowerScale HDFS | |
4 | Run a Spark LineCount/WordCount job on input from the secondary PowerScale HDFS, with output to the primary Amazon EMR (local HDFS) |
For the detailed test cases and their results, see Validation results.
This section includes steps for validating the Hive service.
Note: For a list of specific test steps and output, see Appendix A: Test steps and output.
CREATE database remote_DB COMMENT 'Holds all the tables data in remote location Hadoop cluster' LOCATION 'hdfs://powerscale_fqdn:8020/user/hive/remote_DB'
OK
Time taken: 0.045 seconds
USE remote_DB
OK
Time taken: 0.036 seconds
CREATE TABLE passwd_int_nonpart (user_name STRING, password STRING, user_id STRING, group_id STRING, user_id_info STRING, home_dir STRING, shell STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ':'
OK
Time taken: 0.211 seconds
LOAD data local inpath '/etc/passwd' into TABLE passwd_int_nonpart
Loading data to table remote_db.passwd_int_nonpart
Table remote_db.passwd_int_nonpart stats: [numFiles=1, numRows=0, totalSize=1808, rawDataSize=0]
OK
Time taken: 0.261 seconds
For the detailed test cases and their results, see Validation results.
This section includes steps for validating Distcp within Amazon EMR HDFS and PowerScale clusters.
Note: For a list of specific test steps and output, see Appendix A: Test steps and output.
Run DistCp to copy a sample file from the remote PowerScale HDFS to the local DAS HDFS.
sudo -u hdfs hadoop distcp -pc hdfs://powerscale_fqdn:8020/tmp/redhat-release /tmp/
sudo -u hdfs hdfs dfs -ls -R /tmp/redhat-release
-rw-r--r-- 3 hdfs hdfs 27 2020-03-07 13:26 /tmp/redhat-release
For the detailed test cases and their results see Validation results.
This section describes detailed results of the validation procedures as PASS or FAIL.
Category | Test case name | Expected test results | Status |
Amazon EMR setup | PowerScale as default HDFS file system | Deployment is completed successfully | PASS |
Usability and functionality test of UI | Service check runs successfully | PASS | |
MapReduce | Word count | Command runs successfully | PASS |
Spark | Line count and word count | Command runs successfully | PASS |
Hive and Hive on Tez | DDL operations
| Command runs successfully | PASS |
DML operations
| Command runs successfully | PASS | |
JOIN tables in local database | Command runs successfully | PASS | |
JOIN tables in remote database | Command runs successfully | PASS | |
JOIN tables between local_db and remote_db | Command runs successfully | PASS | |
Local temporary table from remote_db table | Command runs successfully | PASS | |
IMPORT and EXPORT operations | Command runs successfully | PASS | |
Table-level and column-level statistics | Command runs successfully | PASS | |
DistCp | DistCp and backup script | Command runs successfully | PASS |