Home > Storage > PowerScale (Isilon) > Industry Solutions and Verticals > Analytics > Multi-Cloud Data Services for Dell PowerScale in AWS: Amazon EMR for Data Analytics Solutions > Amazon EMR on PowerScale as default HDFS file system
In this solution, PowerScale is the default HDFS file system for EMR big data cluster, and all the data is placed on PowerScale. Only temporary and scratch space is retained on the master and core worker nodes of the EMR cluster.
The following table outlines the procedures for validating HDFS, YARN, MapReduce, and Spark services of Amazon EMR on the PowerScale HDFS storage cluster.
Note: For a list of specific test steps and results, see Appendix A: Test steps and output.
Test case name | Step | Description |
MapReduce | 1 | Create a hdfs home directory on the Amazon EMR and PowerScale clusters. |
2 | Put local file /etc/redhat-release on the PowerScale HDFS as input dataset. | |
3 | Run MapReduce WordCount job on input and out from PowerScale HDFS. | |
Spark (line count and word count) | 1 | Put local file /etc/passwd on the PowerScale HDFS as input dataset. |
2 | Run a Spark LineCount job on input and out from PowerScale HDFS. | |
3 | Run a Spark WordCount job on input and out from PowerScale HDFS. |
For the detailed test cases and their results, see Validation results.
This section includes steps for validating the Hive service.
Note: For a list of specific test steps and output, see Appendix A: Test steps and output.
CREATE database remote_DB COMMENT 'Holds all the tables data in PowerScale hdfs cluster' LOCATION 'hdfs://powerscale_fqdn:8020/user/hive/remote_DB'
OK
Time taken: 0.045 seconds
USE remote_DB
OK
Time taken: 0.036 seconds
CREATE TABLE passwd_int_nonpart (user_name STRING, password STRING, user_id STRING, group_id STRING, user_id_info STRING, home_dir STRING, shell STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ':'
OK
Time taken: 0.211 seconds
LOAD data local inpath '/etc/passwd' into TABLE passwd_int_nonpart
Loading data to table remote_db.passwd_int_nonpart
Table remote_db.passwd_int_nonpart stats: [numFiles=1, numRows=0, totalSize=1808, rawDataSize=0]
OK
Time taken: 0.261 seconds
For the detailed test cases and their results, see Validation results.
This section includes steps for validating Distcp PowerScale HDFS clusters.
Note: For a list of specific test steps and output, see Appendix A: Test steps and output.
Run DistCp to copy a sample file from the remote PowerScale HDFS to the local DAS HDFS.
sudo -u hdfs hadoop distcp -pc /tmp/redhat-release hdfs://powerscale_fqdn:8020/user/hdfs/
sudo -u hdfs hdfs dfs -ls -R hdfs://powerscale_fqdn:8020/user/hdfs/redhat-release
-rw-r--r-- 3 hdfs hdfs 27 2020-03-07 13:26 hdfs://powerscale_fqdn:8020/user/hdfs/redhat-release
For the detailed test cases and their results, see Validation results.
This section describes detailed results of the validation procedures as PASS or FAIL.
Category | Test case name | Expected test results | Status |
Amazon EMR setup | PowerScale as default HDFS file system | Deployment is completed successfully | PASS |
Usability and functionality test of UI | Service check runs successfully | PASS | |
MapReduce | Word count | Command runs successfully | PASS |
Spark | Line count and word count | Command runs successfully | PASS |
Hive and Hive on Tez | DDL operations
| Command runs successfully | PASS |
DML operations
| Command runs successfully | PASS | |
JOIN tables in db1 database | Command runs successfully | PASS | |
JOIN tables in db2 database | Command runs successfully | PASS | |
JOIN tables between db1 and db2 | Command runs successfully | PASS | |
Local temporary table from db1 table | Command runs successfully | PASS | |
IMPORT and EXPORT operations | Command runs successfully | PASS | |
Table-level and column-level statistics | Command runs successfully | PASS | |
DistCp | DisctCp and backup script | Command runs successfully | PASS |