Home > Storage > PowerScale (Isilon) > Industry Solutions and Verticals > Analytics > Multi-Cloud Data Services for Dell PowerScale in AWS: Amazon EMR for Data Analytics Solutions > Amazon EMR on PowerScale installation
The main objective of this solution is to install Amazon EMR big data cluster on PowerScale. PowerScale supplies the HDFS storage for the EMR compute cluster.
Before you launch an Amazon EMR cluster, ensure you complete the tasks in Setting Up Amazon EMR.
In this step, plan for and launch a simple Amazon EMR cluster with selected service like Apache Spark, hive, hbase, and other services installed. The setup process includes creating a PowerScale HDFS storage area to store a sample script like pySpark and others, an input dataset, and cluster output.
The PowerScale setup with HDFS access zone is used here to store scripts jobs like PySpark script, input data, and output data.
This step uploads a sample script or jobs like PySpark script to the PowerScale HDFS location. This is the most common way to prepare an application for Amazon EMR. EMR lets you specify the PowerScale HDFS location of the script when you submit work to your cluster. You can also upload sample input data to PowerScale HDFS for the script to process.
Now that you have completed the prework, you can launch a sample cluster with services like Apache Spark installed using the latest Amazon EMR release.
Launch a cluster with Spark installed using Quick Options:
Note: Choose the applications that you want to run on your Amazon EMR cluster before you launch the cluster. You cannot add or remove applications from a cluster after it has been launched.
Note: The following step is important to make this solution functional. Here we change the default file system of Amazon EMR to PowerScale HDFS file system and use the same for different services like yarn, spark, hive, and others.
This is important because it replaces the default HDFS file system to PowerScale HDFS. To make this change, you can supply a configuration object. You can either use a shorthand syntax to provide the configuration, or reference the configuration object in a JSON file. Configuration objects consist of a classification, properties, and optional nested configurations. Properties correspond to the application settings you want to change. You can specify multiple classifications for multiple applications in a single JSON object.
The following is an example JSON file for a list of configurations.
[
{
"Classification": "core-site",
"Properties": {
"fs.defaultFS": "hdfs://powerscale_fqdn:8020"
}
},
{
"Classification": "hive-site",
"Properties": {
"fs.defaultFS": "hdfs://powerscale_fqdn:8020"
}
}
]
When the status progresses to Waiting, your cluster is running and ready to accept work.
Now that your cluster is running, connect to it and manage it. You can also submit work to your running cluster to process and analyze data.
With your cluster up and running, you can submit health_violations.py as a step. A step is a unit of cluster work made up of one or more jobs. For example, you might submit a step to compute values, or to transfer and process data.
You can submit multiple steps to accomplish a set of tasks on a cluster when you create the cluster, or after it is already running. For more information, see Submit Work to a Cluster.
After a step runs successfully, you can view its output results in the PowerScale HDFS output folder you specified when you submitted the step.
This step is not required, but you can connect to cluster nodes with Secure Shell (SSH) for tasks like issuing commands, running applications interactively, and reading log files.
Now that you have submitted work to your cluster and viewed the results of your sample application, you can shut the cluster down. You have option to retain or delete the data from the PowerScale HDFS cluster.
Shutting down a cluster stops all its associated Amazon EMR charges and Amazon EC2 instances.
This is optional since you can delete or retain the input dataset, cluster output, script, and log files.