Home > Storage > PowerScale (Isilon) > Industry Solutions and Verticals > Analytics > Multi-Cloud Data Services for Dell PowerScale in AWS: Amazon EMR for Data Analytics Solutions > Preparing OneFS
Complete the following steps to configure your PowerScale OneFS cluster for use with Amazon EMR big data cluster. Preparing OneFS requires you to configure DNS, SmartConnect, and access zones to enable the Amazon EMR big data cluster to connect to the PowerScale OneFS cluster. If these preparation steps are not configured, the subsequent configuration steps might fail.
Note: For validation purposes, this section skips DNS and SmartConnect configuration. The section details setting up an access zone (optional) and using the IP address of the PowerScale end point from Faction Cloud.
As a prerequisite step, complete the procedures in the section “Preparing OneFS” in the document PowerScale Powered by Azure Databricks and Faction to accelerate data-driven innovations.
We recommend that you maintain consistent names and numeric IDs for all users and groups on the OneFS cluster and your Hadoop clients. This consistency is important in multiprotocol environments because the HDFS protocol refers to users and groups by name. In contrast, NFS refers to users and groups by their numeric IDs (UIDs and GIDs). Maintaining this parity is critical in the behavior of OneFS multiprotocol file access.
During installation of Hadoop with Amazon EMR, the installer creates all required system accounts on all clients. For example, a Hadoop system account YARN is created with the UID of 502 and the GID of 502 on the Hadoop cluster nodes. EMR creates these accounts if they do not exist. You can ensure parity by pre-creating them on all nodes that will be installed in the Hadoop cluster. You can enforce parity by manually managing when and how these local system accounts are created. Since the Hadoop installer cannot create the local accounts directly on OneFS, you must create them manually. Create the OneFS YARN local account user in the OneFS access zone in which YARN accesses data. Create a local user YARN with the UID of 502 and the GID of 502 to ensure consistency of access and permissions.
For guidance and more information about maintaining parity between OneFS and Hadoop local users and UIDs, see the article Isilon and Hadoop Local User UID Parity.
There are many methods of achieving UID and GID parity. You can use Tools for Using Hadoop with OneFS, perform manual matching, or create scripts that parse users and create the equivalent users. However you choose to achieve this result, the sequence depends on your deployment methodology and management practices. We recommend that you maintain consistency between the Hadoop cluster and OneFS—for example, hdfs=hdfs, yarn=yarn, hbase=hbase, and so on—from a UID and GID consistency perspective.
Manually create following users on the OneFS cluster before deploying the Amazon EMR.
Table 1 Users on OneFS cluster
Item# | User | uid | groups | gid | other groups | Proxy user |
1 | hadoop | 996 | hadoop | 1000 |
|
|
2 | zookeeper | 995 | zookeeper | 993 |
|
|
3 | ganglia | 994 | ganglia | 992 |
|
|
4 | nginx | 993 | nginx | 991 |
|
|
5 | hue | 992 | hue | 990 | hadoop, hive | hadoop |
6 | yarn | 991 | yarn | 989 | hadoop | hadoop |
7 | hdfs | 990 | hdfs | 988 | hadoop | hadoop |
8 | mapred | 989 | mapred | 987 | hadoop |
|
9 | hbase | 988 | hbase | 986 | hadoop |
|
10 | spark | 987 | spark | 985 |
|
|
11 | hive | 986 | hive | 984 | hadoop | hue, spark, presto |
12 | kms | 983 | kms | 983 |
|
|
13 | oozie | 984 | oozie | 982 |
|
|
14 | httpfs | 983 | httpfs | 981 | hadoop |
|
15 | presto | 982 | presto | 980 |
|
|
16 | livy | 981 | livy | 979 | hadoop |
|
For any Hadoop cluster, hdfs is the superuser. In the PowerScale HDFS cluster, the hdfs user must be a superuser for the hdfs access zone. Create a role with backup and restore privileges, and the setup should appear similar to the below example. For detailed information about this process, see the section “Configuring HDFS user for OneFS 8.2 and later versions” in the document PowerScale and Cloudera Data Platform Private Cloud Base document.
For example:
isi auth roles view HdfsAccess --zone=zone1-cdh
Name: HdfsAccess
Description: Bypass FS permissions
Members: - hdfs
Privileges
ID: ISI_PRIV_IFS_BACKUP
Read Only: True
ID: ISI_PRIV_IFS_RESTORE
Read Only: True