As an option, this architecture supports the use of Dell PowerScale storage, a highly flexible scale-out network-attached storage solution that can be used as the primary HDFS storage.
Compute and storage can be scaled independently using this alternative architecture. The PowerScale storage nodes provide the HDFS NameNode and DataNode services instead of the services being assigned to the Master nodes and Worker nodes. The Worker nodes only include enough storage for runtime operations like shuffle-sort spill files and cache.
This alternative architecture reduces the HDFS bandwidth requirements for the Cluster Data network. PowerScale OneFS implements data durability internally and uses a private back-end network for internal operations. A single copy of the data is transferred to the PowerScale storage nodes when Worker nodes write to HDFS. No replication traffic occurs on the Cluster Data network. Also, HDFS recovery traffic for failed drives or nodes does not occur on the Cluster Data network.
In this architecture, Dell Technologies recommends the PowerScale H7000 hybrid configuration for storage in clusters using PowerScale storage for their primary HDFS storage. See the associated Data Management with Cloudera Data Platform on Dell Infrastructure Design Guides for more information about this configuration.