Cluster sizing and scaling are two different but related considerations. Sizing is concerned with ensuring the cluster meets the workload requirements for storage and processing throughput. Scaling is concerned with growth of the cluster over time as capacity needs increase. Since Cloudera CDP Private Cloud Base is a parallel scale-out system, some sizing requirements can be addressed through scaling while others must be addressed through node level sizing.
Sizing and scaling of a CDP Private Cloud Base cluster are complex topics that require knowledge of the workloads. This section highlights the main considerations that are involved but does not provide detailed recommendations. Your Dell Technologies or authorized partner sales representative can help with detailed sizing calculations.
There are many parameters that are involved in cluster sizing. The primary parameters are:
- Storage capacity
- Storage capacity is usually the first parameter that is used to size a cluster. Calculating storage capacity is important and straightforward. However, storage capacity should be calculated while taking the other sizing parameters into account to maintain a balance between storage and processing capability.
- Data ingest volumes and growth rates
- Data ingest volume and growth rates each have multiple impacts on cluster sizing. Storage capacity should account for growth due to data ingest and growth of ingest volumes. Data ingest also impacts network utilization of the Edge and Cluster Data networks. The network bandwidth that is required for ingest, and the amount of ingest processing required, determine the number of Edge nodes that the cluster requires.
- Memory and processor capacity
- Memory and processor requirements for jobs running on the cluster must be considered when sizing. Memory and processor capacity increases as nodes are added to the cluster. Workloads like HBase or Spark jobs with large memory requirements may impact sizing of individual nodes in addition to the overall size of the cluster.
- Service level agreements
- Production cluster sizing must meet any performance requirements that service level agreements (SLAs) specify. If there are critical path jobs that must meet a specific execution time or throughput, the cluster sizing and balance between compute and storage may have to be adjusted accordingly. Overall cluster throughput is as important as storage capacity, and often influences the number of nodes independent of the required storage capacity.