The resources requirements for the intended cluster workloads must be factored into node and cluster sizing. Detailed sizing of workload requirements is complex. However, once workload requirements are known, the mapping into cluster requirements is straightforward. The flexibility of the platform also allows for ongoing adjustment and fine-tuning, so the sizing does not have to be exact.
Apache Spark is used here as an example of how workload sizing should be mapped to cluster requirements. Example Spark instance requirements summarizes the resource requirements for three sample Spark clusters.
Spark cluster resources | Small instance | Medium instance | Large instance | |||
Worker | Cluster | Worker | Cluster | Worker | Cluster | |
Number of pods (Spark workers) | 4 | 8 | 12 | |||
Memory (GB) | 8 | 32 | 16 | 128 | 32 | 384 |
Cores | 4 | 16 | 6 | 48 | 8 | 96 |
Modern data stack storage (GB) | 2 | 8 | 8 | 64 | 2 | 24 |
Ephemeral storage (GB) | 8 | 32 | 16 | 128 | 32 | 384 |
In the table above, three clusters with varying resource requirements and scale have been included. The resources for each Spark worker have been specified, and the expected number of worker pods is included. Based on these requirements, the total resource requirements for each cluster are calculated.
The amount of net new
modern data stack storage that is required is used in the calculation. If the jobs are expected to process existing data, no additional storage is required. If the jobs generate data, significant storage may be required. In the large instance, only 24 GB of modern data stack storage was estimated, while the medium instance requires 64 GB. The medium instance is expected to generate more data than the large instance even though it uses fewer compute resources.
Based on these calculations, the cluster level resources can be determined. For this example, the pilot cluster configuration can support four medium Spark clusters before it runs out of cores, or it can support two large instances.