All the available cluster resources across all worker nodes are pooled and allocated on demand. This configuration provides an abstraction where workloads can be mapped to available resources independent of the physical node used. Accelerators are considered a resource, and any workload pod that requires accelerator resources must run on a node with an available accelerator.
The recommended worker node sizes in this design are based on general-purpose usage. These worker nodes can support various analytics workloads without modification. However, there are scenarios where it is appropriate to change the configurations to match the intended workloads.
Heterogenous node configurations are possible. A cluster can include nodes with different memory, compute, and storage sizes. The resources from all these nodes are added to the overall resource pool.
From a resource point of view, there is little difference between many small nodes and a few large nodes. If the nodes have enough resources to handle the largest expected pod resource request, the difference between nodes is transparent. However, three additional considerations are involved in this tradeoff; network bandwidth, fault zones, and operational overhead.
Available network bandwidth is proportional to the number of nodes. A few large nodes have less bandwidth than many small nodes, even if the aggregate memory and compute resources are the same. The bandwidth requirements for workloads should be factored into cluster and node sizing.
Fault zones are important for the overall reliability of the infrastructure. Although the cluster can continue running when a node fails, resources from that node are lost on failure. Large node configurations in a small cluster can have a substantial impact on available resources when the node fails, even if it is a temporary failure. Sizing should ensure the loss of a node only impacts a small proportion of the overall cluster capacity.
Operational overhead is another consideration for sizing. Every node entails some operational overhead in terms of maintenance and monitoring, so larger nodes can be more efficient. One larger node can also be more energy-efficient than several smaller nodes. Operational capacity should be part of the overall sizing effort.
For parallel, scale-out workloads like Apache Spark and Starburst Enterprise, the resources are allocated based on availability at the cluster level and multiple workload pods that are launched. As a result, workload pods can run on any physical node that can meet the resource requirements. Depending on the Spark job workload, many small pods or a few large pods may be appropriate. The container platform runtime is flexible in this aspect. It is possible to deploy Spark clusters dynamically based on the job itself, instead of requiring a fixed Spark cluster optimized for many types of jobs.
Some workloads such as Starburst may have large memory requirements that cannot be achieved by scaling out. You may have to increase the memory size in some or all nodes to account for the largest expected memory allocation for that workload.
The platform can support up to 250 pods and 2000 worker nodes in a single OpenShift cluster. Cluster and node sizing should aim for substantially fewer pods than this limit.