Domino Data Science Platform uses pools to provide self-service access to the heterogenous blend of hardware configurations that are used in data science. Generally, there are a minimum of two pools: one for CPU workloads and another for GPU workloads. Additional pools can be created in the compute grid as necessary to provide specialized execution hardware. In the Design for Domino Data Science Platform, we created three pools: a default pool, a GPU pool, and a CPU pool.
To size the compute grid for this design, we first established the criteria to determine which pool is more suited to a specific workload:
After determining the pool assignment of each type of workload, we then determine the expected number of instances of each workload and how long they are expected to run. To size the Design for Domino Data Science Platform, we assumed that our data scientists each work in five separate teams of four data scientists. We then estimated the average blend of workloads that a typical enterprise data science team runs. To finalize the workload definition, we also incorporate these assumptions:
Our default pool needs 100 cores and 320 GB of memory to support user workloads. Development environments and web applications are not processing 100 percent of the time. Oversubscription might be possible after monitoring the utilization metrics of the hosts in the pool by using tools like perf, which is included with Red Hat Linux. Although the host operating system and services running on the host require a few resources, our evaluation determined that this minor overhead was unlikely to affect these workloads.
Our GPU workload requirement is to support three separate training jobs for each team, each week. By having three NVIDIA V100 GPUs available in the compute grid, we enable approximately 100 hours of GPU runtime for each team per week or 33 hours per model candidate.
The CPU pool is characterized by its high core count and the large amount of available memory. Because CPU resources can be assigned per core, we first tried to determine the number of cores and memory each team might need to size the CPU pool. This determination proved difficult to model for several reasons:
We opted to use a single PowerEdge C6420 chassis. Unlike the GPU pool in which resource contention is binary (a GPU is either available or unavailable), CPU resource contention is a tradeoff of several metrics and more in-depth evaluation of the specific blend of workloads appropriate for the default pool and CPU pool.
In practice, the requirements of each team are unlikely to be uniform. For example, full model training might take 90 hours to converge but only occur once a month. Domino Data Science Platform provides the ability to report resource utilization, which can help determine if you need to increase the size of a compute pool to meet user demand.