Ready Solutions for AI provide our customers with a flexible solution to power their AI and data science workloads. To create the Design for Domino Data Science Platform, Dell EMC engineering deployed the Domino 4.0 and 4.03 releases on Dell EMC infrastructure in the lab. We created several use cases and advanced through the project life cycle using the Domino Data Science Platform. These use cases are described in Building AI on Domino use cases. We have developed best practices for configuration, sizing, and operations based on this evaluation and incorporated them into our design.
Domino Data Science Platform version 4.0 and later is designed to run on an existing Kubernetes platform. Kubernetes and container-based workloads are increasingly common in the enterprise. However, Kubernetes is a relatively new orchestration engine and many features are not present in all distributions. To benefit from access to all features, we recommend using the upstream Kubernetes release that can make local GPU resources available to the containers in the compute grid. We recommend Pivotal PKS for administrators who are not experienced with running a Kubernetes cluster and do not require GPU resources for deep learning.
The Design for Domino Data Science Platform supports 20 data science users with access to CPU- and GPU-based execution environments. This design also supports access to an external Hadoop cluster for integrations with HDFS and Apache Spark.
The following figure shows the hardware architecture:
Figure 7. Design for Domino Data Science Platform hardware architecture
The following table lists the recommended configuration for Kubernetes master nodes and for Domino platform nodes:
Table 1. Recommended configuration for platform nodes
Component |
Description |
Server |
Dell EMC PowerEdge R640 |
Quantity |
3 |
Processor |
Intel Xeon Silver 4216 2.1G, 16C/32T |
Memory |
48 GB DDR4-2400 |
Storage |
2x 240 GB or 480 GB SATA SSD |
Network |
25 GbE Mellanox ConnectX-5 NDC |
The default pool configuration, as shown in the following table, provides a balanced configuration of CPU cores, clock speed, and memory. Most users perform their development in the default pool in the compute grid. Hosted applications and models are well suited for the default pool configuration. Only run hosted applications and models on a specialized pool in the compute grid after determining that the default pool does not provide the necessary performance.
Table 2. Recommended configuration for the default pool
Component |
Description |
Server |
Dell EMC PowerEdge R640 |
Quantity |
2 or more |
Processor |
2 Intel Xeon Gold 6230 2.1G, 20C/40T |
Memory |
192 GB DDR4-2933 |
Storage |
2x 240 GB or 480 GB SATA SSD (boot) 1 TB+ SSD storage for project files |
Network |
25 GbE Mellanox ConnectX-5 NDC |
The following table shows the recommended hardware tiers for this configuration:
Table 3. Recommended hardware tiers for the default pool
Name |
Cores |
MemGB |
Default.small |
2 |
2 |
Default.medium |
2 |
4 |
Default.large |
4 |
4 |
Default.xl |
4 |
8 |
Default.xxl |
4 |
16 |
Default.xxxl |
8 |
16 |
For the GPU pool, we selected the PowerEdge R740 server, which supports up to three accelerators that are rated for 300W such as the NVIDIA V100 GPU, as shown in the following table. It can also be equipped with up to six 75W NVIDIA T4 GPUs. The PowerEdge C4140 is also supported as a member of the compute grid and supports up to four 300W accelerators as well as NVLink.
Table 4. Recommended configuration for the GPU pool
Component |
Description |
Server |
Dell EMC PowerEdge R740 |
Quantity |
1 or more |
Processor |
2 Intel Xeon Gold 6254 3.1G 18C/36T |
Accelerator |
3 NVIDIA V100 or 6 NVIDIA T4 |
Memory |
192 GB DDR4-2933 |
Storage |
2x 240 GB or 480 GB SATA SSD (boot) 1 TB+ SSD storage for project files |
Network |
25 GbE Mellanox ConnectX-5 NDC |
Only one hardware tier is necessary per GPU-enabled pool. If V100 and T4 configurations are present in the compute grid, they must exist in separate pools. The following table shows the recommended hardware tiers for this configuration:
Table 5. Recommended hardware tiers for the default GPU pool
Name |
Cores |
MemGB |
Gpu.v100 |
12 |
64 |
Gpu.T4 |
6 |
32 |
The following table shows the configuration that is used for the CPU pool in our compute grid. The CPU pool is designed to support data science workloads that require high performance but do not benefit from the parallel execution pattern of the GPU. The PowerEdge C6420 is a 2U chassis with four individual compute nodes inside of it and was selected for maximum density of CPU-bound workloads.
Table 6. Recommended configuration for the CPU pool
Component |
Description |
Server |
Dell EMC PowerEdge C6420 |
Quantity |
4 per chassis, 1+ chassis |
Processor |
2 Intel Xeon Platinum 8280 2.7G 28C/56T |
Memory |
768 GB DDR4-2933 |
Storage |
2x 240 GB or 480 GB SATA SSD (boot) 1 TB+ SSD storage for project files |
Network |
25 GbE Mellanox ConnectX-5 NDC |
The following table shows the recommended hardware tiers for this configuration:
Table 7. Recommended hardware tiers for the CPU pool
Name |
Cores |
MemGB |
Cpu.small |
4 |
16 |
Cpu.medium |
8 |
32 |
Cpu.large |
8 |
64 |
Cpu.xl |
16 |
64 |
Cpu.xxl |
28 |
128 |
Cpu.max |
28 |
384 |