GPU-accelerated modern data stack worker nodes support the platform runtime services and customer workloads that benefit from GPU acceleration. Symcloud Platform Compute and Storage roles define the runtime services, while user deployments define the workloads. Dell Technologies recommends the configuration in GPU-accelerated modern data stack worker node configuration as a starting point for GPU-accelerated workloads.
Table 8. GPU-accelerated modern data stack worker node configuration Platform | PowerEdge R760 server |
Chassis | 2.5 in chassis with up to 16 SAS or SATA drives, Smart Flow, front PERC 12, two CPUs |
Chassis configuration | Riser configuration 5, full-length, with two full-height 16-channel slots (Gen4), two low-profile 16-channel slots (Gen4), one full-height 16-channel slot (Gen5), and one full-height, double-wide, 16-channel, GPU-capable slot (Gen5) |
Power supply | Dual, hot-plug, 2400 W redundant, configuration D, mixed-mode power supplies |
Processor | Intel Xeon Gold 6438Y 2 G, 32 C/64 T, 16 GT/s, 60 M cache, turbo, HT (205 W) DDR5-4800 |
Memory capacity | 512 GB (sixteen 32 GB RDIMM, 3200 MT/s, dual rank) |
Internal RAID storage controllers | Dell PERC H965i with rear load bracket |
Disk—SSD | Six 3.84 TB hot-plug, SAS, mixed-use, up to 24 Gbps, 512e 2.5 in, Federal Information Processing Standard (FIPS) Self-Encrypting Drives (SEDs) |
Boot optimized storage cards | BOSS-N1 controller card + with two M.2 960 GB SSDs (RAID 1) |
Network interface controllers | NVIDIA ConnectX-6 Lx dual port 10/25 GbE SFP28 adapter, PCIe low profile |
GPU, FPGA, or acceleration cards | NVIDIA Ampere A30, PCIe, 165 W, 24 GB passive, double wide, full-height GPU with cable |
Dell Technologies recommends the disk volume and partition layouts for this set of machines that are listed in GPU-accelerated modern data stack worker node volumes and GPU-accelerated modern data stack worker node partitions.
Table 9. GPU-accelerated modern data stack worker node volumes Operating system | RAID 1 | Two 960 GB SSDs | 0 |
Symcloud Storage | No RAID | Six 3.84 TB SAS SSDs | 1-6 |
Table 10. GPU-accelerated modern data stack worker node partitions /boot | 1024 MB | XFS | 0 | Primary | This partition contains BIOS start-up files that must be within first 2 GB of disk. |
/boot/efi | 650 MB | VFAT | 0 | Extended | The partition contains EFI start-up files. |
/ | Around 100 GB | XFS | 0 | LVM | This partition contains the root file system. |
/home | 300 GB | XFS | 0 | LVM | This partition contains the user home directories. |
/var | 400 GB | XFS | 0 | LVM | This partition contains variable data like system logging files, databases, mail and printer spool directories, transient, and temporary files. |
None | Six 3.84 TB | Symcloud Storage | 1-6 | Raw partitions | Symcloud Storage manages these partitions. |
GPU accelerated worker nodes have capabilities that are similar to general purpose worker nodes, while adding GPU acceleration. Dell Technologies recommends four worker nodes for a minimum deployment. You can use any mix of general purpose and GPU-accelerated nodes.
The configuration includes two network ports to support high-availability (HA) networking. These ports can be from a single network card, or a pair of network cards for additional adapter level HA.
Two M.2 960 GB SSDs in a RAID 1 configuration are used for the operating system volume. The home directories are allocated in a separate small partition since user files are not stored at the operating system level on production nodes. Most of the storage is allocated to the /var
partition for runtime files. You can use LVM to adjust the storage allocation between /
, /home
, and /var
for specific needs.
Six SSDs are allocated for use by Symcloud Storage. The services to support this storage are deployed with the Symcloud Storage role. This storage is exposed to workloads running on the cluster through the Kubernetes CSI interface. The recommended configuration provides approximately 23 TB of storage per node. This capacity is enough for typical runtime storage in a modern data stack environment where the bulk of the data is stored on external storage. The external storage can be either HDFS provided by PowerScale, or object storage provided by ECS. If more local storage is needed, up to 10 more SSDs can be added, and drive sizes can be increased.
Memory has been sized to support all the required services in a production deployment with enough headroom for user workloads. The most common change is to increase the memory size to support more containers or workloads requiring more memory.
The processors have been chosen to support compute intensive AI and ML workloads and include dual Intel Advanced Vector Extensions (AVX) units for maximum compute speed. Other processor choices are possible but should be made with memory requirements and overall power consumption in mind.
The GPU has been chosen to support Spark acceleration of SQL and dataframe operations using the NVIDIA RAPIDS Accelerator for Apache Spark. These workload operations are typical in a modern data stack environment. One or two GPUs can be used in this configuration. AI and ML workloads may benefit from alternative GPU models.