Home > Storage > ObjectScale and ECS > Industry Solutions and Verticals > Scalability Guidelines for Deep Learning with High-Speed Object Storage > Solution sizing guidance
DL workloads vary significantly regarding the demand for compute, memory, disk, and I/O profiles, often by orders of magnitude. Sizing guidance for the GPU quantity and configuration of ECS nodes can only be provided when these resource requirements are known in advance. That said, it is usually beneficial to have a few data points on the ratio of GPUs per ECS EXF900 node for common image classification benchmarks.
The goal of this test is to identify the number of GPUs that a single ECS node can support without performance degradation. Performance degradation is measured by comparing the images/second throughput measured throughout the workload and watching for the drop off with one and five ECS nodes. It is also possible to identify performance degradation by monitoring the GPU utilization. In a normal and efficient workload, GPU utilization should always be between 98 and 100%. If a GPU is not fully utilized, it is likely due to a bottleneck somewhere in the infrastructure.
Throughout the test the image/sec throughput was monitored on one and five ECS nodes respectively. Based on the results from the TensorFlow benchmark, Figure 3 compares the Image throughput between results where N GPUs were assigned to one and five ECS nodes respectively. Where N corresponds to the number of GPUs used during the test. As an example, with two GPUs two different tests were run against one and five ECS nodes respectively. Performance degradation began to show with ten GPUs assigned to a single ECS node. With 10 GPUs per node, performance degradation was noticed in the logs by a drop in the images/sec number that continued as the number of GPUs scaled up. The performance degradation is more visible in the graph with eleven and twelve GPUs against a single ECS EXF900 node.
This test shows that a single ECS EXF900 node can support up to nine NVIDIA A100 80GB GPUs without performance degradation.
Note: The tests did not show any performance degradation with the 5-node test configuration.