Here are some key considerations for DL training that we observed during the tests:
- Different CNN models generate varying throughput requirements: Comparing the test results, different CNN models can greatly vary the performance requirements for storage bandwidth. Complex CNN models (like DeepLabv3+) may generate less storage throughput than simple CNN models (like SSD). A high-resolution dataset can generate higher storage bandwidth needs even for the same CNN models. Dell Technologies published a whitepaper Dell EMC PowerScale and NVIDIA DGX A100 Systems for Deep Learning which provides a sizing guideline for different CNN models.
- Multiple GPUs and systems are particularly beneficial to large dataset training: From our tests, we were able to observe the training time for average one epoch reduced from 18 minutes to 3 minutes by using 32 GPUs with 2.8 TB datasets. Multiple DGX A100 systems reduced the training time by 6x for large dataset DL model development.
- Storage throughput requirement grows linearly with the increase of total GPUs during training: From our test results, storage throughput grows linearly with the increase of GPU numbers during the model training. During the sizing for DL infrastructure, it is important to plan storage for future GPU growth as well.
- Larger dataset can generate higher bandwidth requirements during training: It is observed that a large dataset which exceeds the compute system cache will consistently generate higher bandwidth requirements. During the sizing, it is crucial for ADAS development to size your infrastructure with consideration of future data growth.
- NVIDIA DALI allows the training to run at full speed: GPU utilization is generally much higher than CPU utilization. In most of the training cases, the average GPU utilization can reach up to 97%. Low GPU usage could be caused by CPU bottlenecks. When the GPU utilization is low, the CPU is busy with data fetching from storage to memory or from working on small operations. DALI reduces latency and training time, mitigating bottlenecks, by overlapping training and pre-processing which maximizes the training speed. DALI is primarily designed to do preprocessing on a GPU, but most operations also have a fast CPU implementation.
- Batch size should be large enough to reduce training time: It is observed that larger batch size will decrease the training time with higher GPU utilization. The larger batch size will also increase the throughput, so eliminating storage bottlenecks is also required to accelerate the training time. Also, some research by Facebook has shown that in distributed learning, the learning rate scaling rule is surprisingly effective for a broad range of batch sizes. It is recommended to increase learning rate following batch size. For more information, refer to Facebook article Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.