Here are some key considerations for DL training that we observed during the tests:
- Different CNN models generate varying throughput requirement: Comparing the test results, different CNN models can greatly vary the performance requirements for storage bandwidth. Complex CNN models (like DeepLabv3+) may generate less storage throughput than simple CNN models (like SSD). A high-resolution dataset can generate higher storage bandwidth needs even for the same CNN models. Dell Technologies published a whitepaper Deep Learning with Dell Isilon which provides a sizing guideline for different CNN models.
- Multiple GPUs and servers are particularly beneficial to large dataset training: From our tests, we were be able to observe the training time for 1 epoch reduced from 42 minutes to 4 minutes by using 64 GPUs with 3 TB datasets. It reduced 10x training time for large dataset DL model development.
- Storage throughput requirement grows linearly with the increase of total GPUs during training: From our test results, storage throughput grows linearly with the increase of GPU numbers during the model training. During the sizing for DL infrastructure, it is important to plan for the storage for future GPU growth as well.
- Larger dataset can generate higher bandwidth requirements during training: It is observed that large dataset which exceeded the compute server cache will consistently generate higher bandwidth requirement. During the sizing, it is crucial to size your infrastructure and consider about the future data growth especial for ADAS development.
- NVIDIA DALI allows the training to run at full speed: GPU utilization is generally much higher than CPU utilization. In most of the training cases, the average GPU utilization can reach up to 97%. Low
GPU usage could be caused by CPU bottlenecks. When the GPU utilization is low, the CPU is busy with data fetching from storage to memory or from working on small operations. DALI reduces latency and training time, mitigating bottlenecks, by overlapping training and pre-processing -- which maximizes the training speed. DALI is primarily designed to do preprocessing on a GPU, but most operations also have a fast CPU implementation.
- Batch size should be large enough to reduce training time: It is observed that larger batch size will decrease the training time with higher GPU utilization. The larger batch size will also increase the throughput, so eliminating storage bottlenecks is also required to accelerate the training time. Also, some research by Facebook has shown that in distributed learning, the learning rate scaling rule is surprisingly effective for a broad range of batch sizes. It is recommended to increase learning rate following batch size. For more information, refer to Facebook article Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.