In order to measure the training time and bandwidth requirements for the distributed DL platform, different training procedures were carefully executed by using different configuration setups.
Real-time object detection model training test methodology
Here are some key test methodologies that we used to during the training benchmark:
- In order to measure the distributed training performance of the solution, we used SSD (Single Shot MultiBox Detector) model from the NVIDIA DL Examples GitHub repository. This suite of benchmarks performs training of augmented Cityscapes labeled images. This dataset contains 972,825 training images totaling 2.8 TB. This dataset is commonly used by automotive DL researchers for benchmarking and comparison studies. The solution used CityscapesScripts to convert annotations in standard PNG format to COCO format for SSD training.
- Training of SSD requires computationally costly augmentations. To fully utilize GPUs during the training, we are using the NVIDIA DALI library to accelerate data preparation pipelines.
- Scale-out trainings were performed on HW configurations ranging from one DGX A100 system with eight A100 GPUs to four DGX A100 systems with 32 A100 GPUs. This enabled the measurement of the training time, throughput performance, and provided a basic understanding of the training performance.
- In order to measure the training time across multi-GPUs and determine the corresponding bandwidth requirement, we trained SSD model for multi-epochs with the following setup:
- SGD with momentum: 0.9
- Learning rate: 2.6e-3 * number of GPUs * (batch_size / 32)
- batch size: 16 per GPU
- number of worker threads: 20
- no warmup
- ResNet-50 is used as backbone
- Prior to each execution of the benchmark, the L1 and L2 caches on Isilon F800 were flushed with the command isi_for_array isi_flush. It is worth noting, however, that the training process will read the same files repeatedly and after just several minutes, much of the data will be served from one of these caches.
- Multi-GPU training with Distributed Data Parallel – the NVIDIA model uses APEX’s DDP to implement efficient multi-GPU training with NCCL.
- Excepting the training model, we also used the following command to evaluate the training benchmark:
python -m torch.distributed.launch --nproc_per_node=8 --nnodes=3 --node_rank=0 \
main.py --batch-size 32\
--mode training \
--num-workers 20\
--epochs 20\
--data /mnt/cityscapes/ssd3T
Semantic segmentation model training test methodology
Here are some key test methodologies that we used during the training benchmark:
- In order to measure the distributed training performance of the solution, we used NVIDIA semantic segmentation from the GitHub repository to train the model. This training was performed on augmented cityscapes labeled images. This dataset contains 2,025,975 training images in 5.4 TB.
- Scale-out trainings were performed to measure the training time and throughput performance in order to provide a basic understanding of the training performance. Hardware configurations used for this testing ranged from one NVIDIA DGX A100 compute nodes with 8 A100 GPUs to 4 NVIDIA DGX A100 compute nodes with 32 A100 GPUs.
- In order to measure the training time across multi-GPUs and evaluate bandwidth requirement, we trained semantic segmentation model in multiple epochs with the following setup:
- batch size:8 per GPU
- number of worker threads: 10
- no warmup
- deepv3Plus
- Prior to each execution of the benchmark, the L1 and L2 caches on Isilon F800 were flushed with the command isi_for_array isi_flush. In addition, the Linux buffer cache was flushed on all compute nodes by running sync; echo 3 > /proc/sys/vm/drop_caches. However, note that the training process will read the same files repeatedly and after just several minutes, much of the data will be served from one of these caches.
- Multi-GPU training with Distributed Data Parallel – the NVIDIA model uses Apex's DDP to implement efficient multi-GPU training with NCCL. For example, we used the following command to evaluate the training benchmark:
python -m torch.distributed.launch --nproc_per_node=8 --nnodes=4 --node_rank=0 \
train.py --bs_trn 8\
--apex \
--fp16 \
--crop_size "800,800"\
--num_workers 10\
--max_epoch 20\
--arch deepv3.DeepV3PlusW38