In order to measure the training time and bandwidth requirement for distributed DL platform, different training procedures were carefully executed by using different configuration setups.
Real-time object detection model training test methodology
Here are some key test methodologies that we used to during the training benchmark
- In order to measure the distributed training performance of the solution, we used SSD (Single Shot MultiBox Detector) model from the NVIDIA DL Examples GitHub repository, refer to the link. This suite of benchmarks performs training of augmented Cityscapes labeled images. This dataset contains 972,825 training images in 3TB. This dataset is commonly used by automotive DL researchers for benchmarking and comparison studies. The solution used CityscapesScripts to convert annotations in standard PNG format to COCO format for SSD training.
- Training of SSD requires computational costly augmentations. To fully utilize GPUs during the training, we are using the NVIDIA DALI library to accelerate data preparation pipelines.
- Scale-out trainings were performed from one Dell C4140 compute node with four V100 GPUs to 16 Dell C4140 compute nodes with 64 V100 GPUs to measure the training time and throughput performance in order to provide a basic understanding of the training performance.
- In order to measure the training time across multi-GPUs and bandwidth requirement, we trained SSD model for 1 epoch and 2 epochs with the following setup:
SGD with momentum: 0.9
Learning rate: 2.6e-3 * number of GPUs * (batch_size / 32)
batch size:16 per GPU
number of worker threads: 20
no warmup
ResNet-50 is used as backbone
- Prior to each execution of the benchmark, the L1 and L2 caches on Isilon F800 were flushed with the command isi_for_array isi_flush. In addition, the Linux buffer cache was flushed on all compute nodes by running sync; echo 3 > /proc/sys/vm/drop_caches. However, note that the training process will read the same files repeatedly and after just several minutes, much of the data will be served from one of these caches.
- Multi-GPU training with Distributed Data Parallel – the NVIDIA model uses Apex's DDP to implement efficient multi-GPU training with NCCL.
- Excepting the training model, we also used the following command to evaluate the training benchmark:
python -m torch.distributed.launch --nproc_per_node={Number of GPUs} -- nnodes={number of compute node} --node_rank={node_rank}\
main.py --batch-size 16 \
mode benchmark-training \
benchmark-warmup 100 \
benchmark-iterations 200 \
data {data}
Semantic image segmentation model training test methodology
Here are some key test methodologies that we used to during the training benchmark:
- In order to measure the distributed training performance of the solution, we used deeplab-v3-plus model from the GitHub repository. This training performed on augmented cityscapes labeled images. This dataset contains 1,657,075 training images in 5TB
- Scale-out trainings were performed to measure the training time and throughput performance in order to provide a basic understanding of the training performance. Hardware configurations used for this testing ranged from one Dell C4140 compute node with four V100 GPUs to 16 Dell C4140 compute nodes with 64 V100 GPUs.
- In order to measure the training time across multi-GPUs and evaluate bandwidth requirement, we trained deeplab-v3-plus model for 1 epoch and 2 epochs with the following setup:
batch size:16 per GPU
number of worker threads: 20
no warmup
ResNet-101 is used as backbone
- Prior to each execution of the benchmark, the L1 and L2 caches on Isilon F800 were flushed with the command isi_for_array isi_flush. In addition, the Linux buffer cache was flushed on all compute nodes by running sync; echo 3 > /proc/sys/vm/drop_caches. However, note that the training process will read the same files repeatedly and after just several minutes, much of the data will be served from one of these caches.
- Uber Horovod was used with distributed data parallel. Horovod is a distributed DL training framework for PyTorch and other DL frameworks. Horovod uses Message Passing Interface (MPI) for high- speed and low-latency communication between processes. For example, we used the following command to evaluate the training benchmark:
mpirun -np {number of process}\
--mca oob_tcp_if_include {NIC}\
--mca btl_tcp_if_include {NIC}\
-H nodexx:4,nodexx:4,nodexx:4,nodexx:4,nodexx:4,nodexx:4,nodexx:4,nodexx:4,no dexx:4,nodexx:4,nodexx:4,nodexx:4,nodexx:4,nodexx:4,nodexx:4,nodexx:4 \
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl ^openib \
python trainhorovod.py --dataset cityscapes --workers 20 --batch-size 16 --epochs 2