Benchmark methodology

Thank you for your feedback!

To measure the performance of the solution, the image classification benchmark from the TensorFlow Benchmarks repository was run. This benchmark performs training of an image classification convolutional neural network (CNN) on labeled images. Essentially, the system learns whether an image contains a cat, dog, car, or train. The well-known ILSVRC2012 image dataset (often referred to as ImageNet) was used. This dataset contains 1,281,167 training images in 144.8 GB. All images are grouped into 1000 categories or classes. This dataset is commonly used by DL researchers for benchmarking and comparison studies.
The individual JPG images in the ImageNet dataset were converted to 1024 TFRecord files. The TFRecord file format is a Protocol Buffers binary format that combines multiple JPG image files with their metadata (bounding box for cropping and label) into one binary file. It maintains the image compression offered by the JPG format. The total size of the dataset remained roughly the same (148 GB). The average image size was 115 KB.
There are many ways to parallelize model training to take advantage of multiple GPUs across multiple servers. In our tests, we used MPI and Horovod.
Prior to each execution of the benchmark, the L1 and L2 caches on PowerScale were flushed with the command isi_for_array isi_flush. In addition, the Linux buffer cache was flushed on all R7535 servers by running sync; echo 3 > /proc/sys/vm/drop_caches.
The following commands were used to perform the ResNet-50 (V1.0) training with 23 GPUs.
vardate=$(/bin/date '+%Y-%m-%d-%H-%M-%S')
mkdir -p /mnt/isilon/data/imagenet-scratch/train_dir/${vardate}-resnet
mpirun --n 23 \
--allow-run-as-root \
--host hop-r7525-01:3,hop-r7525-03:3,hop-r7525-04:3,hop-r7525-05:3,hop-r7525-06:3,hop-r7525-07:3,hop-r7525-08:3,hop-r7525-02:2 \
--report-bindings \
-bind-to none \
-map-by slot \
-x LD_LIBRARY_PATH \
-x PATH \
-mca plm_rsh_agent ssh \
-mca plm_rsh_args "-p 2222" \
-mca pml ob1 \
-mca btl ^openib \
-mca btl_tcp_if_include ens3f1 \
-x NCCL_DEBUG=INFO \
-x NCCL_IB_HCA=mlx5_1 \
-x NCCL_SOCKET_IFNAME=^docker0,lo,eno,ens3f0 \
/mnt/isilon/data/ai-benchmark-util/round_robin_mpi.py \
python3 -u /root/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
--model=resnet50 \
--batch_size=512 \
--batch_group_size=20 \
--num_batches=500 \
--nodistortions \
--num_gpus=1 \
--device=gpu \
--force_gpu_compatible=True \
--fuse_decode_and_crop=True \
--data_format=NCHW \
--use_fp16=True \
--use_tf_layers=True \
--data_name=imagenet \
--use_datasets=True \
--num_intra_threads=1 \
--num_inter_threads=40 \
--datasets_prefetch_buffer_size=40 \
--datasets_num_private_threads=4 \
--train_dir=/mnt/isilon/data/imagenet-scratch/train_dir/${vardate}-resnet50 \
--sync_on_finish=True \
--summary_verbosity=1 \
--save_summaries_steps=100 \
--save_model_secs=600 \
--variable_update=horovod \
--horovod_device=gpu \
--data_dir=/mnt/isilon1/data/imagenet-scratch/tfrecords \
--data_dir=/mnt/isilon2/data/imagenet-scratch/tfrecords \
--data_dir=/mnt/isilon3/data/imagenet-scratch/tfrecords
The script round_robin_mpi.py was used to select a single --data_dir parameter that distributed the processes across three different mount points.
For different numbers of GPUs, only the --n parameter was changed. Note that the -map-by slot setting causes MPI to use all 3 GPUs (slots) on a R7525 server before it begins using the next server.
Note that during our tests, only ResNet-50 v1.0 was used for a simple reason: this model generates the most throughput, and our goal was to evaluate storage performance. If you are interested you can easily test other models by changing the --model parameter value (to resnet50, resnet152, vgg16, inception3, inception4, googlenet, and so on).

Your Browser is Out of Date

Benchmark methodology

Benchmark methodology