Benchmark methodology

Thank you for your feedback!

To measure the performance of the solution, the image classification benchmark from the TensorFlow Benchmarks repository was run. This benchmark performs training of an image classification convolutional neural network (CNN) on labeled images. Essentially, the system learns whether an image contains a cat, dog, car, or train. The well-known ILSVRC2012 image dataset (often referred to as ImageNet) was used. This dataset contains 1,281,167 training images in 144.8 GB. All images are grouped into 1000 categories or classes. DL researchers commonly use this dataset for benchmarking and comparison studies.
The individual JPG images in the ImageNet dataset were converted to 1024 TFRecord files. The TFRecord file format is a Protocol Buffers binary format that combines multiple JPG image files with their metadata (bounding box for cropping and label) into one binary file. It maintains the image compression offered by the JPG format and the total size of the dataset remained roughly the same (148 GB). The average image size was 115 KB.
There are various ways to parallelize model training to take advantage of multiple GPUs across multiple servers. In our tests, we used MPI and Horovod.
Prior to each execution of the benchmark, the Linux buffer cache was flushed on all DSS 8440 servers by running:
sync; echo 3 > /proc/sys/vm/drop_caches
The following commands were used to perform the ResNet-50 (V1.0) training with sixteen GPUs.
vardate=$(/bin/date '+%Y-%m-%d-%H-%M-%S')
mkdir /imagenet-scratch/train_dir/${vardate}-resnet
env AWS_LOG_LEVEL=0 \
S3_USE_HTTPS=0 \
S3_VERIFY_SSL=0 \
AWS_ACCESS_KEY_ID=imagenet \
AWS_SECRET_ACCESS_KEY="mysecretkey" \
mpirun \
--n 16 \
-allow-run-as-root \
--host dl-worker-01:2,dl-worker-02:2,dl-worker-03:2,dl-worker-04:2,dl-worker-05:2,dl-worker-06:2,dl-worker-07:2,dl-worker-08:2 \
--report-bindings \
-bind-to none \
-map-by slot \
...
./round_robin_ecs.py \
python3 \
-u \
/tensorflow-benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
--model=resnet50 \
--batch_size=1280 \
--num_batches=500 \
--nodistortions \
--num_gpus=1 \
--device=gpu \
--data_format=NCHW \
--use_fp16=True \
--data_name=imagenet \
--use_datasets=True \
--datasets_num_private_threads=32 \
--train_dir=/imagenet-scratch/train_dir/${vardate}-resnet50 \
--sync_on_finish=True \
--summary_verbosity=1 \
--variable_update=horovod \
--horovod_device=gpu \
--data_dir=s3://imagenet/train \
--ecs_node=10.1.2.1 \
--ecs_node=10.1.2.2 \
--ecs_node=10.1.2.3 \
--ecs_node=10.1.2.4 \
--ecs_node=10.1.2.5
The script round_robin_ecs.py was used to parse the –ecs_node parameter that distributed each process across five different ECS nodes. The script will set the environment variable S3_ENDPOINT with different ecs_node value based on OMPI_COMM_WORLD_RANK assigned by mpirun at run time on each process. For more details about OMPI_COMM_WORLD_RANK environment variable, see the Open-MPI official documentation.
For different numbers of GPUs, only the --n parameter was changed. The -map-by slot setting causes MPI to use all GPU (slots) in the current server before it moves onto the next server.
During our tests, only ResNet-50 v1.0 was used for a simple reason; this model generates the biggest throughput, and our goal was to evaluate storage performance. If you are interesting you can easily test other models by changing the --model parameter value (resnet50, resnet152, vgg16, inception3, inception4, googlenet, and so on).

Your Browser is Out of Date

Benchmark methodology

Benchmark methodology