Storage profile of deep learning training

Thank you for your feedback!

The storage I/O profile of the training benchmark is relatively simple. From the perspective of the storage subsystem, a 1.41 GB TFRecord file is opened and read sequentially in its entirety from beginning to end. Multiple TFRecord files are read concurrently. The order in which files are opened is effectively random. As measured using the Linux lsof (list open files) command, a four-GPU compute node reads 120 files concurrently (30 files per GPU). Four such compute nodes would read 480 files concurrently.
If the images/sec throughput is known, the required storage throughput can be calculated easily by multiplying by the average image size of 113 KB. For instance, 1000 images/sec * 113 KB/image = 113 MB/sec. Of course, this does not account for caching at the compute nodes or Isilon nodes which would reduce the required disk throughput.
Another aspect that should be considered when designing deep learning infrastructure is the generation of the TFRecord files. Although this is not part of the TensorFlow CNN benchmark results, it is a very I/O intensive operation and it must be performed before training can begin. To generate the TFRecord files, the directories with the individual JPEG images are listed, file names are ordered randomly, and then multiple processes read these JPEG files while writing TFRecord files. This can be effectively parallelized across multiple compute nodes with MPI or other parallel computing frameworks.