Performance-based sizing considerations

Thank you for your feedback!

Deep Learning workloads vary significantly with respect to the demand for compute, memory, disk, and IO profiles, often by orders of magnitude. Sizing guidance for the GPU quantity and configuration of Isilon nodes can only be provided when these resource requirements are known in advance. That said, it’s usually beneficial to have a few data points on ratio of GPU count per Isilon node for common Image Classification benchmarks.

The following table shows a few such data points:

Table 1. Sizing consideration based on benchmarked workloads

Storage performance demanded	Benchmark	Required storage throughput per V100 GPU (MB/sec/GPU)	V100 GPUs per Isilon F800 node
Low	ResNet50, fp32	40	60
Medium	ResNet50, fp16	80	30
High	AlexNet	200	13

To illustrate, in the above example, ResNet50 with floating point precision of FP16 with 16 GPUs will need a storage system that can read 16 * 80 = 1280 MB/sec. As another example, ResNet50 with FP16 on 120 GPUs would require 4 Isilon F800 nodes (1 chassis).

The above data points were generated using the hardware configuration specified in Appendix A: Hardware and software for training the models (and NOT for inferencing); they do not account for node failure, drive failure, network failure, administrative jobs, or any other concurrent workloads supported by Isilon. Actual ratio of number of GPUs per Isilon node will vary from the above data points based on several factors:

Hardware configuration might be different from what is specified in Appendix A: Hardware and software based on compute, memory, disk, and IO profile requirements.
Deep Learning algorithms are diverse and don’t necessarily have the same infrastructure demands as the benchmarks listed above.
Characteristics and size of the data sets will be different from the data sets described in this whitepaper and used to generate the above data points.
Inferencing workloads will alter the compute and IO requirement significantly from training workloads.
Accounting for node/drive/network failures, admin jobs or other workloads accessing Isilon simultaneously.

An understanding of the IO throughput demanded per GPU for the specific workload and the total storage capacity requirements can help provide better guidance on Isilon node count and configuration. It is recommended to reach out to the Isilon account and SME teams to provide this guidance for the specific deep learning workload, throughput, and storage requirements.

Your Browser is Out of Date

Performance-based sizing considerations

Performance-based sizing considerations