Deep Learning applications have been growing in scale and complexity over the last few years due to significant breakthroughs in algorithms and system architecture. As a result, there has been a growing demand for using high resolution data on complex Deep Learning networks for model training. This has resulted in high memory consumption during the model training phase, often beyond the system limitation.It has therefore become important for application users to reliably predict the memory consumption requirements before starting the model training process.
In this paper, we present a multi-parameter modeling approach to accurately generate a memory consumption model. By reliably predicting the memory requirements of a model prior to runtime, we would be able to optimize the workload distribution without exceeding the system memory requirements. Also, while scaling the model training onto multiple nodes, identifying the peak memory consumption prior to runtime can help distribute the workload in an optimal manner, and achieve peak throughput. We evaluated our modeling approach on a 3D U-net deep learning application and generated a memory consumption model for the network. We subsequently validated the accuracy of our model by predicting the peak memory consumption on some untrained data points and achieved a Mean Absolute Percentage Error of 4.15%. Finally, a comparative evaluation was performed with other machine-learning based regression techniques and confirmed that the symbolic regression approach does a better job at generating accurate models for the 3D U-net peak memory consumption, in the presence of a limited set of performance samples. Going forward, we would like to extend our modeling approach to predict the runtime performance of other large-scale DL workloads.