3D U-nets are typically used in medical imaging for processing three-dimensional volumetric data . 3D U-net has a typical encoder-decoder structure where the encoder structure analyses the input image and performs dimensionality reduction. The decoder path performs up-convolution to produce full image segmentation. Both encoder and decoder paths involve 3D convolutions, max pooling layers, and batch normalizations. Figure 2 shows our 3D U-net architecture. By design, 3D U-net is a symmetric network, meaning the model can be trained and inferred with different image sizes. Due to the high-resolution, three-dimensional images being used for training, and the network hyper-parameters being specified (such as batch size, filter dimensions, and number of layers), memory consumption can quickly escalate beyond the capacity of the underlying hardware, as shown in Figure 1.
To understand the memory consumption behavior of, and develop a baseline memory consumption model for, 3D-net, we first identified the key contributors that dominate most of the memory consumption. The two main memory objects that dominate most of the memory consumption during the model training phase are (i) intermediate tensors (activation maps), and (ii) model weights. Activation maps are the tensors generated after the subsequent convolution and max pooling layers. The size of these tensors depends on four key parameters:
During the forward pass operation in the model training phase, we generate multiple activation maps for each image specified in the batch size. As a result, memory consumption by activation maps scales linearly with the batch size. Equation 1 specifies the memory consumed in bytes due to activation maps:
Another key contributor for memory consumption is model parameters. Unlike activation maps, memory consumed by model parameters is fixed for a given network and does not depend on the input image dimensions and batch sizes. The total number of model parameters that include both model weights and biases depend on (i) number of filters, (ii) filter dimensions of convolution and concatenation layers, and (iii) number of layers. Equation 2 specifies the memory consumed in bytes by the model weights for the 3D U-net architecture specified in Figure 1:
Figure 5 shows the memory consumption predictions made from the baseline analytical models specified in Equations 1 and 2 and the corresponding measured data acquired through memory profiling. The analytical model combines the memory consumed by the activation maps and model weights to predict the peak memory consumption. As shown in the figure, while the analytical model was able to capture the trend in terms of the memory consumption with regard to input image dimension and batch size, the model was not able to predict the peak memory consumption accurately during the training phase. Specifically, for larger input image sizes, where the memory consumption escalates to hundreds of gigabytes, the analytical model (which was generated by studying the intermediate tensors and model parameters) falls significantly short in predicting the actual peak runtime memory performance. This example demonstrates a practical disadvantage of the pure analytical modeling approach, which requires detailed knowledge of the application and the targeted architecture being modeled. Even with expert knowledge, the resulting model often deviates from actual execution due to machine and tool specific behaviors that are difficult to predict or understand.
In the next section, we present an empirical multi-parameter modeling approach to generate accurate memory consumption models by leveraging symbolic regression principles .