Our results and observations are as follows:
For our recommendations, we explicitly assessed both performance and operational flexibility. Our performance comparisons were completed with the MIG feature enabled on the A100 GPUs that resulted in at most a -5 percent impact on the workload performance. The MIG-enabled GPU enables you to configure and reconfigure vGPU profiles on one or more VMs with no operational downtime for the host server. You can reassign and modify profiles assigned to VMs that are not running jobs while other VMs are running workloads on other partitions of a shared GPU.
The partitions performed relative to the size of the dedicated resources available for that partition. We are unable to run ResNet training on partitions grid_a100-2-10c and grid_a100-1-5c. These partitions are suited for inference only and are not recommended for neural network training.
The following table describes the eight scenarios:
A single VM configured with grid_a100-7-40c profile
A single VM configured with grid_a100-4-20c profile
A single VM configured with grid_a100-3-20c profile
A single VM configured with grid_a100-2-10c profile
A single VM configured with grid_a100-1-5c profile
Three VMs configured with grid_a100-4-20c, grid_a100-2-10c, and grid_a100-1-5c profiles
Three VMs, each configured with grid_a100-2-10c, and a fourth VM configured with grid_a100-1-5c profiles
Seven VMs each configured with grid_a100-1-5c profiles
The following figure shows the results of the eight scenarios:
The results show that the ResNet inference job is unable to use the grid_a100-7-40c and grid_a100-4-20c GPU fully. The observed images/second are not high when compared to other profiles.