Our results and observations are as follows:
For our recommendations, we explicitly assessed both performance and operational flexibility. Our performance comparisons were completed with the MIG feature enabled on the A100 GPUs, which resulted in at most a -5 percent impact on the workload performance. The advantage of the MIG-enabled GPU is that you can configure and reconfigure vGPU profiles on one or more VMs with no operational downtime for the host server. Profiles assigned to VMs that are not running jobs can be reassigned or modified while other VMs are running workloads on other partitions of a shared GPU.
MIG performance analysis with ResNet training—We use ResNet training to study the performance of MIG partitions. The following figure shows three scenarios, each with a single VM that is assigned a profile as indicated by the x-axis. For information about profiles, see Multi-Instance GPU feature.
The partitions perform relative to the size of the dedicated resources available for that partition. We are unable to run ResNet training on partitions grid_a100-2-10c and grid_a100-1-5c. These partitions are suited for inference and not recommended for neural network training.
MIG performance analysis with ResNet inference—We use ResNet inference to study the performance of MIG partitions.
The following table describes the eight scenarios:
A single VM configured with grid_a100-7-40c profile
A single VM configured with grid_a100-4-20c profile
A single VM configured with grid_a100-3-20c profile
A single VM configured with grid_a100-2-10c profile
A single VM configured with grid_a100-1-5c profile
Three VMs configured with grid_a100-4-20c, grid_a100-2-10c, and grid_a100-1-5c profiles
Three VMs, each configured with grid_a100-2-10c, and a fourth VM configured with grid_a100-1-5c profiles
Seven VMs each configured with grid_a100-1-5c profiles
The following figure shows the results of the eight scenarios:
The results show the following: