Our results and observations are as follows:
Figure 6. ResNet Training validation setup throughput comparison
The partitions performed relative to the size of the dedicated resources available for that partition. We are unable to run ResNet training on partitions A100-40GB-1-5C and A100-40GB-2-10C. These partitions are suited for inference only and are not recommended for neural network training.
Figure 7. ResNet Inference validation setup throughput comparison
The results presented in the preceding figure are based on the average of three inference runs. The throughput numbers show how the setup affects the performance. The higher the GPU profile used in the environment, the higher the throughput.
For the validation of a specific use case, we validated the AI radiologist use case using cnvrg.io. The AI radiologist use case is a machine learning problem to train an ML model to identify various pathologies based on chest x-rays. When the model is trained, it is used to run inference on unseen test or validation images. For this section, we validated cnvrg.io to enable the training and inference of the AI radiologist use case.
For the validation, we created four templates in cnvrg.io. These templates use the same resources specified in Table 6. We used the smaller GPU profile-based templates for inference and the larger templates for training. The training and inference were run using the NVAIE 3.0 Tensorflow1 container.
We validated the training and inference data by using two scenarios:
The following figures show the sample training performance of AI radiologist use case using the ResNet50 model. They indicate the training loss and accuracy.
Figure 8. Loss and accuracy during training
When the model is trained, we run inference on the validation data to understand how the model performs on the unseen data.
Figure 9. Inference accuracy, latency and throughput
The preceding figure shows the results from inference. The results are from using a model trained for seven epochs. As is clear from the validation AUC, the model needs to be improved. By performing hyperparameter tuning and training for more epochs, we can improve the performance of the model. The model needs to be generalized well to be able to perform well on unseen test or validation data. cnvrg.io enables hyperparameter tuning using the experiment functionality. The following figure shows an example of how we can perform a grid search and run multiple experiments based on various parameters specified by the user.
Figure 10. Running grid search for hyperparameter tuning
Figure 10. Running grid search for hyperparameter tuning