The following table describes the hardware and software configuration used for validating this design:
Table 7. Validation setup
Category | Supported components |
Servers | 4 x PowerEdge R750xa, each with 2 x NVIDIA A100 80 GB |
Storage for VMs and Kubernetes cluster | vSAN |
Storage for AI datasets | PowerScale F810 as NFS storage |
Network switches |
|
Virtualization and container orchestration |
|
Virtualized GPUs and AI software suite | NVIDIA AI Enterprise 1.1 |
MLOps platform | cnvrg.io (version 3.11) |
Tanzu Kubernetes Configuration |
|
For validation, we used multiple worker node configurations. Realistically, data scientists use only a one-worker node configuration for their experiment.
The validation is performed with the AI Radiologist use case. The objective of this use case is to train a DL model to classify pathologies using a patient’s frontal-view chest X-rays. Our project is based on by the Stanford ChexNet project, which was developed to detect pneumonia from a chest X-ray.
The dataset used is the ChestX-ray14, which is one of the largest publicly available chest X-ray datasets released by the National Institute of Health (NIH). We use ChexNet as the baseline model for our project, which is a 121-layer Dense Convolutional Neural Network (DenseNet).
Consider a team with multiple data scientists collaborating in an ML project to create a solution to classify pathologies in a chest X-ray.
The MLOPs workflow for this project consists of various steps such as:
These steps are iterative. Therefore, it becomes more complex when multiple team members are working in the same project focused on different steps of the MLOPs pipeline.
Another task is to manage the efficient resource allocation for various workloads. It is important to have an optimized and effective tool to handle these requests from data scientists and ML engineers.