Overview

Thank you for your feedback!

Environment

The following table describes the hardware and software of the two configurations we used for validating this design. The software versions were the latest during the time of validation. Newer versions might be available after the publication of this document and are fully supported.

Table 7. Validation setup

Category	Components in VMware vSphere with Tanzu	Components in Symworld Cloud Native Platform
Servers	4 x PowerEdge R750xa servers, each with 2 x NVIDIA A100 80 GB	4 x PowerEdge R7525 servers, each with 2 x NVIDIA A100 80 GB
Virtualization and container orchestration	VMware vSphere 7.0U3 Standard Edition VMware vSphere with Tanzu (Required for container orchestration).	Symworld Cloud Native Platform (version 5.3.11-217)
Storage for VMs and Kubernetes cluster	vSAN	Symworld Cloud Native Storage
Storage for AI datasets	PowerScale F810 as NFS storage	PowerScale F810 as NFS storage
Network switches	Dell S5248F-ON (for workload and management) Dell S4148T-ON OOB	Dell S5248F-ON (for workload and management) Dell S4148T-ON OOB
Virtualized GPUs and AI software suite	NVIDIA AI Enterprise (version 1.1)	NVIDIA GPU Operator (version 1.11.1)
MLOps platform	cnvrg.io (version 3.11)	cnvrgv.io (version 4.7.40)

Additionally, we used the following Kubernetes configuration for VMware vSphere with Tanzu:

3 x Kubernetes control plane
2 x worker node with VM Class GPU-train-medium
2 x worker node with VM Class GPU-train-large
2 x worker node with VM Class GPU-infer-small
2 x worker node with VM Class GPU-infer-medium

Approach

The validation is performed with the AI Radiologist use case. The objective of this use case is to train a DL model to classify pathologies using a patient’s frontal-view chest X-rays. Our project is based on by the Stanford ChexNet project, which was developed to detect pneumonia from a chest X-ray.

The dataset used is the ChestX-ray14, which is one of the largest publicly available chest X-ray datasets released by the National Institute of Health (NIH). We use ChexNet as the baseline model for our project, which is a 121-layer Dense Convolutional Neural Network (DenseNet).

Consider a team with multiple data scientists collaborating in an ML project to create a solution to classify pathologies in a chest X-ray.

The MLOPs workflow for this project consists of various steps such as:

Data exploration
Building and training a Deep Neural Network model
Hyperparameter tuning
Deploying the best model

These steps are iterative. Therefore, it becomes more complex when multiple team members are working in the same project focused on different steps of the MLOPs pipeline.

Another task is to manage the efficient resource allocation for various workloads. To have an optimized and effective tool to handle these requests from data scientists and ML engineers.

Your Browser is Out of Date

Overview

Overview

Environment

Approach