AI training and AI model validation take place on the same VM. When the training has taken place, the models obtained are then validated on the same VM used for training.
The methodology used for training and model validation is similar to the method used in the AI-assisted Radiology Using Distributed Deep Learning document.
Model weights are collected for 15 Epochs during training with each later validated with the same techniques as per the referenced paper. That is, Average AUC-ROC (Area Under Curve - Receiver Operating Characteristic) accuracy is calculated for all disease categories contained in the dataset. The Model with the highest performing AUC-ROC accuracy can be used for inferencing on real-world data.
Note: Validation was performed against the Entire Dataset used for training and not on a "withheld or holdout portion of the dataset" which is a common validation technique. The central aim of this paper is to drive parallel VDI and AI workloads. Another aim is to determine whether both can co-exist within an acceptable performance envelope while maximizing hardware utilization; thus testing used the former validation methodology.
To ensure that the user experience was not compromised, the team monitored the following important resources:
- Compute host servers—Solutions based on VMware vCenter for VMware vSphere gather key data (CPU, memory, disc, GPU, and network usage) from each of the compute hosts during each test run. This data is exported to .csv files for single hosts and then consolidated to show data from all hosts. While the report does not include specific performance metrics for the management host servers, these servers are monitored during testing to ensure that they are performing at an expected level with no bottlenecks.
- Hardware resources—Resource overutilization can cause poor EUE. The Dell VDI solutions team monitored the relevant resource utilization parameters and compared them to relatively conservative thresholds. The thresholds, as shown in the following table, were selected based on industry best practices and our experience. These thresholds provide an optimal trade-off between good EUE and cost-per-user while also allowing sufficient burst capacity for seasonal or intermittent spikes in demand.
Table 4. Test performance metrics Host | Login Enterprise | Host CPU Usage (%) Host CPU Core Utilization (%) Host CPU Readiness (%) Host Memory Consumed (GB) Host Memory Active (GB) Memory Ballooning (GB) Host Memory Swap Used (GB) Host Network Usage (Mbps) GPU Utilization (%) Additional: VSI Max Score |
Storage | Login Enterprise | Data store IOPS peak Data store IOPS steady state average Data store Latency peak Data store Latency steady state average |
AI Compute | Time to process | Accuracy Validation (%) |
Note: The VDI solutions team recommends that average CPU utilization not exceed 85 percent in a production environment. A 5 percent margin of error was allocated for this validation effort. Therefore, CPU utilization sometimes exceeds our recommended percentage. Because of the nature of Login VSI testing, these exceptions are reasonable for determining our sizing guidance.