We performed GPU and non-GPU tests with the NVIDIA nVector Knowledge Worker workload to compare the metrics from both test cases. For the GPU testing, we used a single compute host provisioned with six NVIDIA T4 GPUs. We ran 48 virtual machines enabled with NVIDIA T4-2B vGPU profiles on this compute host. The vGPU scheduling policy was set to "Fixed Share Scheduler." For the non-GPU test, we performed testing on a compute host running 48 virtual machines without enabling vGPU profiles.
The compute host was part of a 4-node, VMware vSAN software-defined storage cluster. We used the Citrix MCS linked-clone provisioning method to provision desktop VMs. The Citrix ThinWire Plus was used as the remote display protocol.
Both tests were performed with the NVIDIA nVector Knowledge Worker workload. Table 13 compares the host utilization metrics gathered from vCenter for both GPU and non-GPU test cases, while Table 14 compares the EUE metrics generated by the nVector tool. For definitions of the nVector EUE metrics, see the "Measuring EUE" section. For definitions of host utilization metrics, see the Login VSI "Summary of test results" section.
The host utilization metrics in both tests were well below the threshold that we set (see Table 11). Both tests produced the same image quality (SSIM value 0.99). However, with GPUs enabled, the frames per second (FPS) rate increased by 7.7 percent, and the end-user latency decreased by 22.6 percent.
Test case | Server configuration | nVector workload | Density per host | Average CPU usage | Average GPU usage | Average active memory | Average IOPS per user | Average net Mbps per user |
GPU | Density Optimized + six NVIDIA T4s | Knowledge Worker (NVIDIA T4-2B) | 48 | 46% | 18% | 385 GB | 7.56 | 2.8 |
Non-GPU | Density Optimized | Knowledge Worker | 48 | 65% | N/A | 67 GB | 11 | 2.7 |
Test configuration | nVector workload | GPU profile | Density per host | End-user latency | Frame rate | Image quality |
GPU | Knowledge Worker | NVIDIA T4-2B | 48 | 82 milliseconds | 14 | 0.99 |
Non-GPU | Knowledge Worker | N/A | 48 | 106 milliseconds | 13 | 0.99 |
We performed this test on a 4-node VxRail cluster. Host 1 was configured with six NVIDIA T4 GPUs was running 48 virtual desktops VMs. Each virtual desktop VM was configured with the NVIDIA T4-2B vGPU profile. Host 2 was used to run nVector launcher VMs. A launcher is an endpoint VM from where the desktop VM launch is initiated. Host 2 was configured with three P40 GPUs, and the launcher VMs in host 2 were GPU-enabled through an NVIDIA P40-1B profile. It is a requirement for the nVector tool to enable launcher VMs with GPUs. Host 3 and 4 did not have any load.
The total GPU frame-buffer available on compute host 1 was 96 GB. With vGPU VMs enabled with the NVIDIA T4-2B profile, the maximum number of GPU-enabled users that can be hosted on compute host 1 is 48 users.
The following figure shows CPU utilization across the GPU-enabled host 1 and launcher host 2 during the testing. We can see a spike in CPU usage for compute host 1 during linked-clone creation and the login phase. During the steady state phase, an average CPU utilization of 46 percent was recorded on the GPU-enabled compute host 1. This value was lower than the pass/fail threshold we set for average CPU utilization (see Table 11).
As shown in the following figure, the CPU readiness percentage was well below the 10 percent threshold that we set. The CPU readiness percentage was low throughout testing, indicating that the VMs had no significant delays in scheduling CPU time.
As shown in the following figure, the average steady state CPU core utilization was 41 percent on the GPU-enabled compute host 1.
The following figure shows the GPU usage across the six NVIDIA T4 GPUs configured on the GPU-enabled compute host 1. The GPU usage during the steady state period across the six GPUs averaged approximately 18 percent.
We observed no memory constraints during the testing on the compute host or the management host. Out of 768 GB of available memory per node, the compute host 1 reached a maximum consumed memory of 439 GB.
Active memory usage reached a maximum of 386 GB. There were no variations in memory usage throughout the test, as all vGPU-enabled VM memory was reserved. There was no memory ballooning or swapping on hosts.
Network bandwidth was not an issue in this test. A steady state average network usage of 134 Mbps was recorded during the steady state phase of the testing. The busiest period for network traffic was during the VM creation phase when we recorded a maximum network usage of 8,775 Mbps on compute host 1. The steady state average network usage per user was 2.8 Mbps.
As shown in the following figure, the cluster IOPS reached a maximum value of 28,454 for read IOPS and 4,905 for write IOPS during the testing. The average steady state read IOPS was 111, and the average steady state write IOPS was 252. The average steady state IOPS (read and write) per user was 7.6.
As shown in the following figure, cluster disk latency reached a maximum read latency of 1.51 milliseconds and a maximum write latency of 2.76 milliseconds. The average steady state read latency was 0.15 milliseconds, and the average steady state write latency was 0.8 milliseconds.
We ran this non-graphics test to compare the performance of a GPU-enabled host and a non-GPU host while running the nVector Knowledge Worker workload. We performed this test on a 4-node VxRail cluster. We ran the compute host 1 with 48 desktop VMs. No GPUs were configured. The host 2 ran the nVector launcher virtual machines. We configured host 2 with three P40 GPUs, and the launcher VMs in host 2 were GPU-enabled through an NVIDIA GRID P40-1B profile. It is a requirement for the nVector tool to enable launcher VMs with vGPUs. Host 3 and 4 did not have any load.
The following figure shows the CPU utilization across the desktop host 1 and launcher host 2 during the testing. During the steady state phase, an average CPU utilization of 65 percent was recorded on compute host 1. This value was lower than the pass/fail threshold we set for average CPU utilization (see Table 11). The launcher host 2 had very low CPU usage during the steady state phase.
As shown in the following figure, the CPU readiness was well below the 5 percent threshold that we set.
As shown in the following figure, the average steady state CPU core utilization on the compute host 1 was 56 percent.
We observed no memory constraints during the testing on the compute or the management host. There was no memory ballooning or swapping on hosts. The steady state average consumed memory on the compute host 1 was 434 GB.
As shown in the following figure, the average steady state active memory usage was 67 GB.
Network bandwidth was not an issue in this test. A steady state average of 131 Mbps was recorded on compute host 1 during the steady state phase of the testing. The average network usage per user on the compute host 1 was 2.7 Mbps.
As shown in the following figure, the cluster IOPS reached a maximum value of 24,279 for read IOPS and 3,968 for write IOPS during the testing. The steady state average read IOPS was 198 and the average write IOPS was 333. The average steady state disk IOPS (read and write) per user during the steady state was 11.
As shown in the following figure, cluster disk latency reached a maximum read latency of 1.48 milliseconds and a maximum write latency of 2.34 milliseconds during the testing. The average steady state read latency was 0.12 milliseconds, and the average steady state write latency was 0.47 milliseconds.