The following sections compare the results from MLPerf v2.0 and MLPerf v2.1, and provide performance improvements.
The following figure shows the performance gains:
These performance gains are primarily due to software improvements. We see that there is an approximate 8.64 percent improvement with the server scenario and 6.36 percent improvement with the offline scenario respectively for the ResNet50 model.
The 3D-UNET model shows an approximate 3.39 percent improvement whereas the RNNT server scenario shows an approximate 2.25 percent improvement.
The following graph shows the performance improvement percentage of different workloads:
The preceding graphs show the performance improvement with MLPerf Inference v2.1 compared to MLPerf Inference v2.0. The RNN-T Offline scenario had the most performance improvement with the PowerEdge XE8545 server followed by the ResNet scenario. The PowerEdge XE8545 server delivered considerable performance gains from software improvements. For more information about the software improvements, see NVIDIA’s inference 2.1 blog.
For some customers, optimizing power efficiency is a primary concern. It becomes more essential in an inference deployment setting, sometimes while simultaneously competing for space and power. We understand the concern and provide performance and power results. This section demonstrates the performance and power improvement that we observed in MLPerf Inference v2.1 over MLPerf Inference v2.0.
In this round of MLPerf Inference v2.1, we made submissions that tune for a higher number of inferences per watt. The major difference between MaxQ and non-MaxQ systems is the power measurement along with performance. You might wonder why the throughput of an identical non-MaxQ system is higher than that of a MaxQ system in our submissions. We made some tradeoffs to suit the system to optimally deliver high performance/watt as opposed to only high performance. We noticed significant improvements in performance/watt that were made in MLPerf Inference v2.1.
For power submissions, we derive a metric to measure the power efficiency of the system. We divide the throughput by the power used during the full run (in watts) and obtain performance/watt also known as inferences per watt.
The y axis in the above figure describes the delta in inferences per watt between inference v2.1 and v2.0 submissions. DLRM Server delivered highest improvement in perf/watt at approximately 32 percent followed by RNN-T Offline.
Performance/watt profile such as DAPC was set in the iDRAC during our runs, it helps to improve the power efficiency of the system.
The PowerEdge XE8545 server with four A100 SXM GPUs delivered improvements in performance/watt with the ResNet Server. Because there are no references for performance/watt from MLPerf Inference v2.0 to compare for other than ResNet benchmark, the preceding graph only has ResNet benchmarks. However, with MLPerf Inference v2.1, the PowerEdge XE8545 server delivers outstanding performance/watt even with other workloads scores as seen in Figure 3.