The units on which Inference is measured are based on samples and queries. A sample is a unit on which inference is run such as an image or sentence. A query is a set of samples that are issued to an inference system together. For example, a single query contains eight images. If an R7515 outputs 23,290 samples per second in a resnet50 offline scenario benchmark, it means it could perform image classifications on 23,290 images per second. For detailed explanation of definitions, rules and constraints of MLPerf Inference see https://github.com/mlperf/inference_policies/blob/master/inference_rules.adoc#constraints-for-the-closed-division
In the charts below, “Default Accuracy” refers to a configuration where the model infers samples with at least 99% accuracy, and “High Accuracy” refers to a configuration where the model infers samples with 99.9% accuracy.
Figures 1 to 4 below show the inference capabilities of the PowerEdge R7515 and PowerEdge R7525 configured with different NVIDIA GPUs. Each bar graph indicates the relative performance of inference operations that are completed in a set amount of time while bounded by latency constraints. The higher the bar graph is, the higher the inference capability of the platform. Details on the different scenarios used in MLPerf inference tests (server and offline) are available at the MLPerf website. Offline scenario represents use cases where inference is done as a batch job (for instance using AI for photo sorting), while server scenario represents an interactive inference operation (translation app).
The relative performance of the different servers is plotted to show the inference capabilities and flexibility that can be achieved using these platforms.