Home > Workload Solutions > High Performance Computing > White Papers > Dell Validated Design for Government HPC, Artificial Intelligence, and Data Analytics: AI Inferencing Option > MLPerf performance
MLPerf is a benchmarking suite focused on deep learning developed by the MLCommons community. More information about MLPerf can be found on MLCommons.
AI inference models in production are typically run as single-node instances. For example, Large Language Models (LLMs) are typically deployed as a single instance per node, which has a GPU memory requirement depending on the parameter count and data type (int/float and precision). Popular LLMs have their parameter counts specifically tuned (such as 7B, 13B, etc) to fit their memory footprint within the capacity of a single consumer-grade GPUs.
Multiple users are typically time-sliced, where queries are served first-in-first-out. As user counts grow (typically 10-50 concurrent users per instance, before performance is deemed unacceptable), the LLM service is typically scaled out by deploying additional isolated instances of the same model to support additional users, rather than scaling up additional nodes to the existing instance(s). This limits latency penalties and an additional advantage to approach is that failed instances can be discarded and restarted without affecting other users. As a result, the MLPerf Inference Suite run in single-node configuration was selected as a representative benchmark.
The MLPerf Inference Suite is a subset of benchmark models assumed to be in a previously trained state, and therefore inferencing is primarily concerned with output performance and accuracy. The different scenarios and use cases include: