Abstract

Dell Technologies, AMD, and Deci AI recently submitted results to MLPerf Inference v2.1 in the open division. This blog showcases our first successful three-way submission and describes how the software and hardware of each party was best used to achieve optimal performance for the MLPerf BERT-Large model.

Introduction

MLCommons™ is a consortium of companies whose mission is to accelerate machine learning innovation to benefit everyone. The organization focuses on benchmarking to enable the display of fair performance measurements and makes datasets open and available, since models in the benchmarks are only as good as the data. It also shares best practices to initiate standardization of sharing and communication among machine learning stakeholders.

The MLPerf Inference v2.1 submission falls under the benchmarking road map for MLCommons. Submissions made to the closed division warrant an equitable comparison of hardware platforms and software frameworks. The submissions must use the same model and optimizer as the reference implementation. Additionally, no retraining is permitted in the closed division. On the other hand, the open division promotes faster models and optimizers as it allows benchmark implementations that use a different model for the same task. Any machine learning approach is permitted if it meets the target quality. Results submitted to the open division showcase exciting technologies that are being developed.

This blog highlights an offline submission made in the open division BERT 99.9 category for the natural language processing (NLP) task. The goal of the submission was to maximize throughput while keeping the accuracy within a 0.1 percent margin of error from the baseline accuracy, which is 90.874 F1 (Stanford Question Answering Dataset (SQuAD)).

Dell PowerEdge R7525 Server Powered with AMD EPYC™ processors

Since MLPerf benchmarking results are a showcase of the joint performance of both the software and underlying hardware, Deci AI’s optimized BERT-Large model, known as DeciBERT-Large, was run using ONNXRT on the Dell PowerEdge R7525 rack server populated with two 64-core AMD EPYC 7773X processors.

The PowerEdge R7525 rack server is a highly scalable and adaptable two-socket 2U rack server that delivers powerful performance and flexible configurations. It is ideal for traditional and emerging workloads and applications that include flash software-defined storage (SDS), virtual desktop infrastructure (VDI), and data analytics (DA) workloads. As this blog’s MLPerf submission shows, the PowerEdge R7525 rack server is also well suited for AI workloads such as deep learning inference.

The PowerEdge R7525 server is an excellent server choice for several reasons. Some of the high-level specifications to meet performance demands include up to 24 directly connected NVMe drives that support all flash AF8 vSAN Ready Nodes. The 4 TB of memory and two AMD EPYC processors enable optimal performance. Also, the PowerEdge R7525 server has maximized IOPS, storage, and memory configurations enabled by up to eight PCIe Gen4 slots. Furthermore, AMD Instinct™ MI100 and MI200 series accelerators and other double-width GPUs can provide additional levels of acceleration.

AMD EPYC processors with AMD 3D V-Cache™ Technology were launched in March 2022. This innovative new lineup of server-class AMD EPYC processors was positioned for accelerating technical computing workloads, including computational fluid dynamics (CFD), electronic design automation (EDA), and finite element analysis (FEA).

With this joint MLPerf submission, a first for AMD EPYC processors with AMD 3D V-Cache Technology, AMD demonstrates the applicability of the new AMD EPYC processors and their extra L3 cache for deep learning inference workloads.

Deci AI DeciBERT-Large Model Comparisons and Metrics

Deci AI used their proprietary AutoNAC™ (Automated Neural Architecture Construction) Engine to generate an optimized BERT-Large model, called DeciBERT-Large, tuned specifically for the underlying PowerEdge R7525 server and two 64-core AMD EPYC 7773X processors. The Deci AI algorithm reduces the reference BERT-Large model size by nearly three times, from 340 million parameters in the standard BERT-Large model down to 115 million parameters, while achieving compelling performance and accuracy.

From a memory capacity perspective, the parameter count reduction also contributes to similarly significant space savings with the DeciBERT-Large model. The ONNX DeciBERT-Large model size is 378 MB in FP32 and 95 MB in INT8 compared to 1.4 GB of the reference BERT-Large model implementation from MLCommons.

By pairing the optimized, smaller DeciBERT-Large model with the extended L3 cache capacity of the AMD EPYC processors with 3D V-Cache, more of the model can be stored in the cache at a time. This method of leveraging the additional L3 cache enables near compute and lower latency memory accesses compared to DRAM.

The following tables highlight the data points collected by Deci AI on the PowerEdge R7525 server with two AMD EPYC 7773X processors. The application of the Deci AI AutoNAC algorithm to generate the DeciBERT-Large model highlights a 6.33 times improvement in FP32 performance and a 6.64 times improvement in INT8 performance, while achieving an INT8 F1 score of 91.08, which is higher than the F1 score of 90.07 of the reference BERT-Large implementation in INT8.

Table 1: BERT-Large comparisons – FP32

	F1 accuracy on SQuAD (FP32)	Parameter count	Model size	Throughput (QPS) ONNX runtime FP32
BERT-Large	90.87	340 million	1.4 GB	12
DeciBERT-Large	91.08	115 million	378 MB or 0.378 GB	76
Deci Gains	0.21 better F1 accuracy	66.2% improvement	73% size reduction	6.33 times throughput improvement

Table 2: BERT-Large comparisons – INT8

	F1 accuracy on SQuAD (INT8)	Parameter count	Model size	Throughput (QPS) ONNX runtime INT8
BERT-Large	90.07	340 million	1.4 GB	18
DeciBERT-Large	91.08	115 million	95 MB or 0.095 GB	116
Deci Gains	1.01 better F1 accuracy	66.2% improvement	93.2% size reduction	6.44 times throughput improvement

The following figure shows that Deci AI’s implementations of the BERT-Large models compiled with ONNXRT are critical in enabling competitive performance in both FP32 and INT8 precisions:

Figure 1: Performance comparison between the reference BERT-Large model implementation and Deci AI’s optimized DeciBERT-Large model

FP32 is commonly used for running deep learning models as it is the default floating datatype in programming languages. It consists of 32 bits of ones and zeros, of which the first bit is the sign bit, representing whether the value is positive. The next eight bits are the exponent of the number, and the last 23 bits are the fraction or mantissa of the number. FP32, or floating point 32, uses nine bits for range and 23 bits for accuracy. The dynamic range of FP32, or the quantity of representable numbers using this datatype, reaches nearly four billion values.

INT8 has become a popular datatype for deep learning inference. Since INT8 has fewer bits and a smaller dynamic range (256 values compared to the four billion values representable by FP32), INT8 compute requirements are considerably reduced compared to FP32. Typically, latencies are lower and throughputs are higher when using INT8 models compared to FP32 models. However, the increased throughput and lower latency tends to come at the cost of accuracy degradation.

Most MLPerf BERT-Large submissions in the 99.9 percent accuracy category use 32-bit or 16-bit quantization because 8-bit quantization is lossy and typically reduces model accuracy below the 99.9 percent threshold. For example, while applying INT8 quantization to the baseline BERT-Large model is an option that accelerates throughput from 12 FPS to 18 FPS, it no longer meets the MLPerf 99.9 percent accuracy constraints.

Deci AI AutoNac Engine and Optimization

The Deci AI AutoNAC engine guarantees that the model designed meets the accuracy requirements set by MLPerf and pursues the most performant variation of the specific model within those constraints, allowing INT8 quantization to be leveraged for the submission.

The Deci AI AutoNAC engine begins by generating a dynamic search space that accounts for parameters such as the baseline accuracy, inference performance targets, underlying hardware, compilers, and quantization, among others. A fast and accurate multiconstraints search algorithm is initiated and creates a new model architecture that delivers the highest performance given the defined constraints.

From a computation time perspective, the AutoNAC search process is approximately three times longer than standard training, depending on the task. For example, training the DeciBert model to perform the SQuAD NLP task requires approximately 60 GPU hours. The search for this DeciBERT model required approximately 180 GPU hours, and the computation involved was parallelized. Therefore, the computation of AutoNAC is commercially affordable for almost any organization.

In summary, Deci AI generated a model using AutoNAC that was specifically designed to deliver optimal performance within the MLPerf constraints when running on a Dell server with AMD EPYC processors with 3D V-Cache.

The following figure shows the AutoNAC optimization process:

Figure 2: Deci AI’s AutoNac process

Conclusion

The Deci AI AutoNAC engine generates optimized deep learning inference models that meet customer accuracy and dataset requirements while maximizing performance. The increased performance, combined with the significant reduction in parameter count and memory size, positions Deci AI optimized models as highly efficient for a range of applications. The DeciBERT-Large model is an optimized version of the state-of-the-art BERT-Large model for NLP applications. Applying that to real-world scenarios, call centers are examples of customers that can take advantage of deep learning insights in the areas of sentiment analysis, live transcription and translation, and question answering. The DeciBERT-Large model, as developed for MLPerf v2.1 by Deci AI, can be easily tuned for a call center’s own dataset and application, and deployed in production today to improve performance, shorten time to insights, and enable the deployment of smaller optimized models with reduced compute requirements, which becomes particularly beneficial in power or cost constrained environments.

Your Browser is Out of Date

The First MLPerf Inference v2.1 Performance Result on AMD EPYC™ CPU-Based PowerEdge Servers