Dell PowerEdge with Intel 4th Gen Xeon Built-In Accelerators – Choosing the Right SKU
Download PDFTue, 24 Oct 2023 20:21:02 -0000
|Read Time: 0 minutes
Executive Summary
Intel’s 4th gen Xeon introduces several built-in acceleration engines which have meaningful performance implications for use cases directly relevant to the modern and evolving data center. In this DfD, we’ll present a brief introduction to these accelerators and then provide a comprehensive listing of all 4th Gen Xeon FCLGA4677 socketed SKUs presently offered by Dell Technologies and what accelerator support they each provide.
Before the quick overview to explain the built-in Accelerator Engines, the following chart describes the suffixes found on Intel’s 4th Gen Xeon processors:
Options | 4th Generation Intel® Xeon® Processors |
(formerly Sapphire Rapids-SP) | |
H | Database and Analytics up to 4S and 8S depending on SKU |
M | Processor specifications optimized for AI and media processing workloads |
N | Network/5G/Edge |
(High TPT /Low Latency) Processor specifications optimized for communications/networking/NFV (Network Function(s) Virtualization) workloads and operating environments | |
P | Processor specifications optimized for IaaS cloud environments such as orchestration efficiency in high-frequency VM environments |
Q | Lower Tcase SKUs, targeted towards liquid cooling |
S | Storage-optimized SKU with full accelerators enabled (DSA, QAT, DLB) |
T | Support for up to 10-year reliability and support for higher Tcase. These SKUs are often used in operating environments with long-life use requirements and require Network Equipment Building System (NEBS)–Thermal friendly specification support |
U | Supported in one-socket configurations only |
V | Processors specification optimized for SaaS cloud environments. |
Y | |
+ | Feature plus(+) SKU contains 1 of each accelerator enabled (DSA, DLB, QAT, IAA) |
Intel’s 4th Gen Xeon Acceleration Engines
DSA “Data Streaming Accelerator”
Intel® DSA is a high-performance data copy and transformation accelerator that will be integrated in future Intel® processors, targeted for optimizing streaming data movement and transformation operations common with applications for high-performance storage, networking, persistent memory, and various data processing applications.
IAA “In-Memory Analytics Accelerator”
The Intel® In‐Memory Analytics Accelerator (Intel® IAA) is a hardware accelerator that provides very high throughput compression and decompression combined with primitive analytic functions.
QAT “Quick Assist Technology”
Intel Quick Assist Technology is a high-performance data security and compression acceleration solution provided by Intel. This solution utilizes the QAT chip to share symmetrical/asymmetrical encryption computations, DEFLATE lossless compression, and other computation intensive tasks for lower CPU utilization and higher overall platform performance.
DLB “Dynamic Load Balancer”
Intel® DLB is a Peripheral Component Interconnect Express (PCIe) device that provides load-balanced, prioritized scheduling of events (packets) across CPU cores/threads, enabling efficient core-to-core communication. It is a hardware accelerator located inside the latest Intel® Xeon® CPUs offered by Intel. Under the hood, Intel® DLB is a hardware managed system of queues and arbiters connecting producers and consumers.
List of Xeon Gen 4 SKUs and Accelerator Engine Support
The following chart illustrates Xeon Gen 4 CPUs and the quantity of built-in Accelerator Engines featured on each SKU.
Related Documents
Intel 4th Gen Xeon featuring QAT 2.0 Technology Delivers Massive Performance Uplift in Common Cipher Suites
Sat, 27 Apr 2024 15:07:09 -0000
|Read Time: 0 minutes
Intel QAT Hardware v2.0 acceleration running on 16G PowerEdge delivers on performance for ISPs - Lab Tested and Proven
Introduction
The Internet as we know it would simply not be possible without encryption technologies. This technology lets us perform secure communication and information exchange over public networks. If you buy a pair of shoes from an online retailer, the payment information you provide is encrypted with such a high level of security that extracting your credit card information from ciphertext would be nearly an impossible task for even a supercomputer. The shoes might not end up fitting, but if the requisite encryption and secure communication tech is properly implemented, your payment information remains a secret known only to you and the entity receiving payment.
This domain of security requires hardware that is up to the task of performing handshakes, key exchanges, and other algorithmic tasks at an expeditious speed.
As we’ll demonstrate through extensive testing and proven results in our lab, Intel’s QAT 2.0 Hardware Accelerator featured on Gen4 Xeon processors is a performant and dev friendly choice to supercharge your encryption workloads. This feature is readily available on our current products across the PowerEdge Server portfolio.
What is QAT?
QAT, or “Quick Assist Technology” is an Intel technology that accelerates two common use cases: encryption acceleration and compression/decompression acceleration. In this tech note, we look at the encryption side of the QAT Accelerator feature set and explore leveraging QAT to speed up cipher suites used in deployments of OpenSSL–a common software library used by a vast array of websites and applications to secure their communications.
But before we start, let’s briefly touch on the lineage and history of QAT. QAT was introduced back in 2007, initially available as a discrete add-in PCIe card. A little further on in its evolution, QAT found a home in Intel Chipsets. Now, with the introduction of the 4th Gen Xeon processor, the silicon required to enable QAT acceleration has been added to the SOC. The hardware being this close to the processor has increased performance and reduced the logistical complexity of having to source and manage an external device.
For a complete list of the QAT Hardware v2.0’s cryptosystem and algorithms support, see: https://github.com/intel/QAT_Engine/blob/master/docs/features.md#qat_hw-features
QAT hardware acceleration may not be the fastest method to accelerate all ciphers or algorithms. With this in mind, QAT Hardware Acceleration (also called QAT_HW) can peacefully co-exist with QAT Software Acceleration (or QAT_SW). This configuration, while somewhat complex, is well supported by clear documentation. Fundamentally, this configuration relies on a method to ensure that the maximum performance is extracted for all inputs given what resources are available on the system. Allowing for use of an algorithm bitmap to dynamically choose between and prioritize the use of QAT_HW and QAT_SW based on hardware availability and which method offers the best performance.
Next we'll look at setting up QATlib and see what the performance looks like using OpenSSL Speed and a few common cipher suites.
Lab Test Setup and Notes
For this test we use a Dell PowerEdge R760. This is Dell’s mainstream 2U dual socket 4th Gen Xeon offering and features support for nearly all of Intel’s QAT enabled CPUs. Xeon gen4 CPUs that feature on-chip QAT HW 2.0 will have 1, 2 or 4 QAT endpoints per socket. We selected the Intel(R) Xeon(R) Gold 5420+ CPU that features 1 QAT endpoint for our testing. All else being equal, more endpoints allow for more QAT Hardware acceleration work to be done and allow greater performance in QAT HW accelerated use cases per socket.
As this is not a deployment guide, we’re going to use a RHEL 9.2 install as our operating system and run bare metal for our tests. Our primary resource for setting up QAT Hardware Version 2.0 Acceleration is the excellent QAT documentation found on Intel’s github here: https://intel.github.io/quickassist/index.html
Following the guide, we can simply install from RPM sources, ensure kernel drivers are loaded and we’re about ready to go.
Performance
First up, we’ll take a look at probably the most common public key asymmetric cipher suite, RSA. On the Internet RSA finds its home as a key exchange and signature method used to secure communication and confirm identities. In these graphs we’re comparing the speed of the RSA Sign and Verify algorithm using symmetric QAT_HW vs symmetric QAT off (using OpenSSLs default engine).
The following graphic shows a representation of a TLS handshake. This provides a bit of context concerning the role of the server in key exchange and handshakes.
Greater than 240% performance increase in OpenSSL RSA Verify using QAT Hardware Acceleration Engine vs Default Open SSL Engine.(1)
Testing in our labs shows that enabling QAT offers 240% greater algorithmic operations. The result for this performance improvement could be the implementation of greater security capacity per node without the risk of negative impact on QoS.
Next we’ll look at the industry standard elliptical curve digital signature algorithm (ECDSA), specifically P-384. QAT HW supports both P-256 and P-384, with both offering exceptional performance vs the default OpenSSL engine. ECDSA is a commonly used as a key agreement protocol by many Internet messaging apps.
ECDSA example
Over 30x improvement in ECDSA P384 Sign-in OpenSSL using QAT Hardware Acceleration Engine vs Default OpenSSL Engine(2)
Both of these algorithms provide the level of protection that today’s server security specialists require. However, both are quite different in many aspects.
This vast performance improvement in secure key exchange offers more secure and uncompromised communication without degrading performance.
Conclusion
Intel’s QAT 2.0 Hardware acceleration offers substantial performance improvements for algorithms found in commonly used cipher suites. Also, QAT’s ample documentation and long history of use coupled with these new findings on performance should remove any reservations that a customer might have in deploying these security accelerators. Security at the server silicon level is critical to a modern and uncompromised data center. There is definite value in deploying QAT and a clear path towards realizing accelerated performance in their data center environments.
Legal disclosures
- Based on August 2023 Dell labs testing subjecting the PowerEdge R760 to OpenSSL Speed test running synchronously with default engine vs asynchronous with QAT Hardware Engine. Actual results will vary.
- Based on August 2023 Dell labs testing subjecting the PowerEdge R760 to OpenSSL Speed test running synchronously with default engine vs asynchronous with QAT Hardware Engine. Actual results will vary.
Next-Generation Dell PowerEdge XR Server CPU Improvements
Fri, 03 Mar 2023 19:57:24 -0000
|Read Time: 0 minutes
Summary
Dell Technologies has recently introduced the next generation of Dell PowerEdge XR servers. Powered by 4th Gen Intel® Xeon® Scalable processors with the MCC SKU stack, these servers deliver advanced performance in an energy-efficient design. Dell continues to provide scalability and flexibility with its latest portfolio of short-depth XR servers. These servers integrate technologies such as 4th Gen Intel CPUs, PCIe Gen5, DDR5, NVMe drives, and GPU slots, and they are compliance-tested for NEBS and MIL-STD.
This tech note discusses our CPU performance benchmark testing of the next-generation PowerEdge XR server portfolio and the test results that show improvements over previous PowerEdge XR servers powered by 3rd Gen Intel Xeon Scalable processors and Xeon D processors.
Benchmarks
4th Gen Intel Xeon Scalable processors with the MCC SKU stack were tested using the STREAM and HPL benchmarks and compared with the CPU of the previous generation of XR servers.
STREAM
The STREAM benchmark is a simple, synthetic benchmark designed to measure sustainable memory bandwidth (in MB/s) and a corresponding computation rate for four simple vector kernels: Copy, Scale, Add, and Triad. The STREAM benchmark is designed to work with datasets much larger than the available cache on any system so that the results are (presumably) more indicative of the performance of very large, vector-style applications. Ultimately, we get a reference for compute performance.
HPL
HPL is a high-performance LINPACK benchmark implementation. The code solves a uniformly random system of linear equations and reports time and floating-point operations per second using a standard formula for operation count. It also helps to provide a reference for a system’s compute speed performance.
Performance results
Benchmark testing showed significant performance increases with the 4th Gen Intel Xeon Scalable MCC SKU stack when it was compared with both the Intel Xeon D SKU and the 3rd Gen Intel Xeon Scalable MCC SKU.
Comparison of 4th Gen Intel Xeon Scalable MCC SKU with Intel Xeon D SKU
In our tests, the single-socket PowerEdge XR servers with the 4th Gen Intel Xeon Scalable CPU (32 core) MCC SKU stack delivered a 253 percent increase in HPL performance and a 182 percent increase in STREAM performance. Thus, these servers are faster at the network edge or enterprise edge than the previous-generation PowerEdge XR servers powered by the Intel Xeon D (16 core) SKU.
Figure 1 and Figure 2 show the results of the benchmark tests that compared the performance of the 4th Gen Intel Xeon Scalable processor MCC SKU stack with the Intel Xeon D SKU.
Figure 1. HPL performance comparison: Intel Xeon D SKU and 4th Gen Intel Xeon Scalable MCC SKU
Figure 2. STREAM performance comparison: Intel Xeon D SKU and 4th Gen Intel Xeon Scalable MCC SKU
Comparison of 4th Gen and 3rd Gen Intel Xeon Scalable MCC SKU
In our tests, the single-socket PowerEdge XR servers with the 4th Gen Intel Scalable CPU (32 core) MCC stack delivered a 52 percent increase in STREAM performance and a 72 percent increase in CPU FP rate base performance (floating point performance for the CPU). Thus, these servers are faster for compute at the network edge or enterprise edge than the previous generation of PowerEdge XR servers powered by the 3rd Gen Intel Xeon Scalable MCC SKU.
Figure 3 and Figure 4 show the results of the benchmark tests that compared the performance of the 4th Gen and 3rd Gen Intel Xeon Scalable processor MCC SKU stacks.
Figure 3. STREAM performance for 4th and 3rd Gen Intel Xeon Scalable processors
Figure 4. CPU FP rate base performance for 4th and 3rd Gen Intel Xeon Scalable processors
Conclusion
The Dell PowerEdge XR portfolio continues to provide CPU-based improvements and a streamlined approach for various edge and telecom deployment options. The XR portfolio provides a solution to the challenge of needing a small form factor at the edge with industry-standard rugged certifications (NEBS). It provides a compact solution for scalability along with flexibility for operating in temperatures ranging from –5°C to +55°C.
References