
The PowerEdge XE8545: Performance Summary
Download PDFMon, 16 Jan 2023 19:50:52 -0000
|Read Time: 0 minutes
AI Infrastructure for Computing Without Compromise
Summary
This document is a brief summary of the performance advantages that customers can gain when using the PowerEdge XE8545 acceleration server. All performance and characteristics discussed are based on performance characteristics conducted in the Americas Data Center (CET) labs. Results accurate as of 3/15/2021. Ad Ref #G21000042
Dell’s Latest Acceleration Offering
The PowerEdge XE8545 is Dell EMC’s response to the needs of high-performance machine learning customers who immediately want all the innovation and horsepower provided by the latest NVIDIA GPU technology, without the need to make major cooling-
related changes to their data center. Its specifically air-cooled design provides delivery of four A100 40GB/400W GPUs with a low-latency, switchless SMX4 NVLink interconnect, while letting the data center maintain an energy efficient 35°C. It also has an 80GB/500W GPU option that has been shown to deliver 13-15% higher performance than 400W GPUs at only a slightly lower ambient input temperature (28°C).
500W Advantage Delivers More Performance
- The XE8545 can unleash 500W power with its 80GB GPU to outperform any 400W-based competition by 13-15%
Unlike competitors, Dell worked with NVIDIA early in the design process to ensure that the XE8545 could run at 500 Watts of power when using the high capacity 80GB A100 GPUs – and still be air-cooled. This 80GB/500W GPU option allows the XE8545 to drive harder and derive more performance from each of the GPUs. Using the common industry benchmark ResNet50 v1.5 model to measure image classification speed with a standard batch size, the 500W GPU took 67.78 minutes to train, compared to 73.32 minutes for the 400W GPU – 7.56% faster. And when batch size is doubled, it results in up to 13- 15% better
performance! When speed of results is a customer’s primary concern, the XE8545 can deliver the power needed to get those results faster.
Training – A Generational Leap in Performance
- XE8545 trains ResNet50 image classification to top-quality accuracy in less than HALF (1/2) the time of the previous generation PowerEdge Systems
It is clear from the chart above that an XE845 with 40GB memory is more than twice as fast as the previous generation C4140 when training an image classification model – in fact, faster than two C4140s running in parallel! And the 80GB GPU option is even faster! This is a great illustration of the combined power of the new technologies packed into the XE8545 – the latest NVIDIA GPUS, the latest AMD CPUs and the latest generation of PCIe IO fabric. Further gains in performance can be achieved by workloads that take advantage of the improvements in how the A100 performs the matrix multiplication involved in machine learning– by better accounting for “sparsity”. That is, the occurrence of many zeros in the matrix, that previously resulted in lots of time-consuming “multiplying-by-zero” operations that had no actual effect on the final result.
And as with all operations for the XE8545, it delivers the very top-level performance using only air-cooling. It does not require liquid cooling.
Inference Analysis of Images
- XE8545 can analyze over 150k images per second – that’s 1.46x more images per second on each SXM4 A100 GPU compared to previous generation PowerEdge
Inference tends to scale linearly – as there is no peer-to-peer GPU communication involved - and the XE8545 has proven to have exceptional linear scalability. So, it is not surprising that the XE8545 produces excellent high-performance inference results. As with training, the 80GB/500W A100 GPU has a performance edge - 10% faster than the 400W GPU (at a proportional power increase).
MIG - Multi-Instance GPUs - 7X Faster for Inferencing
The innovative Multi-Instance GPU (MIG) technology introduced with the A100 GPU allows the XE8545 to partition each A100 GPU into as many as seven “slices”, each fully isolated with its own high-bandwidth memory, cache, and compute cores. So, if fully utilized, an XE8545 server can be running 28 separate high- performance instances of inferencing. Each of those instances has been determined by NVIDIA to provide performance equivalent to the previous generation V100. So, an A100 GPU can be thought of as 7 times faster than the previous generation – specifically for inferencing, where peer-to-peer communication does not come into play.
NVIDIA Certified - Faster Deployment of Machine Learning Environments
The XE8545 has undergone NVIDIA’s comprehensive certification program for Datacenter AI: NVIDIA GPU Cloud (NGC). It is now certified to run at the latest Gen4 networking speeds and can take advantage of the NGC catalog that hosts frameworks and containers for the top AI, ML and HPC software, already tuned, tested and optimized. With NGC certification data centers can quickly and easily deploy machine learning environments with confidence and get results faster. For more details on NVIDIA certified systems here.
A New Server - New Technologies - New Levels of Accelerated Performance
The PowerEdge XE8545 introduces the latest industry technologies in a combination that delivers the kind of high-performance, accelerated computing that can handle even the most demanding Artificial Intelligence and Machine Learning workloads or scientific high-performance computing analysis. It provides the highest levels of power and performance in an air-cooled environment, simplifying operational continuity in enterprise data centers.
Related Documents

Dell PowerEdge MX760c servers handle 19.7% more SQL Server work and support 25% more VMs
Thu, 16 Mar 2023 17:12:23 -0000
|Read Time: 0 minutes
Principled Technologies tested a VMware cluster of new 16th generation Dell PowerEdge MX servers. This accomplished more OLTP orders per minute and increased VM density compared to a similarly configured 15th generation MX750c.
Figure 1: Performance of the previous-generation cluster with 24 VMs and the new-generation cluster with 30 VMs. Higher OPM and lower latency are better.
A sample OLTP database workload was used in the test as OLTP plays a crucial role in many digital business processes, such as retail transactions, inventory tracking, customer relationship management, and other business operations. Based on Microsoft SQL Server 2019 DVD using Store 3 hosted in multiple virtual machines, the workload testing targeted the maximum orders per minute (OPM) each cluster could achieve. This was done by increasing the thread count and decreasing the think time until performance degraded. The testing found customers can accelerate existing virtual machines by a simple lift and shift vMotion-type migration to the newer compute sled. However, due to the extra performance available from the new MX760c platform, thanks in part to its 4th Generation Intel Xeon Scalable processors, it was also found additional workloads could be added while still maintaining the required performance.
Figure 2: Total orders per minute the two clusters achieved with 24VMs.
Figure 3: Total orders per minute the two clusters achieved with 30 VMs.
Regarding workloads, new-generation Dell PowerEdge MX760c servers can offer compelling performance gains. 16th Generation Dell PowerEdge MX760c servers can simultaneously provide increased workload performance and VM density.

BIOS Settings for Optimized Performance on Next-Generation Dell PowerEdge Servers
Fri, 03 Mar 2023 16:41:27 -0000
|Read Time: 0 minutes
Summary
Dell PowerEdge servers provide a wide range of tunable parameters to allow customers to achieve top performance. The information in this paper outlines the tunable parameters available in the latest generation of PowerEdge servers (for example, R660, R760, MX760, and C6620) and provides recommended settings for different workloads.
Figure 1. PowerEdge R660
Figure 2. PowerEdge R760
The following tables provide the BIOS setting recommendations for the latest generation of PowerEdge servers.
Table 1. BIOS setting recommendations—System profile settings
System setup screen | Setting | Default | Recommended setting for performance | Recommended setting for low latency, Stream, and MLC environments | Recommended | |
System profile settings | System Profile | Performance Per Watt [1] | Performance Optimized | First select Performance Optimized and then select Custom [1] | Custom
| |
System profile settings | CPU Power Management | System DBPM | Maximum Performance | Maximum Performance | Maximum Performance | |
System profile settings | Memory Frequency | Maximum Performance | Maximum Performance | Maximum Performance | Maximum Performance | |
System profile settings | Turbo Boost [2] | Enabled | Enabled | Enabled | Enabled | |
System profile settings | C1E | Enabled | Disabled | Disabled | Disabled | |
System profile settings | C States | Enabled | Disabled | Disabled | Autonomous or Disabled [6] | |
System profile settings | Monitor/Mwait | Enabled | Enabled | Disabled [3] | Enabled | |
System profile settings | Memory Patrol Scrub | Standard | Standard [4] | Standard/Disabled [4] | Disabled | |
System profile settings | Memory Refresh Rate | 1x | 1x | 1x | 1x | |
System profile settings | Uncore Frequency | Dynamic | Maximum [5] | Maximum [5] | Dynamic | |
System profile settings | Energy Efficient Policy | Balanced Performance | Performance | Performance | Performance | |
System profile settings | CPU Interconnect Bus Link Power Management | Enabled | Disabled | Disabled | Disabled | |
System profile settings | PCI ASPM L1 Link Power Management | Enabled | Disabled | Disabled | Disabled |
[1] Depends on how system was ordered. Other System Profile defaults are driven by this choice and may be different than the examples listed. Select Performance Profile first, and then select Custom to load optimal profile defaults for further modification
[2] SST Turbo Boost Technology is substantially better than previous generations for latency-sensitive environments, but specific Turbo residency cannot be guaranteed under all workload conditions. Evaluate Turbo Boost Technology in your own environment to choose which setting is most appropriate for your workload, and consider the Dell Controlled Turbo option in parallel.
[3] Monitor/Mwait should only be disabled in parallel with disabling Logical Processor. This will prevent the Linux intel_idle driver from enforcing C-states.
[4] You can test your own environment to determine whether disabling Memory Patrol Scrub is helpful.
[5] Dynamic selection can provide more TDP headroom at the expense of dynamic uncore frequency. Optimal setting is workload dependent.
[6] Autonomous on Air Cooled system or Disabled on Liquid Cooled Systems
Table 2. BIOS setting recommendations—Memory, processor, and iDRAC settings
System setup screen | Setting | Default | Recommended setting for performance | Recommended setting for low latency, Stream, and MLC environments | Recommended |
Memory settings | Memory Operating Mode | Optimizer | Optimizer [1] | Optimizer [1] | Optimizer [1] |
Memory settings | Memory Node Interleave | Disabled | Disabled | Disabled | Disabled |
Memory settings | DIMM Self Healing | Enabled | Disabled | Disabled | Disabled |
Memory settings | ADDDC setting | Disabled [2] | Disabled [2] | Disabled [2] | Disabled [2] |
Memory settings | Memory Training | Fast | Fast | Fast | Fast |
Memory settings | Correctable Error Logging | Enabled | Disabled | Disabled | Disabled |
Processor settings | Logical Processor | Enabled | Disabled [3] | Disabled [3] | Enabled |
Processor settings | Virtualization Technology | Enabled | Disabled | Disabled | Disabled |
Processor settings | CPU Interconnect Speed | Maximum Data Rate | Maximum Data Rate | Maximum Data Rate | Maximum Data Rate |
Processor settings | Adjacent Cache Line Prefetch | Enabled | Enabled | Enabled | Enabled |
Processor settings | Hardware Prefetcher | Enabled | Enabled | Enabled | Enabled |
Processor settings | DCU Streamer Prefetcher | Enabled | Enabled | Disabled | Disabled |
Processor settings | DCU IP Prefetcher | Enabled | Enabled | Enabled | Enabled |
Processor settings | Sub NUMA Cluster | Disabled | SNC 2 | SNC 4 on XCC SNC 2 on MCC | SNC 4 on XCC SNC 2 on MCC |
Processor settings | UPI Prefetch | Enabled | Enabled | Enabled | Enabled |
Processor settings | Dell Controlled Turbo | Disabled | Disabled | Enabled [4] | Disabled |
Processor settings | Dell Controlled Turbo Optimizer mode | Disabled | Enabled [5] | Enabled [5] | Enabled [5] |
Processor settings | XPT Prefetch | Enabled | Disabled | Disabled | Enabled |
Processor settings | UPI Prefetch | Enabled | Disabled | Disabled | Enabled |
Processor settings | LLC Prefetch | Disabled | Enabled | Disabled | Disabled |
Processor settings | DeadLine LLC Alloc | Enabled | Enabled | Enabled | Disabled |
Processor settings | Directory AtoS | Disabled | Disabled | Disabled | Disabled |
Processor settings | Dynamic SST Perf Profile | Disabled | Disabled | Enabled | Disabled |
Processor settings | SST-Perf- profile | Operating Point 1 | Operating Point 1 | Operating Point ? [6] | Operating Point 1 |
iDRAC settings | Thermal Profile | Default | Maximum Performance | Maximum Performance | Maximum Performance |
[1] Use Optimizer Mode when Memory Bandwidth Sensitive, up to 33% BW reduction with Fault Resilient Mode.
[2] Only available when x4 DIMMS installed in the system.
[3] Logical Processor (Hyper Threading) tends to benefit throughput-oriented workloads such as SPEC CPU2017 INT and FP_RATE. Many HPC workloads disable this option. This only benefits SPEC FP_rate if the thread count scales to the total logical processor count.
[4] Dell Controlled Turbo helps to keep core frequency at the maximum all-cores Turbo frequency, which reduces jitter. Disable if Turbo disabled.
[5] Option is available on liquid cooled systems only.
[6] Depends on if your program is affected by Base and Turbo frequency. Will reduce CPU core count and give higher Base and Turbo frequencies.
iDRAC recommendations
- Thermally challenged environments should increase fan speed through iDRAC Thermal section.
- All Power Capping should be removed in performance-sensitive environments.
BIOS settings glossary
- System Profile: (Default=Performance Per Watt)—It can be difficult to set each individual power/performance feature for a specific environment. Because of this, a menu option is provided that can help a customer optimize the system for things such as minimum power usage/acoustic levels, maximum efficiency, Energy Star optimization, or maximum performance.
- Performance Per Watt DAPC (Dell Advanced Power Control)—This mode uses Dell presets to maximize the performance/watt efficiency with a bias towards power savings. It provides the best features for reducing power and increasing performance in applications where maximum bus speeds are not critical. It is expected that this will be the favored mode for SPECpower testing. "Efficiency–Favor Power" mode maintains backwards compatibility with systems that included the preset operating modes before Energy Star for servers was released.
- Performance Per Watt OS—This mode optimizes the performance/watt efficiency with a bias towards performance. It is the favored mode for Energy Star. Note that this mode is slightly different than "Performance Per Watt DAPC" mode. In this mode, no bus speeds are derated as they are in the Performance Per Watt DAPC mode, leaving the operating system in control of those changes.
- Performance—This mode maximizes the absolute performance of the system without regard for power. In this mode, power consumption is not considered. Things like fan speed and heat output of the system, in addition to power consumption, might increase. Efficiency of the system might go down in this mode, but the absolute performance might increase depending on the workload that is running.
- Custom—Custom mode allows the user to individually modify any of the low-level settings that are preset and unchangeable in any of the other four preset modes.
- C-States—C-states reduce CPU idle power. There are three options in this mode:
- Enabled: When “Enabled” is selected, the operating system initiates the C-state transitions. Some operating system software might defeat the ACPI mapping (for example, intel_idle driver).
- Autonomous: When "Autonomous" is selected, HALT and C1 requests get converted to C6 requests in hardware.
- Disable: When "Disable" is selected, only C0 and C1 are used by the operating system. C1 gets enabled automatically when an OS auto-halts.
- C1 Enhanced Mode—Enabling C1E (C1 enhanced) state can save power by halting CPU cores that are idle.
- Turbo Mode—Enabling turbo mode can boost the overall CPU performance when all CPU cores are not being fully utilized. A CPU core can run above its rated frequency for a short period of time when it is in turbo mode.
- Hyper-Threading—Enabling Hyper-Threading lets the operating system address two virtual or logical cores for a physical presented core. Workloads can be shared between virtual or logical cores when possible. The main function of hyper-threading is to increase the number of independent instructions in the pipeline for using the processor resources more efficiently.
- Execute Disable Bit—The execute disable bit allows memory to be marked as executable or non-executable when used with a supporting operating system. This can improve system security by configuring the processor to raise an error to the operating system when code attempts to run in non-executable memory.
- DCA—DCA capable I/O devices such as network controllers can place data directly into the CPU cache, which improves response time.
- Power/Performance Bias—Power/performance bias determines how aggressively the CPU will be power managed and placed into turbo. With "Platform Controlled," the system controls the setting. Selecting "OS Controlled" allows the operating system to control it.
- Per Core P-state—When per-core P-states are enabled, each physical CPU core can operate at separate frequencies. If disabled, all cores in a package will operate at the highest resolved frequency of all active threads.
- CPU Frequency Limits—The maximum turbo frequency can be restricted with turbo limiting to a frequency that is between the maximum turbo frequency and the rated frequency for the CPU installed.
- Energy Efficient Turbo—When energy efficient turbo is enabled, the CPU's optimal turbo frequency will be tuned dynamically based on CPU utilization.
- Uncore Frequency Scaling—When enabled, the CPU uncore will dynamically change speed based on the workload.
- MONITOR/MWAIT—MONITOR/MWAIT instructions are used to engage C-states.
- Sub-NUMA Cluster (SNC)—SNC breaks up the last level cache (LLC) into disjoint clusters based on address range, with each cluster bound to a subset of the memory controllers in the system. SNC improves average latency to the LLC and memory. SNC is a replacement for the cluster on die (COD) feature found in previous processor families. For a multi-socketed system, all SNC clusters are mapped to unique NUMA domains. (See also IMC interleaving.) Values for this BIOS option can be:
- Disabled: The LLC is treated as one cluster when this option is disabled.
- Enabled: Uses LLC capacity more efficiently and reduces latency due to core/IMC proximity. This might provide performance improvement on NUMA-aware operating systems.
- Snoop Preference—Select the appropriate snoop mode based on the workload. There are two snoop modes:
- HS w. Directory + OSB + HitME cache: Best overall for most workloads (default setting)
- Home Snoop: Best for BW sensitive workloads
- XPT Prefetcher—XPT prefetch is a mechanism that enables a read request that is being sent to the last level cache to speculatively issue a copy of that read to the memory controller prefetcher.
- UPI Prefetcher—UPI prefetch is a mechanism to get the memory read started early on DDR bus. The UPI receive path will spawn a memory read to the memory controller prefetcher.
- Patrol Scrub—Patrol scrub is a memory RAS feature that runs a background memory scrub against all DIMMs. This feature can negatively affect performance.
- DCU Streamer Prefetcher—DCU (Level 1 Data Cache) streamer prefetcher is an L1 data cache prefetcher. Lightly threaded applications and some benchmarks can benefit from having the DCU streamer prefetcher enabled. Default setting is Enabled.
- LLC Dead Line Allocation—In some Intel CPU caching schemes, mid-level cache (MLC) evictions are filled into the last level cache (LLC). If a line is evicted from the MLC to the LLC, the core can flag the evicted MLC lines as "dead." This means that the lines are not likely to be read again. This option allows dead lines to be dropped and never fill the LLC if the option is disabled. Values for this BIOS option can be:
- Disabled: Disabling this option can save space in the LLC by never filling MLC dead lines into the LLC.
- Enabled: Opportunistically fill MLC dead lines in LLC, if space is available.
- Adjacent Cache Prefetch—Lightly threaded applications and some benchmarks can benefit from having the adjacent cache line prefetch enabled. Default is Enabled.
- Intel Virtualization Technology—Intel Virtualization Technology allows a platform to run multiple operating systems and applications in independent partitions, so that one computer system can function as multiple virtual systems. Default is Enabled.
- Hardware Prefetcher—Lightly threaded applications and some benchmarks can benefit from having the hardware prefetcher enabled. Default is Enabled.
- Trusted Execution Technology—Enable Intel Trusted Execution Technology (Intel TXT). Default is Disabled.