Dell Next Generation PowerEdge Servers: Designed with DDR5 to Deliver Future-Ready Bandwidth
Download PDFFri, 03 Mar 2023 17:38:38 -0000
|Read Time: 0 minutes
Summary
This Direct from Development (DfD) tech note describes the DDR5 Memory technology for Dell’s latest generation PowerEdge Server portfolio. This document provides a high-level overview for DDR5, including information about generational performance improvement.
Overview
DDR5 Memory technology is the next big advancement in the world of DRAM Memory and is launching on the latest generation PowerEdge Servers.
DDR SDRAM (Dual Data Rate Synchronous Dynamic Random Access Memory) is a DRAM package on a DIMM. DDR means that the data is transferred at both the rising and falling edge of the clock signal. SDRAM is different from Asynchronous RAM because it is synchronized to the clock of the processor and hence the bus. Today, virtually all SDRAM is manufactured in compliance with standards established by JEDEC, an electronics industry association that adopts open standards to facilitate the interoperability of electronic components. This makes DDR5 an important spec for any standard server.
DDR5 is the fifth major iteration of this standard. Compared to its predecessors, DDR5 provides higher bandwidth and increased bandwidth efficiency.
The core counts are growing with every new generation of CPU. DDR4 has reached its limit in terms of memory bandwidth and density. It can only support up to 16GB Density and 3200MT/s speed. This is where DDR5 technology offers solutions to meet customer needs for greater memory capacity per core, and bandwidth per core.
DDR5 offers a 50% increase in the bandwidth with 4800MT/s as compared to DDR4 with 3200MT/s[1]. It also supports a maximum of up to 32Gb density (a density that is not available in the latest PowerEdhe generation launch), as compared to 16Gb in the previous generation. DDR5 also offers 2x the burst length, 2x bank groups, 2x banks, Decision Feedback Equalization, two independent 40-bit channels per DIMM, and optimized power management on DIMM.
The following table provides information about the latest Dell PowerEdge portfolio for DDR5, including capacity, bandwidth, DIMM type, and Dell part numbers. Note that Dell does not support DIMM capacity mixing on the latest generation. These represent maximum bandwidth at ideal configurations. CPU vendors may reduce bandwidth capability based on their respective DIMM population rules. Total system bandwidth is expected to vary between platforms based on population capability, such as on 8 x 1 DPC Intel® CPU- based platforms.
Table 1. Details about the latest Dell PowerEdge portfolio for DDR5
DIMM Capacity (GB) | DIMM Speed (MT/s) | DIMM Type | Dell PN* | Ranks per DIMM | Data Width | Density | Technology |
16 | 4800 | RDIMM | 1V1N1 | 1 | x8 | 16Gb | SDP |
32 | 4800 | RDIMM | W08W9 | 2 | x8 | 16Gb | SDP |
64 | 4800 | RDIMM | J52K5 | 2 | x4 | 16Gb | SDP |
128 | 4800 | RDIMM | MMWR9 | 4 | x4 | 16Gb | 3DS |
256 | 4800 | RDIMM | PCFCR | 8 | x4 | 16Gb | 3DS |
* Part numbers are subject to change. Additional part numbers may be required.
Dell Customer Experience improvement for PowerEdge Servers
Beginning in March 2022 on previous PowerEdge generation platforms, Dell Technologies began a journey to improve the customer experience related to memory errors. The following key improvements were made at that time, which are also included in the latest generation of PowerEdge servers.
- Single-Bit Correctable Error Messaging – This style of the message has been removed. Working with vendor partners across the industry and studying our own field performance, we could find no relationship between correctable error reporting and subsequent uncorrectable errors on the same DIMM in the same system. To avoid concerning alert messaging and potential unnecessary downtime, we have eliminated this messaging.
- Uncorrectable Error Messaging – Previously we would recommend after an uncorrectable error to perform memory self-healing. That is still a recommended action that will occur automatically on the next reset after the error is detected. We have now determined that having an uncorrectable error causes a loss of confidence in the long-term health of the memory hardware. Customer data on this hardware is always critical and for that reason, we recommend scheduling a replacement as soon as an uncorrectable error is detected, to avoid any doubt of future system health.
- Self-Health Messaging – Previous messaging gave a notification recommending scheduling a reset to perform self-healing. However, customers notified us that we did not give an adequate indication of the urgency of the reset, and it is very costly to take down the server for this action. Upon further consideration, we schedule the self-healing to occur in the background opportunistically on the next reset because the action is typically not urgent and can wait until scheduled downtime. We will no longer send self-heal messaging in logs requesting a customer action for this reason.
- Revise Diagnostic Messaging – Certain benign system events in the prior design would trigger a MEM5100 “OEM Diagnostic Event” message with encoded details. When this occurs frequently in customer logs it can cause concern. What do these messages mean? Should I replace the DIMM? These events do not indicate a degradation of DIMM health or an early indication of DIMM failure, but the messaging was left too ambiguous for customers.
We have updated the language to clearly state the action and intent. For example, such a message might be “An event has been completed successfully in the memory device at <location>. The server and device are operating normally; no action is required.” An extended ID code is then provided for internal terms to reference when required.
The latest generation of PowerEdge improvements
Quality and a premier customer experience with Dell PowerEdge servers continues to be a focus in our latest generation design. Our specific goals to achieve this are to reduce log chattiness and give clear crisp messaging on the health of the memory hardware. With that in mind, we have continued to refine our messaging strategy so that we can act swiftly to identify and diagnose issues without filling customer logs with verbose diagnostic memory error messages. Here are a few additional changes exclusive to the latest PowerEdge server design:
- Debug logging to TSR/Support Assist – We have enabled a new pipeline of diagnostic data that only displays in the Support Assist log. This data is collected in real-time as it occurs by our iDRAC BMC but is only harvested and logged when Support Assist is requested. This eliminates the need to log continuous “bread crumb” type information into the SEL and LC logs and while maintaining the diagnosability when we need it most.
- Enhanced SPD Error Logging – The SPD for DDR5 is significantly bigger than for DDR4. Dell Technologies has a proprietary logging format that resides on each Dell DIMM device. Expanding beyond what was possible in DDR4, we have enhanced logging to include new events such as detail about the health of the on-DIMM PMIC and more robust logging of CPU data when memory errors occur. When problems arise, we understand it can be chaotic and memory could be swapped between systems or labeled in the wrong box by mistake when returning to Dell for diagnosis. This enhanced logging will help us see the history of the DIMM itself. We can identify trends and previous problems to get to a solution quickly.
- Out-of-Band Access Improvement – The information provided by iDRAC has always been available out-of-band, but beginning in this latest generation of PowerEdge server it is also available even when the system is off with the power cord plugged in. This is strategically useful for diagnosing memory health because memory is often one of the most critical components for a successful power-up sequence. What if the system hangs for some reason in the OS? What if you need to keep the system offline due to rack power constraints but you need detail about the health and history of the memory? In the latest PowerEdge servers you can still remote into the iDRAC BMC of the system and collect health and status information while the system is offline.
Figure 1. DDR5 inserted in the Dell PowerEdge Chassis
Conclusion
With improved bandwidth and continuous improvements for providing a quality customer experience on memory — all provided in a dense form factor of DDR5, Dell Technologies continues to provide best-in-class features and specifications for its constantly evolving better and faster PowerEdge Server portfolio.
References
[1] These tests were performed in the Solutions and Performance Analysis Lab at Dell Technologies in December 2022.