Your Browser is Out of Date

Nytro.ai uses technology that works best in other browsers.
For a full experience use one of the browsers below

Dell.com Contact Us
United States/English
Hasnain Shabbir
Hasnain Shabbir

Hasnain Shabbir is a thermal control design and implementation expert with a passion for developing innovative solutions to meet the needs of data center platforms. With extensive experience in power and performance optimization, he helps businesses reduce their carbon footprint and minimize costs while achieving their sustainability goals. His focus on thermal design and validation has enabled Dell Technologies to become a trusted advisor to customers seeking to enhance their operational efficiency.


LinkedIn: www.linkedin.com/in/hshabbir

Assets

Home > Servers > Rack and Tower Servers > Intel > Direct from Development - Tech Notes

PowerEdge Cooling smart cooling Servers Intel 4th Gen Xeon

The Future of Server Cooling - Part 2: New IT hardware Features and Power Trends

Robert Curtis Chris Peterson David Moss David Hardy Rick Eiland Tim Shedd Eric Tunks Hasnain Shabbir Todd Mottershead Robert Curtis Chris Peterson David Moss David Hardy Rick Eiland Tim Shedd Eric Tunks Hasnain Shabbir Todd Mottershead

Fri, 03 Mar 2023 17:21:25 -0000

|

Read Time: 0 minutes

Summary

Part 1 of this three-part series, titled The Future of Server Cooling, covered the history of server and data center cooling technologies.

Part 2 of this series covers new IT hardware features and power trends with an overview of the cooling solutions that Dell Technologies provides to keep IT infrastructure cool.

Overview

The Future of Server Cooling was written because future generations of PowerEdge servers may require liquid cooling to enable certain CPU or GPU configurations. Our intent is to educate customers about why the transition to liquid cooling may be required, and to prepare them ahead of time for these changes. Integrating liquid cooling solutions on future PowerEdge servers will allow for significant performance gains from new technologies, such as next-generation Intel® Xeon® and AMD EPYC CPUs, and NVIDIA, Intel, and AMD GPUs, as well as the emerging segment of DPUs.  

Part 1 of this three-part series reviewed some major historical cooling milestones and evolution of cooling technologies over time both in the server and the data center.

Part 2 of this series describes the power and cooling trends in the server industry and Dell Technologies’ response to the challenges through intelligent hardware design and technology innovation.

Part 3 of this series will focus on technical details aimed to enable customers to prepare for the introduction, optimization, and evolution of these technologies within their current and future datacenters.

Increasing power requirements and heat generation trends within servers

CPU TDP trends over time – Over the past ten years, significant innovations in CPU design have included increased core counts, advancements in frequency management, and performance optimizations. As a result, CPU Thermal Design Power (TDP) has nearly doubled over just a few processor generations and is expected to continue increasing. 

 

Figure 1.  TDP trends over time

Emergence of GPUs – Workloads such as Artificial Intelligence (AI) and Machine Learning (ML) capitalize the parallel processing capabilities of Graphic Processing Units (GPUs). These subsystems require significant power and generate significant amounts of heat. As it has for CPUs, the power consumption of GPUs has rapidly increased. For example, while the power of an NVIDIA A100 GPU in 2021 was 300W, NVIDIA H100 GPUs are releasing soon at up to 700W. GPUs up to 1000W are expected in the next three years. 

Memory – As CPU capabilities have increased, memory subsystems have also evolved to provide increased performance and density. A 128GB LRDIMM installed in an Intel-based Dell 14G server would operate at 2666MT/s and could require up to 11.5W per DIMM. The addition of 256GB LRDIMMs for subsequent Dell AMD platforms pushed the performance to 3200MT/s but required up to 14.5W per DIMM. The latest Intel and AMD based platforms from Dell operate at 4800MT/s and with 256GB RDIMMs consuming 19.2W each. Intel based systems can support up to 32 DIMMs, which could require over 600W of power for the memory subsystem alone.

Storage – Data storage is a key driver of power and cooling. Fewer than ten years ago, a 2U server could only support up to 16 2.5” hard drives. Today a 2U server can support up to 24 2.5” drives. In addition to the increased power and cooling that this trend has driven, these higher drive counts have resulted in significant air flow impedance both on the inlet side and exhaust side of the system. With the latest generation of PowerEdge servers, a new form factor called E3 (also known as EDSFF or “Enterprise & Data Center SSD Form Factor) brings the drive count to 16 in some models but reduces the width and height of the storage device, which gives more space for airflow. The “E3” family of devices includes “Short” (E3.S), “Short – Double Thickness”: (E3.S 2T), “Long” (E3.L), and “Long – Double Thickness” (E3L.2T). While traditional 2.5” SAS drives can require up to 25W, these new EDSFF designs can require up to 70W as shown in the following table.

(Source: https://members.snia.org/document/dl/26716, page 25.)

Innovative Dell Technologies design elements and cooling techniques to help manage these trends

“Smart Flow” configurations

Dell ISG engineering teams have architected new system storage configurations to allow increased system airflow for high power configurations. These high flow configurations are referred to as “Smart Flow”. The high airflow aspect of Smart Flow is achieved using new low impedance airflow paths, new storage backplane ingredients, and optimized mechanical structures all tuned to provide up to a 15% higher airflow compared to traditional designs. Smart Flow configurations allow Dell’s latest generation of 1U and 2U servers to support new high-power CPUs, DDR5 DIMMs, and GPUs with minimal tradeoffs.  

Figure 2.  R660 “Smart Flow” chassis

 

Figure 3.  R760 “Smart Flow” chassis

FGPU configurations 

The R750xa and R760xa continue the legacy of the Dell C4140, with GPUs located in the “first-class” seats at the front of the system. Dell thermal and system architecture teams designed these next generation GPU optimized systems with GPUs in the front to provide fresh (non-preheated) air to the GPUs in the front of the system. These systems also incorporate larger 60x76mm fans to provide the high airflow rates required by the GPUs and CPUs in the system. Look for additional fresh air GPU architectures in future Dell systems.   

 

Figure 4.  R760xa chassis showing “first class seats” for GPU at the front of the system

4th Generation DLC with leak detection

Dell’s latest generation of servers continue to expand on an already extensive support for direct liquid cooling (DLC). In fact, a total of 12 Dell platforms have a DLC option including an all-new offering of DLC in the MX760c. Dell’s 4th generation liquid cooling solution has been designed for robust operation under the most extreme conditions. If an excursion occurs, Dell has you covered. All platforms supporting DLC utilize Dell’s proprietary Leak Sensor solution. This solution is capable of detecting and differentiating small and large leaks which can be associated with configurable actions including email notification, event logging, and system shutdown.  

 

Figure 5.  2U chassis with Direct Liquid Cooling heatsink and tubing

Application optimized designs 

Dell closely monitors not only the hardware configurations that customers choose but also the application environments they run on them. This information is used to determine when design changes might help customers to achieve a more efficient design for power and cooling with various workloads. 

An example of this is in the Smart Flow designs discussed previously, in which engineers reduced the maximum storage potential of the designs to deliver more efficient air flow in configurations that do not require maximum storage expansion.

Another example is in the design of the “xs” (R650xs, R660xs, R750xs, and R760xs) platforms. These platforms are designed to be optimized specifically for virtualized environments. Using the R750xs as an example, it supports a maximum of 16 hard drives. This reduces the density of power supplies that must be supported and allows for the use of lower cost fans. This design supports a maximum of 16 DIMMs which means that the system can be optimized for a lower maximum power threshold, yet still deliver enough capacity to support large numbers of virtual machines. Dell also recognized that the licensing structure of VMware supports a maximum of 32 cores per license. This created an opportunity to reduce the power and cooling loads even further by supporting CPUs with a maximum of 32 cores which have a lower TDP than the higher core count CPUs.

Software design

As power and cooling requirements increase, Dell is also investing in software controls to help customers manage these new environments. iDRAC and Open Manage Enterprise (OME) with the Power Manager plug-in both provide power capping. OME Power Manager will automatically manipulate power based on policies set by the customer. In addition, iDRAC, OME Power Manager, and CloudIQ all report power usage to allow the customer the flexibility to monitor and adapt power usage based on their unique requirements.

Conclusion

As Server technology evolves, power and cooling challenges will continue. Fan power in air-cooled servers is one of largest contributors to wasted power. Minimizing fan power for typical operating conditions is the key to a thermally efficient server and has a large impact on customer sustainability footprint.

As the industry adopts liquid cooling solutions, Dell is ensuring that air cooling potentials are maximized to protect customer infrastructure investments in air cooling based data centers around the globe. The latest generation of Dell servers required advanced engineering simulations and analysis to improve system design to increase system airflow per unit watt of fan power, as compared to the previous generation of platforms, not only to maximize air cooling potential but to keep it efficient as well. Additional air-cooling opportunities are enabled with Smart Flow configurations – allowing higher CPU bins to be air cooled, as compared to the requirement for liquid cooling. A large number of thermal and power sensors have been implemented to manage both power and thermal transients using Dell proprietary adaptive closed loop algorithms that maximize cooling at the lowest fan power state and that protect systems at excursion conditions by closed loop power management.

Home > Servers > Systems Management > Direct from Development: Tech Notes

PowerEdge systems management iDRAC

“Thermal Manage” Features and Benefits

Hasnain Shabbir Rick Hall Doug Iler Hasnain Shabbir Rick Hall Doug Iler

Mon, 16 Jan 2023 17:06:35 -0000

|

Read Time: 0 minutes

Summary

This Tech Note covers the features and benefits of using the “Thermal Manage” features within the iDRAC Datacenter license.

Introduction

With increasing server densities and the desire to maximize compute power per unit area at the datacenter level, there is an increasing need for better telemetry and controls related to power and thermals to manage and optimize data center efficiency.

“Thermal Manage” includes features of the iDRAC Datacenter license and provides key thermal telemetry and associated control features that facilitate deployment and customization challenges.

Thermal Manage – Feature Overview

Thermal Manage allows customers to customize the thermal operation of their PowerEdge servers with the following benefits:

  • Optimize server-related power and cooling efficiencies across their datacenters.
  • Integrates seamlessly with OpenManage Enterprise Power Manager for optimized management experience.
  • Provides a state-of-the-art PCIe cooling management dashboard.

Represented in the following diagram (See figure 1) and listed below is a summary of the features and its utilities.

  1. System Airflow Consumption: Displays the real-time system airflow consumption (in CFM), allowing airflow balancing at rack and datacenter level.
  2. Custom Delta-T: Limit air temperature rise from inlet air to exhaust to right-size your infrastructure level cooling.
  3. Exhaust Temperature Control: Specify the temperature limit of the air exiting the server to match your datacenter needs.
  4. Custom PCIe inlet temperature: Choose the right input inlet temperature to match 3rd party device requirements.
  5. PCIe airflow settings: Provides a comprehensive PCIe device cooling view of the server and allows cooling customization of 3rd party cards.

Details and Use Cases

By default, Dell server thermal controls algorithm works to minimize system airflow consumption and maximize exhaust air temperature.

The higher the air exhaust temperature going into the HVAC (CRAC units) – the higher capacity they exhibit.

  • It is directly proportional to the temperature difference between return air (exhaust) and the cooling coil for a given coil flow rate.
  • This could result in lower CRAC capital costs if you can cool more with fewer CRAC units and an operational savings of cooling with less equipment.

Some customers, however, have challenges with high exhaust temperatures in the hot aisle, namely:

  • Technicians don’t like the extra heat while working in the hot aisle.
  • Components in the hot aisle (PDUs, cables, network switches) may have exceeded their ambient temperatures.

Figure 1 displays the features and its utilities.

In either case, we allow customization of this exhaust temperature via iDRAC interfaces.

Using the real-time airflow telemetry, a datacenter can create a good balance of airflow delivery vs. airflow demand at the server. A reduction in CFM also can be monetized on a dollar/CFM basis.

  • In an example analysis using a 17 KW rack, a drop in CFM by 10% could result in capital savings (CRAC costs of $257/rack) and an annual operational savings of $93 per rack based on the typical energy cost and data center efficiencies assumed.
  • However, the greater benefit is the potential ability to fit more racks on the floor (or more servers in a rack), if airflow balancing is achieved by closely matching the server/rack airflow consumption.

 iDRAC Thermal Manage features require an iDRAC Datacenter license. Here is an image from the iDRAC GUI showing the thermal telemetry and customization options:


Deploying 3rd party PCIe cards in PowerEdge servers is a common practice. The PCIe airflow settings feature allows a better understanding of the cooling state of the PCIe devices. This helps customers protect their high-value PCIe card with the right amount of cooling. Additionally, this optimizes system airflow, which ties into the earlier point of data center airflow management.

By default, the presence of a 3rd party card may cause the system fan speeds to increase based on internal algorithms. However, this additional cooling may be more or less than required and hence the need for allowing customers to customize airflow delivery to their custom card.

In the iDRAC GUI under PCIe Airflow Settings (Dashboard » System » Overview » Cooling » Configure Cooling – see example snapshot below), the system displays high-level cooling details for each slot in which a card is present. It also displays the max airflow capability of each slot. This airflow information is provided in units of LFM (Linear Feet per Minute), which is industry standard for defining the airflow needs for a card. Only for the 3rd Party Card, customers can see min LFM value delivered to the card and either disable the custom cooling response for that card or disable and then set custom LFM value desired (based on card vendor specifications).

NOTE: For Dell standard devices, the correct power and cooling requirements are part of the iDRAC code, which allows for the appropriate airflow.

In Conclusion

Thermal Manage features within the iDRAC Datacenter provides industry-leading custom thermal control options that provides valuable custom cooling and efficiency optimization options for both the system and data center level.

Home > Servers > PowerEdge Components > Direct from Development: Tech Notes

PowerEdge Cooling smart cooling

Multi Vector Cooling 2.0 for Next-Generation PowerEdge Servers

Matt Ogle Hasnain Shabbir Matt Ogle Hasnain Shabbir

Mon, 16 Jan 2023 13:44:20 -0000

|

Read Time: 0 minutes

Summary

Next-generation PowerEdge servers (15G) support the latest compute, storage and networking technologies with the help of innovation in hardware and thermal controls design that builds on the foundations of the previous-generation (14G) MVC 1.0 solution. This DfD outlines the new MVC 2.0 innovations on both the hardware thermal design and system thermal controls front that enables maximum system performance, with an eye on thermal efficiency and key customizations desired by customers to tune the system to their deployment needs and challenges.

Introduction

Next-generation PowerEdge servers (15G) support higher-performance CPUs, DIMMs and networking components that will greatly increase the servers’ capabilities. However, as capabilities increase, so does the need for continued innovation to keep the system cool and running efficiently.

Multi Vector Cooling (MVC) is not any specific feature – rather it is a term that captures all of the thermal innovations implemented onto PowerEdge platforms. MVC 2.0 for next-generation PowerEdge servers builds upon existing innovations with additional support in hardware design, improved system layout, and cutting-edge thermal controls. These improvements address the needs of an ever-changing compute landscape, demanding a ‘green performance’, low carbon footprint, as well adding customization levers to optimize not only at the server level, but also at the data center level, generally with airflow handling and power delivery.

Hard Working Hardware

While most of the innovations for MVC 2.0 center around optimizing thermal controls and management, the advancement of physical cooling hardware and its architecture layout is clearly essential:

  • Fans - In addition to the cost-effective standard fans, multiple tiers of high performing, Dell-designed fans are supported to increase system cooling. The high performance silver and gold fans can be configured into next-generation PowerEdge servers for supporting increased compute density. Figure 1 below depicts the airflow increase (in CFM) for these high performance fans when compared to baseline fans.


Figure 1 – Comparison of airflow output in CFM

  • Heatsinks - The improved CPU heatsink design not only improves CPU cooling capability, but also helps in streamlining airflow and air temperature distribution across the chassis. Innovative heatsink ‘arms’ with high performance heat pipes and optimized fin spacing achieve this goal.
  • Layout - The T-shape system motherboard layout, along with PSUs that are now located at each corner of the chassis, allows improved airflow balancing and system cooling, and consequently, improved system cooling efficiency. This layout also improves PSU cooling due to reduced risk from high pre-heat coming from CPU heatsinks, and the streamlined airflow helps with PCIe cooling as well enabling support for PCIe Gen4 adapters. Lastly, this layout creates a better cable routing experience on the PDU side of the racks where power cables are generally separated by grid assignments for redundancy.

 

AI Based Thermal Controls

To best supplement the improved cooling hardware, the PowerEdge engineering team focused on developing a more autonomous environment. Key features from prior-generations were expanded upon to deliver thermal autonomous solutions capable of cooling next-generation PowerEdge servers. Our AI based proprietary and patented fuzzy logic driven adaptive closed loop controller has been expanded to not just do fan speed control based on thermal sensor input but is now utilized for power management. This allows for the optimization of system performance, especially in transient workloads and systems operating in challenging thermal environments by automating power management that is required beyond fan speed control for thermal management.

 
Figure 2 – Each operating environment has unique challenges 

This automation with granular power capping capability across various supported sub -system power domains (more specifically CPU and DIMM) ensures thermal compliance with minimum performance impact in challenging thermal conditions. See Figure 2 for illustrates area where new controls solution optimize system performance and uptime.

iDRAC Datacenter Thermal Management Features and OME

With introduction of iDRAC Datacenter license and OME’s power manager one-to-many capabilities, customers can monitor and tackle challenges associated to server customizations as well as deployment in their datacenter (power and airflow centric). Below list highlights some of the key features:

  1. System Airflow Consumption - Users can view real-time system airflow consumption (in CFM), allowing airflow balancing at the rack and datacenter level with newly added integration in the OME Power Manager
  2. Custom Delta-T - Users can limit the air temperature rise from the inlet to exhaust to right-size their infrastructure level cooling
  3. Custom PCIe inlet temperature - Users can choose the right input inlet temperature to match 3rd party device requirements
  4. Exhaust Temperature Control - Users can specify the temperature limit of the air exiting the server to match their datacenter hot aisle needs or limitations (personnel presence, networking/power hardware)
  5. PCIe airflow settings - Users are provided a comprehensive PCIe devices cooling view of the server that informs and allows cooling customization of 3rd party cards

Figure 3 illustrates how the features previously mentioned work together at a system level:


Figure 3 – iDRAC thermal management features and customizations

 

Channel Card Support

Dell Technologies also offers flexibility for customers wanting to implement non-Dell channel cards. Comprehensive support for PCIe communication standards like PLDM, NC-SI and custom implementations by vendors for GPUs and accelerators, such as Nvidia, AMD, Intel for temperature monitoring and closed loop system fan control. Channel cards that follow these standards will therefore have optimal thermal and power management behavior in PE Servers. Future updates would also include suppo rt of new open loop cooling levels defined in latest release of PCIe-SIG standards document.

 

Conclusion

The Dell Technologies MVC 2.0 solution enables next-generation (15G) PowerEdge servers to support dense configs and workloads with higher-performance cooling hardware, increased automation, simplified but advanced management and channel card flexibility. By expanding upon the existing MVC 1.0 design strategy, the MVC 2.0 solution resolves new thermal challenges so that PowerEdge customers can fully utilize their datacenters while managing the deployment constraints like airflow and power delivery in an optimal fashion.