Talking CloudIQ: PowerEdge
Wed, 08 Nov 2023 16:32:28 -0000
|Read Time: 0 minutes
Introduction
In my previous blogs, I have focused on a specific feature in CloudIQ. This blog talks about various CloudIQ features for Dell’s PowerEdge servers. Dell CloudIQ continues to expand its feature set for PowerEdge assets. CloudIQ integrates with Dell’s OpenManage Enterprise at each of your sites, to efficiently collect and aggregate telemetry data to give you a multisite, enterprise-wide view of all your PowerEdge servers and chassis. And with OpenManage Enterprise 4.0, onboarding your PowerEdge servers to CloudIQ is easier than ever!
Health, inventory, and performance
Since the introduction of PowerEdge support in CloudIQ, health, inventory, and performance monitoring for PowerEdge servers have all been available. CloudIQ provides an overall health score for each PowerEdge server and recommended remediation when an issue is identified. Inventory reporting provides numerous properties about each server, including contract status, component firmware versions, licensing information, and hardware listings to name a few. CloudIQ displays key performance metrics and not only shows historical trends but identifies performance anomalies and provides performance forecasting. This information allows you to see unexpected performance patterns, and plan future resource needs based on trending workloads.
Figure 1. Example of a performance forecasting chart for PowerEdge
Cybersecurity
Cybersecurity is a feature in CloudIQ that allows you to compare your existing security configuration settings to a predefined set of desired security configuration settings. The configuration is continuously monitored, notifying you when a configuration does not meet its desired setting. Cybersecurity monitors up to 31 server configuration settings and 18 chassis configuration settings tied to NIST security standards. Without automated continuous checking, it's impractical to manually check all settings on all servers every day. Lab tests show that it takes six minutes on average to manually check just 15 settings on a single server.
Users can also see a list of applicable Dell Security Advisories (DSAs) for their PowerEdge systems. By intelligently matching attributes like models and code versions, users can quickly see which DSAs are applicable to their systems, allowing them to take immediate action to remediate these security vulnerabilities.
Figure 2. The Security Assessment page for a PowerEdge chassis
System Management
You can now initiate BIOS and firmware updates for PowerEdge servers and chassis from CloudIQ. Users with a Server Admin role in CloudIQ can initiate these upgrades across multiple systems with just a few clicks. This feature simplifies the process of keeping your fleet of servers consistent and secure.
Figure 3. Multisystem update for PowerEdge servers and chassis
Virtualization View
The integration of PowerEdge into the Virtualization View consolidates and simplifies resource information about PowerEdge servers running ESXi. Available details include the OS version, model, resource consumption per virtual machine, and health issues with recommendations for remediation. A hyperlink lets you quickly navigate to the system details page for the PowerEdge server for more troubleshooting. Another hyperlink directs you to vCenter to perform virtualized resource administration.
Figure 4. PowerEdge support in the Virtualization View
Carbon footprint monitoring
CloudIQ has introduced carbon footprint analysis support for PowerEdge servers and chassis. CloudIQ takes power and energy metrics and calculates carbon emissions based on international standards and conversion factors for location. CloudIQ Administrators can override and customize these values with their own unique location emission factors.
Figure 5. Energy, power, and carbon emissions for a PowerEdge server
Custom reports and IT integrations
You can generate custom reports using both tables and charts for PowerEdge servers:
- Tables are available to provide lists of assets, code versions, contract information, capacity metrics, and average performance metrics.
- Charts can be used to see historical performance trends and performance anomalies.
You can also take advantage of custom tags in your reports. For example, you can create a list of PowerEdge servers in a certain business unit with their BIOS and firmware versions, contract expiration dates, average power consumption, and service tags. And with Webhooks and REST API access, you can integrate data and events from CloudIQ with ServiceNow, Slack, and other IT tools to help you monitor your entire Dell IT infrastructure.
Figure 6. Custom reporting table for PowerEdge with custom tags
Conclusion
As IT resources become more remote and isolated, it has become increasingly time consuming to maintain, manage, and secure resources in the data center and at the edge. CloudIQ simplifies monitoring and management by providing a single portal to view all your PowerEdge servers across your entire environment. With cybersecurity monitoring of PowerEdge servers and chassis, you can quickly see where security configuration settings may be incorrectly set or accidentally changed, opening those systems to cyberattacks, and receive instructions to remediate. With the new maintenance and management features, CloudIQ simplifies the process of keeping your entire fleet at consistent, secure, and desired BIOS and firmware versions. The carbon footprint page in CloudIQ helps you meet sustainability goals. And with Webhook and REST API support, CloudIQ can be integrated with other IT tools to help you monitor not only your PowerEdge servers, but your entire Dell IT portfolio.
Resources
This Knowledge Base Article discusses how to onboard PowerEdge devices to CloudIQ.
For a quick demo about CloudIQ PowerEdge support, see the CloudIQ videos section on the Info Hub.
Direct from Development Tech Note: Dell CloudIQ Cybersecurity for PowerEdge: The Benefits of Automation
See other informative blogs: Overview of CloudIQ, Proactive Health Scores, Capacity Monitoring and Planning, Cybersecurity, and Custom Reports and Tags.
How do you become more familiar with Dell Technologies and CloudIQ? The Dell Technologies Info Hub site provides expertise that helps to ensure customer success with Dell Technologies platforms. We have CloudIQ demos, white papers, and videos available at the Dell Technologies CloudIQ page. Also, feel free to reference the white paper CloudIQ: A Detailed Overview which provides an in-depth summary of CloudIQ.
Author: Derek Barboza, Senior Principal Engineering Technologist
Related Blog Posts
Choosing a PowerEdge Server and NVIDIA GPUs for AI Inference at the Edge
Fri, 05 May 2023 16:38:19 -0000
|Read Time: 0 minutes
Dell Technologies submitted several benchmark results for the latest MLCommonsTM Inference v3.0 benchmark suite. An objective was to provide information to help customers choose a favorable server and GPU combination for their workload. This blog reviews the Edge benchmark results and provides information about how to determine the best server and GPU configuration for different types of ML applications.
Results overview
For computer vision workloads, which are widely used in security systems, industrial applications, and even in self-driven cars, ResNet and RetinaNet results were submitted. ResNet is an image classification task and RetinaNet is an object detection task. The following figures show that for intensive processing, the NVIDIA A30 GPU, which is a double-wide card, provides the best performance with almost two times more images per second than the NVIDIA L4 GPU. However, the NVIDIA L4 GPU is a single-wide card that requires only 43 percent of the energy consumption of the NVIDIA A30 GPU, considering nominal Thermal Design Power (TDP) of each GPU. This low-energy consumption provides a great advantage for applications that need lower power consumption or in environments that are more challenging to cool. The NVIDIA L4 GPU is the replacement for the best-selling NVIDIA T4 GPU, and provides twice the performance with the same form factor. Therefore, we see that this card is the best option for most Edge AI workloads.
Conversely, the NVIDIA A2 GPU exhibits the most economical price (compared to the NVIDIA A30 GPU's price), power consumption (TDP), and performance levels among all available options in the market. Therefore, if the application is compatible with this GPU, it has the potential to deliver the lowest total cost of ownership (TCO).
Figure 1: Performance comparison of NVIDIA A30, L4, T4, and A2 GPUs for the ResNet Offline benchmark
Figure 2: Performance comparison of NVIDIA A30, L4, T4, and A2 GPUs for the RetinaNet Offline benchmark
The 3D-UNet benchmark is the other computer vision image-related benchmark. It uses medical images for volumetric segmentation. We saw the same results for default accuracy and high accuracy. Moreover, the NVIDIA A30 GPU delivered significantly better performance over the NVIDIA L4 GPU. However, the same comparison between energy consumption, space, and cooling capacity discussed previously applies when considering which GPU to use for each application and use case.
Figure 3: Performance comparison of NVIDIA A30, L4, T4, and A2 GPUs for the 3D-UNet Offline benchmark
Another important benchmark is for BERT, which is a Natural Language Processing model that performs tasks such as question answering and text summarization. We observed similar performance differences between the NVIDIA A30, L4, T4, and A2 GPUs. The higher the value, the better.
Figure 4: Performance comparison of NVIDIA A30, L4, T4, and A2 GPUs for the BERT Offline benchmark
MLPerf benchmarks also include latency results, which are the time that systems take to process requests. For some use cases, this processing time can be more critical than the number of requests that can be processed per second. For example, if it takes several seconds to respond to a conversational algorithm or an object detection query that needs a real-time response, this time can be particularly impactful on the experience of the user or application.
As shown in the following figures, the NVIDIA A30 and NVIDIA L4 GPUs have similar latency results. Depending on the workload, the results can vary due to which GPU provides the lowest latency. For customers planning to replace the NVIDIA T4 GPU or seeking a better response time for their applications, the NVIDIA L4 GPU is an excellent option. The NVIDIA A2 GPU can also be used for applications that require low latency because it performed better than the NVIDIA T4 GPU in single stream workloads. The lower the value, the better.
Figure 4: Latency comparison of NVIDIA A30, L4, T4, and A2 GPUs for the ResNet single-stream and multistream benchmark
Figure 5: Latency comparison of NVIDIA A30, L4, T4, and A2 GPUs for the RetinaNet single-stream and multistream benchmark and the BERT single-stream benchmark
Dell Technologies submitted to various benchmarks to help understand which configuration is the most environmentally friendly as the data center’s carbon footprint is a concern today. This concern is relevant because some edge locations have power and cooling limitations. Therefore, it is important to understand performance compared to power consumption.
The following figure affirms that the NVIDIA L4 GPU has equal or better performance per watt compared to the NVIDIA A2 GPU, even with higher power consumption. For Throughput and Perf/watt values, higher is better; for Power(watt) values, lower is better.
Figure 6: NVIDIA L4 and A2 GPU power consumption comparison
Conclusion
With measured workload benchmarks on MLPerf Inference 3.0, we can conclude that all NVIDIA GPUs tested for Edge workloads have characteristics that address several use cases. Customers must evaluate size, performance, latency, power consumption, and price. When choosing which GPU to use and depending on the requirements of the application, one of the evaluated GPUs will provide a better result for the final use case.
Another important conclusion is that the NVIDIA L4 GPU can be considered as an exceptional upgrade for customers and applications running on NVIDIA T4 GPUs. The migration to this new GPU can help consolidate the amount of equipment, reduce the power consumption, and reduce the TCO; one NVIDIA L4 GPU can provide twice the performance of the NVIDIA T4 GPU for some workloads.
Dell Technologies demonstrates on this benchmark the broad Dell portfolio that provides the infrastructure for any type of customer requirement.
The following blogs provide analyses of other MLPerfTM benchmark results:
- Dell Servers Excel in MLPerf™ Inference 3.0 Performance
- Dell Technologies’ NVIDIA H100 SXM GPU submission to MLPerf™ Inference 3.0
- Empowering Enterprises with Generative AI: How Does MLPerf™ Help Support
- Comparison of Top Accelerators from Dell Technologies’ MLPerf™
References
For more information about Dell Power Edge servers, go to the following links:
- Dell’s PowerEdge XR7620 for Telecom/Edge Compute
- Dell’s PowerEdge XR5610 for Telecom/Edge Compute
- PowerEdge XR4520c Compute Sled specification sheet
- PowerEdge XE2420 Spec Sheet
For more information about NVIDIA GPUs, go to the following links:
MLCommonsTM Inference v3.0 results presented in this document are based on following system IDs:
ID | Submitter | Availability | System |
---|---|---|---|
2.1-0005 | Dell Technologies | Available | Dell PowerEdge XE2420 (1x T4, TensorRT) |
2.1-0017 | Dell Technologies | Available | Dell PowerEdge XR4520c (1x A2, TensorRT) |
2.1-0018 | Dell Technologies | Available | Dell PowerEdge XR4520c (1x A30, TensorRT) |
2.1-0019 | Dell Technologies | Available | Dell PowerEdge XR4520c (1x A2, MaxQ, TensorRT) |
2.1-0125 | Dell Technologies | Preview | Dell PowerEdge XR5610 (1x L4, TensorRT, MaxQ) |
2.1-0126 | Dell Technologies | Preview | Dell PowerEdge XR7620 (1x L4, TensorRT) |
Table 1: MLPerfTM system IDs
Sweet 16 ways OpenManage helps customers to maximize their investment in PowerEdge
Wed, 12 Apr 2023 01:27:49 -0000
|Read Time: 0 minutes
As we at Dell announce details of the new wave of PowerEdge servers (details here), we want to highlight 16 examples of how the OpenManage portfolio of systems management software enhances our server range. Like I always say, where there are servers, there are server management requirements.
The OpenManage portfolio exists to save customers of any size time and money, eliminating the necessity of high-touch, manual steps to deliver efficiency. Designed to scale, with integrated security, Dell’s OpenManage strategy is to give customers a choice by using orchestration, automation, and integration, leveraging APIs with open standards.
#1 – Server health monitoring—This is server management 101. However, given the fact that PowerEdge servers are the foundation of the modern data center, this basic element is critical to application and services uptime. OpenManage solutions have many ways to get this information from the agent-free iDRAC directly (GUI/SNMP/SMTP/syslog/API and more) or through the Dell OpenManage Enterprise console, OpenManage mobile, Dell CloudIQ, VMware vCenter integration, Microsoft System Center, and leading third-party management software such as Nagios.
#2 – Remote access to servers—If deep one-to-one control for troubleshooting, deployment, configuration, console access, and so on is needed, then iDRAC is the answer. Dell's unique iDRAC9 offers out-of-band remote server connection, including firmware configuration, full server console remote control through eHTML5 (sometimes called vKMV) GUI, virtual media, and server telemetry. iDRAC agentless architecture offers server monitoring and control from anywhere without the need to install any software. There are many additional features, from basic power on/off control offered through the GUI, CLI, or API to advanced server profile configuration to ensure that servers have the correct firmware configuration settings.
#3 – Server deployment—The time between when a server is racked and powered until it is live (time to value) can be greatly reduced by leveraging the automation integrated into OpenManage. Starting with streamlining one-to-one deployments, the iDRAC features a lifecycle controller that rapidly configures elements such as RAID storage configurations and populate deployments with up-to-date operating system drivers. In addition, iDRAC also features a zero-touch deployment to automatically download a server configuration profile (SCP) and even complete an unattended operating system installation the first time the server powers up on a customer’s network. Beyond one-to-one solutions, OpenManage offers a broad number of deployment solutions, including: OpenManage Enterprise, offering firmware setting configuration and supporting agnostic operating system installation through ISO images; Microsoft System Center integration; and deeper customizable VMware installations through OpenManage Enterprise for VMware vCenter. Finally, for customers using tools such as Ansible, Terraform, or Prometheus, OpenManage supplies integration packs and sample code leveraging Dell's APIs.
#4 – Manage and update firmware—There are multiple methods to update PowerEdge server firmware, depending on needs. Methods range from one-to-one, using iDRAC/Lifecycle Controller, to console-based methods for updating multiple servers. Leveraging large-scale automation, these tools can audit existing servers, compare online catalogs, then download and apply the correct updates quickly and consistently with massive time savings compared to manual methods. One example is the integration into VMware using OpenManage Enterprise for VMware vCenter, which offers cluster-aware updates, updating one cluster node at a time using DSR to keep workloads up and running. Dell supplies Repository Manager to build custom firmware catalogs like the packaged interpretable ISOs that are used by other Dell updating tools where servers are isolated or air gapped. And, of course, Dell supplies an Ansible module offering firmware updates to the DevOps user base.
#5 – Configuration drift detection—OpenManage Enterprise provides compliance features that detect, highlight, and remediate configuration drift issues, with simple processes for both firmware versions and firmware configuration settings.
#6 – Secure supply chain assurance—Using Dell’s Secure Component Verification (SCV) allows organizations to ensure that their new servers are delivered with the same components installed at Dell Technologies’ manufacturing facility, using a digital, cryptographically secured signed inventory certificate.
#7 – Power usage reporting (and carbon emissions calculations)—There are multiple ways to view server power consumption data, depending on needs and preferences. One way is to open the iDRAC web GUI, while another way is to use scripts, either Racadm or Redfish, to retrieve the data. iDRAC can also send data to the OpenManage Enterprise Power Manager plug-in, where power data, including carbon emissions, is processed and grouped, and can be displayed, reported, and actioned. OpenManage Enterprise can also forward this information to CloudIQ for PowerEdge for additional analysis and visualization. For those customers looking for maximum data, iDRAC9 can stream these power statistics as telemetry data to analytics solutions such as Splunk or ELK Stack for real-time in-depth analysis.
#8 – Power usage control—Power consumption capping ability is integrated into iDRAC. OpenManage Enterprise Power Manager adds the capability to apply power caps to individual servers or groups of servers. This power capping can be permanent, scheduled at particular times for specific weekends, or ad hoc in response to an incident when reduction in power consumption is required, such as when running on UPS or on-premises generators.
#9 – Thermal event management—While thermal monitoring alerting and even shutdown is integrated into PowerEdge servers through the iDRAC, OpenManage Enterprise Power Manager augments this through powerful Emergency Power Reduction (EPR) policies. This feature reduces the power consumption of servers through a power cap policy to throttle a group of servers. EPR policies can be used as a permanent or scheduled method to limit server power consumption or as an immediate temporary measure during a thermal emergency, for example, CRAC unit failure.
#10 – Performance monitoring—From the iDRAC GUI, CLI, and API, server performance telemetry data can be obtained. OpenManage Enterprise Power Manager can consume and report this data, automatically highlighting idle servers. Telemetry information can be passed to third-party solutions such as Splunk. Finally, CloudIQ can analyze information and present the information in a dashboard format with graphical visualization, and, for key metrics, highlight anomalies based on historic seasonality data.
#11 – Enterprise secure key management—iDRAC provides a standards-based Key Management Interoperability Protocol (KMIP) to encrypt data at rest on self-encrypting SSDs or self-encrypting hard drives and pass the key to a key management system. Solutions such as Thales CipherTrust Manager offer centralized key management for multiple PowerEdge servers and many other products.
#12 – Detailed server telemetry—iDRAC9 provides more than 180 data metrics that can integrate advanced server hardware operation telemetry. Many of these can be reported and visualized in CloudIQ or streamed to analytics solutions such as Splunk. This server telemetry data allows customers to access detailed information to avoid failure events, optimize server operation, and enhance cyber resiliency.
#13 – Automatic call and ticket creation—This ranges from the Dell services plug-in for OpenManage Enterprise, which offers the creation of a support case directly with Dell without any human intervention, to integration with ServiceNow by Dell’s integration pack. Alternatively, OpenManage Enterprise offers a flexible set of actions, including running scripts, SNMP forwarding Syslog event, and emailing based on the monitoring of SNMP events. This automation can be used to pass information to a third-party solution for incident management.
#14 – Capacity planning—The iDRAC provides a large amount of performance statistics. This data can be collected and analyzed by the Dell CloudIQ IOPS solution to produce a forward-looking capacity analysis on items such as CPU usage based on real historical data values for a given server and workload.
#15 – Cloud-based infrastructure management—Dell's AIOp’s CloudIQ can not only consolidate multiple instances of OpenManage Enterprise, but it can also integrate Dell storage, server, data protection, networking, HCI, and CI products. Hosted in Dell’s secure data center, CloudIQ combines proactive monitoring, machine learning, and predictive analytics to reduce risk, plan ahead, and improve productivity from core to edge.
#16 – Cybersecurity from concept to retirement—Dell Cyber Resilient Architecture 2.0 includes features such as iDRAC silicon-based root of trust, dynamic USB port management, UEFI Secure Boot, and signed firmware updates. All these features are controlled by OpenManage tools that let customers protect, detect, and recover in response to security threats.
We hope that this list has given you a few suggestions on how the OpenManage portfolio can help your organization. Servers are a vital element of organizations’ infrastructure and the foundation of modern business, and it’s critical to manage and monitor them to deliver visibility, productivity, and control. Server management tools not only make tasks easy, faster, and consistent but also decrease failures with increased efficiency. Remember, don't just manage, automate.
Is your organization using all the features that Dell OpenManage offers and getting the maximum benefits from investing in PowerEdge servers? Ask your account manager for more details.
References
#2 Support for Integrated Dell Remote Access Controller 9 (iDRAC9)
#3 How to create and deploy a Server Template in OpenManage Enterprise (video)
#4 Updating Firmware and Drivers on Dell PowerEdge Servers
#5 Improve Operational Efficiency Through OME Server Drift Management
#6 Dell Technologies Secured Component Verification for PowerEdge
#7 #8, #9 Server Power Consumption Reporting and Management
#10 CloudIQ Provides Data Driven Server Management Decisions
#11 OpenManage Secure Enterprise Key Manager Solutions Brief
#12 Transform Datacenter Analytics with iDRAC9 Telemetry Streaming
#13 Support for OpenManage Integration with ServiceNow
#14 Talking CloudIQ: Capacity Monitoring and Planning
#15 CloudIQ: AIOps for Intelligent IT Infrastructure Insights
#16 Cyber Resilient Security in Dell PowerEdge Servers
Additional resources
- Dell server management portfolio: OpenManage microsite
- API catalog (interactive support resource): Dell Technologies Developer
- Ansible Python PowerShell module library and code examples: Dell Technologies GitHub
- Dell systems management offerings: Dell Systems Management Overview Guide