Talking CloudIQ: Proactive Health Scores
Fri, 05 Aug 2022 20:29:33 -0000
|Read Time: 0 minutes
Introduction
This is the second in a series of blogs discussing CloudIQ. In my first blog, I provided a high-level overview of CloudIQ and some of its key features. I will continue with a series of blogs, each talking about one of the key features in more detail. This blog discusses one of CloudIQ’s key differentiating features: the Proactive Health Score.
Proactive Health Score
The Proactive Health Score uses various factors to provide a consolidated view of a system’s health into a single health score. Health scores are based on up to five categories: Components, Configuration, Capacity, Performance, and Data Protection. Based on the resulting health score, the system is put into one of three risk categories: Poor, Fair, or Good. The score starts at 100 and is reduced by the issue with the highest deduction.
A system in the Poor category has a score of 0 to 70 and poses an imminent critical risk. It could be a storage pool that is overprovisioned and full, meaning that systems will be trying to write to storage that is unavailable. Or it could be a significant component failure. Whatever the issue, it is something that requires your immediate attention.
A system in the Fair category has a score of 71 to 94. Systems in this category have an issue that should be looked at, but certainly not something that requires you to get out of bed at 3:00am to address immediately. It could be something like a storage pool predicted to be full in a week or a system inlet temperature that exceeds the upper warning threshold on a PowerEdge server.
A system in the Good category has a score of 95 to 100 and is doing fine. There may be a minor issue that you need to look at, but nothing significant that is expected to cause any near-term problems. An example would be a fibre port with a warning status on a Connectrix switch.
Now what happens if there are multiple issues on a system? We hinted at this earlier. The score is only affected by the most critical issue. Let’s say that there are four issues on a system: one 30-point deduction, one 10-point deduction, and two 5-point deductions. In this case, the health score is 70. When the 30-point deduction is addressed, the score would become 90. We do this to prevent a system with several minor issues from appearing at high risk or at a higher risk than a system with a significant issue.
Figure 1. System Health page
Recommended resolution
So now that we have been notified of an issue on a system, what do we do next? Well, with CloudIQ, we will offer up recommended remediation actions to address the issue before it has a significant impact on the environment. This may come in the form of a recommended configuration change or other action, a knowledge base article with a resolution, or some commands to run to gain the necessary information to resolve the issue.
Figure 2. Recommended remediation
Health Score History
CloudIQ also tracks the history of the Proactive Health Score. We can see both new and resolved issues along a chart with a selectable date range. Details of the issues are listed below the chart. By providing the history of the health score, CloudIQ allows users to identify possible recurring issues in the environment.
Figure 3. Health Score history
Notifications
What if we do not want to log in to CloudIQ on a daily or weekly basis to check our systems? We can easily be notified by email any time a system health change occurs. These notifications can be set up for a configurable set of systems, allowing users only to receive notifications for those systems for which they are responsible.
For the more motivated user, CloudIQ supports Webhooks. With this feature, users can send a Webhook for any health change notification to integrate with third-party tools such as ServiceNow, Slack, or Teams. Webhooks are sent for both open and closed issues with a unique identifier. This allows users to correlate the resolved issue with the open issue to automatically close out any created incident. Some Webhook integration examples can be found here.
Conclusion
Whether it be for storage, networking, hyperconverged, servers, or data protection, the Proactive Health Score summarizes the health of a system into a single number, providing an immediate indication of the status of each system. Developed in tandem with experts from each product team, any issues identified for a system are accompanied by recommended remediation to help with self-service and quickly reduce risk. And with email notifications and Webhooks, users can be notified proactively any time an issue is identified.
Resources
How do you become more familiar with Dell Technologies and CloudIQ? The Dell Technologies Info Hub site provides expertise that helps to ensure customer success with Dell Technologies platforms. We have CloudIQ demos, white papers, and videos available at the Dell Technologies CloudIQ page. Also, feel free to reference the CloudIQ Overview Whitepaper which provides an in-depth summary of CloudIQ. Interested in DevOps? Go to our public API page for information about integrating CloudIQ with other IT tools using Webhooks or REST API.
Stay tuned for my next blog, where I'll talk about capacity forecasting and capacity anomaly detection in CloudIQ.
Author: Derek Barboza, Senior Principal Engineering Technologist