Recovering Clusters Faster: VxRail Serviceability
Tue, 08 Nov 2022 20:13:28 -0000|
Read Time: 0 minutes
This is the fifth article in a series introducing VxRail concepts.
Every tool or piece of equipment out there requires maintenance of some kind. That’s as true for the cars we drive as it is for the servers, storage, and switches that power our data centers. However, a lot of data center maintenance is reactive. Look at hardware failure as an example. If a drive were to fail in one of your clusters, nothing would happen until IT staff respond. VxRail offers the ability to automate some of these responses. Let’s talk about what happens when things go sideways in a cluster’s life.
Help Righting the Ship
One of the roles that the VxRail Manager VM fills is a centralized alert collector. VxRail integrates with the iDRAC to monitor hardware health and with vCenter to monitor VMware software, in addition to its own internal alerts and events. VxRail monitors all this information and creates a more holistic monitoring system for a cluster. This obviously benefits IT staff, but there are some additional benefits to this multi-level integration that other solutions might struggle to match.
VxRail uses a service called “Secure Connect Gateway” to establish a static connection to Dell data centers. This enables a lot of functionality on VxRail, including with CloudIQ for multi-cluster management, but that’s the subject of a future discussion. This static connection helps technical support become more proactive in helping you recover your clusters. For example, say you had a disk fail. If Secure Connect Gateway is enabled, VxRail would dial home and create a case automatically. Support could then use this to confirm the disk failure and confirm that there aren’t any other hardware or software issues being raised. Depending on what warranty services you have, you could even opt to have a replacement hard drive sent out automatically. It wasn’t uncommon for me to see support cases where we were the first to let the administrators know that there was an issue. It was definitely nice to be able to tell them a correction was already on its way out to them.
These phone homes that go through the Secure Connect Gateway add more value than helping to automate parts of some dispatches. The gateway also aids in the support experience. It can do this in a few ways, including providing an integrated support chat applet, accessible from the VxRail Support tab in vCenter. Secure Connect Gateway also facilitates the transfer of the system logs needed to troubleshoot most any problem in the VxRail stack. These logs include the VxRail Manager virtual machine logs, vCenter logs, ESXi logs, iDRAC logs, and platform logs. vCenter and ESXi logs obviously are logs specific to the software powering the cluster. The iDRAC and platform logs contain the hardware inventory, LCM activity, out-of-band hardware log, and more.
I’ve touched on a lot of topics surrounding the support experience, but there’s one more that absolutely needs to be mentioned—that’s the people in support! The technical support staff standing behind VxRail are a very talented and knowledgeable group of folks. Many of these agents are VMware Certified Professionals, some looking for higher levels of certification, like the VCIX, one of VMware’s expert level certifications. This cumulative knowledge pool allows our support team to resolve over 95% of the incidents they encounter without needing a higher-level VMware engagement. However, in the instances where a VMware engagement is needed, say that a bug is discovered with vCenter for example, then VxRail support can escalate to VMware on the end customer’s behalf. This helps to create continuity in the support experience that might be missing from a solution without the jointly engineered nature of VxRail.
Servicing clusters can become a challenge, no matter the environment. Hardware and software both encounter failures that require an IT staff response. As environments grow and scale, the challenge of maintaining health for the environment grows, too. To help meet this expanding problem, VxRail helps administrators by automatically collecting events and alerts from the hardware and both VMware and VxRail software. This information can then be compressed into log bundles that can be shared with support. Contacting support is even easier, thanks to an integrated chat connecting your host to VxRail support staff. These support staff are specialists in both VMware and VxRail software, capable of resolving a vast majority of all issues with a single vendor. Our final discussion will be on the extensibility of VxRail, featuring CloudIQ and the VxRail API. See you there!
Related Blog Posts
Scaling Up VxRail: Managing an Ecosystem
Tue, 08 Nov 2022 20:13:27 -0000|
Read Time: 0 minutes
This is the sixth article in a series introducing VxRail concepts.
The engineering team behind VxRail has done a fantastic job building cluster and life cycle management tools into our software. The cluster update process is an excellent example of one of these software enhancements. However, we need to go further. The value of these enhancements decays a bit as you have more and more clusters, resulting in more and more actions required to manage an environment. This end result is antithetical to the entire idea behind VxRail. However, this complexity reintroduction never occurs, thanks to the features and functionality of the VxRail API and CloudIQ. The API scales out management operations by providing access to many of the same software calls that VxRail makes. Then we have CloudIQ. CloudIQ is a cloud-based management utility that can interface with various Dell infrastructures that VxRail uses to improve cluster management as environments scale out.
Expanding and automating your VxRail environment
For readers that aren’t familiar with what APIs are, the acronym stands for “Application Programming Interface.” APIs exist to help two, or sometimes more, pieces of software communicate with each other. VxRail has its own API that works in conjunction with VMware APIs and the Redfish API for the iDRAC and hardware. This enables the management of hardware and both VMware and VxRail software at scale. The VxRail API Guide shows the full range of calls available to developers. There are dozens of them; the last number I saw was over 70 individual calls. Now, there’s more to the API than its comprehensive nature. It also brings with it the simplicity of use. The API can be taken advantage of using the Swagger web interface and a PowerShell module to provide simple command line interfaces that IT staff are familiar with.
The API can help customers of any size, but who I see that benefits the most from using an API is a large customer that might have tens to hundreds of nodes in many clusters. The scale of these environments creates a need for further automation that can link VxRail clusters with management tools and practices. Some use-case examples include items like node discovery to see what various hardware is available and the versions running on that hardware; another example would be something like examining node and cluster health throughout the data center. The API can also enable infrastructure-as-code projects, such as automatically spinning up and winding down clusters as needed. Even automating simple tasks, like the shutdown of clusters in a way that maintains data consistency, provides a massive value to VxRail customers.
CloudIQ: Helping Manage Your Ecosystem
VxRail has more than the API to aid in managing large environments. As great as the API is, it takes a bit of preparation to use, whereas CloudIQ is ready for use as soon as Secure Connect Gateway is enabled and clusters are enrolled. If you haven’t heard of CloudIQ, I recommend checking out the CloudIQ simulator. The simulator doesn’t provide access to the complete feature set of CloudIQ but makes for an excellent introduction to what the product can do.
CloudIQ is a cloud-based application that monitors and resolves problems with Dell storage, server, data protection, networking, HCI, and CI products, and APEX services. You might see CloudIQ referred to as an AIOps application. This is short for artificial intelligence for IT operations. In the case of VxRail, this data is sent in by customers’ clusters using Secure Connect Gateway, where CloudIQ can then perform analytics functions. The output of this analytics can be used to create custom reports, create various estimates on storage utilization, reduce IT risk, and recover from problems faster. Beginning in May and continuing into June, Dell ran a survey of CloudIQ users. These users were able to accelerate IT recovery as little as 2x to as much as 10x faster, which saved them about an entire workday per week, on average. CloudIQ provides all this to customers with no financial or IT overhead due to it being freely available for use by Dell customers connecting to the Dell cloud.
Growth is exciting, but it comes with new challenges, and old ones don’t go away—they get bigger. VxRail provides customers with an API designed to work with the iDRAC and VMware APIs to provide automation throughout the entire cluster stack. This helps customers reduce repetitive labor tasks and create infrastructure-as-code projects. Then with CloudIQ, IT staff can get a view of their Dell infrastructure equipment from one pane of glass. For VxRail, this would include software versions, cluster health scores, the ability to initiate updates, and other functionality. While the API offers most of its value to customers with very large VxRail footprints, most all customers can also benefit from CloudIQ to view multiple clusters as well as the remainder of their Dell infrastructure equipment.
VxRail Cluster Integrity and Resilience
Tue, 01 Nov 2022 13:30:05 -0000|
Read Time: 0 minutes
This is the fourth article in a series introducing VxRail concepts.
Maintaining cluster integrity is an important task to ensure normal business operations. Some readers might not have a great understanding of what cluster integrity means, so let’s quickly define what we’re discussing. Cluster integrity describes a state where a cluster remains free of hardware and software errors. One of the primary challenges to maintaining cluster integrity is handling change through the cluster life cycle. It’s very likely that an administrator will need to make some kind of change—whether to address something minor, like a disk failure, or something more complex, like needing to add new hardware to a cluster. Software updates are a different kind of challenge. Each cluster node has drivers, firmware, an operating system, and the more elaborate VMware and VxRail system software, adding up to a considerable number of files to consider. VxRail helps make patching more holistic and successful.
The hardware life cycle
As a Dell Support specialist, I spoke with customers about many challenges, and one of the biggest was moving through the hardware lifecycle. For customers working in a nonclustered environment, hardware refreshes frequently come along with highly involved migrations. Administrators can create clusters from these nodes to better provision IT resources, but these clusters are made from off-the-shelf hardware that isn’t necessarily intended to work together. VxRail HCI system software simplifies scaling out drive additions.
Part of the VxRail advantage is the ability to provide administrators confidence while it facilitates change. Continuously Validated States help to achieve this by providing administrators with hardware choices they can select, knowing that their cluster’s current state was built to support that new hardware. For example, maybe you have a 3-node cluster and are ready to add nodes and expand it to a 5-node cluster, but a matching NIC isn’t available. Continuously Validated States define other NICs that will work without creating compatibility problems. Automation scripting, which is used at the time of adding the new nodes to the cluster, scans the network, identifies the available hosts, and orchestrates node assignment to the cluster. This allows customers to scale out a cluster quite easily.
Nodes can also be removed from a cluster with similar scripting. This scripting helps clusters make intergenerational migrations when used in conjunction with the node addition automation. Once the new-generation hosts arrive at the data center, they get added to the cluster with one wizard and can be removed with another to complete the life cycle move. However, VxRail also supports heterogeneous clusters. This would allow you to continue using the older cluster nodes as long as they are needed and comply with each of the cluster’s Continuously Validated States as the cluster progresses through the life cycle.
Continuing with our travel analogy from a previous blog in this series, if the update process is a vehicle, then each patch is a piece of cargo that an administrator has to worry about. These patches can quickly build up into backlogs for IT staff, even if only some of them need to be applied to clusters. VxRail life cycle management processes improve patch control by consolidating these independent release cycles into singular bundles that have confirmed compatibility between the new software packages. This helps promote cluster integrity by creating order among these patches. The patches are bundled up into VxRail releases that are made available to cluster owners within 30 days of VMware’s releases. IT staff can then use these resources to be more selective in the patching process and can think of these cycles as opportunities to add new features and functionality to clusters, as opposed to a task with no clearly defined benefit or purpose.
Maintaining cluster integrity means maintaining the stable and productive working state that businesses need nodes to be in. An HCI cluster built internally faces additional challenges maintaining this integrity, especially as software and hardware changes are needed in the environment. VxRail Continuously Validated States help to both broaden the viable hardware options and provide software patch control through update bundles that bake in the patches. Orchestration automation serves to control the cluster as patches are applied or when hardware changes, such as adding new nodes to the cluster, are made. The next article in this series will discuss serviceability and include topics like disk replacement, alerts and events, the overall support experience, and more!