Moving Through the VxRail Cluster Life Cycle
Tue, 01 Nov 2022 13:29:21 -0000|
Read Time: 0 minutes
This is the third article in a series introducing VxRail concepts.
The last entry in this series discussed Continuously Validated States and the benefits that come with having new cluster states tested before they ever arrive or are implemented. This article is about movement. More specifically, movement to new cluster states with new software, firmware, and drivers. If Continuously Validated States help provide known-good destination states and create our map, then the VxRail enhanced update process creates the vehicle used to move from one state to the next. Let’s dive into some of the specifics that illustrate the advantage of the VxRail process over traditional update processes and vLCM.
The first step in an update cycle is to define a new state, so let’s discuss that first. Whether updating a single server or an HCI solution you built yourself, the first step in building this state is identifying all the hardware so that nothing gets missed in the cycle. Once all hardware is accounted for, administrators can begin to collect the updated driver and firmware packages. Depending on the volume of hardware, updating a single node might well require around 15 different packages to touch all the system software, drivers, and firmware. If the environment has different hardware among nodes, then this task becomes more complex with more components to account for.
Where administrators would previously construct an update themselves, VxRail users perform updates by using prebuilt packages. These prebuilt packages contain the components to move a cluster to its next Continuously Validated State and are intended to service the VxRail family, as opposed to a specific cluster. This means that whether you’re primarily working with smaller clusters with different hardware or large, monolithic clusters, you can use a single bundle to bring the entire cluster up to date. In addition to the individual update packages, the update also carries a new Continuously Validated State with it. This frees up IT resources to complete more important tasks that have a greater business impact.
Life cycle management prechecks
In addition to compressing updates into single packages, the VxRail update process performs a series of readiness prechecks to ensure that the cluster is in a state where it is ready to accept an update. These tasks are examples of VxRail automation that obviously wouldn’t be present in an IT-designed HCI solution or traditional infrastructure. Let’s talk about what some of these prechecks are and what they can do for you as a user.
The precheck process examines more than 200 different items, so I won’t go into all of them here. However, I would like to highlight a few areas. Let’s start this part of the discussion with hardware and work our way up. Hardware examination runs a full range of exams to confirm cluster health. For example, physical checks are performed on memory to look for memory bit errors that could cause a host to crash during the update. Some other examples include inventory checks, to confirm that the hardware profile hasn’t changed to include components that our bundles can’t address, such as an unsupported NIC or another PCI device.
Prechecks extend to software versions as well. Software prechecks examine items such as whether a host successfully entered maintenance mode or if services are in the proper state to begin an update. These prechecks, in some cases, replace user interaction, as with the ability to cycle hosts into maintenance mode.
After the prechecks are complete, VxRail shows users all the hardware and software affected by the update. This helps users understand exactly what is changing in the environment. As indicated in the screenshot, this information also helps identify the specific changes to the cluster.
Launching an update
Users have two options for launching an update—they can update the cluster immediately, or they can schedule it to run at a planned time. A lot of customers I worked with liked to schedule their updates to run over weekends. Users might think that this is largely analogous to VMware’s vLCM. vLCM does offer automation benefits, but users must still create their own cluster profiles, create their own images, and perform their own testing. So, while vLCM certainly offers some automation advantages, VxRail takes this further by enhancing the update package collection and application processes. VxRail clusters can also be updated through the API or with CloudIQ.
Hopefully, this has helped illuminate some of the value that VxRail can provide to the cluster update cycle. Users get the benefit of consolidated update packages, saving the time and effort of collecting these files themselves. An in-depth series of prechecks then combs through cluster hardware and software to confirm that a cluster is in an ideal state to accept update packages. Once this is complete, change analysis scripting specifies the changes to be made to the environment. Finally, with the application of the update, VxRail sequentially moves node to node and cycles each through the update list, placing the nodes into maintenance mode and having vMotion move workloads to other available nodes. Taken together, these services, which are under continuous improvement by VxRail Engineering, help to make the update cycle as easy as possible.
Related Blog Posts
Recovering Clusters Faster: VxRail Serviceability
Tue, 08 Nov 2022 20:13:28 -0000|
Read Time: 0 minutes
This is the fifth article in a series introducing VxRail concepts.
Every tool or piece of equipment out there requires maintenance of some kind. That’s as true for the cars we drive as it is for the servers, storage, and switches that power our data centers. However, a lot of data center maintenance is reactive. Look at hardware failure as an example. If a drive were to fail in one of your clusters, nothing would happen until IT staff respond. VxRail offers the ability to automate some of these responses. Let’s talk about what happens when things go sideways in a cluster’s life.
Help Righting the Ship
One of the roles that the VxRail Manager VM fills is a centralized alert collector. VxRail integrates with the iDRAC to monitor hardware health and with vCenter to monitor VMware software, in addition to its own internal alerts and events. VxRail monitors all this information and creates a more holistic monitoring system for a cluster. This obviously benefits IT staff, but there are some additional benefits to this multi-level integration that other solutions might struggle to match.
VxRail uses a service called “Secure Connect Gateway” to establish a static connection to Dell data centers. This enables a lot of functionality on VxRail, including with CloudIQ for multi-cluster management, but that’s the subject of a future discussion. This static connection helps technical support become more proactive in helping you recover your clusters. For example, say you had a disk fail. If Secure Connect Gateway is enabled, VxRail would dial home and create a case automatically. Support could then use this to confirm the disk failure and confirm that there aren’t any other hardware or software issues being raised. Depending on what warranty services you have, you could even opt to have a replacement hard drive sent out automatically. It wasn’t uncommon for me to see support cases where we were the first to let the administrators know that there was an issue. It was definitely nice to be able to tell them a correction was already on its way out to them.
These phone homes that go through the Secure Connect Gateway add more value than helping to automate parts of some dispatches. The gateway also aids in the support experience. It can do this in a few ways, including providing an integrated support chat applet, accessible from the VxRail Support tab in vCenter. Secure Connect Gateway also facilitates the transfer of the system logs needed to troubleshoot most any problem in the VxRail stack. These logs include the VxRail Manager virtual machine logs, vCenter logs, ESXi logs, iDRAC logs, and platform logs. vCenter and ESXi logs obviously are logs specific to the software powering the cluster. The iDRAC and platform logs contain the hardware inventory, LCM activity, out-of-band hardware log, and more.
I’ve touched on a lot of topics surrounding the support experience, but there’s one more that absolutely needs to be mentioned—that’s the people in support! The technical support staff standing behind VxRail are a very talented and knowledgeable group of folks. Many of these agents are VMware Certified Professionals, some looking for higher levels of certification, like the VCIX, one of VMware’s expert level certifications. This cumulative knowledge pool allows our support team to resolve over 95% of the incidents they encounter without needing a higher-level VMware engagement. However, in the instances where a VMware engagement is needed, say that a bug is discovered with vCenter for example, then VxRail support can escalate to VMware on the end customer’s behalf. This helps to create continuity in the support experience that might be missing from a solution without the jointly engineered nature of VxRail.
Servicing clusters can become a challenge, no matter the environment. Hardware and software both encounter failures that require an IT staff response. As environments grow and scale, the challenge of maintaining health for the environment grows, too. To help meet this expanding problem, VxRail helps administrators by automatically collecting events and alerts from the hardware and both VMware and VxRail software. This information can then be compressed into log bundles that can be shared with support. Contacting support is even easier, thanks to an integrated chat connecting your host to VxRail support staff. These support staff are specialists in both VMware and VxRail software, capable of resolving a vast majority of all issues with a single vendor. Our final discussion will be on the extensibility of VxRail, featuring CloudIQ and the VxRail API. See you there!
Scaling Up VxRail: Managing an Ecosystem
Tue, 08 Nov 2022 20:13:27 -0000|
Read Time: 0 minutes
This is the sixth article in a series introducing VxRail concepts.
The engineering team behind VxRail has done a fantastic job building cluster and life cycle management tools into our software. The cluster update process is an excellent example of one of these software enhancements. However, we need to go further. The value of these enhancements decays a bit as you have more and more clusters, resulting in more and more actions required to manage an environment. This end result is antithetical to the entire idea behind VxRail. However, this complexity reintroduction never occurs, thanks to the features and functionality of the VxRail API and CloudIQ. The API scales out management operations by providing access to many of the same software calls that VxRail makes. Then we have CloudIQ. CloudIQ is a cloud-based management utility that can interface with various Dell infrastructures that VxRail uses to improve cluster management as environments scale out.
Expanding and automating your VxRail environment
For readers that aren’t familiar with what APIs are, the acronym stands for “Application Programming Interface.” APIs exist to help two, or sometimes more, pieces of software communicate with each other. VxRail has its own API that works in conjunction with VMware APIs and the Redfish API for the iDRAC and hardware. This enables the management of hardware and both VMware and VxRail software at scale. The VxRail API Guide shows the full range of calls available to developers. There are dozens of them; the last number I saw was over 70 individual calls. Now, there’s more to the API than its comprehensive nature. It also brings with it the simplicity of use. The API can be taken advantage of using the Swagger web interface and a PowerShell module to provide simple command line interfaces that IT staff are familiar with.
The API can help customers of any size, but who I see that benefits the most from using an API is a large customer that might have tens to hundreds of nodes in many clusters. The scale of these environments creates a need for further automation that can link VxRail clusters with management tools and practices. Some use-case examples include items like node discovery to see what various hardware is available and the versions running on that hardware; another example would be something like examining node and cluster health throughout the data center. The API can also enable infrastructure-as-code projects, such as automatically spinning up and winding down clusters as needed. Even automating simple tasks, like the shutdown of clusters in a way that maintains data consistency, provides a massive value to VxRail customers.
CloudIQ: Helping Manage Your Ecosystem
VxRail has more than the API to aid in managing large environments. As great as the API is, it takes a bit of preparation to use, whereas CloudIQ is ready for use as soon as Secure Connect Gateway is enabled and clusters are enrolled. If you haven’t heard of CloudIQ, I recommend checking out the CloudIQ simulator. The simulator doesn’t provide access to the complete feature set of CloudIQ but makes for an excellent introduction to what the product can do.
CloudIQ is a cloud-based application that monitors and resolves problems with Dell storage, server, data protection, networking, HCI, and CI products, and APEX services. You might see CloudIQ referred to as an AIOps application. This is short for artificial intelligence for IT operations. In the case of VxRail, this data is sent in by customers’ clusters using Secure Connect Gateway, where CloudIQ can then perform analytics functions. The output of this analytics can be used to create custom reports, create various estimates on storage utilization, reduce IT risk, and recover from problems faster. Beginning in May and continuing into June, Dell ran a survey of CloudIQ users. These users were able to accelerate IT recovery as little as 2x to as much as 10x faster, which saved them about an entire workday per week, on average. CloudIQ provides all this to customers with no financial or IT overhead due to it being freely available for use by Dell customers connecting to the Dell cloud.
Growth is exciting, but it comes with new challenges, and old ones don’t go away—they get bigger. VxRail provides customers with an API designed to work with the iDRAC and VMware APIs to provide automation throughout the entire cluster stack. This helps customers reduce repetitive labor tasks and create infrastructure-as-code projects. Then with CloudIQ, IT staff can get a view of their Dell infrastructure equipment from one pane of glass. For VxRail, this would include software versions, cluster health scores, the ability to initiate updates, and other functionality. While the API offers most of its value to customers with very large VxRail footprints, most all customers can also benefit from CloudIQ to view multiple clusters as well as the remainder of their Dell infrastructure equipment.