Exploring the customer experience with lifecycle management for vSAN Ready Nodes and VxRail clusters
Thu, 24 Sep 2020 19:41:49 -0000|
Read Time: 0 minutes
The difference between VMware vSphere LCM (vLCM) and Dell EMC VxRail LCM is still a trending topic that most HCI customers and prospects want more information about. While we compared the two methods at a high level in our previous blog post, let’s dive into the more technical aspects of the LCM operations of VMware vLCM and VxRail LCM. The detailed explanation in this blog post should give you a more complete understanding of your role as an administrator for cluster lifecycle management with vLCM versus VxRail LCM.
Even though vLCM has introduced a vast improvement in automating cluster updates, lifecycle management is more than executing cluster updates. With vLCM, lifecycle management is still very much a customer-driven endeavor. By contrast, VxRail’s overarching goal for LCM is operational simplicity, by leveraging Continuously Validated States to drive cluster LCM for the customer. This is a large part of why VxRail has over 8,600 customers since it was launched in early 2016.
In this blog post, I’ll explain the four major areas of LCM:
- Defining the initial baseline configuration
- Planning for a cluster update
- Executing the cluster update
- Sustaining cluster integrity over the long term
Defining the initial baseline configuration
The baseline configuration is a vital part of establishing a steady state for the life of your cluster. The baseline configuration is the current known good state of your HCI stack. In this configuration, all the component software and firmware versions are compatible with one another. Interoperability testing has validated full stack integrity for application performance and availability while also meeting security standards in place. This is the ‘happy’ state for you and your cluster. Any changes to the configuration use this baseline to know what needs to be rectified to return to the ‘happy’ state.
How is it done with vLCM?
vLCM depends on the hardware vendor to provide a Hardware Management Services virtual machine. Dell provides this support for its Dell EMC PowerEdge servers, including vSAN ReadyNodes. I’ll use this implementation to explain the overall process. Dell EMC vSAN ReadyNodes use the OpenManage Integration for VMware vCenter (OMIVV) plugin to connect to and register with the vCenter Server.
Once the VM is deployed and registered, you need to create a credential-based profile. This profile captures two accounts: one for the out-of-band hardware interface, the iDRAC, and the other for the root credentials for the ESXi host. Future changes to the passwords require updating the profile accordingly.
With the VM connection and profile in place, a Catalog XML file is used by vLCM to define the initial baseline configuration. To create the Catalog XML file, you need to install and configure the Dell Repository Manager (DRM) to build the hardware profile. Once a profile is defined to your specification, it must then be exported and stored on an NFS or CIFS share. The profile is then used to populate the Repository Profile data in the OMIVVV UI. If you are unsure of your configuration, refer to the vSAN Hardware Compatibility List (HCL) for the specific supported firmware versions. Once the hardware profile is created, you can then associate it with the cluster profile. With the cluster profile defined, you can enable drift detection. Any future change to the Catalog XML file is done within the DRM.
It’s important to note that vLCM was introduced in vSphere 7.0. To use vLCM, you must first update or deploy your clusters to run vSphere 7.x.
How is it done with VxRail LCM?
With VxRail, when the cluster arrives at the customer data center, it’s already running in a ‘happy’ state. For VxRail, the ‘happy’ state is called Continuously Validated States. The term is pluralized because VxRail defines all the ‘happy’ states that your cluster will update to over time. This means that your cluster is always running in a ‘happy’ state without you needing to research, define, and test to arrive at Continuously Validated States throughout the life of your cluster. VxRail – well, specifically the VxRail engineering team - does it for you. This has been a central tenet of VxRail since the product first launched with vSphere 6.0. Since then it has helped customers transition to vSphere 6.5, 6.7, and now 7.0.
Once the VxRail cluster initialization is completed, use your Dell EMC Support credentials to configure the VxRail repository setting within vCenter. VxRail Manager plugin to vCenter will automatically connect to the VxRail repository at Dell EMC and pull down the next available update package.
Figure 1 Defining the initial baseline configuration
Planning for a cluster update
Updates are a constant in IT, and VMware is constantly adding new capabilities or product/security fixes that require updating to newer versions of software. Take for example the vSphere 7.0 Update 1 release that VMware and Dell Technologies just announced. Those eye-opening features are available to you when you update to that release. You can check out just how often VMware has historically updated vSphere here: https://kb.vmware.com/s/article/2143832.
As you know, planning for a cluster update is an iterative process with inherent risk associated with it. Failing to plan diligently can cause adverse effects on your cluster, ranging from network outages and node failure to data unavailability or data loss. That said, it’s important to mitigate the risk where you can.
How is it done with vLCM?
With vLCM, the responsibility of planning for a cluster update rests on the customers’ shoulders, including the risk. Understanding the Bill of Materials that makes up your server’s hardware profile is paramount to success. Once all the components are known, and a target version of vSphere ESXi is specified, the supported driver and firmware version needs to be investigated and documented. You must consult the VMware Compatibility Guide to find out which drivers/firmware are supported for each ESXi release.
It is important to note that although vLCM gives you the toolset to apply firmware and driver updates, it does not validate compatibility or support for each combination for you, except for the HBA Driver. This task is firmly in the customer’s domain. It is advisable to validate and test the combination in a separate test environment to ensure that no performance regression or issues are introduced into the production environment. Interoperability testing can be an extensive and expensive undertaking. Customers should create and define robust testing processes to ensure that full interoperability and compatibility is met for all components managed and upgraded by vLCM.
With Dell EMC vSAN Ready Nodes, customers can rest assured that the HCL certification and compatibility validation steps have been performed. However, the customer is still responsible for interoperability testing.
How is it done with VxRail LCM?
VxRail engineering has taken a unique approach to LCM. Rather than leaving the time-consuming LCM planning to already overburdened IT departments, they have drastically reduced the risk by investing over $60 million, more than 25,000 hours of testing for major releases, and more than 100 team members into a comprehensive regression test plan. This plan is completed prior to every VxRail code release. (This is in addition to the testing and validation performed by PowerEdge, on which VxRail nodes are built.)
Dell EMC VxRail engineering performs this testing within 30 days of any new VMware release (even quicker for express patches), so that customers can continually benefit from the latest VMware software innovations and confidently address security vulnerabilities. You may have heard this called “synchronous release”.
The outcome of this effort is a single update bundle that is used to update the entire HCI stack, including the operating system, the hardware’s drivers and firmware, and management components such as VxRail Manager and vCenter. This allows VxRail to define the declarative configuration we mentioned previously (“Continuously Validated States”), allowing us to move easily from one validated state to the next with each update.
Figure 2 Planning for a cluster update
Executing the cluster update
The biggest improvement with vLCM is its ability to orchestrate and automate a full stack HCI cluster update. This simplifies the update operation and brings enormous time savings. This process is showcased in a recent study performed by Principled Technologies with PowerEdge Servers with vSphere (not including vSAN).
How is it done with vLCM?
The first step is to import the ESXi ISO via the vLCM tab in the vCenter Server UI. Once uploaded, select the relevant cluster, ensure that the cluster profile (created in the initial baseline configuration phase) is associated with the cluster being updated. Now, you can apply the target configuration by editing the ESXi image and, from the OMIVV UI, choose the correct firmware and driver to apply to the hardware profile. Once a compliance scan is complete, you will have the option to remediate all hosts.
If there are multiple homogenous clusters you need to update, it can be as easy as using the same cluster profile to execute the cluster update against. However, if the next cluster has a different hardware configuration, then you would have to perform the above steps over again. Customers with varying hardware and software requirements for their clusters will have to repeat many of these steps, including the planning tasks, to ensure stack integrity.
How it is done with VxRail LCM?
With VxRail and Continuously Validated States, updating from one configuration to another is even simpler. You can access the VxRail Manager directly within the vCenter Server UI to initiate the update. The LCM operation automatically retrieves the update bundle from the VxRail repository, runs a full stack pre-update health check, and performs the cluster update.
With VxRail, performing multi-cluster updates is as simple as performing a single-cluster update. The same LCM cluster update workflow is followed. While different hardware configurations on separate clusters will add more labor for IT staff for vSAN Ready Nodes, this doesn’t apply to VxRail. In fact, in the latest release of our SaaS multi-cluster management capability set, customers can now easily perform cluster updates at scale from our cloud-based management platform, MyVxRail.
Figure 3 Executing a cluster update
Sustaining cluster integrity over the long term
The long-term integrity of a cluster outlasts the software and hardware in it. As mentioned earlier, because new releases are made available frequently, software has a very short life span. While hardware has more staying power, it won’t outlast some of the applications running on them. New hardware platforms will emerge. New hardware devices will enter the market that will launch new workloads, such as machine learning, graphics rendering, and visualization workflows. You will need the cluster to evolve non-disruptively to deliver the application performance, availability, and diversity your end-users require.
How is it done with vLCM?
In its current form, vLCM will struggle in long-term cluster lifecycle management. In particular, its inability to support heterogeneous nodes (nodes with different hardware configurations) in the same cluster will limit its application diversification and its ability to take advantage of new hardware platforms without impacting end-users.
How it is done with VxRail LCM?
VxRail LCM touts its ability to allow customers to grow non-disruptively and to scale their clusters over time. That includes adding non-identical nodes into the clusters for new applications, adding new hardware devices for new applications or more capacity, or retiring old hardware from the cluster.
Figure 4 Comparing vSphere LCM and VxRail LCM cluster update operations driven by the customer
The VMware vLCM approach empowers customers who are looking for more configuration flexibility and control. They have the option to select their own hardware components and firmware to build the cluster profile. With this freedom comes the responsibility to define the HCI stack and make investments in equipment and personnel to ensure stack integrity. vLCM supports this customer-driven approach with improvements in cluster update execution for faster outcomes.
Dell EMC VxRail LCM continues to take a more comprehensive approach to optimize operational efficiency from the point of the view of the customer. VxRail customers value its LCM capabilities because it reduces operational time and effort which can be diverted into other areas of need in IT. VxRail takes on the responsibility to drive stack integrity for the lifecycle management of the cluster with Continuously Validated States. And VxRail sustains stack integrity throughout the life of the cluster, allowing you to simply and predictably evolve with technology trends.
Related Blog Posts
What is the difference between VMware vSphere LCM and VxRail LCM
Fri, 13 Nov 2020 19:08:38 -0000|
Read Time: 0 minutes
Update to VxRail 7.0.100 and Unleash the Performance Within It
Mon, 02 Nov 2020 15:50:28 -0000|
Read Time: 0 minutes
What could be better than faster storage? How about faster storage, more capacity, and better durability?
Last week at Dell Technologies we released VxRail 7.0.100. This release brings support for the latest versions of VMware vSphere and vSAN 7.0 Update 1. Typically, in an update release we will see a new feature or two, but VMware out did themselves and crammed not only a load of new or significantly enhanced features into this update, but also some game changing performance enhancements. As my peers at VMware already did a fantastic job of explain these features, I won’t even attempt to replicate their work – you can find links to the blogs on features that caught my attention in the reference section below. Rather, I want to draw attention to the performance gains, and ask the question: Could RAID5 with compression only be the new normal?
Don’t worry, I can already hear the cries of “Max performance needs RAID1, RAID5 has IO amplification and parity overhead, data reduction services have drawbacks”, but bear with me a little. Also, I’m not suggesting that RAID5 compression only be used for all workloads, there are some workloads that are definitely unsuitable – streams of compressed video come to mind. Rather I’m merely suggesting that after our customers go through the painless process of updating their cluster to VxRail 7.0.100 from one of our 36 previous releases in over the past two years (yes you can leap straight from 4.5.211 to 7.0.100 in a single update and yes we do handle the converging and decommissioning of the Platform Services Controller), that they check out the reduction in storage IO latency that their existing workload is putting on their VxRail cluster, and investigate what it represents – in short, more storage performance headroom.
As customers buy VxRail clusters to support production workloads, they can’t exactly load it up with a variety of benchmark workload test to see how far they can push it. But at VxRail we are fortune to have our own dedicated performance team, who have enough VxRail nodes to run a mid-sized enterprise, and access to a large library of components so that they can replicate almost any VxRail configuration we sell (and a few we don’t). So, there is data behind my outrageous suggestion, it isn’t just back of the napkin mathematics. Grab a copy of the performance team’s recent findings in their whitepaper: Harnessing the performance of Dell EMC VxRail 7.0.100: A lab based performance analysis of VxRail, and skip to figure 3. There you’ll find some very telling before and after performance latency curves with and without data reduction services for an RDBMS workload. Spoiler: 58% more peak IOPS and almost 40% lower latency, with compression this only drops to a still very significant 45% more peak IOPS with 30% lower latency. (For those of you screaming “but failure domains” check out the blog Space Efficiency Using the New “Compression only” Option where Pete Kohler explains the issue, and how it not longer exists with compression only.) But what about RAID5? Skip up to figure 1 which summarizes the across the board performance gains for IOPS and throughput, impressive, right? Now slide down to figure 2 to compare the throughput, in particular compare RAID 1 on 7.0 with RAID 5 on 7.0 U1 – the read performance is almost identical, while the gap in write performance has narrowed. Write performance on RAID5 will likely always lag RAID1 due to IO amplification, but VMware is clearly looking to narrow that gap as much as possible.
If nothing else the whitepaper should tell you that a simple hassle-free upgrade to VxRail 7.0.100 will unlock additional performance headroom on your vSAN cluster without any additional costs, and that the tradeoffs associated with RAID5 and data reduction services (compression only) are greatly reduced. There are opportunistic space savings to be had from compression only, but they require committing to a cluster wide change to unlock, which is something that should not be taken lightly. However, realizing the guaranteed 33% capacity savings of RAID5, can be unlocked per virtual machine, reverted just as easily, represents a lower risk. I opened asking the question if RAID5 with compression only could be the new normal, and I firmly believe that the data indicates that this is a viable option for many more workloads.
My peers at VMware (John Nicholson, Pete Flecha (these two are the voices and brains behind the vSpeakingPodcast – definitely worth listening to), Teodora Todorova Hristov, Pete Koehler and Cedric Rajendran) have written great and in-depth blogs about these features that caught my attention:
vSAN HCI Mesh – eliminate stranded storage by enabling VMs registered to cluster A access storage from cluster B
Shared Witness for 2-Node Deployments - reduced administration time and infrastructure costs thru one witness for up to sixty-four 2-node clusters
Enhanced vSAN File Services – adds SMB v2.1 and v3 for Windows and Mac clients. Add Kerberos authentication for existing NFS v4.1
Space Efficiency: Compression only option - For demanding workloads that cannot take advantage of deduplication. Compression only has higher throughput, lower latency, and significantly reduced impact on write performance compared to deduplication and compression. Compression only has a reduced failure domain and 7x faster data rebuild rates.
Spare Capacity Management – Slack space guidance of 25% has been replaced with a calculated Operational Reserve the requires less space, and decreases with scale. Additional option to enable Host rebuild reserve, VxRail Sizing Tool reserves this by default when sizing configurations, with the filter Add HA
Enhanced Durability during Maintenance Mode – data being intended for a host in maintenance mode is temporally recorded in a delta file on another host, providing the configured FTT during Maintenance Mode operations