Exploring the customer experience with lifecycle management for vSAN Ready Nodes and VxRail clusters
Thu, 24 Sep 2020 19:41:49 -0000|
Read Time: 0 minutes
The difference between VMware vSphere LCM (vLCM) and Dell EMC VxRail LCM is still a trending topic that most HCI customers and prospects want more information about. While we compared the two methods at a high level in our previous blog post, let’s dive into the more technical aspects of the LCM operations of VMware vLCM and VxRail LCM. The detailed explanation in this blog post should give you a more complete understanding of your role as an administrator for cluster lifecycle management with vLCM versus VxRail LCM.
Even though vLCM has introduced a vast improvement in automating cluster updates, lifecycle management is more than executing cluster updates. With vLCM, lifecycle management is still very much a customer-driven endeavor. By contrast, VxRail’s overarching goal for LCM is operational simplicity, by leveraging Continuously Validated States to drive cluster LCM for the customer. This is a large part of why VxRail has over 8,600 customers since it was launched in early 2016.
In this blog post, I’ll explain the four major areas of LCM:
- Defining the initial baseline configuration
- Planning for a cluster update
- Executing the cluster update
- Sustaining cluster integrity over the long term
Defining the initial baseline configuration
The baseline configuration is a vital part of establishing a steady state for the life of your cluster. The baseline configuration is the current known good state of your HCI stack. In this configuration, all the component software and firmware versions are compatible with one another. Interoperability testing has validated full stack integrity for application performance and availability while also meeting security standards in place. This is the ‘happy’ state for you and your cluster. Any changes to the configuration use this baseline to know what needs to be rectified to return to the ‘happy’ state.
How is it done with vLCM?
vLCM depends on the hardware vendor to provide a Hardware Management Services virtual machine. Dell provides this support for its Dell EMC PowerEdge servers, including vSAN ReadyNodes. I’ll use this implementation to explain the overall process. Dell EMC vSAN ReadyNodes use the OpenManage Integration for VMware vCenter (OMIVV) plugin to connect to and register with the vCenter Server.
Once the VM is deployed and registered, you need to create a credential-based profile. This profile captures two accounts: one for the out-of-band hardware interface, the iDRAC, and the other for the root credentials for the ESXi host. Future changes to the passwords require updating the profile accordingly.
With the VM connection and profile in place, a Catalog XML file is used by vLCM to define the initial baseline configuration. To create the Catalog XML file, you need to install and configure the Dell Repository Manager (DRM) to build the hardware profile. Once a profile is defined to your specification, it must then be exported and stored on an NFS or CIFS share. The profile is then used to populate the Repository Profile data in the OMIVVV UI. If you are unsure of your configuration, refer to the vSAN Hardware Compatibility List (HCL) for the specific supported firmware versions. Once the hardware profile is created, you can then associate it with the cluster profile. With the cluster profile defined, you can enable drift detection. Any future change to the Catalog XML file is done within the DRM.
It’s important to note that vLCM was introduced in vSphere 7.0. To use vLCM, you must first update or deploy your clusters to run vSphere 7.x.
How is it done with VxRail LCM?
With VxRail, when the cluster arrives at the customer data center, it’s already running in a ‘happy’ state. For VxRail, the ‘happy’ state is called Continuously Validated States. The term is pluralized because VxRail defines all the ‘happy’ states that your cluster will update to over time. This means that your cluster is always running in a ‘happy’ state without you needing to research, define, and test to arrive at Continuously Validated States throughout the life of your cluster. VxRail – well, specifically the VxRail engineering team - does it for you. This has been a central tenet of VxRail since the product first launched with vSphere 6.0. Since then it has helped customers transition to vSphere 6.5, 6.7, and now 7.0.
Once the VxRail cluster initialization is completed, use your Dell EMC Support credentials to configure the VxRail repository setting within vCenter. VxRail Manager plugin to vCenter will automatically connect to the VxRail repository at Dell EMC and pull down the next available update package.
Figure 1 Defining the initial baseline configuration
Planning for a cluster update
Updates are a constant in IT, and VMware is constantly adding new capabilities or product/security fixes that require updating to newer versions of software. Take for example the vSphere 7.0 Update 1 release that VMware and Dell Technologies just announced. Those eye-opening features are available to you when you update to that release. You can check out just how often VMware has historically updated vSphere here: https://kb.vmware.com/s/article/2143832.
As you know, planning for a cluster update is an iterative process with inherent risk associated with it. Failing to plan diligently can cause adverse effects on your cluster, ranging from network outages and node failure to data unavailability or data loss. That said, it’s important to mitigate the risk where you can.
How is it done with vLCM?
With vLCM, the responsibility of planning for a cluster update rests on the customers’ shoulders, including the risk. Understanding the Bill of Materials that makes up your server’s hardware profile is paramount to success. Once all the components are known, and a target version of vSphere ESXi is specified, the supported driver and firmware version needs to be investigated and documented. You must consult the VMware Compatibility Guide to find out which drivers/firmware are supported for each ESXi release.
It is important to note that although vLCM gives you the toolset to apply firmware and driver updates, it does not validate compatibility or support for each combination for you, except for the HBA Driver. This task is firmly in the customer’s domain. It is advisable to validate and test the combination in a separate test environment to ensure that no performance regression or issues are introduced into the production environment. Interoperability testing can be an extensive and expensive undertaking. Customers should create and define robust testing processes to ensure that full interoperability and compatibility is met for all components managed and upgraded by vLCM.
With Dell EMC vSAN Ready Nodes, customers can rest assured that the HCL certification and compatibility validation steps have been performed. However, the customer is still responsible for interoperability testing.
How is it done with VxRail LCM?
VxRail engineering has taken a unique approach to LCM. Rather than leaving the time-consuming LCM planning to already overburdened IT departments, they have drastically reduced the risk by investing over $60 million, more than 25,000 hours of testing for major releases, and more than 100 team members into a comprehensive regression test plan. This plan is completed prior to every VxRail code release. (This is in addition to the testing and validation performed by PowerEdge, on which VxRail nodes are built.)
Dell EMC VxRail engineering performs this testing within 30 days of any new VMware release (even quicker for express patches), so that customers can continually benefit from the latest VMware software innovations and confidently address security vulnerabilities. You may have heard this called “synchronous release”.
The outcome of this effort is a single update bundle that is used to update the entire HCI stack, including the operating system, the hardware’s drivers and firmware, and management components such as VxRail Manager and vCenter. This allows VxRail to define the declarative configuration we mentioned previously (“Continuously Validated States”), allowing us to move easily from one validated state to the next with each update.
Figure 2 Planning for a cluster update
Executing the cluster update
The biggest improvement with vLCM is its ability to orchestrate and automate a full stack HCI cluster update. This simplifies the update operation and brings enormous time savings. This process is showcased in a recent study performed by Principled Technologies with PowerEdge Servers with vSphere (not including vSAN).
How is it done with vLCM?
The first step is to import the ESXi ISO via the vLCM tab in the vCenter Server UI. Once uploaded, select the relevant cluster, ensure that the cluster profile (created in the initial baseline configuration phase) is associated with the cluster being updated. Now, you can apply the target configuration by editing the ESXi image and, from the OMIVV UI, choose the correct firmware and driver to apply to the hardware profile. Once a compliance scan is complete, you will have the option to remediate all hosts.
If there are multiple homogenous clusters you need to update, it can be as easy as using the same cluster profile to execute the cluster update against. However, if the next cluster has a different hardware configuration, then you would have to perform the above steps over again. Customers with varying hardware and software requirements for their clusters will have to repeat many of these steps, including the planning tasks, to ensure stack integrity.
How it is done with VxRail LCM?
With VxRail and Continuously Validated States, updating from one configuration to another is even simpler. You can access the VxRail Manager directly within the vCenter Server UI to initiate the update. The LCM operation automatically retrieves the update bundle from the VxRail repository, runs a full stack pre-update health check, and performs the cluster update.
With VxRail, performing multi-cluster updates is as simple as performing a single-cluster update. The same LCM cluster update workflow is followed. While different hardware configurations on separate clusters will add more labor for IT staff for vSAN Ready Nodes, this doesn’t apply to VxRail. In fact, in the latest release of our SaaS multi-cluster management capability set, customers can now easily perform cluster updates at scale from our cloud-based management platform, MyVxRail.
Figure 3 Executing a cluster update
Sustaining cluster integrity over the long term
The long-term integrity of a cluster outlasts the software and hardware in it. As mentioned earlier, because new releases are made available frequently, software has a very short life span. While hardware has more staying power, it won’t outlast some of the applications running on them. New hardware platforms will emerge. New hardware devices will enter the market that will launch new workloads, such as machine learning, graphics rendering, and visualization workflows. You will need the cluster to evolve non-disruptively to deliver the application performance, availability, and diversity your end-users require.
How is it done with vLCM?
In its current form, vLCM will struggle in long-term cluster lifecycle management. In particular, its inability to support heterogeneous nodes (nodes with different hardware configurations) in the same cluster will limit its application diversification and its ability to take advantage of new hardware platforms without impacting end-users.
How it is done with VxRail LCM?
VxRail LCM touts its ability to allow customers to grow non-disruptively and to scale their clusters over time. That includes adding non-identical nodes into the clusters for new applications, adding new hardware devices for new applications or more capacity, or retiring old hardware from the cluster.
Figure 4 Comparing vSphere LCM and VxRail LCM cluster update operations driven by the customer
The VMware vLCM approach empowers customers who are looking for more configuration flexibility and control. They have the option to select their own hardware components and firmware to build the cluster profile. With this freedom comes the responsibility to define the HCI stack and make investments in equipment and personnel to ensure stack integrity. vLCM supports this customer-driven approach with improvements in cluster update execution for faster outcomes.
Dell EMC VxRail LCM continues to take a more comprehensive approach to optimize operational efficiency from the point of the view of the customer. VxRail customers value its LCM capabilities because it reduces operational time and effort which can be diverted into other areas of need in IT. VxRail takes on the responsibility to drive stack integrity for the lifecycle management of the cluster with Continuously Validated States. And VxRail sustains stack integrity throughout the life of the cluster, allowing you to simply and predictably evolve with technology trends.
Related Blog Posts
Our fastest and biggest launch ever! - We’ve also made it simpler
Tue, 13 Jul 2021 17:41:25 -0000|
Read Time: 0 minutes
With this hardware launch, we at VxRail are refreshing our mainline platforms. Our “everything” E Series, our performance-focused P Series, and our virtualization-accelerated V Series. You’ve probably already guessed that these nodes are faster and bigger. This is always the case with new hardware in the tech industry, thanks to Moore’s Law of "Cramming more components onto integrated circuits,” but we’ve also made this hardware release simpler. Let’s dig into these changes, what they mean to you, the consumer, and what choices you may need to consider.
The headline in this could well be the 3rd Generation Intel Xeon Scalable processor (code named Ice Lake) with its increased cores and performance. After all, the CPU is the heart of every computing device from the nebulous public cloud to the smart refrigerator in your kitchen. But there is more to CPUs and servers than cores and clock speeds. The most significant of these, in my opinion, are support for the fourth generation of the PCIe bus. PCIe Gen 3 was introduced on 12th Generation PowerEdge servers in 2012, so the arrival of PCIe Gen 4 with double the bandwidth and 33% more lanes is very much appreciated. The PCIe bus is the highway network that connects everything together, this increase in bandwidth and lanes drives change and enables improvements in many other components.
The most significant impact for VxRail is the performance that it unlocks with PCIe Gen 4 NVMe drives, available on all the new nodes including the V Series. With vSAN’s distributed architecture, all writes go to multiple cache drives on multiple nodes. Anything that improves cache performance, be it high bandwidth, lower latency networking, or faster cache drives, will drive overall application performance and increased densities. For the relatively small price premium of NVMe cache drives over SAS caches drives, VxRail can deliver up to 35% higher IOPS and up to 14% lower latency (OLTP 32K on RAID 1). NVMe cache drives also reduce the performance impact of enabling data service like deduplicate, compression, and encryption at rest. For more information, check out this paper from our performance team last year (did you know that VxRail has its own performance testing team?) where they showed the performance impact of dedupe and compression compared to compression only compared to no data reduction. This data highlights the small performance impact that compression only has on performance and the benefit of NVMe for cache drives.
Staying with storage, the new SAS HBA has double the number of lanes, which doubles the bandwidth available to drives. Don’t assume that this means twice the storage performance – wait for my next post where I’ll delve into those details with ESG. It is a topic worthy of its own post and well worth the wait, I promise! The SAS HBA has been moved to the front of the node right behind the drive bay, this is noteworthy because it frees up a PCIe slot on some configurations. We also freed up a PCIe slot on all configurations with the new Boot Optimized Storage Solution (BOSS) device – more on the new BOSS device below. These changes: deliver a third PCIe slot on the E Series, flexibility on the V Series with support for six GPUs while still offering PCIe slots for networking and FC expansion. Some would argue you can never have enough PCIe slots, but we argued, and sacrificed these gains on the P Series in favor of delivering four additional capacity drive slots, providing 184 TB of raw storage capacity in 2U. Don’t worry, there are still plenty of PCIe slots for additional networking or fibre channel cards – yes in case you missed it, you can add fibre channel storage to your favorite HCI platform, extending the storage offerings for your various workloads, through the addition of QLogic or Emulex 16/32GB fibre channel cards. These are also PCIe Gen 4 to drive maximum performance.
PCIe Gen 4 is also enabling network cards to drive more throughput. With this new generation of VxRail, we are launching with an onboard quad port 25 GbE networking card, 2.5 times more than what the previous generation launched with. See the Get thee to 25GbE section in my recent post for A trilogy of reasons to see why you need to be looking at 25 GbE NICs today, even if you are not upgrading your network switches just yet. With this release, VxRail is shifting our onboard networking to use the Open Compute Project (OCP) spec 3.0 form factor. For you, the customer, this means greater choice in on
-board network cards, with 10 cards from three vendors available at launch, and more to come. If you are not familiar with OCP, check it out. OCP is a large cross company organization that started as an internal project at Facebook, but now has a diverse membership of almost 100 companies working “collaboratively on redesigning hardware technology to efficiently support the growing demands on compute infrastructure.” The quad 25Gbe NIC is only consuming half of the bandwidth that OCP 3.0 can support, so we all have an interesting networking future.
This hardware release is not just faster and bigger, we have also made these VxRail nodes simpler. Simplicity, like beauty, is in the eye of the beholder; there isn’t an industry benchmark for it, but I think you’ll agree with me that these changes will make life simpler in the data center. The new BOSS-S2 device is located at the rear of the node and hot-pluggable. In the event of failure of a RAID 1 protected M.2 SATA drive, it can easily and non-disruptively be replaced without powering off and opening the node. We’ve also relocated the power supplies, there is now one on each side of the chassis. This improves air flow, cooling, and enables easier and tidier cabling – we’ve all seen those rats’ nest of cables in the data center. Moving around to the front, we’ve added a Quick Resource Locator (QRL) to the chassis luggage tag, which can be scanned with an Android or iOS app, this will display system and warranty details and also provide links to SolVe procedures and documentation. Sticking with mobile applications, we’ve added OpenManage and Mobile Quick Sync 2 which enables, from the press of the Wireless Activation button, access to iDRAC and all the troubleshooting help it provides – no more dragging a crash cart across the data center.
VxRail is more than the sum of its components, be it through Lifecycle Management, simpler cloud operations, or ongoing product education. The value it delivers is seen daily by our 12.4K customers around the globe. Today we celebrate not just our successes and our new release, but also the successes and achievements of the giants that hoist us up to stand on their shoulders and enable VxRail and our customers to reach for the stars. Join us as we continue our journey and Reimagine HCI.
More GPUs, CPUs and performance - oh my!
Mon, 14 Jun 2021 11:18:50 -0000|
Read Time: 0 minutes
Continuous hardware and software changes deployed with VxRail’s Continuously Validated State
A wonderful aspect of software-defined-anything, particularly when built on world class PowerEdge servers, is speed of innovation. With a software-defined platform like VxRail, new technologies and improvements are continuously added to provide benefits and gains today, and not a year or so in the future. With the release of VxRail 7.0.200, we are at it again! This release brings support for VMware vSphere and vSAN 7.0 Update 2, and for new hardware: 3rd Gen AMD EPYC processors (Milan), and more powerful hardware from NVIDIA with their A100 and A40 GPUs.
VMware, as always, does a great job of detailing the many enhanced or new features in a release. From high level What’s New corporate or personal blog posts, to in-depth videos by Duncan Epping. However, there are a few changes that I want to highlight:
Get thee to 25GbE: A trilogy of reasons - Storage, load-balancing, and pricing.
vSAN is a distributed storage system. To that end, anything that improves the network or networking efficiency improves storage performance and application performance -- but there is more to networking than big, low-latency pipes. RDMA has been a part of vSphere since the 6.5 release; it is only with 7.0 Update 2 that it is leveraged by vSAN. John Nicholson explains the nuts and bolts of vSAN RDMA in this blog post, but only touches on the performance gains. From our performance testing on VxRail, I can share with you the gains we have seen with VxRail: up to 5% reduction in CPU utilization, up to 25% lower latency, and up to 18% higher IOPS, along with increases in read and write throughput. It should be noted that even with medium block IO, vSAN is more than capable of saturating a 10GbE port, RDMA is pushing performance beyond that, and we’ve yet to see what Intel 3rd Generation Xeon processors will bring. The only fly in the ointment for vSAN RDMA is the current small list of approved network cards – no doubt more will be added soon.
vSAN is not the only feature that enjoys large low-latency pipes. Niels Hagoort describes the changes in vSphere 7.0 Update 2 that have made vMotion faster, thus making Balancing Workloads Invisible and the lives of virtualization administrators everywhere a lot better. Aside: Can I say how awesome it is to see VMware continuing to enhance a foundational feature that they first introduced in 2003, a feature that for many was that lightbulb Aha! moment that started their virtualization journey.
One last nudge: pricing. The cost delta between 10GbE and 25GbE network hardware is minimal, so for greenfield deployments the choice is easy; you may not need it today, but workloads and demands continue to grow. For brownfield, where the existing network is not due for replacements, the choice is still easy. 25GbE NICs and switch ports can negotiate to 10GbE making a phased migration, VxRail nodes now and switches in the future, possible. The inverse is also possible: upgrade the network to 25GbE switches while still connecting your existing VxRail 10GbE SFP+ NIC ports.
Is 25GbE in your infrastructure upgrade plans yet? If not, maybe it should be.
A duo of AMD goodness
Last year we released two AMD-based VxRail platforms, the E665/F and the P675F/N, so I’m delighted to see CPU scheduler optimizations for AMD EPYC processors, as described in Aditya Sahu blog post. What is even better is the 29 page performance study Aditya links to, the depth of detail provided on how the ESXi CPU scheduling works, and didn’t work, with AMD EYPC processors is truly educational. The extensive performance testing VMware continuously runs and the results they share (spoiler: they achieved very significant gains) are also a worthwhile read. In our testing we’ve seen that with just these scheduler optimizations AMD alone VxRail 7.0.200 can provide up to 27% more IOPS and up to 27% lower latency for both RAID1 and RAID5 with relational database (RDBMS22K 60R/40W 100%Random) workloads.
VxRail begins shipping the 3rd generation AMD EYPC processors – also known as
Milan – in VxRail E665 and P675 nodes later this month. These are not a replacement
for the current 2nd Gen EPYC processors we offer, rather the addition of higher
performing 24-core, 32-core, and 64-core choices to the VxRail line up delivering up to 33% more IOPS and 16% lower latency across a range of workloads and block sizes. Check out this VMware blog post for the performance gains they showcase with the VMmark benchmarking tool.
HCI Mesh – only recently introduced, yet already getting better
When VMware released HCI Mesh just last October, it enabled stranded storage on one VxRail cluster to be consumed by another VxRail cluster. With the release of VxRail 7.0.200 this has been expanded to making it more applicable to more customers by enabling any vSphere clusters to also be consumers of that excess storage capacity – these remote clusters do not require a vSAN license and consume the storage in the same manner they would any other datastore. This opens up some interesting multi-cluster use cases, for example:
In solutions where a software application licensing requires each core/socket in the vSphere cluster to be licensed, this licensing cost can easily dwarf other costs. Now this application can be deployed on a small compute-only cluster, while consuming storage from the larger VxRail cluster. Or where the density of storage per socket didn’t make VxRail viable, it can now be achieved with a smaller VxRail cluster, plus a separate compute-only cluster. If only the all the goodness that is VxRail was available in a compute-only cluster – now that would be something dynamic…
A GPU for every workload
GPUs, once the domain of PC gamers, are now a data center staple with their parallel processing capabilities accelerating a variety of workloads. The versatile VxRail V Series has multiple NVIDIA GPUs to choose from and we’ve added two more with the addition of the NVIDIA A40 and A100. The A40 is for sophisticated visual computing workloads – think large complex CAD models, while the A100 is optimized for deep learning inference workloads for high-end data science.
Evolution of hardware in a software-defined world
PowerEdge took a big step forward with their recent release built on 3rd Gen Intel Xeon Scalable processors. Software-defined principles enable VxRail to not only quickly leverage this big step forward, but also to quickly leverage all the small steps in hardware changes throughout a generation. Building on the latest PowerEdge servers we are Reimagine HCI with VxRail with the next generation VxRail E660/F, P670F or V670F. Plus, what’s great about VxRail is that you can seamlessly integrate this latest technology into your existing infrastructure environment. This is an exciting release, but equally exciting are all the incremental changes that VxRail software-defined infrastructure will get along the way with PowerEdge and VMware.
VxRail, flexibility is at its core.
- VxRail systems with Intel 3rd Generation Xeon processors will be globally available in July 2021.
- VxRail systems with AMD 3rd Generation EPYC processors will be globally available in June 2021.
- VxRail HCI System Software updates will be globally available in July 2021.
- VxRail dynamic nodes will be globally available in August 2021.
- VxRail self-deployment options will begin availability in North America through an early access program in August 2021.
- Blog: Reimagine HCI with VxRail
- Attend our launch webinar to learn more.
- Press release: Dell Technologies Reimagines Dell EMC VxRail to Offer Greater Performance and Storage Flexibility