The Case for Elastic Stack on HCI
Thu, 11 Jun 2020 21:34:33 -0000|
Read Time: 0 minutes
The Elastic Stack, also known as the “ELK Stack”, is a widely used, collection of software products based on open source used for search, analysis, and visualization of data. The Elastic Stack is useful for a wide range of applications including observability (logging, metrics, APM), security, and general-purpose enterprise search. Dell Technologies is an Elastic Technology Partner1 This blog covers some basics of hyper-converged infrastructure (HCI), some Elastic Stack fundamentals, and the benefits of deploying Elastic Stack on HCI.
HCI integrates the compute and storage resources from a cluster of servers using virtualization software for both CPU and disk resources to deliver flexible, scalable performance and capacity on demand. The breadth of server offerings in the Dell PowerEdge portfolio gives system architects many options for designing the right blend of compute and storage resources. Local resources from each server in the cluster are combined to create virtual pools of compute and storage with multiple performance tiers.
VxFlex is a Dell Technologies developed, hypervisor agnostic, HCI platform integrated with high-performance, software-defined block storage. VxFlex OS is the software that creates a server and IP-based SAN from direct-attached storage as an alternative to a traditional SAN infrastructure. Dell Technologies also offers the VxRail HCI platform for VMware-centric environments. VxRail is the only fully integrated, pre-configured, and pre-tested VMware HCI system powered with VMware vSAN. We show below why both HCI offerings are highly efficient and effective platforms for a truly scalable Elastic Stack deployment.
Elastic Stack Overview
The Elastic Stack is a collection of four open-source projects: Elasticsearch, Logstash, Kibana, and Beats. Elasticsearch is an open-source, distributed, scalable, enterprise-grade search engine based on Lucene. Elasticsearch is an end-to-end solution for searching, analyzing, and visualizing machine data from diverse source formats. With the Elastic Stack, organizations can collect data from across the enterprise, normalize the format, and enrich the data as desired. Platforms designed for scale-out performance running the Elastic Stack provides the ability to analyze and correlate data in near real-time.
Elastic Stack on HCI
In March 2020, Dell Technologies validated the Elastic Stack running on our VxFlex family of HCI2. It will be shown how the features of HCI provide distinct benefits and cost savings as an integrated solution for the Elastic Stack. The Elastic Stack, and Elasticsearch specifically, is designed for scale-out. Data nodes can be added to an Elasticsearch cluster to provide additional compute and storage resources. HCI also uses a scale-out deployment model that allows for easy, seamless scalability horizontally by adding additional nodes to the cluster(s). However, unlike bare-metal deployments, HCI also scales vertically by adding resources dynamically to Elasticsearch data nodes or any other Elastic Stack roles through virtualization. VxFlex admins use their preferred hypervisor and VxFLEX OS and for VxRail it is done with VMware ESX and vSAN. Additionally, the Elastic Stack can be deployed on Kubernetes clusters, therefor admins can also choose to leverage VMware Tanzu for Kubernetes management.
Virtualization has long been a strategy for achieving more efficient resource utilization and data center density. Elasticsearch data nodes tend to have average allocations of 8-16 cores and 64GB of RAM. With the current ability to support up to 112 cores and 6TB of RAM in a single 2RU Dell server, Elasticsearch is an attractive application for virtualization. Additionally, the Elastic Stack is also significantly more CPU efficient than some alternative products improving the cost-effectiveness of deploying Elastic with VMware or other virtualization technologies. We would recommend sizing for 1 physical CPU to 1 virtual CPU (vCPU) for Elasticsearch Hot Tier along with the management and control plane resources. While this is admittedly like the VMware guidance for some similar analytics platforms, these VMs tend to consume a significantly smaller CPU footprint per data node. The Elastic Stack tends to take advantage of hyperthreading and resource overcommitment more effectively. While needs will vary by customer use case, our experience shows the efficiencies in the Elastic Stack and Elastic data lifecycle management allow the Elasticsearch Warm Tier, Kibana, and Proxy servers can be supported by 1 physical CPU to 2 vCPUs and the Cold Tier can be upwards of 4 vCPUs to a physical CPU.
Because Elasticsearch tiers data on independent data nodes versus multiple mount points on a single data node or indexer, the multiple types and classes of software-defined storage defined for independent HCI clusters can be easily leveraged between Elasticsearch clusters to address data temperatures. It should be noted that currently Elastic does not currently recommend any non-block storage (S3, NFS, etc.) as a target for Elasticsearch except as a target for Elasticsearch Snapshot and Restore. (It is possible to use S3 or NFS on Isilon or ECS as an example as a retrieval target for Logstash, but that is a subject for a later blog.) For example, vSAN in VxRail provides Optane, NVMe, SSD, and HDD storage options. A user can deploy their primary Elastic Stack environment with its Hot Elasticsearch data nodes, Kibana, and the Elastic Stack management and control plane on an all-flash VxRail cluster, and then leverage a storage dense hybrid vSAN cluster for Elasticsearch cold data.
Image 1. Example Logical Elastic Stack Architecture on HCI
Software-defined storage in HCI provides native enterprise capabilities including data encryption and data protection. Because FlexOS and vSAN provide HA via the software-defined storage, Replica Shards in Elastic for data protection are not required. Elastic will shard an index into 5 shards by default for processing, but Replica Shards for data protection are optional. Because we have data protection at the storage layer we did not use Replicas in our validation of VxFlex and we saw no impact on performance.
HCI enables customers to expand and efficiently manage the rapid adoption of an Elastic environment with dynamic resource expansion and improved infrastructure management tools. This allows for the rapid adoption of new use cases and new insights. HCI reduces datacenter sprawl and associated costs and inefficiencies related to the adoption of Elastic on bare metal. Ultimately HCI can deliver a turnkey experience that enables our customers to continuously innovate through insights derived by the Elastic Stack.
- Elastic Technology and Cloud Partners - https://www.elastic.co/about/partners/technology
- Elastic Stack Solution on Dell EMC VxFlex Family - https://www.dellemc.com/en-in/collaterals/unauth/white-papers/products/converged-infrastructure/elastic-on-vxflex.pdf
- Elasticsearch Sizing and Capacity Planning Webinar - https://www.elastic.co/webinars/elasticsearch-sizing-and-capacity-planning
About the Author
Keith Quebodeaux is an Advisory Systems Engineer and analytics specialist with Dell Technologies Advanced Technology Solutions (ATS) organization. He has worked in various capacities with Dell Technologies for over 20 years including managed services, converged and hyper-converged infrastructure, and business applications and analytics. Keith is a graduate of the University of Oregon and Southern Methodist University.
I would like to gratefully acknowledge the input and assistance of Craig G., Rakshith V., and Chidambara S. for their input and review of this blog. I would like to especially thank Phil H., Principal Engineer with Dell Technologies whose detailed and extensive advice and assistance provided clarity and focus to my meandering evangelism. Your support was invaluable. As with anything the faults are all my own.
Related Blog Posts
Exploring the customer experience with lifecycle management for vSAN Ready Nodes and VxRail clusters
Thu, 24 Sep 2020 19:41:49 -0000|
Read Time: 0 minutes
The difference between VMware vSphere LCM (vLCM) and Dell EMC VxRail LCM is still a trending topic that most HCI customers and prospects want more information about. While we compared the two methods at a high level in our previous blog post, let’s dive into the more technical aspects of the LCM operations of VMware vLCM and VxRail LCM. The detailed explanation in this blog post should give you a more complete understanding of your role as an administrator for cluster lifecycle management with vLCM versus VxRail LCM.
Even though vLCM has introduced a vast improvement in automating cluster updates, lifecycle management is more than executing cluster updates. With vLCM, lifecycle management is still very much a customer-driven endeavor. By contrast, VxRail’s overarching goal for LCM is operational simplicity, by leveraging Continuously Validated States to drive cluster LCM for the customer. This is a large part of why VxRail has over 8,600 customers since it was launched in early 2016.
In this blog post, I’ll explain the four major areas of LCM:
- Defining the initial baseline configuration
- Planning for a cluster update
- Executing the cluster update
- Sustaining cluster integrity over the long term
Defining the initial baseline configuration
The baseline configuration is a vital part of establishing a steady state for the life of your cluster. The baseline configuration is the current known good state of your HCI stack. In this configuration, all the component software and firmware versions are compatible with one another. Interoperability testing has validated full stack integrity for application performance and availability while also meeting security standards in place. This is the ‘happy’ state for you and your cluster. Any changes to the configuration use this baseline to know what needs to be rectified to return to the ‘happy’ state.
How is it done with vLCM?
vLCM depends on the hardware vendor to provide a Hardware Management Services virtual machine. Dell provides this support for its Dell EMC PowerEdge servers, including vSAN ReadyNodes. I’ll use this implementation to explain the overall process. Dell EMC vSAN ReadyNodes use the OpenManage Integration for VMware vCenter (OMIVV) plugin to connect to and register with the vCenter Server.
Once the VM is deployed and registered, you need to create a credential-based profile. This profile captures two accounts: one for the out-of-band hardware interface, the iDRAC, and the other for the root credentials for the ESXi host. Future changes to the passwords require updating the profile accordingly.
With the VM connection and profile in place, a Catalog XML file is used by vLCM to define the initial baseline configuration. To create the Catalog XML file, you need to install and configure the Dell Repository Manager (DRM) to build the hardware profile. Once a profile is defined to your specification, it must then be exported and stored on an NFS or CIFS share. The profile is then used to populate the Repository Profile data in the OMIVVV UI. If you are unsure of your configuration, refer to the vSAN Hardware Compatibility List (HCL) for the specific supported firmware versions. Once the hardware profile is created, you can then associate it with the cluster profile. With the cluster profile defined, you can enable drift detection. Any future change to the Catalog XML file is done within the DRM.
It’s important to note that vLCM was introduced in vSphere 7.0. To use vLCM, you must first update or deploy your clusters to run vSphere 7.x.
How is it done with VxRail LCM?
With VxRail, when the cluster arrives at the customer data center, it’s already running in a ‘happy’ state. For VxRail, the ‘happy’ state is called Continuously Validated States. The term is pluralized because VxRail defines all the ‘happy’ states that your cluster will update to over time. This means that your cluster is always running in a ‘happy’ state without you needing to research, define, and test to arrive at Continuously Validated States throughout the life of your cluster. VxRail – well, specifically the VxRail engineering team - does it for you. This has been a central tenet of VxRail since the product first launched with vSphere 6.0. Since then it has helped customers transition to vSphere 6.5, 6.7, and now 7.0.
Once the VxRail cluster initialization is completed, use your Dell EMC Support credentials to configure the VxRail repository setting within vCenter. VxRail Manager plugin to vCenter will automatically connect to the VxRail repository at Dell EMC and pull down the next available update package.
Figure 1 Defining the initial baseline configuration
Planning for a cluster update
Updates are a constant in IT, and VMware is constantly adding new capabilities or product/security fixes that require updating to newer versions of software. Take for example the vSphere 7.0 Update 1 release that VMware and Dell Technologies just announced. Those eye-opening features are available to you when you update to that release. You can check out just how often VMware has historically updated vSphere here: https://kb.vmware.com/s/article/2143832.
As you know, planning for a cluster update is an iterative process with inherent risk associated with it. Failing to plan diligently can cause adverse effects on your cluster, ranging from network outages and node failure to data unavailability or data loss. That said, it’s important to mitigate the risk where you can.
How is it done with vLCM?
With vLCM, the responsibility of planning for a cluster update rests on the customers’ shoulders, including the risk. Understanding the Bill of Materials that makes up your server’s hardware profile is paramount to success. Once all the components are known, and a target version of vSphere ESXi is specified, the supported driver and firmware version needs to be investigated and documented. You must consult the VMware Compatibility Guide to find out which drivers/firmware are supported for each ESXi release.
It is important to note that although vLCM gives you the toolset to apply firmware and driver updates, it does not validate compatibility or support for each combination for you, except for the HBA Driver. This task is firmly in the customer’s domain. It is advisable to validate and test the combination in a separate test environment to ensure that no performance regression or issues are introduced into the production environment. Interoperability testing can be an extensive and expensive undertaking. Customers should create and define robust testing processes to ensure that full interoperability and compatibility is met for all components managed and upgraded by vLCM.
With Dell EMC vSAN Ready Nodes, customers can rest assured that the HCL certification and compatibility validation steps have been performed. However, the customer is still responsible for interoperability testing.
How is it done with VxRail LCM?
VxRail engineering has taken a unique approach to LCM. Rather than leaving the time-consuming LCM planning to already overburdened IT departments, they have drastically reduced the risk by investing over $60 million, more than 25,000 hours of testing for major releases, and more than 100 team members into a comprehensive regression test plan. This plan is completed prior to every VxRail code release. (This is in addition to the testing and validation performed by PowerEdge, on which VxRail nodes are built.)
Dell EMC VxRail engineering performs this testing within 30 days of any new VMware release (even quicker for express patches), so that customers can continually benefit from the latest VMware software innovations and confidently address security vulnerabilities. You may have heard this called “synchronous release”.
The outcome of this effort is a single update bundle that is used to update the entire HCI stack, including the operating system, the hardware’s drivers and firmware, and management components such as VxRail Manager and vCenter. This allows VxRail to define the declarative configuration we mentioned previously (“Continuously Validated States”), allowing us to move easily from one validated state to the next with each update.
Figure 2 Planning for a cluster update
Executing the cluster update
The biggest improvement with vLCM is its ability to orchestrate and automate a full stack HCI cluster update. This simplifies the update operation and brings enormous time savings. This process is showcased in a recent study performed by Principled Technologies with PowerEdge Servers with vSphere (not including vSAN).
How is it done with vLCM?
The first step is to import the ESXi ISO via the vLCM tab in the vCenter Server UI. Once uploaded, select the relevant cluster, ensure that the cluster profile (created in the initial baseline configuration phase) is associated with the cluster being updated. Now, you can apply the target configuration by editing the ESXi image and, from the OMIVV UI, choose the correct firmware and driver to apply to the hardware profile. Once a compliance scan is complete, you will have the option to remediate all hosts.
If there are multiple homogenous clusters you need to update, it can be as easy as using the same cluster profile to execute the cluster update against. However, if the next cluster has a different hardware configuration, then you would have to perform the above steps over again. Customers with varying hardware and software requirements for their clusters will have to repeat many of these steps, including the planning tasks, to ensure stack integrity.
How it is done with VxRail LCM?
With VxRail and Continuously Validated States, updating from one configuration to another is even simpler. You can access the VxRail Manager directly within the vCenter Server UI to initiate the update. The LCM operation automatically retrieves the update bundle from the VxRail repository, runs a full stack pre-update health check, and performs the cluster update.
With VxRail, performing multi-cluster updates is as simple as performing a single-cluster update. The same LCM cluster update workflow is followed. While different hardware configurations on separate clusters will add more labor for IT staff for vSAN Ready Nodes, this doesn’t apply to VxRail. In fact, in the latest release of our SaaS multi-cluster management capability set, customers can now easily perform cluster updates at scale from our cloud-based management platform, MyVxRail.
Figure 3 Executing a cluster update
Sustaining cluster integrity over the long term
The long-term integrity of a cluster outlasts the software and hardware in it. As mentioned earlier, because new releases are made available frequently, software has a very short life span. While hardware has more staying power, it won’t outlast some of the applications running on them. New hardware platforms will emerge. New hardware devices will enter the market that will launch new workloads, such as machine learning, graphics rendering, and visualization workflows. You will need the cluster to evolve non-disruptively to deliver the application performance, availability, and diversity your end-users require.
How is it done with vLCM?
In its current form, vLCM will struggle in long-term cluster lifecycle management. In particular, its inability to support heterogeneous nodes (nodes with different hardware configurations) in the same cluster will limit its application diversification and its ability to take advantage of new hardware platforms without impacting end-users.
How it is done with VxRail LCM?
VxRail LCM touts its ability to allow customers to grow non-disruptively and to scale their clusters over time. That includes adding non-identical nodes into the clusters for new applications, adding new hardware devices for new applications or more capacity, or retiring old hardware from the cluster.
Figure 4 Comparing vSphere LCM and VxRail LCM cluster update operations driven by the customer
The VMware vLCM approach empowers customers who are looking for more configuration flexibility and control. They have the option to select their own hardware components and firmware to build the cluster profile. With this freedom comes the responsibility to define the HCI stack and make investments in equipment and personnel to ensure stack integrity. vLCM supports this customer-driven approach with improvements in cluster update execution for faster outcomes.
Dell EMC VxRail LCM continues to take a more comprehensive approach to optimize operational efficiency from the point of the view of the customer. VxRail customers value its LCM capabilities because it reduces operational time and effort which can be diverted into other areas of need in IT. VxRail takes on the responsibility to drive stack integrity for the lifecycle management of the cluster with Continuously Validated States. And VxRail sustains stack integrity throughout the life of the cluster, allowing you to simply and predictably evolve with technology trends.
VxRail & Intel Optane for Extreme Performance
Fri, 07 Aug 2020 15:33:49 -0000|
Read Time: 0 minutes
Enabling high performance for HCI workloads is exactly what happens when VxRail is configured with Intel Optane Persistent Memory (PMem). Optane PMem provides compute and storage performance to better serve applications and business-critical workloads. So, what is Intel Optane Persistent Memory? Persistent memory is memory that can be used as storage, providing RAM-like performance, very low latency and high bandwidth. It’s great for applications that require or consume large amounts of memory like SAP HANA, and has many other use cases as shown in Figure 1 and VxRail is certified for SAP HANA as well as Intel Optane PMem.
Moreover, PMem can be used as block storage where data can be written persistently, a great example is for DBMS log files. A key advantage to using this technology is that you can start small with a single PMem card (or module), then scale and grow as needed with the ability to add up to 12 cards. Customers can take advantage of PMem immediately because there’s no need to make major hardware or configuration changes, nor budget for a large capital expenditure.
There are a wide variety of use cases today including those you see here:
Figure 1: Intel Optane PMem Use Cases
PMem offers two very different operating modes, that being Memory and App Direct, and in turn App Direct can be used in two very different ways.
First, Intel Optane PMem in Memory mode is not yet supported by VxRail. This mode acts as volatile system memory and provides significantly lower cost per GB then traditional DRAM DIMMs. A follow-on update to this blog will describe this mode and test results in much more detail once it is supported.
As for App Direct mode (supported today), PMem is consumed by virtual machines as either a block storage device, known as vPMemDisk, or as byte addressable memory, known as Virtual NVDIMM. Both provide great benefit to the applications running in a virtual machine, just in very different ways. vPMemDisk can be used by any virtual machine hardware, and by any Guest OS. Since it’s presented as a block device it will be treated like any other virtual disk. Applications and/or data can then be placed on this virtual disk. The second consumption method, NVDIMM has the advantage of being addressed in the same way as regular RAM, however, it can retain its data through reboots or power failures. This is a considerable plus for large in-memory databases like SAP HANA where cache warm-up or the time to load tables in memory can be significant!
However, it’s important to note that, like any other memory module, the PMem module does not provide data redundancy. This may not be an issue for some data files on commonly used applications that can be re-created in case of a host failure. But a key principle when using PMem, either as block storage or byte addressable memory is that the applications are responsible for handling data replication to provide durability.
New data redundancy options are expected on applications that are using PMem and should be well understood before deployment.
First, we’ll look at test results using PMem as virtual disk (or vPMemDisk). Our Engineering team tested VxRail with PMem in App Direct mode and ran comparison tests against a VxRail all-flash (P570F series platform). The testing simulated a typical 4K OLTP workload with 70/30 RW ratio. Our results achieve more than 1.8M IOPs or 6X more than the all-flash VxRail system. That equates to 93% faster response times (or lower latency) and 6X greater throughput as shown here:
Figure 2: VxRail PMem App Direct versus VxRail all-flash
This latency difference indicates the potential to improve the performance of legacy applications by placing specific data files on a PMem module, for example, placing log files on PMem. To verify the benefit of this log acceleration use case we ran a TPC-C benchmark comparing VxRail configured with a log file on a vPMEMDIsk to a VxRail all-flash vSAN, and we saw a 46% improvement on the number of transactions per minute.
Figure 3: Log file acceleration use case
For the second consumption method, we tested PMem in App direct mode using the NVDIMM consumption method. We performed tests using 1,2,4,8 and then 12 PMEM modules. All testing has been evaluated and validated by ESG (Enterprise Strategy Group). The certified white paper has been published as highlighted in the resources section.
Figure 4: NVDIMM device testing (vSAN not-optimized versus optimized PMem NVDIMM)
The results prove linear scalability as we increase the number of modules from 1 to 12. And with 12 PMem modules, VxRail achieves 80 times more IOPs than when running against vSAN not optimized (meaning VxRail all-flash vSAN with no PMem involved), and 100X for the 4K RW workload. The right half of the graphic depicts throughput results for very large IO, 64KB. When PMem is optimized on 12 modules we saw 28X higher throughput for a 64KB random read (RR) workload, and PMem is 13 times faster for the 64K RW.
What you see here is amazing performance on a single VxRail host and almost linear scalability when adding PMem!! Yes, that warrants a double bang. If you were to max out a 64-node cluster, the potential scalability is phenomenal and game changing!
So, what does all this mean? Key takeaways are:
- The local performance of VxRail with Intel Optane PMem can scale to 12M read IOPS, and more than 4M write IOPs or 70GB/s read throughput / 22GB/s write throughput on a single host.
- The use of PMEM modules doesn’t affect the regular activity on vSAN Datastores and extends the value of your VxRail platform in many ways;
- It can be used to accelerate legacy applications, such as RDBMS Log acceleration
- It enables the deployment of in memory databases and applications that can benefit from the higher IO throughput provided by PMEM while still taking the benefit of vSAN characteristics in the VxRail platform
- The local performance of a single host with 12 x 128GB PMem modules achieves more than 12M read IOPS, and more than 4M write IOPs
- It not only increases performance of traditional HCI workloads such as VDI, but also support performance-intensive transactional and analytics workloads
- It offers orders-of-magnitude faster performance than traditional storage
- It provides more memory for less cost as PMem is much less costly than DRAM
The references and validation testing have been completed by ESG (Enterprise Strategy Group). White papers and other resources on VxRail for Extreme Performance are available via the links listed below.
By: KJ Bedard – VxRail Technical Marketing Engineer