Home Integrated Products Microsoft HCI Solutions from Dell Technologies Blogs

Microsoft HCI Solutions from Dell Technologies: Designed for extreme resilient performance

Wed, 16 Jun 2021 13:35:49 -0000

Read Time: 0 minutes

Dell EMC Integrated System for Microsoft Azure Stack HCI (Azure Stack HCI) is a fully productized HCI solution based on our flexible AX node family as the foundation.

Before I get into some exciting performance test results, let me set the stage. Azure Stack HCI combines the software-defined compute, storage, and networking features of Microsoft Azure Stack HCI OS, with AX nodes from Dell Technologies to deliver the perfect balance for performant, resilient, and cost-effective software-defined infrastructure.

Figure 1 illustrates our broad portfolio of AX node configurations with a wide range of component options to meet the requirements of nearly any use case – from the smallest remote or branch office to the most demanding database workloads.

Figure 1: current platforms supporting our Microsoft HCI Solutions from Dell Technologies

Each chassis, drive, processor, DIMM module, network adapter and their associated BIOS, firmware, and driver versions have been carefully selected and tested by the Dell Technologies Engineering team to optimize the performance and resiliency of Microsoft HCI Solutions from Dell Technologies. Our Integrated Systems are designed for 99.9999% hardware availability*.

* Based on Bellcore component reliability modeling for AX-740xd nodes and S5248S-ON switches a) in 2- to 4-node clusters configured with N + 1 redundancy, and b) in 4- to 16-node clusters configured with N + 2 redundancy, March 2021.

Comprehensive management with Dell EMC OpenManage Integration with Windows Admin Center, rapid time to value with Dell EMC ProDeploy options, and solution-level Dell EMC ProSupport complete this modern portfolio.

You'll notice in that table that we have a new addition -- the AX-7525: a dual-socket, AMD-based platform designed for extreme performance and high scalability.

The AX-7525 features direct-attach NVMe drives with no PCIe switch, which provides full Gen4 PCIe potential to each storage device, resulting in massive IOPS and throughput at minimal latency.

To get an idea of how performant and resilient this platform is, our Dell Technologies experts put a 4-node AX-7525 cluster to the test. Each node had the following configuration:

24 NVMe drives (PCIe Gen 4)
Dual-socket AMD EPYC 7742 64-Core Processor (128 cores)
1 TB RAM
1 Mellanox CX6 100 gigabit Ethernet RDMA NIC

The easy headline would be that this setup consistently delivered nearly 6M IOPs at sub 1ms latency. One could think that we doctored these performance tests to achieve these impressive figures with just a 4-node cluster!

The reality is that we sought to establish the ‘hero numbers’ as a baseline – ensuring that our cluster was configured optimally. However, we didn’t stop there. We wanted to find out how this configuration would perform with real-world IO patterns. This blog won’t get into the fine-grained details of the white paper, but we’ll review the test methodology for those different scenarios and explain the performance results.

Figure 2 shows the 4-node cluster and fully converged network topology that we built for the lab:

Figure 2: Lab setup

We performed two differentiated sets of tests in this environment:

Tests with IO profiles aimed at identifying the maximum IOPS and throughput thresholds of the cluster
- Test 1: Using a healthy 4-node cluster
Tests with IO profiles that are more representative of real-life workloads (online transaction processing (OLTP), online analytical processing (OLAP), and mixed workload types)
- Test 2: Using a healthy 4-node cluster
- Test 3: Using a degraded 4-node cluster, with a single node failure
- Test 4: Using a degraded 4-node cluster, with a two-node failure

To generate real-life workloads, we used VMFleet, which leverages PowerShell scripts to create Hyper-V virtual machines executing DISKSPD to produce the desired IO profiles.

We chose the three-way mirror resiliency type for the volumes we created with VMFleet because of its superior performance versus erasure coding options in Storage Spaces Direct.

Now that we have a clearer idea of the lab setup and the testing methodology, let’s move on to the results for the four tests.

Test 1: IO profile to push the limits on a healthy 4-node cluster with 64 VMs per node

Here are the details of the workload profile and the performance we obtained:

IO profile	Block size	Thread count	Outstanding IO	Write %	IO pattern	Total IOs	Latency
B4-T2-O32-W0-PR	4k	2	32	0%	100% random read	5,727,985	1.3 ms (read)
B4-T2-O16-W100-PR	4k	2	16	100%	100% random write	700,256	9 ms* (write)
						Throughput
B512-T1-O8-W0-PSI	512k	1	8	0%	100% sequential read	105 GB/s
B512-T1-O1-W100-PSI	512k	1	1	100%	100% sequential write	8 GB/s

* The reason for this slightly higher latency is because we are pushing too many Outstanding IOs and we already plateaued on performance. We noticed that even with 32 VMs, we hit the same IOs, because all we are doing from that point on is adding more load that a) isn’t driving any additional IOs and b) just adds to the latency.

This test sets the bar for the limits and maximum performance we can obtain from this 4-node cluster: almost 6 million read IOs, 700k write IOs, and a bandwidth of 105 GB/s for reads, and 8 GB/s for writes.

Test 2: real-life workload IO profile on a healthy 4-node cluster with 32 VMs per node

The IO profiles for this test encompass a broad range of real-life scenarios:

OLTP oriented: we tested for a wide spectrum of block sizes, ranging from 4k to 32k, and write IO ratios, varying from 20% to 50%.
OLAP oriented: the most common OLAP IO profile is large block size and sequential access. Other workloads that follow a similar pattern are file backups and video streaming. We tested 64k to 512k block sizes and 20% to 50% write IO ratios.

The following figure shows the details and results we obtained for all the different tested IO patterns:

Figure 3: Test 2 results

Super impressive results and important to notice (on the left) the 1.6 million IOPS at 1.2 millisecond average latency for the typical OLTP IO profile of 8 KB block size and 30% random write. Even at 32k block size and 50% write IO ratio, we measured 400,000 IOs at under 7 milliseconds latency.

Also, very remarkable is the extreme throughput we witnessed during all the tests, with special emphasis on the incredible 29.65 GB/s with an IO profile of 512k block size and 20% write ratio.

Tests 3 and 4: push the limits and real-life workload IO profiles on a degraded 4-node cluster

To simulate a one-node failure (Test 3), we shut down node 4, which caused node 2 to take additional ownership of the 32 restarted VMs from node 4, for a total of 64 VMs on node 2.

Similarly, to simulate a two-node failure (Test 4), we shut down nodes 3 and 4, leading to a VM reallocation process from node 3 to node 1, and from node 4 to node 2. Nodes 1 and 2 ended up with 64 VMs each.

The cluster environment continued to produce impressive results even in this degraded state. The table below compares the testing scenarios that used IO profiles aimed at identifying the maximum thresholds.

IO profile	Healthy cluster		One node failure		Two node failure
	Total IOs	Latency	Total IOs	Latency	Total IOs	Latency
B4-T2-O32-W0-PR	4,856,796	0.38 ms (read)	4,390,717	0.38 ms (read)	3,842,997	0.26 ms (read)
B4-T2-O16-W100-PR	753,886	3.2 ms (write)	482,715	5.7 ms (write)	330,176	11.4 ms (write)
	Throughput		Throughput		Throughput
B512-T1-O8-W0-PSI	91 GB/s		113 GB/s		77 GB/s
B512-T1-O1-W100-PSI	8 GB/s		6 GB/s		10 GB/s

Figure 4 illustrates the test results for real-life workload scenarios for the healthy cluster and for the one-node and two-node degraded states.

Figure 4: Test 3 and 4 results

Once more, we continued to see outstanding performance results from an IO, latency, and throughput perspective, even with one or two nodes failing.

One important consideration we observed is that for the 4k and 8k block sizes, IOs decrease and latency increases as one would expect, whereas for the 32k and higher block sizes we realized that:

Latency was less variable across the failure scenarios because write IOs did not need to be committed across as many nodes in the cluster.
After the two-node failure, there was actually an increase of IOs (20-30%) and throughput (52% average)!

There are two reasons for this:

The 3-way mirrored volumes became 2-way mirrored volumes on the two surviving nodes. This effect led to 33% fewer backend drive write IOs. The overall drive write latency decreased, driving higher read and write IOs. This only applied when CPU was not the bottleneck.
Each of the remaining nodes doubled the number of running VMs (from 32 to 64), which directly translated into greater potential for more IOs.

Conclusion

We are happy to share with you these figures about the extreme-resilient performance our integrated systems deliver, during normal operations or in the event of failures.

Dell EMC Integrated System for Microsoft Azure Stack HCI, especially with the AX-7525 platform, is an outstanding solution for customers struggling to support their organization’s increasingly heavy demand for resource intensive workloads and to maintain or improve their corresponding service level agreements (SLAs).

Tags:

Your Browser is Out of Date

Microsoft HCI Solutions from Dell Technologies: Designed for extreme resilient performance

Test 1: IO profile to push the limits on a healthy 4-node cluster with 64 VMs per node

Test 2: real-life workload IO profile on a healthy 4-node cluster with 32 VMs per node

Tests 3 and 4: push the limits and real-life workload IO profiles on a degraded 4-node cluster

Conclusion

Related Blog Posts

2023 Updates for Azure Stack HCI and Hub (Part I)

Azure Stack HCI

What’s new with Azure Stack HCI?

Azure Stack HCI software and hardware updates

Azure Stack HCI End of Life (EOL) for several components

Conclusion

Dell and Azure Stack HCI Made Easy: the Video Series

What you will find

Will there be more?

Conclusion