The cluster configuration that we deployed for the DBaaS platform included hardware components, BIOS, firmware, and drivers that our engineering teams deliberately selected and validated to balance the requirements of the following design goals:
The integrated system consisted of four AX-740xd nodes from Dell Technologies, two Dell EMC PowerSwitch S5248F-ON Top of Rack (ToR) network switches, and one Dell EMC PowerSwitch S3048-ON for out-of-band (OOB) management. Our engineering organization validated all AX nodes and network topologies. The AX nodes were factory pre-installed with Azure Stack HCI, version 20H2.
We set up a four-node cluster for the best possible resiliency, allowing us to sustain a failure of two entire nodes without impacting the running workloads. To ensure that we could use larger VM sizes to deploy the AKS-HCI workload clusters, we populated the cluster with 192 CPU cores and 1.5 TB of RAM. Larger AKS-HCI workload clusters allowed us to provision more Azure Arc-enabled SQL Managed Instances and Microsoft SQL databases for functional and performance testing.
Note: Azure Stack HCI, version 21H2, introduces dynamic processor compatibility mode. However, Dell Technologies continues to strongly recommend that all nodes in an Azure Stack HCI cluster be homogenous – identical CPUs, memory, disk drives, and Ethernet adapter symmetry. There should also be identical BIOS, firmware, and driver revisions on the components that are provided by the Azure Stack HCI solutions catalog from Dell Technologies.
For the storage subsystem, we selected a single-tier, all-flash SSD configuration because of its high performance and ease of configuration and maintenance. We chose the three-way mirror resiliency type for the volumes to maximize database performance compared with parity options in Storage Spaces Direct. The following tables detail the cluster configuration and AX-740xd node specifications that we built in the lab.
Table 1. ASHCLUSTER configuration
Cluster design elements | Description |
Cluster node model | AX-740xd |
Number of cluster nodes | 4 |
Network switch model | Dell EMC PowerSwitch S5248F-ON featuring 48 x 25 GbE SFP28 + 2 x 200 GbE QSFP23-DD + 4 x 100 GbE QSFP28 ports |
Number of ToR network switches | 2 |
Network topology | |
Volume resiliency | |
Usable storage capacity | Approximately 16 TB |
Table 2. AXNode01 – AXNode04 node specifications
Resources per cluster node | Description |
CPU | Dual-socket Intel Xeon Gold 6248R 3.0G, 24C/48T Processor |
Memory | 384 GB |
Storage controller for operating system | BOSS-S1 controller card |
Physical drives for operating system | 2 x M.2 240 GB SATA drives configured as RAID 1 |
Storage controller for Storage Spaces Direct | HBA330 Controller Adapter, Low Profile |
Physical drives for Storage Spaces Direct | 8 x 1.92 TB SSD SAS Mixed Use |
Network adapter for management and VM traffic | Intel X710 Dual Port 10 GbE SFP+ rNDC |
Network adapter for storage traffic | 1 x QLogic FastLinQ 41262 Dual Port 10/25GbE SFP28 Adapter |
Operating system | Microsoft Azure Stack HCI, version 20H2 |
Solution update catalog version for deployment | February 2021 |
We followed the HCI Deployment Guide to prepare and deploy the integrated system. The guide provides the PowerShell commands necessary to automate significant portions of the deployment. We followed the post-deployment procedures, such as Azure onboarding for Azure Stack HCI, version 20H2, creating virtual disks, and managing and monitoring with Windows Admin Center, in the Managing and Monitoring the Solution Infrastructure Life Cycle Operations Guide.
The following figures depict the integrated system’s networking configuration. We employed a scalable, nonconverged network topology and the S5248F-ON ToR switches for their high port count density. This ensured we could scale to the maximum cluster size of 16 AX nodes as database and application demands grew. Management and VM traffic traversed a dual port rNDC configured in Hyper-V as a Switch Embedded Team (SET). Storage traffic passed through a dedicated dual port adapter with no teaming in Hyper-V. Storage networking was configured to use iWARP for RDMA.
Note: In the test configuration, we found that iWARP was a better option when compared to RoCE, because iWARP did not require any additional configuration steps on the ToR switches. To enable the functionality on the AX nodes, iWARP only required BIOS and driver setting changes, which are noted in the HCI Deployment Guide.
Figure 3. Physical network connectivity
Figure 4. Hyper-V virtual networking configuration
We followed the Network Integration and Host Network Configuration Options Guide to implement this scalable, nonconverged network topology. This guide provides the PowerShell commands that are required to configure the networking on each AX node. We leveraged the sample switch configurations listed in the Switch Configurations - iWARP Only (QLogic Cards) Guide for the S5248F-ON switch configurations.
The following table lists the tools that we used to perform the deployment, life cycle management, and monitoring of the infrastructure layer.
Table 3. Inventory of tools at the infrastructure layer
Tool | Version |
Microsoft Windows Admin Center | 2103.2 |
Dell EMC OpenManage Integration with Microsoft Windows Admin Center | 2.1 |
PowerShell | 5.1 or Core |
After initial deployment of ASHCLUSTER, we added it to Microsoft Windows Admin Center. We used Windows Admin Center throughout the testing scenarios to monitor processor, memory, network, and storage performance and capacity at the physical infrastructure and VM levels. We also used it for some troubleshooting and maintenance tasks. The following figure shows the integrated system added to Windows Admin Center.
Figure 5. ASHCLUSTER in Windows Admin Center dashboard
We installed the OpenManage Integration extension in Windows Admin Center using the instructions in the Operations Guide. We used the stand-alone extension to verify the health of the ASHCLUSTER hardware before we performed administrative tasks and to prepare for cluster expansion. To orchestrate operating system, BIOS, firmware, and driver updates in a single workflow with no interruption to running workloads, the extension also installed a snap-in to the Microsoft Cluster-Aware Updating extension. The operational tasks are covered in more detail in the testing scenarios section. The following figure shows the first step in the 1-click full stack life cycle management using Cluster-Aware Updating workflow.
Note: We checked the OpenManage Integration with Microsoft Windows Admin Center v2.1 compatibility matrix to ensure we were using the correct versions of all supported software. For example, we had to update the Microsoft Failover Cluster Tool Extension to the 1.128.0.nupkg release to ensure that Cluster-Aware Updating would function as expected.
Figure 6. Full stack Cluster-Aware Updating available operating system updates