Value Optimized AX-6515 for ROBO Use Cases
Tue, 14 Jul 2020 13:09:24 -0000
|Read Time: 0 minutes
Introduction
Small offices and remote branch office (ROBO) use cases present special challenges for IT organizations. The issues tend to revolve around how to implement a scalable, resilient, secure, and highly performant platform at an affordable TCO. The infrastructure must be capable enough to efficiently run a highly diversified portfolio of applications and services and yet be simple to deploy, update, and support by a local IT generalist. Dell Technologies and Microsoft help you accelerate business outcomes in these unique ROBO environments with our Dell EMC Solutions for Microsoft Azure Stack HCI.
In this blog post, we share VMFleet results observed in the Dell Technologies labs for our newest AX-6515 two-node configuration – ideal for ROBO environments. Optimized for value, the small but powerful AX-6515 node packs a dense, single-socket 2nd Gen AMD EPYC processor in a 1RU chassis delivering peak performance and excellent TCO. We also included the Dell EMC PowerSwitch S5212F-ON in our testing to provide 25GbE network connectivity for the storage, management, and VM traffic in a small form factor. The Dell EMC Solutions for Azure Stack HCI Deployment Guide was followed to construct the test lab and applies only to infrastructure that is built with validated and certified AX nodes running Microsoft Windows Server 2019 Datacenter from Dell Technologies.
We were quite impressed with the VMFleet results. First, we stressed the cluster’s storage subsystem to its limits using scenarios aimed at identifying maximum IOPS, latency, and throughput. Then, we adjusted the test parameters to be more representative of real-world workloads. The following summary of findings indicated to us that this two-node, AMD-based, all-flash cluster could meet or exceed the performance requirements of workload profiles often found in ROBO environments:
- Achieved over 1 million IOPS at microsecond latency using a 4k block size and 100% random-read IO pattern.
- Achieved over 400,000 IOPS at 4 millisecond latency using a 4k block size and 100% random-write IO pattern.
- Using 512k block sizes, drove 6 GB/s and 12 GB/s throughput for 100% sequential-write and 100% sequential-read IO patterns, respectively.
- Using a range of real-world scenarios, the cluster achieved hundreds of thousands of IOPS at under 7 milliseconds latency and drove between 5 – 12 GB/s of sustained throughput.
Lab Setup
The following diagram illustrates the environment created in the Dell Technologies labs for the VMFleet testing. Ancillary services required for cluster operations such as DNS, Active Directory, and a file server for cluster quorum are not depicted.
Figure 1 Network topology
Table 1 Cluster configuration
Cluster Design Elements | Description |
Number of cluster nodes | 2 |
Cluster node model | AX-6515 nodes |
Number of network switches for RDMA and TCP/IP traffic | 2 |
Network switch model | Dell EMC PowerSwitch S5212F-ON |
Network topology | Fully-converged network configuration. RDMA and TCP/IP traffic traversing 2 x 25GbE network connections from each host. |
Network switch for OOB management | Dell EMC PowerSwitch S3048-ON |
Resiliency option | Two-way mirror |
Usable storage capacity | Approximately 12 TB |
Table 2 Cluster node resources
Resources per Cluster Node | Description |
CPU | Single-socket AMD EPYC 7702P 64-Core Processor |
Memory | 256 GB DDR4 RAM |
Storage controller for OS | BOSS-S1 adapter card |
Physical drives for OS | 2 x Intel 240 GB M.2 SATA drives configured as RAID 1 |
Storage controller for Storage Spaces Direct (S2D) | HBA330 Mini |
Physical drives | 8 x 1.92 TB Mixed Use KIOXIA SAS SSDs |
Network adapter | Mellanox ConnectX-5 Dual Port 10/25GbE SFP28 Adapter |
Operating System | Windows Server 2019 Datacenter |
The architectures of Azure Stack HCI solutions are highly opinionated and prescriptive. Each design is extensively tested and validated by Dell Technologies Engineering. Here is a summary of the key quality attributes that define these architectures followed by a section devoted to our performance findings.
- Efficiency – Many customers are interested in improving performance and gaining efficiencies by modernizing their aging virtualization platforms with HCI. Using Azure Stack HCI helps avoid a DIY approach to IT infrastructure, which is prone to human error and is more labor intensive.
- Maintainability – Our solution makes it simple to incorporate hybrid capabilities to reduce operational burden using Microsoft Windows Admin Center (WAC). Services in Azure can also be leveraged to avoid additional on-premises investments for management, monitoring, BCDR, security, and more. We have also developed the Dell EMC OpenManage Integration with Microsoft Windows Admin Center to assist with hardware monitoring and to simplify updates with Cluster Aware Updates (CAU).
- Availability – Using a two-way mirror, we always have two copies of our data. This configuration can survive a single drive failure in one node or survive an entire node failure. However, the cluster cannot survive two failures simultaneously on both nodes. In case greater resiliency is desired, volumes can be created using nested resiliency. Nested resiliency is discussed in more detail in the "Optional modifications to the architecture" section later in this blog post.
- Supportability – Support is provided by dedicated Dell Technologies ProSupport Plus and ProSupport for Software technicians who have expertise specifically tailored to Azure Stack HCI solutions.
Testing Results
We leveraged VMFleet to benchmark the storage subsystem of our 2-node cluster. Many Microsoft customers and partners rely on this tool to help them stress test their Azure Stack HCI clusters. VMFleet consists of a set of PowerShell scripts that deploy virtual machines to a Hyper-V cluster and execute Microsoft’s DiskSpd within those VMs to generate IO. The following table presents the range of VMFleet and DiskSpd parameters used during our testing in the Dell Technologies labs.
Table 3 Test parameters
VMFleet and DiskSpd Parameters | Values |
Number of VMs running per node | 20 |
vCPUs per VM | 2 |
Memory per VM | 8 GB |
VHDX size per VM | 40 GB |
VM Operating System | Windows Server 2019 |
0 | |
Block sizes (B) | 4k – 512k |
Thread count (T) | 2 |
Outstanding IOs (O) | 32 |
Write percentages (W) | 0, 20, 50, 100 |
IO patterns (P) | Random, Sequential |
We first selected DiskSpd scenarios aimed at identifying the maximum IOPS, latency, and throughput thresholds of the cluster. By pushing the limits of the storage subsystem, we confirmed that the networking, compute, operating systems, and virtualization layer were configured correctly according to our Deployment Guide and Network Integration and Host Network Configuration Options guide. This also ensured that that no misconfiguration occurred during initial deployment that could skew the real-world storage performance results. Our results are depicted in Table 4.
Table 4 Maximums test results
Scenario | Parameter Values Explained | Performance Metric |
B4-T2-O32-W0-PR | Block size: 4k Thread count: 2 Outstanding IO: 32 IO pattern: 100% random read | IOPS: 1,146,948 Read latency: 245 microseconds CPU utilization: 48% |
B4-T2-O32-W100-PR | Block size: 4k Thread count: 2 Outstanding IO: 32 IO pattern: 100% random write | IOPS: 417,591 Write latency: 4 milliseconds CPU utilization: 25% |
B512-T2-O2-W0-PSI | Block size: 512k Thread count: 2 Outstanding IO: 8 IO pattern: 100% sequential read | Throughput: 12 GB/s |
B512-T2-O2-W100-PSI | Block size: 512k Thread count: 2 Outstanding IO: 8 IO pattern: 100% sequential write | Throughput: 6 GB/s |
We then stressed the storage subsystem using IO patterns more reflective of the types of workloads found in a ROBO use case. These applications are typically characterized by smaller block sizes, random I/O patterns, and a variety of read/write ratios. Examples include general enterprise and small office LOB applications and OLTP workloads. The testing results in Figure 2 below indicate that the cluster has the potential to accelerate OLTP workloads and make enterprise applications highly responsive to end users.
Figure 2 Performance results with smaller block sizes
Other services like backups, streaming video, and large dataset scans have larger block sizes and sequential IO patterns. With these workloads, throughput becomes the key performance indicator to analyze. The results shown in the following graph indicate an impressive sustained throughput that can greatly benefit this category of IT services and applications.
Figure 3 Performance results with larger block sizes
Optional modifications to the architecture
Customers could make modifications to the lab configuration to accommodate different requirements in the ROBO use case. For example, Dell Technologies completely supports a dual-link full mesh topology for Azure Stack HCI. This non-converged storage switchless topology eliminates the need for network switches for storage communications and enables you to use existing infrastructure for management and VM traffic. This approach will result in similar or improved performance metrics versus those mentioned in this blog due to the 2 x 25 GB direct connections between the nodes and the isolation of the storage traffic on these dedicated connections.
Figure 4 Two-node back-to-back architecture option
There may be situations in ROBO scenarios where there are no IT generalists near the site to address hardware failures. When a drive or entire node fails, it may take days or weeks before someone can service the nodes and return the cluster to full functionality. Consider nested resiliency instead of two-way mirroring to handle multiple failures on a two-node cluster. Inspired by RAID 5 + 1 technology, workloads remain online and accessible even in the following circumstances:
Figure 5 Nested resiliency option
Be aware that there is a capacity efficiency cost when using nested resiliency. Two-way mirroring is 50% efficient, meaning 1 TB of data takes up 2 TB of physical storage capacity. Depending on the type of nested resiliency you choose to configure, capacity efficiency can range between 25% - 40%. Therefore, ensure you have an adequate amount of raw storage capacity if you intend to use this technology. Performance is also going to be affected when using nested resiliency – especially on workloads with a higher percentage of write IO since more copies of the data need to be maintained on the cluster.
If you need greater flexibility in cluster resources, Dell Technologies offers Azure Stack HCI configurations to meet any workload profile and business requirement. The table below shows the different resource options available for each AX node. To find more detailed specifications about these configurations, please review the detailed product specifications on our product page.
Table 5 Azure Stack HCI configuration options
Visit our website for more details on Dell EMC Solutions for Azure Stack HCI.
Related Blog Posts

Boost Performance on Dell EMC Solutions for Microsoft Azure Stack HCI using Intel Optane Persistent Memory
Tue, 14 Jul 2020 13:09:24 -0000
|Read Time: 0 minutes
Modern IT applications have a broad range of performance requirements. Some of the most demanding applications use Online Transactional Processing (OLTP) database technology. Typical organizations have many mission critical business services reliant on workloads powered by these databases. Examples of such services include online banking in the financial sector and online shopping in the retail sector. If the response time of these systems is slow, customers will likely suffer a poor user experience and may take their business to competitors. Dissatisfied customers may also express their frustration through social media outlets resulting in incalculable damage to a company’s reputation.
The challenge in maintaining an exceptional consumer experience is providing databases with performant infrastructure while also balancing capacity and cost. Traditionally, there have been few cost-effective options that cache database workloads, which would greatly improve end-user response times. Intel Optane persistent memory (Intel Optane PM) offers an innovative path to accelerating database workloads. Intel Optane PM performs almost as well as DRAM, and the data is preserved after a power cycle. We were interested in quantifying these claims in our labs with Dell EMC Solutions for Microsoft Azure Stack HCI.
Azure Stack HCI running Microsoft Windows Server 2019 provides industry-leading virtual machine performance with Microsoft Hyper-V and Microsoft Storage Spaces Direct technology. The platform supports Non-Volatile Memory Express (NVMe), Intel Optane PM, and Remote Direct Memory Access (RDMA) networking. Azure Stack HCI is a fully productized, validated, and supported HCI solution that enables enterprises to modernize their infrastructure for improved application uptime and performance, simplified management and operations, and lower total cost of ownership. AX nodes from Dell EMC, powered by industry-leading PowerEdge server platforms, offer a high-performance, scalable, and secure foundation on which to build a software-defined infrastructure.
In our lab testing, we wanted to observe the impact on performance when Intel Optane PM was added as a caching tier to an Azure Stack HCI cluster. We set up two clusters to compare. One cluster was configured as a two-tier storage subsystem with Intel Optane PM in the caching tier and SATA Read-Intensive SSDs in the capacity tier. We inserted 12 x 128 GB Intel Optane PM modules into this cluster for a total of 1.5 TB per node. The other cluster’s storage subsystem was configured as a single-tier of SATA Read-Intensive SSDs. With respect to CPU selection, memory, and Ethernet adapters, the two clusters were configured identically.
Only the Dell EMC AX-640 nodes currently accommodate Intel Optane PM. The clusters were configured as follows:
Cluster Resources | Without Intel Optane PM | With Intel Optane PM |
Number of nodes | 4 | 4 |
CPU | 2 x Intel 6248 CPU @ 2.50 GHz (3.90 GHz with TurboBoost) | 2 x Intel 6248 CPU @ 2.50 GHz (3.90 GHz with TurboBoost) |
Memory | 384 GB RAM | 384 GB RAM |
Disks | 10 x 2.5 in. 1.92 TB Intel S4510 RI SATA SSD | 10 x 2.5 in. 1.92 TB Intel S4510 RI SATA SSD |
NICs | Mellanox ConnectX-5 EX Dual Port 100 GbE | Mellanox ConnectX-5 EX Dual Port 100 GbE |
Persistent memory | None | 12 x 128 GB Intel Optane PM per node |
Volumes were created using three-way mirroring for the best balance between performance and resiliency. Three-way mirroring protects data by enabling the cluster to safely tolerate two hardware failures. For example, data on a volume would be successfully preserved even after the simultaneous loss of an entire node and a drive in another node.
Intel Optane PM has two operating modes – Memory Mode and App Direct Mode. Our tests used App Direct Mode. In App Direct Mode, the operating system uses Intel Optane PM as persistent memory distinct from DRAM. This mode enables extremely high performing storage that is byte-addressable-like, memory coherent, and cache coherent. Cache coherence is important because it ensures that data is a uniformly shared resource across all nodes. In the four-node Azure Stack HCI cluster, cache coherence ensured that when data was read or written from one node that the same data was available across all nodes.
VMFleet is a storage load generation tool designed to perform I/O and capture performance metrics for Microsoft failover clusters. For the small block test, we used VMFleet to generate 100 percent reads at a 4K block size. The baseline configuration without Intel Optane PM sustained 2,103,412 IOPS at 1.5-millisecond (ms) average read latency. These baseline performance metrics demonstrated outstanding performance. However, OLTP databases target 1 ms or less latency for reads.
Comparatively, the Intel Optane PM cluster demonstrated 43 percent faster IOPS and decreased latency by 53 percent. Overall, this cluster sustained slightly over 3 million IOPS at .7 ms average latency. Benefits include:
- Significant performance improvement in IOPS means transactional databases and similar workloads will improve in scalability.
- Applications reading from storage will receive data faster, thus improving transactional response times.
- Intel Optane PM coherent cache provides substantial performance benefits without sacrificing availability.
When exploring storage responsiveness, testing large block read and write requests is also important. Data warehouses and decision-support systems are examples of workloads that read larger blocks of data. For this testing, we used 512 KB block sizes and sequential reads as part of the VMFleet testing. This test provided insight into the ability of Intel Optane PM cache to improve storage system throughput.
The cluster populated with Intel Optane PM was 109% faster than the baseline system. Our comparisons of 512 KB sequential reads found total throughput of 11 GB/s for the system without Intel Optane PM and 23 GB/s for the system with Intel Optane PM caching. Benefits include:
- Greater throughput enables faster scans of data for data warehouse systems, decision-support systems, and similar workloads.
- The benefit to the business is faster reporting and analytics.
- Intel Optane PM coherent cache provides substantial throughput benefits without sacrificing availability.
Overall, the VMFleet tests were impressive. Both Azure Stack HCI configurations had 40 SSDs across the four nodes for approximately 76 TB of performant storage. To accelerate the entire cluster required 12 Intel Optane PM 128 GB modules per server for a total of 48 modules across the four nodes. Test results show that both OLTP and data-warehouse type workloads would exhibit significant performance improvements.
Testing 100 percent reads of 4K blocks showed:
- 43 percent performance improvement in IOPS.
- 53 percent decrease in average read latency.
- Improved scaling and faster transaction processing. Overall, application performance would be significantly accelerated, improving end-user experience.
Testing 512 KB sequential reads showed:
- 109 percent increased throughput.
- Faster reporting and faster time to analytics and data insights.
The configuration presented in this lab testing scenario will not be appropriate for every application. Any Azure Stack HCI solution must be properly scoped and sized to meet or exceed the performance and capacity requirements of its intended workloads. Work with your Dell Technologies account team to ensure that your system is correctly configured for today’s business challenges and ready for expansion in the future. To learn more about our solutions for Azure Stack HCI, visit the Dell EMC Solutions for Microsoft Azure Stack HCI InfoHub.

HPC Application Performance on Dell EMC PowerEdge R7525 Servers with the AMD MI100 Accelerator
Tue, 15 Dec 2020 14:23:27 -0000
|Read Time: 0 minutes
Overview
The Dell EMC PowerEdge R7525 server supports the AMD MI100 GPU Accelerator. The server is a two-socket, 2U rack-based server that is designed to run complex workloads using highly scalable memory, I/O capacity, and network options. The system is based on the 2nd Gen AMD EPYC processor (up to 64 cores), has up to 32 DIMMs, and has PCI Express (PCIe) 4.0-enabled expansion slots. The server supports SATA, SAS, and NVMe drives and up to three double-wide 300 W accelerators.
The following figure shows the front view of the server:
Figure 1. Dell EMC PowerEdge R7525 server
The AMD Instinct™ MI100 accelerator is one of the world’s fastest HPC GPUs available in the market. It offers innovations to obtain higher performance for HPC applications with the following key technologies:
- AMD Compute DNA (CDNA)—Architecture optimized for compute-oriented workloads
- AMD ROCm—An Open Software Platform that includes GPU drivers, compilers, profilers, math and communication libraries, and system resource management tools
- Heterogeneous-Computing Interface for Portability (HIP)—An interface that enables developers to covert CUDA code to portable C++ so that the same source code can run on AMD GPUs
This blog focuses on the performance characteristics of a single PowerEdge R7525 server with AMD MI100-32G GPUs. We present results from the general matrix multiplication (GEMM) microbenchmarks, the LAMMPS benchmarks, and the NAMD benchmarks to showcase performance and scalability.
The following table provides the configuration details of the PowerEdge R7525 system under test (SUT):
Table 1. SUT hardware and software configurations
Component | Description |
Processor | AMD EPYC 7502 32-core processor |
Memory | 512 GB (32 GB 3200 MT/s * 16) |
Local disk | 2 x 1.8 TB SSD (No RAID) |
Operating system | Red Hat Enterprise Linux Server 8.2 |
GPU | 3 x AMD MI100-PCIe-32G |
Driver version | 3204 |
ROCm version | 3.9 |
Processor Settings > Logical Processors | Disabled |
System profiles | Performance |
NUMA node per socket | 4 |
NAMD benchmark | Version: NAMD 3.0 ALPHA 6 |
LAMMPS (KOKKOS) benchmark | Version: LAMMPS patch_18Sep2020+AMD patches |
The following table lists the AMD MI100 GPU specifications:
Table 2. AMD MI100 PCIe GPU specification
Component | |
GPU architecture | MI100 |
Peak Engine Clock (MHz) | 1502 |
Stream processors | 7680 |
Peak FP64 (TFLOPS) | 11.5 |
Peak FP64 Tensor DGEMM (TFLOPS) | 11.5 |
Peak FP32 (TFLOPS) | 23.1 |
Peak FP32 Tensor SGEMM (TFLOPS) | 46.1 |
Memory size (GB) | 32 |
Memory ECC support | Yes |
TDP (Watt) | 300 |
GEMM Microbenchmarks
The GEMM benchmark is a simple, multithreaded dense matrix-to-matrix multiplication benchmark that can be used to test the performance of GEMM on a single GPU. The rocblas-bench binary compiled from https://github.com/ROCmSoftwarePlatform/rocBLAS was used to collect DGEMM and SGEMM results. The results of these tests reflect the performance of an ideal application that only runs matrix multiplication in the form of the peak TFLOPS that the GPU can deliver. Although GEMM benchmark results might not represent real-world application performance, it is still a good benchmark to demonstrate the performance capability of different GPUs.
The following figure shows the observed numbers of DGEMM and SGEMM:
Figure 2. DGEMM and SGEMM for both AMD MI100 peak and AMD-PCIe sustained
The results indicate:
- In the DGEMM (double-precision GEMM) benchmark, the theoretical peak performance of the AMD MI100 GPU is 11.5 TFLOPS and the measured sustained performance is 7.9 TFLOPS. As shown in Table 2, the standard double precision (FP64) theoretical peak and the FP64 tensor DGEMM peak performance are both at 11.5 TFLOPS. Because most real world HPC applications typically are not heavily implemented with DGEMM or other matrix operations, this high standard FP64 capability boosts the performance on other non-matrix double-precision math calculations.
- For FP32 Tensor operations in the SGEMM (single-precision GEMM) benchmark, the theoretical peak performance of the AMD MI100 GPU is 46.1 TFLOPS, and the measured sustained performance is approximately 30 TFLOPS.
The LAMMPS benchmark
The Large-Scale Atom/Molecular Massively Parallel Simulator (LAMMPS) runs threads in parallel using message-passing techniques. This benchmark measures the scalability and performance of large, parallel systems of multiple GPUs.
The following figure shows the KOKKOS implementation of LAMMPS scaled relatively linearly as AMD MI100 GPUs were added across four datasets: EAM, LJ, Tersoff, and ReaxFF/C.
Figure 3. LAMMPS benchmark showing scaling of multiple AMD MI100 GPUs
The NAMD Benchmark
Nanoscale Molecular Dynamics (NAMD) is a parallel molecular dynamics system designed for simulation of large biomolecular systems. The NAMD benchmark stresses the scaling and performance aspects of the server and GPU configuration.
The following figure plots the results of the NAMD microbenchmark:
Figure 4. NAMD benchmark performance
Aggregate data of multiple GPU cards is preferred because the Alpha builds of the NAMD 3.0 binary do not scale beyond a single accelerator. Three replica simulations were launched on the same server, one on each GPU, in parallel. NAMD was CPU-bound in previous versions. The new 3.0 version has reduced the CPU dependence. As a result, three-copy simulation produced linear scaling performing three times faster across all datasets.
As part of the optimization, the NAMD benchmark numbers in the following figure show the relative performance difference using different numbers of CPU cores for the STMV dataset:
Figure 5. CPU core dependency on NAMD
The AMD MI100 GPU exhibited an optimum configuration of four CPU cores per GPU.
Conclusion
The AMD MI100 accelerator delivers industry-leading performance, and it is a well-positioned performance per dollar GPU for both FP32 and FP64 HPC parallel codes.
- FP32 applications perform well using the AMD MI100 GPU based on the SGEMM, LAMMPS, and NAMD benchmarks performance by using tensor cores and native FP32 compute cores.
- FP64 applications perform well using the AMD MI100 GPU by using native FP64 compute cores.
Next Steps
In the future, we plan to test other HPC and Deep Learning applications. We also plan to research using “Hipify” tools to port CUDA sources to HIP.