Ready Solution for HPC PixStor Storage Capacity Expansion, HDR100 Update
Tue, 17 Nov 2020 21:43:49 -0000
|Read Time: 0 minutes
Introduction
Today’s HPC environments have ever increasing demands for very high-speed storage that also frequently must provide high capacity and distributed access via several standard protocols such as NFS, and SMB. These high demand HPC requirements are typically covered by Parallel File Systems that provide concurrent access to a single file or set of files from multiple nodes, very efficiently and securely distributing data to multiple LUNs across several servers using the network technology with the highest speed available.
Solution Architecture
This blog is a technology update for the use of Infiniband HDR100 on the Dell EMC Ready Solution for HPC PixStor Storage, a Parallel File System (PFS) solution for HPC environments where PowerVault ME484 EBOD arrays are used to increase the capacity of the solution. Figure 1 presents the reference architecture depicting the capacity expansion SAS additions to the existing PowerVault ME4084 storage arrays, replacing Infiniband EDR components with HDR100: ConnectX-6 HCAs and QM8700 switches. The PixStor solution includes the widespread General Parallel File System also known as Spectrum Scale as the PFS component, in addition to many other ArcaStream software components including advanced analytics, simplified administration and monitoring, efficient file search, advanced gateway capabilities and many others.
Figure 1 Reference Architecture
Solution Components
This solution was released with the latest Intel Xeon 2nd generation Scalable Xeon CPUs (Cascade Lake) and some of the servers will use the fastest RAM available (2933 MT/s). However, due to hardware availability during testing, the solution prototype used servers with Intel Xeon 1st generation Scalable Xeon CPUs (Skylake) and slower RAM to characterize the performance. Since the bottleneck of the solution is at the SAS controllers of the DellEMC PowerVault ME40x4 arrays, no significant performance disparity is expected once the Skylake CPUs and RAM are replaced with Cascade Lake CPUs and faster RAM. In addition, the solution was updated to the latest version of PixStor (5.1.3.1) that supports RHEL 7.7 and OFED 5.0-2.1.8.
Table 1 shows the list of main components for the solution where the first column has components used at release time and therefore available to customers, and the last column has the components actually used for characterizing the performance of the solution. The drives listed for data (12TB NLS) and metadata (960GB SSD), are the ones used for performance characterization, and faster drives can provide better Random IOPs and may improve create/removal metadata operations.
Finally, for completeness, the list of possible data HDDs and metadata SSDs was included, which is based on the drives supported as enumerated on the DellEMC PowerVault ME4 support matrix, available online.
Table 1 Components to be used at release time and those used in the test bed
Solution Component | Released | Test Bed | |
Internal Mgmt Connectivity | Dell Networking S3048-ON Gigabit Ethernet | ||
Data Storage Subsystem | 1x to 4x Dell EMC PowerVault ME4084
| ||
Optional High Demand Metadata Storage Subsystem | 1x to 2x (max 4x) Dell EMC PowerVault ME4024 | ||
RAID Storage Controllers | Redundant 12 Gbps SAS | ||
Capacity w/o Expansion | Raw: 4032 TB (3667 TiB or 3.58 PiB) with 12TB HDDs | ||
Capacity w/Expansion | Raw: 8064 TB (7334 TiB or 7.16 PiB) with 12TB HDDs Formatted ~ 6144 GB (5588 TiB or 5.46 PiB) | ||
Processor | Gateway/Ngenea (R740) | 2x Intel Xeon Gold 6230 2.1G, 20C/40T, 10.4GT/s, 27.5M Cache, Turbo, HT (125W) DDR4-2933 | 2x Intel Xeon Gold 6136 @ |
High Demand Metadata (R740) | |||
Storage Node (R740) | |||
Management Node (R440) | 2x Intel Xeon Gold 5220 2.2G, 18C/36T, 10.4GT/s, 24.75M Cache, Turbo, HT (125W) DDR4-2666 | 2x Intel Xeon Gold 5118 @ 2.30GHz, 12 cores | |
Memory | Gateway/Ngenea (R740) | 12 x 16GiB 2933 MT/s RDIMMs | 24x 16GiB 2666 MT/s RDIMMs (384 GiB) |
High Demand Metadata (R740) | |||
Storage Node (R740) | |||
Management Node (R440) | 12 X 16GB RDIMMs, 2666 MT/s (192GiB) | 12x 8GiB 2666 MT/s RDIMMs (96 GiB) | |
Operating System | CentOS 7.7 | ||
Kernel version | 3.10.0-1062.12.1.el7.x86_64 | ||
PixStor Software | 5.1.3.1 | ||
OFED Version | Mellanox OFED 5.0-2.1.8 | ||
High Performance Network Connectivity | Mellanox ConnectX-6 Dual-Port InfiniBand VPI HDR100/100 GbE, and 10 GbE | ||
High Performance Switch | 2x Mellanox QM8700 (HA – Redundant) | ||
Local Disks (OS & Analysis/monitoring) | All servers except Management node 3x 480GB SSD SAS3 (RAID1 + HS) for OS PERC H730P RAID controller 3x 480GB SSD SAS3 (RAID1 + HS) for OS & Analysis/Monitoring PERC H740P RAID controller | All servers except Management node 2x 300GB 15K SAS3 (RAID 1) for OS PERC H330 RAID controller Management Node 5x 300GB 15K SAS3 (RAID 5) for OS & PERC H740P RAID controller | |
Systems Management | iDRAC 9 Enterprise + DellEMC OpenManage |
Performance Characterization
To characterize this Ready Solution, we used the hardware specified in the last column of Table 1, including the optional High Demand Metadata Module. In order to assess the solution performance, the following benchmarks were used:
- IOzone N to N sequential
- IOzone random
- IOR N to 1 sequential
- MDtest
For all benchmarks listed above, the test bed had the clients as described in the Table 2 below. Since the number of compute nodes available for testing was only 16, when a higher number of threads was required, those threads were equally distributed on the compute nodes (i.e. 32 threads = 2 threads per node, 64 threads = 4 threads per node, 128 threads = 8 threads per node, 256 threads =16 threads per node, 512 threads = 32 threads per node, 1024 threads = 64 threads per node). The intention was to simulate a higher number of concurrent clients with the limited number of compute nodes available. Since the benchmarks support a high number of threads, a maximum value up to 1024 was used (specified for each test), while avoiding excessive context switching and other related side effects that can affect performance results.
Table 2 Client test bed
Number of Client nodes | 16 |
Client node | C6320 |
Processors per client node | 2 x Intel(R) Xeon(R) Gold E5-2697v4 18 Cores @ 2.30GHz |
Memory per client node | 12 x 16GiB 2400 MT/s RDIMMs |
High Performance Adapter | Mellanox ConnectX-4 InfiniBand VPI |
Operating System | CentOS 7.6 |
OS Kernel | 3.10.0-957.10.1 |
PixStor Software | 5.1.3.1 |
OFED Version | Mellanox OFED 5.0-1.0.0 |
Sequential IOzone Performance N clients to N files
Sequential N clients to N files performance was measured with IOzone version 3.487. Tests executed varied from single thread up to 512 threads on the capacity expanded solution (4x ME4084s + 4x ME484s); results from the EDR testing are contrasted with the HDR100 update.
Caching effects were minimized by setting the file system tunable page pool to 8 GiB on the clients and 24 GiB on the servers and using files twice the total memory size of the clients or servers (whichever value is larger). It is important to note that for the file system, the page pool tunable sets the maximum amount of memory used by the file system for caching data, regardless of the amount of RAM installed and free. Also, important to note is that while in previous DellEMC HPC solutions the block size for large sequential transfers is 1 MiB, the file system was formatted with 8 MiB blocks and therefore that value is used on the benchmark for optimal performance. That may look too large and apparently waste too much space, but the file system uses subblock allocation to prevent that situation. In the current configuration, each block was subdivided in 256 subblocks of 32 KiB each.
The following commands were used to execute the benchmark for writes and reads, where Threads was the variable with the number of threads used (1 to 512 incremented in powers of two), and threadlist was the file that allocated each thread on a different node, using round robin to spread them homogeneously across the 16 compute nodes.
./iozone -i0 -c -e -w -r 8M -s 128G -t $Threads -+n -+m ./threadlist
./iozone -i1 -c -e -w -r 8M -s 128G -t $Threads -+n -+m ./threadlist
Figure 2 N to N Sequential Performance
From the results we can observe that performance rises fast with the number of clients used and then reaches a plateau that is fairly stable until the maximum number of threads that IOzone allows is reached, and therefore large file sequential performance is stable except for 512 concurrent threads (about 8% lower). The maximum read performance of 23.8 GB/s at 32 threads was still limited by the bandwidth of the two IB HDR100 links used on the storage nodes starting at 8 threads. Read performance at 4 threads is considerably lower and at high thread counts is a bit lower compared to EDR (less than 5%), but the results are reproduceable. Since the sequential N to 1 test using IOR uses the same data size and similar parameters but on a single file (adding locking overhead), the big drop in read performance at 4 threads (and to a much smaller degree at high thread counts) may be due to IOzone using calls that are working less efficiently than IOR calls, but more work is needed to find the reason for the different behavior.
The highest write performance of 21 GB/s was achieved at 512 threads. It is important to remember that for PixStor file system, the preferred mode of operation is scatter, and the solution was formatted to use such mode. In this mode, blocks are allocated from the very beginning of operation in a pseudo-random fashion, spreading data across the whole surface of each HDD. While the obvious disadvantage is a smaller initial maximum performance, that performance is maintained fairly constant regardless of how much space is used on the file system. That in contrast to other parallel file systems that initially use the outer tracks that can hold more data (sectors) per disk revolution, and therefore have the highest possible performance the HDDs can provide, but as the system uses more space, inner tracks with less data per revolution are used, with the consequent reduction of performance.
Sequential IOR Performance N clients to 1 file
Sequential N clients to a single shared file performance was measured with IOR version 3.3.0, assisted by OpenMPI v4.0.1 to run the benchmark over the 16 compute nodes. Tests executed varied from a one thread up to 512 threads (since there were not enough cores for 1024 threads), and results are contrasted to the solution without the capacity expansion.
Caching effects were minimized by setting the file system page pool tunable to 8 GiB on the clients and 24 GiB on the servers and using a total data size bigger than twice the total memory size of clients or servers (whichever is larger). This benchmark tests used 8 MiB blocks for optimal performance. The previous performance test section has a more complete explanation for those matters.
The following commands were used to execute the benchmark for writes and reads, where Threads was the variable with the number of threads used (1 to 1024 incremented in powers of two), and my_hosts.$Threads is the corresponding file that allocated each thread on a different node, using round robin to spread them homogeneously across the 16 compute nodes.
mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --mca btl_openib_allow_ib 1 --mca pml ^ucx --oversubscribe --prefix /mmfs1/perftest/ompi /mmfs1/perftest/lanl_ior/bin/ior -a POSIX -v -i 1 -d 3 -e -k -o /mmfs1/perftest/tst.file -w -s 1 -t 8m -b 128G
mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --mca btl_openib_allow_ib 1 --mca pml ^ucx --oversubscribe --prefix /mmfs1/perftest/ompi /mmfs1/perftest/lanl_ior/bin/ior -a POSIX -v -i 1 -d 3 -e -k -o /mmfs1/perftest/tst.file -r -s 1 -t 8m -b 128G
Figure 3 N to 1 Sequential Performance
Performance rises fast with the number of clients used and then reaches a plateau that is fairly stable for reads and writes from 8 threads all the way to the maximum number of threads used on this test. The maximum read performance was 24.2 GB/s at 32 threads and the bottleneck was the InfiniBand HDR100 interface apparently at higher than line speed. Similarly, notice that the maximum write performance of 19.9 GB/s was reached at 16 threads. An important data point is at 4 threads, that even that uses the same data size and parameters as IOzone with the extra burden of locking, no performance drop is observed for writes as it was for IOzone.
Random small blocks IOzone Performance N clients to N files
Random N clients to N files performance was measured with IOzone 3.487. Tests executed varied from 16 threads up to 512 threads since there was not enough client-cores for 1024 threads. Lower thread counts were not tested at this time since they take a very large execute time and IOzone does not allow to get results until the test is completed in its entirety and the most important information tends to be the maximum IOPS that the solution can provide. Each thread was using a different file and the threads were assigned on a round robin fashion to the client nodes. This benchmark used 4 KiB blocks for emulating small block traffic.
Caching effects were minimized by setting the file system page pool tunable to 8GiB on the clients and 24 GiB on the servers and using a total data size bigger than twice the total page pool size of clients or servers (whichever is larger). It is important to note that the page pool tunable sets the maximum amount of memory used by the file system for caching data, regardless the amount of RAM installed and free.
Figure 4 N to N Random Performance
From the results we can observe that write performance starts at a high value of 23.4K IOps and remains under 25K steadily up to 256 threads where it peaks at 27.4K IOps. Read performance on the other hand starts at 1.3K IOps and increases performance almost linearly with the number of threads used (keep in mind that number of threads is doubled for each data point) and reaches the maximum performance of 33.8K IOPS at 512 threads. Using more threads would require more than the 16 compute nodes or more cores per node to avoid loss of performance due to process context switching, data locality and similar effects. ME4 arrays require a higher IO pressure (queue or IO depth) to reach their maximum random IOPS showing in this test a lower apparent performance, where the arrays could in fact deliver more IOPS when using tests like FIO that can control the IO depth per process.
Metadata performance with MDtest using empty files
Metadata performance was measured with MDtest version 3.3.0, assisted by OpenMPI v4.0.1 to run the benchmark over the 16 compute nodes. Tests executed varied from single thread up to 512 threads. The benchmark was used for files only (no directory metadata), getting the number of creates, stats and removes that the solution can handle, and results were contrasted with previous EDR results.
To properly evaluate the solution in comparison to other DellEMC HPC storage solutions and the previous blog results, the optional High Demand Metadata Module was used, but with a single ME4024 array; but in fact, the large configuration tested in this work was designated to have two ME4024s.
This High Demand Metadata Module can support up to four ME4024 arrays, and it is suggested to increase the number of ME4024 arrays to 4, before adding another metadata module. Additional ME4024 arrays are expected to increase the Metadata performance with each additional array, except maybe for Stat operations, since the IOPS numbers are very high, at some point the CPUs will become a bottleneck and performance will not continue to increase linearly.
The following command was used to execute the benchmark, where Threads was the variable with the number of threads used (1 to 512 incremented in powers of two), and my_hosts.$Threads is the corresponding file that allocated each thread on a different node, using round robin to spread them homogeneously across the 16 compute nodes. Similar to the Random IO benchmark, the maximum number of threads was limited to 512, since there are not enough cores for 1024 threads and context switching would affect the results, reporting a number lower than the real performance of the solution.
mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --prefix /mmfs1/perftest/ompi --mca btl_openib_allow_ib 1 /mmfs1/perftest/lanl_ior/bin/mdtest -v -d /mmfs1/perftest/ -i 1 -b $Directories -z 1 -L -I 1024 -y -u -t -F
Since performance results can be affected by the total number of IOPs, the number of files per directory and the number of threads, it was decided to fix the total number of files to 2 MiB files (2^21 = 2097152), the number of files per directory fixed at 1024, and the number of directories varied as the number of threads changed as shown in Table 3.
Table 3 MDtest distribution of files on directories
Number of Threads | Number of directories per thread | Total number of files |
1 | 2048 | 2,097,152 |
2 | 1024 | 2,097,152 |
4 | 512 | 2,097,152 |
8 | 256 | 2,097,152 |
16 | 128 | 2,097,152 |
32 | 64 | 2,097,152 |
64 | 32 | 2,097,152 |
128 | 16 | 2,097,152 |
256 | 8 | 2,097,152 |
512 | 4 | 2,097,152 |
Figure 5 Metadata Performance - Empty Files
First, note that the scale chosen was logarithmic with base 10, to allow comparing operations that have differences several orders of magnitude; otherwise some of the operations would look like a flat line close to 0 on a normal graph. A log graph with base 2 could be more appropriate, since the number of threads are increased in powers of 2, but the graph would look very similar and people tend to handle and remember better numbers based on powers of 10.
The system gets very good results with Stat operations reaching their peak value at 256 threads with 6M op/s respectively. Removal operations attained the maximum of 189.7K op/s at 32 threads and Create operations achieving their peak at 512 threads with 266.8.1K op/s. Stat operation have more variability, but once they reach their peak value, performance does not drop below 3M op/s for Stats. Create and Removal are more stable once their reach a plateau and remain above 160K op/s for Removal and 128K op/s for Create.
Metadata performance with MDtest using 4 KiB files
This test is almost identical to the previous one, except that instead of empty files, small files of 4KiB were used.
The following command was used to execute the benchmark, where Threads was the variable with the number of threads used (1 to 512 incremented in powers of two), and my_hosts.$Threads is the corresponding file that allocated each thread on a different node, using round robin to spread them homogeneously across the 16 compute nodes.
mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --prefix /mmfs1/perftest/ompi --mca btl_openib_allow_ib 1 /mmfs1/perftest/lanl_ior/bin/mdtest -v -d /mmfs1/perftest/ -i 1 -b $Directories -z 1 -L -I 1024 -y -u -t -F -w 4K -e 4K
Figure 6 Metadata Performance - Small files (4K)
The system gets very good results for Stat reaching a peak value at 512 threads with almost 4.9M op/s. Remove operations attained the maximum of 442.7K op/s at 128 threads and Create operations achieving their peak with 75K op/s at 512 threads and apparently not reaching a plateau yet. Stat and Removal operations have more variability, but once they reach their peak value, performance does not drop below 3.5M op/s for Stats and 315K op/s for Removal. Create and Read have less variability and keep increasing as the number of threads grows.
Since these numbers are for a metadata module with a single ME4024, performance will increase for each additional ME4024 array, however we cannot simply assume a linear increase for each additional ME4024. Unless the whole file fits inside the inode for such file, data targets on the ME4084s will be used to store part of the 4K files, limiting the performance to some degree. Since the inode size is 4KiB and it still needs to store metadata, only files around 3 KiB will fit inside and any file bigger than that will use data targets.
Conclusions and Future Work
The solution has similar performance to that observed with the Infiniband EDR technology. An overview of the performance for HDR100 can be seen in Table 4; it is expected to be stable from an empty file system until is almost full because of the use of Scatter allocation across the whole surface area of ALL HDDs. Furthermore, the solution scales in capacity and performance linearly as more storage node modules are added, and a similar performance increase can be expected from the optional high demand metadata module. This solution provides HPC customers with a very reliable parallel file system used by many Top 500 HPC clusters. In addition, it provides exceptional search capabilities, advanced monitoring and management. With the addition of optional gateway nodes, it allows file sharing via ubiquitous standard protocols like NFS, SMB to as many clients as needed. Finally, Ngenea nodes allow efficient access to other cost-effective storage tiers such as ECS, Isilon enterprise NAS and Cloud solutions using different protocols.
Table 4 Peak & Sustained Performance
| Peak Performance | Sustained Performance | ||
Write | Read | Write | Read | |
Large Sequential N clients to N files | 21.0 GB/s | 23.8 GB/s | 20.5 GB/s | 23.0 GB/s |
Large Sequential N clients to single shared file | 19.9 GB/s | 24.2 GB/s | 19.1 GB/s | 23.4 GB/s |
Random Small blocks N clients to N files | 33.8KIOps | 27.4KIOps | 33.80KIOps | 23.0KIOps |
Metadata Create empty files | 266.8K IOps | 128K IOps | ||
Metadata Stat empty files | 6M IOps | 3M IOps | ||
Metadata Remove empty files | 189.7K IOps | 160K IOps | ||
Metadata Create 4KiB files | 75K IOps | 75K IOps | ||
Metadata Stat 4KiB files | 4.9M IOps | 3.5M IOps | ||
Metadata Remove 4KiB files | 442.7K IOps | 315K IOps |
Performance for the gateway nodes was measured and will be reported in a new blog. Finally, high performance NVMe nodes are being tested and results will also be released in a different blog.
Related Blog Posts
Dell Validated Design for HPC BeeGFS High Capacity Storage with 16G Servers
Fri, 03 May 2024 22:02:08 -0000
|Read Time: 0 minutes
The Dell Validated Design (DVD) for HPC BeeGFS High Capacity Storage is a fully supported, high-throughput, scale-out, parallel file system storage solution. The system is highly available, and capable of supporting multi-petabyte storage. This blog discusses the solution architecture, how it is tuned for HPC performance, and presents I/O performance using IOZone sequential and MDtest benchmarks.
BeeGFS high-performance storage solutions built on NVMe devices are designed as a scratch storage solution for datasets which are usually not retained beyond the lifetime of the job. For more information, see HPC Scratch Storage with BeeGFS.
Figure 1. DVD for BeeGFS High Capacity - Large Configuration
The system shown in Figure 1 uses a PowerEdge R660 as the management server and entry point into the system. There are two pairs of R760 servers, each in an active-active high availability configuration. One set serves as Metadata Servers (MDS) connected to a PowerVault ME5024 storage array, which hosts the Metadata Targets (MDTs). The other set serves as Object Storage Servers (OSS) connected to up to four PowerVault ME5084 storage arrays, which host the Storage Targets (STs) for the BeeGFS filesystem. The ME5084 arrays supports hard disk drives (HDDs) of up to 22TB. To scale beyond four ME5084s, additional OSS pairs are needed.
This solution uses Mellanox InfiniBand HDR (200 Gb/s) for the data network on the OSS pair, and InfiniBand HDR-100 (100 Gb/s) for the MDS pair. The clients and servers are connected to the 1U Mellanox Quantum HDR Edge Switch QM8790, which supports up to 80 ports of HDR100 using splitter cables, or 40 ports of unsplit HDR200. Additionally, a single switch can be configured to have any combination of split and unsplit ports.
Hardware and Software Configuration
The following tables describe the hardware specifications and software versions validated for the solution.
Table 1. Testbed configuration
Management Server | 1x Dell PowerEdge R660 |
Metadata Servers (MDS) | 2x Dell PowerEdge R760 |
Object Storage Servers (OSS) | 2x Dell PowerEdge R760 |
Processor | 2x Intel(R) Xeon(R) Gold 6430 |
Memory | 16x 16GB DDR5, 4800MT/s RDIMMs – 256 GB |
InfiniBand HCA | MGMT: 1x Mellanox ConnectX-6 single port HDR-100 MDS: 2x Mellanox ConnectX-6 single port HDR-100 OSS: 2x Mellanox ConnectX-6 single port HDR |
External Storage Controller | MDS: 2x Dell 12Gb/s SAS HBA per R760 server OSS: 4x Dell 12Gb/s SAS HBA per R760 server |
Data Storage Enclosure | Small: 1x Dell PowerVault ME5084 Medium: 2x Dell PowerVault ME5084 Large 4x Dell PowerVault ME5084 Drive configuration is described in the section Storage Service |
Metadata Storage Enclosure | 1x Dell PowerVault ME5024 enclosure fully populated with 24 drives Drive configuration is described in the section Metadata Service |
RAID controllers | Duplex SAS RAID controllers in the ME5084 and ME5024 enclosures |
Hard Disk Drives | Per ME5084 enclosure: 84x SAS3 drives supported by ME5084 Per ME5024 enclosure: 24x SAS3 SSDs supported by ME5024 |
Operating System | Red Hat Enterprise Linux release 8.6 (Ootpa) |
Kernel Version | 4.18.0-372.9.1.el8.x86_64 |
Mellanox OFED version | MLNX_OFED_LINUX-5.6-2.0.9.0 |
Grafana | 10.3.3-1 |
InfluxDB | 1.8.10-1 |
BeeGFS File System | 7.4.0p1 |
Solution Configuration Details
The BeeGFS architecture consists of four main services:
- Management Service
- Metadata Service
- Storage Service
- Client Service
There is also an optional BeeGFS Monitoring Service.
Except for the client service which is a kernel module, the management, metadata, and storage services are user space processes. It is possible to run any combination of BeeGFS services (client and server components) together on the same machine. It is also possible to run multiple instances of any BeeGFS service on the same machine. In the Dell High Capacity configuration of BeeGFS, the metadata server runs the metadata services (1 per Metadata Target), as well as the monitoring service and management service. It’s important to note that the management service runs on the metadata server, not on the management server. This ensures that there is a redundant node to fail over to in case of a single server outage. The storage service (only 1 per server) runs on the storage servers.
Management Service
The beegfs-mgmtd service is handled by the metadata servers (not the management server) and handles registration of resources (storage, metadata, monitoring) and clients. The mgmtd store is initialized as a small partition on the first metadata target. In a healthy system, it will be started on MDS-1.
Metadata Service
The ME5024 storage array used for metadata storage is fully populated with 24x 960 GB SSDs in this evaluation. These drives are configured in 12x linear RAID1 disk groups of two drives each as shown in Figure 2. Each RAID1 group is a metadata target (MDT).
Figure 2. Fully Populated ME5024 array with 12 MDTs
In BeeGFS, each metadata service handles only a single MDT. Since there are 12 MDTs, there are 12 metadata service instances. Each of the two metadata servers run six instances of the metadata service. The metadata targets are formatted with an ext4 file system (ext4 performs well with small files and small file operations). Additionally, BeeGFS stores information in extended attributes and directly on the inodes of the file system to optimize performance, both of which work well with the ext4 file system.
Storage Service
The data storage solution evaluated in this blog is distributed across four PowerVault ME5084 arrays, which makes up a large configuration. Additionally, there is a medium configuration (comprising two arrays) and a small configuration (consisting of a single array), which will not be assessed in this document. Linear RAID-6 disk groups of 10 drives (8+2) each are created on each array. A single volume using all the space is created for every disk group. This will result in 8 disk groups/volumes per array. Each ME5 array has 84 drives and handles 8 RAID-6 disk groups. 4 drives are left over, which are configured as global hot spares across the array volumes. The storage targets are formatted with an XFS filesystem as it delivers high write throughput and scalability with RAID arrays.
There are a total of 32x RAID-6 volumes across 4x ME5084 in the base configuration shown in Figure 1. Each of these RAID-6 volumes are configured as a Storage Target (ST) for the BeeGFS file system, resulting in a total of 32 STs.
Each ME5084 array has 84 drives, with drives numbered 0-41 in the top drawer, and 42-83 in the bottom drawer. In Figure 3, the drives are numbered 1-8 signifying which RAID6 disk group it belongs to. The drives marked “S” are configured as global hot spares, which will automatically take the place of any failed drive in a disk group.
Figure 3. RAID 6 (8+2) disk group layout on one ME5084
Client Service
The BeeGFS client module is loaded on all the hosts which require access to the BeeGFS file system. When the BeeGFS module is loaded and the beegfs-client service is started, the service mounts the file systems defined in /etc/beegfs/beegfs-mounts.conf instead of the usual approach based on /etc/fstab. With this approach, the beegfs-client service starts, like any other Linux service, through the service startup script and enables the automatic recompilation of the BeeGFS client module after system updates.
Monitoring Service
The BeeGFS monitoring service (beegfs-mon.service) collects BeeGFS statistics and provides them to the user, using the time series database InfluxDB. For visualization of data, beegfs-mon-grafana provides predefined Grafana dashboards that can be used out of the box. Figure 4 provides an overview of the main dashboard, showing BeeGFS storage server statistics during a short benchmark run. There are charts that display how much network traffic the servers are handling, how much data is being read and written to the disk, etc.
The monitoring capabilities in this release have been upgraded with Telegraf, allowing users to view system-level metrics such as CPU load (per-core and aggregate), memory usage, process count, and more.
Figure 4. Grafana Dashboard - BeeGFS Storage Server
High Availability
In this release, the high-availability capabilities of the solution have expanded to improve the resiliency of the filesystem. The high availability software now takes advantage of dual network rings so that it may communicate via both the management ethernet interface as well as InfiniBand. Additionally, the primary quorum of decision-making nodes have been expanded to include all five base servers, instead of the previous three (MGMT, MDS-1, MDS-2).
Performance Evaluation
This section presents the performance characteristics of the DVD for HPC BeeGFS High-Capacity Storage using IOZone sequential and MDTest. More information about performance characterization using IOR N-1, IOZone random, IO500, and StorageBench, will be available in an upcoming white paper.
The storage performance was evaluated using the IOZone benchmark which measured sequential read and write throughput. Table 2 describes the configuration of the PowerEdge C6420 nodes used as BeeGFS clients for these performance studies. These client nodes use a mix of processors i due to available resources, everything else is identical.
Table 2. Client Configuration
Clients | 16x Dell PowerEdge C6420 |
Processor per client | 12 clients: 2x Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz 3 clients: 2x Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz 1 client: 2x Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz |
Memory | 12x 16GB DDR4 2933MT/s DIMMs – 192GB |
Operating System | Red Hat Enterprise Linux release 8.4 (Ootpa) |
Kernel Version | 4.18.0-305.el8.x86_64 |
Interconnect | 1x Mellanox Connect-X 6 Single Port HDR100 Adapter |
OFED Version | MLNX_OFED_LINUX-5.6-2.0.9.0 |
The clients are connected over an HDR-100 network, while the servers are connected with a mix of HDR-100 and HDR. This configuration is described in Table 3.
Table 3. Network Configuration
InfiniBand Switch | QM8790 Mellanox Quantum HDR Edge Switch – 1U with 40x HDR 200Gb/s ports (ports can also be configured into split 2xHDR-100 100Gb/s) |
Management Switch | PowerSwitch N3248TE-ON – 1U with 48x 1Gb is recommended. Due to resource availability, this DVD was tested with Dell PowerConnect 6248 Switch – 1U with 48x 1Gb |
InfiniBand HCA | MGMT: 1x Mellanox ConnectX-6 single port HDR-100 MDS: 2x Mellanox ConnectX-6 single port HDR-100 OSS: 2x Mellanox ConnectX-6 single port HDR Clients: 1x Mellanox ConnectX-6 single port HDR-100 |
Sequential Reads and Writes N-N
The sequential reads and writes were measured using IOZone (v3.492). Tests were conducted with multiple thread counts starting at one thread and increasing in powers of two, up to 1,024 threads. This test is N-N; one file will be generated per thread. The processes were distributed across the 16 physical client nodes in a round-robin manner so that requests were equally distributed with load balancing.
For thread counts two and above, an aggregate file size of 8TB was chosen to minimize the effects of caching from the servers as well as from the BeeGFS clients. For one thread, a file size of 1TB was chosen. Within any given test, the aggregate file size used was equally divided among the number of threads. A record size of 1MiB was used for all runs. The benchmark was split into two commands: a write benchmark and a read benchmark, which are shown below.
iozone -i 0 -c -e -w -C -r 1m -s $Size -t $Thread -+n -+m /path/to/threadlist #write
iozone -i 1 -c -e -C -r 1m -s $Size -t $Thread -+n -+m /path/to/threadlist # read
OS Caches were dropped on the servers and clients between iterations as well as between write and read tests by running the following command:
sync; echo 3 > /proc/sys/vm/drop_caches
Figure 5. N-N Sequential Performance
Figure 5 illustrates the performance scaling of read and write operations. The peak read throughput of 34.55 GB/s is achieved with 512 threads, while the peak write throughput of 31.89 GB/s is reached with 1024 threads. The performance shows a steady increase as the number of threads increases, with read performance beginning to stabilize at 32 threads and write performance at 64 threads. Beyond these points, the performance increases at a gradual rate, maintaining an approximate throughput of 34 GB/s for reads and 31 GB/s for writes. Regardless of the thread count, read operations consistently match or exceed write operations.
Tuning Parameters
The following tuning parameters were in place while carrying out the performance characterization of the solution.
The default stripe count for BeeGFS is 4, however, the chunk size and number of targets per file (stripe count) can be configured on a per-directory or per-file basis. For all of these tests, BeeGFS stripe size was set to 1MiB and stripe count set to 1, as shown below:
[root@node025 ~]# beegfs-ctl --getentryinfo --mount=/mnt/beegfs/ /mnt/beegfs/stripe1 --verbose
Entry type: directory
EntryID: 0-65D529C1-7
ParentID: root
Metadata node: mds-1-numa1-2 [ID: 5]
Stripe pattern details:
+ Type: RAID0
+ Chunksize: 1M
+ Number of storage targets: desired: 1
+ Storage Pool: 1 (Default)
Inode hash path: 4E/25/0-65D529C1-7
Transparent huge pages were disabled, and the following virtual memory settings configured on the metadata and storage servers:
/proc/sys/vm/dirty_background_ratio = 5
/proc/sys/vm/dirty_ratio = 20
/proc/sys/vm/min_free_kbytes = 262144
/proc/sys/vm/vfs_cache_pressure = 50
The following tuning options were used for the storage block devices on the storage servers
IO Scheduler deadline : deadline Number of schedulable requests : 2048 Maximum amount of read ahead data :4096
In addition to the above, the following BeeGFS specific tuning options were used
beegfs-meta.conf
connMaxInternodeNum = 96 logLevel = 3 tuneNumWorkers = 4 tuneTargetChooser = roundrobin tuneUsePerUserMsgQueues = true
beegfs-storage.conf
connMaxInternodeNum = 32 tuneBindToNumaZone = tuneFileReadAheadSize = 0m tuneFileReadAheadTriggerSize = 4m tuneFileReadSize = 1M tuneFileWriteSize = 1M tuneFileWriteSyncSize = 0m tuneNumWorkers = 10 tuneUsePerTargetWorkers = true tuneUsePerUserMsgQueues = true tuneWorkerBufSize = 4m
beegfs-client.conf
connRDMABufSize = 524288 connRDMABufNum = 4
Additionally, the ME5084s were initialized with a chunksize of 128k.
Conclusion and Future Work
This blog presents the performance characteristics of the Dell Validated Design for HPC High Capacity BeeGFS Storage with 16G hardware. This DVD provides a peak performance of 34.6 GB/s for reads and 31.9 GB/s for writes using the IOZone sequential benchmark.
Major improvements have been made in this release to security, high-availability, deployment, and telemetry monitoring.In future projects, we will evaluate other benchmarks including MDTest, IOZone random (random IOP/s), IOR N to 1 (N threads to a single file), NetBench (clients to servers, omitting ME5 devices), and StorageBench (servers to disks directly, omitting network).
The Dell PowerEdge C6615: Maximizing Value and Minimizing TCO for Dense Compute and Scale-out Workloads
Mon, 02 Oct 2023 21:35:01 -0000
|Read Time: 0 minutes
In the ever-evolving landscape of data centers and IT infrastructure, meeting the demands of scale-out workloads is a continuous challenge. Organizations seek solutions that not only provide superior performance but also optimize Total Cost of Ownership (TCO).
Enter the new Dell PowerEdge C6615, a modular node designed to address these challenges with innovative solutions. Let's delve into the key features and benefits of this groundbreaking addition to the Dell PowerEdge portfolio.
Industry challenges
- Maximizing Rack utilization: One of the primary challenges in the data center world is maximizing rack utilization. The Dell PowerEdge C6615 addresses this by offering dense compute options.
- Cutting-edge processors: High-performance processors are crucial for scalability and security. The C6615 is powered by a 4th Generation AMD EPYC 8004 series processor, ensuring top-tier performance.
- Total Cost of Ownership (TCO): TCO is a critical consideration that encompasses power and cooling efficiency, licensing costs, and seamless integration with existing data center infrastructure. The C6615 is designed to reduce TCO significantly.
Introducing the Dell PowerEdge C6615
The Dell PowerEdge C6615 is a modular node designed to revolutionize data center infrastructure. Here are some key highlights:
- Price-performance ratio: The C6615 offers outstanding price per watt for scale-out workloads, with up to a 315% improvement compared to a one-socket (1S) server with AMD EPYC 9004 Series server processor.
- Optimized thermal solution: It features an optimized thermal solution that allows for air-cooling configurations with up to 53% improved cooling performance compared to the previous generation chassis.
- Density-optimized compute: The C6615's architecture is tailored for scale-out WebTech workloads, offering exceptional performance with reduced TCO.
- High-speed NVMe storage: It provides high-speed NVMe storage for applications with intensive IOPS requirements, ensuring efficient performance.
- Efficient scalability: With 40% more cores per rack compared to the AMD EPYC 9004 Series server processors, the C6615 allows for quicker and more efficient scalability.
- SmartNIC: It includes a SmartNIC with hardware-accelerated networking and storage, saving CPU cycles and enhancing security.
Key features
To maximize efficiency and reduce environmental impact, the PowerEdge C6615 incorporates several key features:
- Power and thermal efficiency: The 2U chassis with four nodes enhances power and thermal efficiency, eliminating the need for liquid cooling.
- Flexible I/O options: It supports up to two PCIe Gen5 slots and one 16 PCIe Gen5 OCP 3.0 slot for network cards, ensuring versatile connectivity.
- Security: Security is integrated at every phase of the PowerEdge lifecycle, from supply chain protection to Multi-Factor Authentication (MFA) and role-based access controls.
Accelerating performance
In benchmark testing, the C6615 outperforms the competition:
- HPL Benchmark: It showcases up to a 335% improvement in performance per watt per dollar and a 210% increase in performance per CPU dollar compared to other 1S systems with the AMD EPYC 9004 Series server processor.
Figure 1. HPL benchmark performance
- SPEC_CPU2017 Benchmark: Results demonstrate up to a 205% improvement in performance per watt per dollar and a remarkable 128% increase in performance per CPU dollar compared to similar systems.
Figure 2. SPEC_CPU2017 benchmark performance
Final thoughts
The seamless integration of the Dell PowerEdge C6615 into existing processes and toolsets is facilitated by comprehensive iDRAC9 support for all components. This ensures a smooth transition while leveraging the full potential of your server infrastructure.
Dell's commitment to environmental sustainability is evident in its use of recycled materials and energy-efficient options, helping to reduce carbon footprints and operational costs.
In conclusion, the Dell PowerEdge C6615 emerges as a leading dense compute solution, delivering exceptional value and unmatched performance. For more information, visit the PowerEdge Servers Powered by AMD site and explore how this innovative solution can transform your data center operations.
Note: Performance results may vary based on specific configurations and workloads. It's recommended to consult with Dell or an authorized partner for tailored solutions.
Author: David Dam