Your Browser is Out of Date

Nytro.ai uses technology that works best in other browsers.
For a full experience use one of the browsers below

Dell.com Contact Us
United States/English
Brendan Hanlon
Brendan Hanlon

Assets

Home > Workload Solutions > High Performance Computing > Blogs

HPC

Dell Validated Design for HPC BeeGFS High Capacity Storage with 16G Servers

Brendan Hanlon Brendan Hanlon

Fri, 03 May 2024 22:02:08 -0000

|

Read Time: 0 minutes

The Dell Validated Design (DVD) for HPC BeeGFS High Capacity Storage is a fully supported, high-throughput, scale-out, parallel file system storage solution. The system is highly available, and capable of supporting multi-petabyte storage. This blog discusses the solution architecture, how it is tuned for HPC performance, and presents I/O performance using IOZone sequential and MDtest benchmarks.

BeeGFS high-performance storage solutions built on NVMe devices are designed as a scratch storage solution for datasets which are usually not retained beyond the lifetime of the job. For more information, see HPC Scratch Storage with BeeGFS.

This image shows Dell PowerEdge Servers and Dell PowerVault

Figure 1. DVD for BeeGFS High Capacity - Large Configuration

The system shown in Figure 1 uses a PowerEdge R660 as the management server and entry point into the system. There are two pairs of R760 servers, each in an active-active high availability configuration. One set serves as Metadata Servers (MDS) connected to a PowerVault ME5024 storage array, which hosts the Metadata Targets (MDTs). The other set serves as Object Storage Servers (OSS) connected to up to four PowerVault ME5084 storage arrays, which host the Storage Targets (STs) for the BeeGFS filesystem. The ME5084 arrays supports hard disk drives (HDDs) of up to 22TB. To scale beyond four ME5084s, additional OSS pairs are needed.

This solution uses Mellanox InfiniBand HDR (200 Gb/s) for the data network on the OSS pair, and InfiniBand HDR-100 (100 Gb/s) for the MDS pair. The clients and servers are connected to the 1U Mellanox Quantum HDR Edge Switch QM8790, which supports up to 80 ports of HDR100 using splitter cables, or 40 ports of unsplit HDR200. Additionally, a single switch can be configured to have any combination of split and unsplit ports.

Hardware and Software Configuration

The following tables describe the hardware specifications and software versions validated for the solution.

Table 1. Testbed configuration

Management Server

1x Dell PowerEdge R660

Metadata Servers (MDS)

2x Dell PowerEdge R760

Object Storage Servers (OSS)

2x Dell PowerEdge R760

Processor

2x Intel(R) Xeon(R) Gold 6430

Memory

16x 16GB DDR5, 4800MT/s RDIMMs – 256 GB

InfiniBand HCA

MGMT: 1x Mellanox ConnectX-6 single port HDR-100

MDS: 2x Mellanox ConnectX-6 single port HDR-100

OSS: 2x Mellanox ConnectX-6 single port HDR

External Storage Controller

MDS: 2x Dell 12Gb/s SAS HBA per R760 server

OSS: 4x Dell 12Gb/s SAS HBA per R760 server

Data Storage Enclosure

Small: 1x Dell PowerVault ME5084

Medium: 2x Dell PowerVault ME5084

Large 4x Dell PowerVault ME5084

Drive configuration is described in the section Storage Service

Metadata Storage Enclosure

1x Dell PowerVault ME5024 enclosure fully populated with 24 drives

Drive configuration is described in the section Metadata Service

RAID controllers

Duplex SAS RAID controllers in the ME5084 and ME5024 enclosures

Hard Disk Drives

Per ME5084 enclosure: 84x SAS3 drives supported by ME5084

Per ME5024 enclosure: 24x SAS3 SSDs supported by ME5024

Operating System

Red Hat Enterprise Linux release 8.6 (Ootpa)

Kernel Version

4.18.0-372.9.1.el8.x86_64

Mellanox OFED version

MLNX_OFED_LINUX-5.6-2.0.9.0

Grafana

10.3.3-1

InfluxDB

1.8.10-1

BeeGFS File System

7.4.0p1

Solution Configuration Details

The BeeGFS architecture consists of four main services:

  1. Management Service
  2. Metadata Service
  3. Storage Service
  4. Client Service

There is also an optional BeeGFS Monitoring Service.

Except for the client service which is a kernel module, the management, metadata, and storage services are user space processes. It is possible to run any combination of BeeGFS services (client and server components) together on the same machine. It is also possible to run multiple instances of any BeeGFS service on the same machine. In the Dell High Capacity configuration of BeeGFS, the metadata server runs the metadata services (1 per Metadata Target), as well as the monitoring service and management service. It’s important to note that the management service runs on the metadata server, not on the management server. This ensures that there is a redundant node to fail over to in case of a single server outage. The storage service (only 1 per server) runs on the storage servers.

Management Service

The beegfs-mgmtd service is handled by the metadata servers (not the management server) and handles registration of resources (storage, metadata, monitoring) and clients. The mgmtd store is initialized as a small partition on the first metadata target. In a healthy system, it will be started on MDS-1.

Metadata Service

The ME5024 storage array used for metadata storage is fully populated with 24x 960 GB SSDs in this evaluation. These drives are configured in 12x linear RAID1 disk groups of two drives each as shown in Figure 2. Each RAID1 group is a metadata target (MDT).

This image shows the ME5024 Storage Array Configuration

Figure 2. Fully Populated ME5024 array with 12 MDTs

In BeeGFS, each metadata service handles only a single MDT. Since there are 12 MDTs, there are 12 metadata service instances. Each of the two metadata servers run six instances of the metadata service. The metadata targets are formatted with an ext4 file system (ext4 performs well with small files and small file operations). Additionally, BeeGFS stores information in extended attributes and directly on the inodes of the file system to optimize performance, both of which work well with the ext4 file system.

Storage Service

The data storage solution evaluated in this blog is distributed across four PowerVault ME5084 arrays, which makes up a large configuration. Additionally, there is a medium configuration (comprising two arrays) and a small configuration (consisting of a single array), which will not be assessed in this document. Linear RAID-6 disk groups of 10 drives (8+2) each are created on each array. A single volume using all the space is created for every disk group. This will result in 8 disk groups/volumes per array. Each ME5 array has 84 drives and handles 8 RAID-6 disk groups. 4 drives are left over, which are configured as global hot spares across the array volumes. The storage targets are formatted with an XFS filesystem as it delivers high write throughput and scalability with RAID arrays.

There are a total of 32x RAID-6 volumes across 4x ME5084 in the base configuration shown in Figure 1. Each of these RAID-6 volumes are configured as a Storage Target (ST) for the BeeGFS file system, resulting in a total of 32 STs.

Each ME5084 array has 84 drives, with drives numbered 0-41 in the top drawer, and 42-83 in the bottom drawer. In Figure 3, the drives are numbered 1-8 signifying which RAID6 disk group it belongs to. The drives marked “S” are configured as global hot spares, which will automatically take the place of any failed drive in a disk group.

This image shows the Duplex RAID controllers to Storage Servers

Figure 3. RAID 6 (8+2) disk group layout on one ME5084

Client Service

The BeeGFS client module is loaded on all the hosts which require access to the BeeGFS file system. When the BeeGFS module is loaded and the beegfs-client service is started, the service mounts the file systems defined in /etc/beegfs/beegfs-mounts.conf instead of the usual approach based on /etc/fstab. With this approach, the beegfs-client service starts, like any other Linux service, through the service startup script and enables the automatic recompilation of the BeeGFS client module after system updates.

Monitoring Service

The BeeGFS monitoring service (beegfs-mon.service) collects BeeGFS statistics and provides them to the user, using the time series database InfluxDB. For visualization of data, beegfs-mon-grafana provides predefined Grafana dashboards that can be used out of the box. Figure 4 provides an overview of the main dashboard, showing BeeGFS storage server statistics during a short benchmark run. There are charts that display how much network traffic the servers are handling, how much data is being read and written to the disk, etc.

The monitoring capabilities in this release have been upgraded with Telegraf, allowing users to view system-level metrics such as CPU load (per-core and aggregate), memory usage, process count, and more.

This image shows a screenshot of the Grafana dashboard.

Figure 4. Grafana Dashboard - BeeGFS Storage Server

High Availability

In this release, the high-availability capabilities of the solution have expanded to improve the resiliency of the filesystem. The high availability software now takes advantage of dual network rings so that it may communicate via both the management ethernet interface as well as InfiniBand. Additionally, the primary quorum of decision-making nodes have been expanded to include all five base servers, instead of the previous three (MGMT, MDS-1, MDS-2).

Performance Evaluation

This section presents the performance characteristics of the DVD for HPC BeeGFS High-Capacity Storage using IOZone sequential and MDTest. More information about performance characterization using IOR N-1, IOZone random, IO500, and StorageBench, will be available in an upcoming white paper.

The storage performance was evaluated using the IOZone benchmark which measured sequential read and write throughput. Table 2 describes the configuration of the PowerEdge C6420 nodes used as BeeGFS clients for these performance studies. These client nodes use a mix of processors i due to available resources, everything else is identical.

Table 2. Client Configuration

Clients

16x Dell PowerEdge C6420

Processor per client

12 clients: 2x Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz

3 clients: 2x Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz

1 client: 2x Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz

Memory

12x 16GB DDR4 2933MT/s DIMMs – 192GB

Operating System

Red Hat Enterprise Linux release 8.4 (Ootpa)

Kernel Version

4.18.0-305.el8.x86_64

Interconnect

1x Mellanox Connect-X 6 Single Port HDR100 Adapter

OFED Version

MLNX_OFED_LINUX-5.6-2.0.9.0

The clients are connected over an HDR-100 network, while the servers are connected with a mix of HDR-100 and HDR. This configuration is described in Table 3.

Table 3. Network Configuration

InfiniBand Switch

QM8790 Mellanox Quantum HDR Edge Switch – 1U with 40x HDR 200Gb/s ports (ports can also be configured into split 2xHDR-100 100Gb/s)

Management Switch

PowerSwitch N3248TE-ON – 1U with 48x 1Gb is recommended. Due to resource availability, this DVD was tested with Dell PowerConnect 6248 Switch – 1U with 48x 1Gb

InfiniBand HCA

MGMT: 1x Mellanox ConnectX-6 single port HDR-100

MDS: 2x Mellanox ConnectX-6 single port HDR-100

OSS: 2x Mellanox ConnectX-6 single port HDR

Clients: 1x Mellanox ConnectX-6 single port HDR-100

Sequential Reads and Writes N-N

The sequential reads and writes were measured using IOZone (v3.492). Tests were conducted with multiple thread counts starting at one thread and increasing in powers of two, up to 1,024 threads. This test is N-N; one file will be generated per thread. The processes were distributed across the 16 physical client nodes in a round-robin manner so that requests were equally distributed with load balancing.

For thread counts two and above, an aggregate file size of 8TB was chosen to minimize the effects of caching from the servers as well as from the BeeGFS clients. For one thread, a file size of 1TB was chosen. Within any given test, the aggregate file size used was equally divided among the number of threads. A record size of 1MiB was used for all runs. The benchmark was split into two commands: a write benchmark and a read benchmark, which are shown below.

iozone -i 0 -c -e -w -C -r 1m -s $Size -t $Thread -+n -+m /path/to/threadlist #write
iozone -i 1 -c -e    -C -r 1m -s $Size -t $Thread -+n -+m /path/to/threadlist # read

OS Caches were dropped on the servers and clients between iterations as well as between write and read tests by running the following command:

sync; echo 3 > /proc/sys/vm/drop_caches

This shows the N-N sequential performance of read and write operations. The performance shows a steady increase.

Figure 5. N-N Sequential Performance

Figure 5 illustrates the performance scaling of read and write operations. The peak read throughput of 34.55 GB/s is achieved with 512 threads, while the peak write throughput of 31.89 GB/s is reached with 1024 threads. The performance shows a steady increase as the number of threads increases, with read performance beginning to stabilize at 32 threads and write performance at 64 threads. Beyond these points, the performance increases at a gradual rate, maintaining an approximate throughput of 34 GB/s for reads and 31 GB/s for writes. Regardless of the thread count, read operations consistently match or exceed write operations.

Tuning Parameters 

The following tuning parameters were in place while carrying out the performance characterization of the solution.

The default stripe count for BeeGFS is 4, however, the chunk size and number of targets per file (stripe count) can be configured on a per-directory or per-file basis. For all of these tests, BeeGFS stripe size was set to 1MiB and stripe count set to 1, as shown below:

[root@node025 ~]# beegfs-ctl --getentryinfo --mount=/mnt/beegfs/ /mnt/beegfs/stripe1 --verbose
Entry type: directory
EntryID: 0-65D529C1-7
ParentID: root
Metadata node: mds-1-numa1-2 [ID: 5]
Stripe pattern details:
+ Type: RAID0
+ Chunksize: 1M
+ Number of storage targets: desired: 1
+ Storage Pool: 1 (Default)
Inode hash path: 4E/25/0-65D529C1-7

Transparent huge pages were disabled, and the following virtual memory settings configured on the metadata and storage servers:

/proc/sys/vm/dirty_background_ratio = 5
/proc/sys/vm/dirty_ratio = 20
/proc/sys/vm/min_free_kbytes = 262144
/proc/sys/vm/vfs_cache_pressure = 50

The following tuning options were used for the storage block devices on the storage servers

IO Scheduler deadline   : deadline
Number of schedulable requests  : 2048
Maximum amount of read ahead data  :4096


In addition to the above, the following BeeGFS specific tuning options were used

beegfs-meta.conf

connMaxInternodeNum           = 96
logLevel                      = 3
tuneNumWorkers                = 4
tuneTargetChooser             = roundrobin
tuneUsePerUserMsgQueues       = true

beegfs-storage.conf

connMaxInternodeNum           = 32
tuneBindToNumaZone            =
tuneFileReadAheadSize         = 0m
tuneFileReadAheadTriggerSize = 4m
tuneFileReadSize              = 1M
tuneFileWriteSize             = 1M
tuneFileWriteSyncSize         = 0m
tuneNumWorkers                = 10
tuneUsePerTargetWorkers       = true
tuneUsePerUserMsgQueues       = true
tuneWorkerBufSize             = 4m

beegfs-client.conf

connRDMABufSize = 524288
connRDMABufNum = 4

Additionally, the ME5084s were initialized with a chunksize of 128k.

Conclusion and Future Work

This blog presents the performance characteristics of the Dell Validated Design for HPC High Capacity BeeGFS Storage with 16G hardware. This DVD provides a peak performance of 34.6 GB/s for reads and 31.9 GB/s for writes using the IOZone sequential benchmark.

Major improvements have been made in this release to security, high-availability, deployment, and telemetry monitoring.In future projects, we will evaluate other benchmarks including MDTest, IOZone random (random IOP/s), IOR N to 1 (N threads to a single file), NetBench (clients to servers, omitting ME5 devices), and StorageBench (servers to disks directly, omitting network).