WRF Performance with AMD EPYC 7003 Series processors On Dell EMC PowerEdge servers

The Weather Research and Forecasting (WRF) model is an open-source mesoscale weather prediction model that is predominantly used in a multi-compute node environment for atmospheric research and operational forecasts. This model performs well on the latest generation of the AMD EPYC 3^rd Gen (7003 Series) processor family, code name Milan. In this blog, we highlight the performance improvement of WRF application on the AMD Milan processors based on Dell EMC PowerEdge servers.

This blog follows up our first blog in this series, where we introduced the AMD Milan processor architecture, key BIOS tuning options, and baseline microbenchmark performance. We analyzed the performance improvement of the latest AMD EPYC Milan (7003 Series) processor-based Dell EMC PowerEdge servers compared to the second-generation AMD EPYC Rome (7002 Series) processor-based Dell EMC PowerEdge servers. The testbed hardware and software details are outlined in the following table:

Table 1: Testbed hardware and software details

Server	Dell EMC PowerEdge 2-socket servers (with AMD Milan Processors)			Dell EMC PowerEdge 2-socket servers (with AMD Rome Processors)
Processor model Cores/socket Frequency (Base-Boost) TDP L3 cache Processor bus speed	7763 64 2.45 GHz – 3.5 GHz 280 W 256 MB 16 GT/s	7713 64 2.0 GHz – 3.7 GHz 225 W 256 MB 16 GT/s	7543 32 2.8 GHz – 3.7 GHz 225 W 256 MB 16 GT/s	7662 64c 2.0 GHz – 3.35 GHz 200 W 256 MB 16 GT/s	7542 32 2.9 GHz – 3.4 GHz 225 W 128 MB 16 GT/s
Operating system	Red Hat Enterprise Linux 8.3 (4.18.0-240.el8.x86_64)
Memory	DDR4 256G (16 GB x 16) 3200 MT/s
Interconnect	NVIDIA Mellanox HDR
BIOS/CPLD	2.2.5 / 1.1.12 (AMD 7763,AMD 7713,AMD 7543) 2.1.6 / 1.1.12 (AMD 7662) 2.1.5 / 0.10.3 (AMD 7542)
Applications	WRF v3.9.1.1, WRF v4.2.2
Benchmark datasets	conus 2.5km, new conus 2.5km, wrf_large 3km

The following figure shows the domain for the tested datasets:

Figure 1: Domain configuration for new conus 2.5 km, conus 2.5 km, and wrf_large datasets.

The following table provides a brief description of each dataset:

Table 2: Configuration for new conus 2.5 km conus 2.5 km and wrf_large datasets

	conus 2.5 km	new conus 2.5 km	wrf_large
Run hours	3	3	2
Resolution(m)	2500	2500	3000
Vertical layers	35	35	50
Grid points	1501 x 1201	1901 x 1301	1500 x 1500
interval_seconds	10800	10800	21600

The results were measured by averaging the WRF computation time of each timestep from the rsl.error.0000 output file.

Single node performance

The following figures show the application performance for the datasets mentioned in Table 2. In each figure, the numbers over the bars represent the relative change in the application performance compared to the application performance obtained on the AMD 7542 Rome processor model.

Figure 2: Relative difference in the performance of WRF by processor and dataset type mentioned in Table 1

WRF was compiled with the "dm + sm" configuration and all the available cores were subscribed during WRF simulation runs. To optimize performance, we tried different MPI process count, OpenMP thread count combinations and tiling schemes (WRF_NUM_TILES) options. For single-node tests, two MPI processes per Core Complex Die (CCD) deliver the best results for conus 2.5 km and new conus 2.5 km datasets. We used eight processes per CCD for the wrf_large dataset.

Depending on the dataset, the AMD 7763 processor can deliver up to 14 percent better performance over the AMD 7543 processor. In the previous blog, we observed better performance improvements on the 32 core Milan processor model with memory bandwidth bound benchmarks like HPCG and STREAM. WRF is a memory bandwidth bound application and there is notable performance improvement in the 32-core processor model: the AMD 7543 delivers up to 26 percent better performance over AMD 7542 processor.

From the performance that is shown in Figure 2 and the average power usage data that is shown in figure 3, we noted that the AMD 7713 processor can deliver up to 58 percent better performance per watt than the AMD 7662 processor.

Figure 3: Power used by platform and processor type: average idle server power usage was 305 W (7542), 338 W (7662), 305 W (7543), 258 W (7713), and 272 W (7763)

Multi-node scalability

To evaluate the scalability of WRF, we used eight nodes. Each node is equipped with an AMD 7713 processor and interconnected using the NVIDIA Mellanox HDR interconnect. The nodes used for benchmarking were connected to the same HDR switch. Table 1 provides details about the server and software that was used for the test. The text on top of the line represents the relative change in the application performance (on 2,4 and 8 nodes) with respect to the application performance obtained on the single node.

Figure 4: Multi-node performance of WRF on an AMD Milan 7713 processor for datasets listed in Table 1

The scalability numbers have been rounded off to a single digit. We observed good scalability with all the datasets listed in Table 1.

Conclusions and recommendations

WRF delivers better performance and performance per watt on AMD Milan processors. There is a significant performance improvement on the 32 core Milan processor model and the WRF simulations scale well with the datasets described in this blog. However, the scalability might vary depending on the dataset being used and the node count being tested. Ensure that you test the impact of the tile size, process, and threads per process before use.

We will continue to post new blogs on this site as updates arise.

Your Browser is Out of Date

WRF Performance with AMD EPYC 7003 Series processors On Dell EMC PowerEdge servers