WRF Performance with 3rd Generation Intel Xeon Scalable Processors On Dell EMC PowerEdge servers

Many sectors like aviation, travel, tourism, energy, and transportation heavily rely on timely and accurate weather predictions provided by weather forecast centers. These operational forecast centers make use of numerical weather prediction (NWP) models to predict the weather based on current weather conditions. Weather research and forecasting (WRF) is one of the most widely used numerical weather prediction systems for weather forecasting. A suitable combination of robust computational resources, high-speed network and high throughput storage is required to achieve the maximum performance on high-performance computing (HPC) cluster for the WRF model to deliver timely forecasts.

In this blog, we highlight the performance improvement for WRF with Intel Ice Lake processors as compared with Intel Cascade Lake processors with Dell EMC PowerEdge servers. These tests were carried out on two socket Dell PowerEdge servers by setting the BIOS option to the HPC workload profile. The testbed hardware and software details are outlined in the following table:

Table 1: Testbed hardware and software details

Component	Dell EMC PowerEdge R750 server	Dell EMC PowerEdge R650 server	Dell EMC PowerEdge C6420 server	Dell EMC PowerEdge C6420 server
SKU	8380	6338	8280	6252
Cores/Socket	40	32	28	24
Frequency (Base-Max Turbo)	2.30 – 3.40 GHz	2.0 – 3.20 GHz	2.70 – 4.0 GHz	2.10 – 3.70 GHz
TDP	270 W	205 W	205 W	150 W
L3Cache	60M	48M	38.5M	37.75M
Operating System	Red Hat Enterprise Linux 8.3 4.18.0-240.22.1.el8_3.x86_64	Red Hat Enterprise Linux 8.3 4.18.0-240.22.1.el8_3.x86_64	Red Hat Enterprise Linux 8.3 4.18.0-240.el8.x86_64	Red Hat Enterprise Linux 8.3 4.18.0-240.el8.x86_64
Memory	32 GB x 16 (2Rx8) 3200 MT/s	32 GB x 16 (2Rx8) 3200 MT/s	16 GB x 12 (2Rx8) 2933 MT/s	16 GB x 12 (2Rx8) 2933 MT/s
BIOS/CPLD	1.2.4/1.0.5		2.11.2/1.1.0
Interconnect	NVIDIA Mellanox HDR	NVIDIA Mellanox HDR	NVIDIA Mellanox HDR100	NVIDIA Mellanox HDR100
Compiler	Intel parallel studio 2020 (update 4)
Datasets	conus 2.5km, new conus 2.5km, wrf_large 3km

We benchmarked WRF-V3.9.1.1 with the conus 2.5km and new conus 2.5km datasets and WRF-V4.2.2 with new conus 2.5km and wrf_large 3km datasets. The following figure shows the simulation domain for the tested datasets:

Figure 1: Domain configuration for new conus 2.5 km, conus 2.5 km, and wrf_large datasets.

The following table provides a brief description of each dataset:

Table 2: Configuration for new conus 2.5 km, conus 2.5 km and wrf_large datasets

	conus 2.5 km	new conus 2.5 km	wrf_large
Run hours	3	3	2
Resolution(m)	2500	2500	3000
Vertical layers	35	35	50
Grid points	1501 x 1201	1901 x 1301	1500 x 1500
interval_seconds	10800	10800	21600

The results were measured by averaging the WRF computation time of each timestep from the rsl.error.0000 output file. The timesteps during the file read / write (of wrfout* / wrfinput* ) were not included in the average.

Single node performance

The following figures show the application performance for the datasets mentioned in Table 2. In each figure, the numbers over the bars represent the relative performance compared to the performance obtained with the Intel 6252 Cascade Lake processor model. The blue and green bars represent application performance obtained with Ice lake and Cascade Lake processors.

Figure 2: Relative performance of WRF by processor and dataset type mentioned in Table 1

WRF was compiled with the "dm + sm" configuration with avx2 instructions and serial netcdf support (io_form* set to 2). All the available cores were subscribed during WRF simulation runs. To optimize performance, we tested different MPI process counts, OpenMP thread count combinations, and tiling schemes (WRF_NUM_TILES).

Depending on the dataset, the 8380 processor model can deliver up to 19 percent better performance compared to the 6338 processor model. Relative to Cascade Lake, the Ice Lake architecture has more memory channels and offers higher aggregate memory bandwidth. WRF, which is typically memory bandwidth bound, can take advantage of the additional memory bandwidth (Table 3) provided by Ice Lake and the results demonstrate up to 65 percent performance improvement over the Cascade Lake counterparts. Comparison of Instructions Per Cycle (IPC) and DRAM Bandwidth Utilization collected using Intel OneAPI Vtune profiler on Intel Ice Lake and Cascade Lake processors is shown in Table 3.

Table 3: Metrics collected using Intel OneAPI vtune profiler

	8380		8280
	IPC	Bandwidth(GB/s)	IPC	Bandwidth (GB/s)
conus 2.5km (WRFV3)	0.99	257.32	0.86	128.30
new conus2.5km (WRFV3)	1.57	192.18	1.48	120.96
new conus 2.5km (WRFV4)	1.36	191.43	1.14	115.46
wrf_large (WRFV4)	1.09	64.80	0.90	62.55

Intel’s Ice Lake is expected to deliver around 20 percent better IPC than the Cascade Lake model (8380 vs 8280). With datasets covered in this blog, we found that Intel 8380 processor reports 6 to 19% better IPC than the Intel 8280 processor.

Figure 3 shows the power consumption using the box and whiskers plot when the system was being benchmarked with the four tests shown in Figure 2. Box indicates the spread of the central 50% of the power data, and the central line represents the median power value. The dots shows the outlier power values , most of which were recorded during the initialization and finalization phase of the tests.

Figure 3: Power used by platform and processor type

Average frequency usage for 8380, 6338, 8280, and 6252 processors were around 2.9, 2.5, 3.0, and 2.5 GHz respectively for all datasets.

Multi-node Scalability

We used eight nodes to evaluate the scalability of WRF. Each node is equipped with the Intel 8380 processor and interconnected using the NVIDIA Mellanox HDR interconnect. The nodes used for benchmarking were connected to the same HDR switch. Table 1 provides details about the server and software that was used for the test. The text on top of the bar in Figure 4 represents the relative performance (on two, four, and eight nodes) for the application as compared with the performance with a single node.

Figure 4: Multi-node performance of WRF on an Intel 8380 processor model for datasets listed in Table 1

The scalability numbers have been rounded off to a single digit. We observed good scalability with all the datasets listed in Table 1.

Conclusions and recommendations

For WRF, Intel Ice Lake demonstrates significant performance improvement as compared with Intel Cascade Lake processors. WRF simulations scale well with the datasets described in this blog. The scalability might vary depending on the dataset being used and the node count being tested. For the best performance with WRF, the impact of the tile size, process, and threads per process should be evaluated.

Your Browser is Out of Date

WRF Performance with 3rd Generation Intel Xeon Scalable Processors On Dell EMC PowerEdge servers

Single node performance

Multi-node Scalability

Conclusions and recommendations