Overview

Intel^® Xeon^® Scalable Processors have been proven for consistent and stable performance for many workload types. New 3rd Generation Intel^® Xeon^® Scalable Processors, also known by the code name of Ice Lake perform exceptionally well for a BWA-GATK pipeline. In this study, we tested two Ice Lake processors, 8352Y and 8358, and the test server configuration is also summarized in Table 1.

Table 1. Tested compute node configuration

Dell EMC PowerEdge C6520
CPU	Tested 3^rd Gen Intel® Xeon® Scalable Processors: 2x Intel® Xeon® Platinum 8352Y Processor, 32 cores, 2.20 GHz – 3.40 GHz Base-Boost, TDP 205 W, 48 MB L3 Cache 2x Intel® Xeon® Platinum 8358 Processor, 32 cores, 2.60 GHz – 3.40 GHz Base-Boost, TDP 250 W, 48 MB L3 Cache
RAM	DDR4 512G (32 GB x 12) 3200 MT/s
Operating system	RHEL 8.3 (4.18.0-240.22.1)
Filesystem network	NVIDIA Mellanox InfiniBand HDR100
Filesystem	Dell EMC Ready Solutions for HPC BeeGFS High Capacity Storage
BIOS system profile	Performance Optimized
Logical processor	Disabled
Virtualization technology	Disabled
BWA	0.7.15-r1140
Sambamba	0.7.0
Samtools	1.6
GATK	3.60-g89b7209

The test data was chosen from one of Illumina’s Platinum Genomes. ERR194161 was processed with Illumina HiSeq 2000 submitted by Illumina and can be obtained from EMBL-EBI. The DNA identifier for this individual is NA12878. The description of the data from the linked website shows that this sample has a >30x depth of coverage, and it reaches ~53x.

Performance evaluation

Single sample performance

Table 2 summarizes the overall runtimes and the comparisons between each step for our 9-step BWA-GATK pipeline with a single sample.

The mapping and sorting step is the only step that we can peak the true performance variations across different CPUs in Table 2. A rough estimation of overall performance improvements from 6248R (6248) to 8352Y and 8358 are 3.8 (9.0) % and 4.8 (10.0) %, respectively. The test batch for 6248R was Dell EMC PowerEdge R640 server with 394 GB RAM and local storage, and the configuration details for 6248 can be found from the embedded link.

The mapping and sorting step shows a descent ~36 % runtime reduction due to the nature of the good scalability of BWA. The base recalibration step also takes advantage of a higher core count from Ice Lake CPUs.

Table 2. BWA-GATK performance comparisons between Ice Lake and Cascade Lake

Steps	8352Y 32c 2.2 GHz	8358 32c 2.6 GHz	6248R 24c 3.0 GHz	6248 20c 2.5 GHz
Mapping and sorting	3.23 (32)	3.23 (32)	5.04 (24)	5.22 (20)
Mark duplicates	1.16 (13)	1.16 (13)	1.14 (13)	1.29 (13)
Generate realigning targets	0.47 (32)	0.46 (32)	0.16 (24)	0.42 (20)
Insertion and deletion realigning	8.16 (1)	7.97 (1)	7.20 (1)	7.87 (1)
Base recalibration	2.06 (32)	2.07 (32)	2.41 (24)	2.30 (20)
Haplotypercaller	8.01 (16)	7.96 (16)	8.06 (16)	8.25 (16)
Genotype GVCFs	0.01 (32)	0.01 (32)	0.01 (24)	0.01 (20)
Variant recalibration	0.20 (1)	0.20 (1)	0.19 (1)	0.23 (1)
Apply variant recalibration	0.01 (1)	0.01 (1)	0.01 (1)	0.01 (1)
Total runtime (hours)	23.32	23.07	24.23	25.61

Note: The number of cores used for the test is parenthesized.

Multiple sample performances – throughput

A typical way of running an NGS pipeline is to process multiple samples on a compute node and use multiple compute nodes to maximize the throughput. However, this time the tests were performed on a single compute node due to the limited number of servers available at this moment.

The current pipeline invokes many pipe operations in the first step to minimize the amount of writing intermediate files. Although this saves a day of runtime and lowers the storage usage significantly, the cost of invoking pipes is quite heavy. Hence, this limits the number of concurrent sample processings. Typically, a process silently fails when there is not enough resource left to start an additional process.

As shown in Table 3 for the 8352Y test, the maximum number of samples that can be processed simultaneously is around 14 samples. Although a 14-sample test was not performed, 14 samples could likely be the maximum number of samples that can be processed together because the two pipelines were failed on the 16-sample test. In other words, ~ 6 genomes per day throughput is achievable with 8352Y. Also, 8358 shows 2 failed processes when 16 samples were processed simultaneously while the throughput reached ~7 genomes per day (Table 4).

Table 3. Throughput test for Intel^® Xeon^® Platinum 8352Y

Steps	Runtime with a various number of samples
Number of samples	1	2	4	8	12	16
Number of samples Failed	0	0	0	0	0	2
Mapping and sorting	2.84	4.20	7.11	13.44	20.77	26.62
Mark duplicates	1.17	1.18	1.29	1.77	2.49	3.05
Generate realigning targets	0.46	0.51	0.52	0.77	1.09	1.25
Insertion and deletion realigning	7.94	8.04	8.02	8.00	8.26	8.11
Base recalibration	2.00	2.16	2.83	4.41	6.04	7.20
Haplotypercaller	8.00	7.93	9.10	9.24	9.31	9.26
Genotype GVCFs	0.02	0.02	0.03	0.02	0.03	0.04
Variant recalibration	0.17	0.20	0.21	0.20	0.19	0.23
Apply variant recalibration	0.01	0.02	0.02	0.02	0.02	0.03
Total runtime (hours)	22.60	24.26	29.12	37.89	48.20	55.78
Genomes per day	1.06	1.98	3.30	5.07	5.98	6.02

Table 4. Throughput test for Intel^® Xeon^® Platinum 8358

Steps	Runtime with a various number of samples
Number of samples	1	8	12	14	16	1
Number of samples Failed	0	0	0	0	2	0
Mapping and sorting	2.67	11.79	18.26	22.84	24.34	2.67
Mark duplicates	1.16	1.51	2.18	2.59	2.65	1.16
Generate realigning targets	0.43	0.70	0.96	1.17	1.15	0.43
Insertion and deletion realigning	7.97	8.00	7.99	8.20	8.19	7.97
Base recalibration	1.94	4.05	5.65	6.47	6.56	1.94
Haplotypercaller	8.00	8.21	8.22	8.24	8.25	8.00
Genotype GVCFs	0.02	0.03	0.03	0.03	0.02	0.02
Variant recalibration	0.18	0.25	0.14	0.30	0.30	0.18
Apply variant recalibration	0.01	0.01	0.02	0.02	0.02	0.01
Total runtime (hours)	22.37	34.55	43.44	49.86	51.49	22.37
Genomes per day	1.07	5.56	6.63	6.74	6.53	1.07

Conclusion

The field of NGS data analysis has been moving fast in terms of data growth and data variations. The majority of the open-source applications in NGS data analysis are unable to take advantage of accelerator technology and do not scale well over the number of cores. It is time that users need to think about how this problem can be tackled. One simple way to avoid this problem is to perform data-level parallelization. Although the decision of making when to split data is pretty hard, it is tractable with careful interventions in an existing BWA-GATK pipeline without diluting statistical power with a sheer number of data. If each smaller data chunk goes through an individual pipeline on each core and is merged at the end, it could be possible to achieve better performance on a single sample. This performance gain could lead to higher throughput if the overall runtime reduces significantly.

Nonetheless, 3rd Generation Intel^® Xeon^® Scalable Processors, especially 8352Y, and 8358 are excellent choices for the highest variant calling analysis throughput and single sample analysis.

Your Browser is Out of Date

Processing Six Human 50x WGS per day with 3rd Gen Intel Xeon Scalable Processors

Overview

Performance evaluation

Single sample performance

Multiple sample performances – throughput

Conclusion