Processing Six Human 50x WGS per day with 3rd Gen Intel Xeon Scalable Processors
Mon, 24 May 2021 22:07:44 -0000
|Read Time: 0 minutes
Overview
Intel® Xeon® Scalable Processors have been proven for consistent and stable performance for many workload types. New 3rd Generation Intel® Xeon® Scalable Processors, also known by the code name of Ice Lake perform exceptionally well for a BWA-GATK pipeline. In this study, we tested two Ice Lake processors, 8352Y and 8358, and the test server configuration is also summarized in Table 1.
Dell EMC PowerEdge C6520 | |
CPU | Tested 3rd Gen Intel® Xeon® Scalable Processors: 2x Intel® Xeon® Platinum 8352Y Processor, 32 cores, 2.20 GHz – 3.40 GHz Base-Boost, TDP 205 W, 48 MB L3 Cache 2x Intel® Xeon® Platinum 8358 Processor, 32 cores, 2.60 GHz – 3.40 GHz Base-Boost, TDP 250 W, 48 MB L3 Cache |
RAM | DDR4 512G (32 GB x 12) 3200 MT/s |
Operating system | RHEL 8.3 (4.18.0-240.22.1) |
Filesystem network | NVIDIA Mellanox InfiniBand HDR100 |
Filesystem | Dell EMC Ready Solutions for HPC BeeGFS High Capacity Storage |
BIOS system profile | Performance Optimized |
Logical processor | Disabled |
Virtualization technology | Disabled |
BWA | |
Sambamba | |
Samtools | |
GATK |
The test data was chosen from one of Illumina’s Platinum Genomes. ERR194161 was processed with Illumina HiSeq 2000 submitted by Illumina and can be obtained from EMBL-EBI. The DNA identifier for this individual is NA12878. The description of the data from the linked website shows that this sample has a >30x depth of coverage, and it reaches ~53x.
Performance evaluation
Single sample performance
Table 2 summarizes the overall runtimes and the comparisons between each step for our 9-step BWA-GATK pipeline with a single sample.
The mapping and sorting step is the only step that we can peak the true performance variations across different CPUs in Table 2. A rough estimation of overall performance improvements from 6248R (6248) to 8352Y and 8358 are 3.8 (9.0) % and 4.8 (10.0) %, respectively. The test batch for 6248R was Dell EMC PowerEdge R640 server with 394 GB RAM and local storage, and the configuration details for 6248 can be found from the embedded link.
The mapping and sorting step shows a descent ~36 % runtime reduction due to the nature of the good scalability of BWA. The base recalibration step also takes advantage of a higher core count from Ice Lake CPUs.
Steps | 8352Y 32c 2.2 GHz | 8358 32c 2.6 GHz | 6248R 24c 3.0 GHz | 6248 20c 2.5 GHz |
Mapping and sorting | 3.23 (32) | 3.23 (32) | 5.04 (24) | 5.22 (20) |
Mark duplicates | 1.16 (13) | 1.16 (13) | 1.14 (13) | 1.29 (13) |
Generate realigning targets | 0.47 (32) | 0.46 (32) | 0.16 (24) | 0.42 (20) |
Insertion and deletion realigning | 8.16 (1) | 7.97 (1) | 7.20 (1) | 7.87 (1) |
Base recalibration | 2.06 (32) | 2.07 (32) | 2.41 (24) | 2.30 (20) |
Haplotypercaller | 8.01 (16) | 7.96 (16) | 8.06 (16) | 8.25 (16) |
Genotype GVCFs | 0.01 (32) | 0.01 (32) | 0.01 (24) | 0.01 (20) |
Variant recalibration | 0.20 (1) | 0.20 (1) | 0.19 (1) | 0.23 (1) |
Apply variant recalibration | 0.01 (1) | 0.01 (1) | 0.01 (1) | 0.01 (1) |
Total runtime (hours) | 23.32 | 23.07 | 24.23 | 25.61 |
Note: The number of cores used for the test is parenthesized.
Multiple sample performances – throughput
A typical way of running an NGS pipeline is to process multiple samples on a compute node and use multiple compute nodes to maximize the throughput. However, this time the tests were performed on a single compute node due to the limited number of servers available at this moment.
The current pipeline invokes many pipe operations in the first step to minimize the amount of writing intermediate files. Although this saves a day of runtime and lowers the storage usage significantly, the cost of invoking pipes is quite heavy. Hence, this limits the number of concurrent sample processings. Typically, a process silently fails when there is not enough resource left to start an additional process.
As shown in Table 3 for the 8352Y test, the maximum number of samples that can be processed simultaneously is around 14 samples. Although a 14-sample test was not performed, 14 samples could likely be the maximum number of samples that can be processed together because the two pipelines were failed on the 16-sample test. In other words, ~ 6 genomes per day throughput is achievable with 8352Y. Also, 8358 shows 2 failed processes when 16 samples were processed simultaneously while the throughput reached ~7 genomes per day (Table 4).
Steps | Runtime with a various number of samples | |||||
Number of samples | 1 | 2 | 4 | 8 | 12 | 16 |
Number of samples Failed | 0 | 0 | 0 | 0 | 0 | 2 |
Mapping and sorting | 2.84 | 4.20 | 7.11 | 13.44 | 20.77 | 26.62 |
Mark duplicates | 1.17 | 1.18 | 1.29 | 1.77 | 2.49 | 3.05 |
Generate realigning targets | 0.46 | 0.51 | 0.52 | 0.77 | 1.09 | 1.25 |
Insertion and deletion realigning | 7.94 | 8.04 | 8.02 | 8.00 | 8.26 | 8.11 |
Base recalibration | 2.00 | 2.16 | 2.83 | 4.41 | 6.04 | 7.20 |
Haplotypercaller | 8.00 | 7.93 | 9.10 | 9.24 | 9.31 | 9.26 |
Genotype GVCFs | 0.02 | 0.02 | 0.03 | 0.02 | 0.03 | 0.04 |
Variant recalibration | 0.17 | 0.20 | 0.21 | 0.20 | 0.19 | 0.23 |
Apply variant recalibration | 0.01 | 0.02 | 0.02 | 0.02 | 0.02 | 0.03 |
Total runtime (hours) | 22.60 | 24.26 | 29.12 | 37.89 | 48.20 | 55.78 |
Genomes per day | 1.06 | 1.98 | 3.30 | 5.07 | 5.98 | 6.02 |
Steps | Runtime with a various number of samples | |||||
Number of samples | 1 | 8 | 12 | 14 | 16 | 1 |
Number of samples Failed | 0 | 0 | 0 | 0 | 2 | 0 |
Mapping and sorting | 2.67 | 11.79 | 18.26 | 22.84 | 24.34 | 2.67 |
Mark duplicates | 1.16 | 1.51 | 2.18 | 2.59 | 2.65 | 1.16 |
Generate realigning targets | 0.43 | 0.70 | 0.96 | 1.17 | 1.15 | 0.43 |
Insertion and deletion realigning | 7.97 | 8.00 | 7.99 | 8.20 | 8.19 | 7.97 |
Base recalibration | 1.94 | 4.05 | 5.65 | 6.47 | 6.56 | 1.94 |
Haplotypercaller | 8.00 | 8.21 | 8.22 | 8.24 | 8.25 | 8.00 |
Genotype GVCFs | 0.02 | 0.03 | 0.03 | 0.03 | 0.02 | 0.02 |
Variant recalibration | 0.18 | 0.25 | 0.14 | 0.30 | 0.30 | 0.18 |
Apply variant recalibration | 0.01 | 0.01 | 0.02 | 0.02 | 0.02 | 0.01 |
Total runtime (hours) | 22.37 | 34.55 | 43.44 | 49.86 | 51.49 | 22.37 |
Genomes per day | 1.07 | 5.56 | 6.63 | 6.74 | 6.53 | 1.07 |
Conclusion
The field of NGS data analysis has been moving fast in terms of data growth and data variations. The majority of the open-source applications in NGS data analysis are unable to take advantage of accelerator technology and do not scale well over the number of cores. It is time that users need to think about how this problem can be tackled. One simple way to avoid this problem is to perform data-level parallelization. Although the decision of making when to split data is pretty hard, it is tractable with careful interventions in an existing BWA-GATK pipeline without diluting statistical power with a sheer number of data. If each smaller data chunk goes through an individual pipeline on each core and is merged at the end, it could be possible to achieve better performance on a single sample. This performance gain could lead to higher throughput if the overall runtime reduces significantly.
Nonetheless, 3rd Generation Intel® Xeon® Scalable Processors, especially 8352Y, and 8358 are excellent choices for the highest variant calling analysis throughput and single sample analysis.