Home Workload Solutions High Performance Computing Blogs

BWA-GATK Pipeline Performance on Dell EMC PowerEdge R6525 Server

Tue, 30 Mar 2021 18:34:08 -0000

Read Time: 0 minutes

Kihoon Yoon

Overview

We’ve been speculating that AMD Milan with Zen3 cores which allows more cores to share the same L3 cache could perform better for Next Generation Sequencing (NGS) applications. Comparing to the predecessor AMD EPYC Rome, the number of cores sharing the L3 cache is doubled-up from 4 to 8 for the 64 core processor model. In addition to that, the cache (both L1 and L2) is upgraded with new prefetchers, and memory bandwidth is improved.

Since Milan and Rome share the same SP3 socket, Dell EMC PowerEdge R6525 was selected for the case study and able to minimize system-to-system variations. The test server configuration is summarized in Table 1.

Table 1. Tested compute node configuration

Dell EMC PowerEdge R6525
CPU	Tested AMD Milan: 2x 7763 (Milan), 64 Cores, 2.45GHz – 3.5GHz Base-Boost, TDP 280W, 256 MB L3 Cache 2x 7713 (Milan), 64 Cores, 2.0GHz – 3.7GHz Base-Boost, TDP 225W, 256 MB L3 Cache 7543 (Milan), 32 Cores, 2.8GHz – 3.7 GHz Base-Boost, TDP 225W, 256 MB L3 Cache Tested AMD Rome: 7702 (Rome), 64 Cores, 2.0 GHz – 3.35 GHz Base-Boost, TDP 200W, 256 MB L3 Cache
RAM	DDR4 256G (16Gb x 16) 3200 MT/s
OS	RHEL 8.3 (4.18.0-240.el8.x86_64)
Filesystem Network	Mellanox InfiniBand HDR100
Filesystem	Dell EMC Ready Solutions for HPC BeeGFS High Capacity Storage
BIOS System Profile	Performance Optimized
Logical Processor	Disabled
Virtualization Technology	Disabled
BWA	0.7.15-r1140
Sambamba	0.7.0
Samtools	1.6
GATK	3.6-0-g89b7209

The test data was chosen from one of Illumina’s Platinum Genomes. ERR194161 was processed with Illumina HiSeq 2000 submitted by Illumina and can be obtained from EMBL-EBI. The DNA identifier for this individual is NA12878. The description of the data from the linked website shows that this sample has a >30x depth of coverage, and it reaches ~53x.

Performance Evaluation

Characterizing steps in BWA-GATK Pipeline

In a typical BWA-GATK pipeline, there are multiple steps, and each step consists of various applications which behave distinctively. As shown in Table 3, the applications in some steps do not support multi-threading operations. These steps are problematic since there are only a few ways to improve performance.

Table 2. The steps in BWA-GATK pipeline and tools

Steps	Applications	Multi-threading support
Mapping & Sorting	BWA, samtools, sambamba	Yes
Mark Duplicates	Sambamba	Yes
Generate Realigning Targets	GATK RealignerTargetCreator	Yes
Insertion/Deletion Realigning	GATK IndelRealigner	No
Base Recalibration	GATK BaseRecalibrator	Yes
Haplotypercaller	GATK HaplotypeCaller	Yes
Genotype GVCFs	GATK GenotypeGVCFs	Yes
Variant Recalibration	GATK VariantRecalibrator	No
Apply Variant Recalibration	GATK ApplyRecalibration	No

Single thread applications, especially Variant Recalibration and Apply Variant Recalibratrion steps show no runtime variation due to the deterministic algorithm and the inputs for these steps are small. Hence, these two steps are not reported in Figure 1. The first step, Mapping & Sorting scales as the number of cores increases (Figure 1, (a)). Also, Genotype GVCFs is not included in Figure 1 although it supports multi-threading operation for a similar reason.

Burrows-Wheeler Aligner (BWA) is one of the most popular short sequence aligners for non-gapped aligning analysis. BWA scales well until 32 cores, and CPU usage drops down dramatically after 32 cores. The runtime improvement becomes marginal with higher core numbers greater than 32. Using more than 80 cores for this step is the wasting of resources.

Sambamba which is compatible with Picard is used for marking duplicate reads. The behavior of sambamba is plotted in Figure 1, (b). Due to the highly parallelized nature of design, the memory consumption increases to create more hash tables for additional threads. Amazingly, 50x human whole genome sequence (WGS) is not big enough to use more than 13 cores for the well-designed software.

After Mark Duplicates step, Genome Analysis Tool Kit, hence GATK, written in Java plays a critical role in performance measurement and creating answers. These steps do not scale at all as shown in Figure 1, (c) (d) (e) and (f). A better approach will be discussed in future work to handle the misbehavior in multi-core and multi-socket environments.

Figure 1. Runtimes of 7702 with various number of cores for each step. Milan CPUs also show similar behaviors.

Single Sample Performance

Socket to Socket Comparison

This test is not a fair comparison since the majority of steps will not take advantage of using all the cores except 7543 with 32 cores. However, this comparison will help to decide which CPU could be best for the throughput test.

Table 3 summarizes the overall runtimes for BWA-GATK pipeline, and it is hard to say which one is better in terms of total runtimes. A lot more tests are required to differentiate the performance differences in GATK steps. Also, the results from 7502 and 7402 were from the previous tests with different environments.

The mapping & Sorting step is the only step that we can peak the true performance variations across different CPUs in Table 3. A rough estimation of performance improvement from 7702 to 7763 is 7% while the performance gain is 5% from 7702 to 7713.

Surprisingly, the Base Recalibration step showed similar results as the Mapping & Sorting step, which is 8% and 3% improvement.

Table 3. BWA-GATK performance comparisons between Milan and Rome. The number of cores used for the test is parenthesized.

Steps

Runtime (hours)

AMD

7763

64c

2.45GHz

AMD

7713

64c

2.0GHz

AMD

7543

32c

2.8GHz

AMD

7702

64c

2.0GHz

AMD

7502

32c

2.5GHz

AMD

7402

24c

3.0GHz

Mapping & Sorting

2.44

(64)

2.49

(64)

3.69

(32)

2.63

(64)

4.68

(32)

5.73

(24)

Mark Duplicates

1.07

(13)

1.10

(13)

1.01

(13)

1.01

(13)

0.93

(13)

0.94

(13)

Generate Realigning Targets

0.55

(32)

0.56

(32)

0.50

(32)

0.58

(32)

0.45

(32)

0.44

(32)

Insertion/Deletion Realigning

8.73

(1)

9.13

(1)

7.73

(1)

8.78

(1)

8.30

(1)

8.21

(1)

Base Recalibration

2.27

(32)

2.38

(32)

2.17

(32)

2.46

(32)

2.52

(32)

2.67

(24)

Haplotypercaller

10.20

(16)

10.57

(16)

9.15

(16)

9.02

(16)

9.33

(16)

9.05

(16)

Genotype GVCFs

0.02

(32)

0.02

(32)

0.01

(32)

0.02

(32)

0.01

(32)

0.01

(24)

Variant Recalibration

0.31

(1)

0.20

(1)

0.17

(1)

0.12

(1)

0.21

(1)

0.13

(1)

Apply Variant Recalibration

0.01

(1)

0.01

(1)

0.01

(1)

0.01

(1)

0.01

(1)

0.01

(1)

Total Runtime (hours)

25.59

26.47

24.44

24.64

26.46

27.25

Multiple Sample Performance - Throughput

A typical way of running an NGS pipeline is to process multiple samples on a compute node and use multiple compute nodes to maximize the throughput. However, this time the tests were performed on a single compute node due to the limited number of servers available at this moment.

The current pipeline invokes a large number of pipe operations in the first step to minimize the amount of writing intermediate files. Although this saves a day of runtime and lowers the storage usage significantly, the cost of invoking pipes is quite heavy. Hence, this limits the number of concurrent sample processing. Typically, a process silently fails when there is not enough resource left to start an additional process.

However, the failures experiencing during this study are quite different from the previous observations. 10 pipelines started in R6525 with 2x 7763 sustain only 6 pipelines on average with the 50x human WGS. Four pipelines are failed with the broken pipes error which suggests some sort of file operation. Current BeeGFS storage for the test is designed for high capacity, theoretical sequential write bandwidth of 25 GB/s. However, roughly 16 GB/s is achievable where there is not heavy usage loaded on this storage in a shared storage environment. This is not an ideal environment for any benchmark practice; however, the results here are quite helpful to see what the performance of these systems looks like in a real life.

As shown in Table 4, the maximum number of samples that can be processed at the same time is around 4 or 5, and the ~ 4.79 50x human whole genomes per day throughput is achievable with the current environment.

Table 4. Throughput test for Milan 7763

Steps	Runtime (hours)
Steps	1 Sample	2 Samples	4 Samples	6 Samples	10 Samples
Number of Samples Failed	0	0	0	1	4
Mapping & Sorting	2.44	2.91	4.33	5.86	8.33
Mark Duplicates	1.07	1.40	1.69	1.31	5.51
Generate Realigning Targets	0.55	0.88	1.77	0.50	2.07
Insertion/Deletion Realigning	8.73	8.97	8.92	8.92	9.70
Base Recalibration	2.27	2.50	2.79	3.26	3.67
Haplotypercaller	10.20	10.57	10.27	9.91	9.96
Genotype GVCFs	0.02	0.11	0.10	0.10	0.15
Variant Recalibration	0.31	0.25	0.20	0.21	0.36
Apply Variant Recalibration	0.01	0.02	0.01	0.01	0.03
Total Runtime (hours)	25.59	27.62	30.08	30.08	39.79
Genomes per day	0.94	1.74	4.79	3.99	3.62

Conclusion

The field of NGS data analysis has been moving fast in terms of data growth and data variations. However, the community has not been done much work adapting new technologies available such as accelerators. Instead of improving the quality of codes, the community is faced with analyzing the data without multi-thread processing since GATK version 4 and up does not support multi-threading anymore while the number of cores in a CPU increases fast.

It is time that users need to think about how this problem can be tackled. One simple way to avoid this problem is to perform data-level parallelization. Although the decision-making of when to split data is pretty hard, it is certainly tractable with careful interventions in an existing BWA-GATK pipeline without diluting statistical power with a sheer number of data. If each smaller data chunk goes through an individual pipeline on each core and is merged at the end, it could be possible to achieve better performance on a single sample. This performance gain could lead to higher throughput if the overall runtime reduces significantly.

Nonetheless, Milan 7763 or 7713 are an excellent candidate to cover both current multi-threading-based pipelines and future data-level parallelization driven pipelines with more available cores.

Tags:

Overview

Gene expression analysis is as important as identifying Single Nucleotide Polymorphism (SNP), insertion/deletion (indel), or chromosomal restructuring. Ultimately, all physiological and biochemical events depend on the final gene expression products, proteins. Although most mammals have an additional controlling layer before protein expression, knowing how many transcripts exist in a system helps to characterize the biochemical status of a cell. Ideally, technology would enable us to quantify all proteins in a cell, which would significantly advance the progress of Life Science. However, we are far from achieving this.

This blog provides the test results of one popular RNA-Seq data analysis pipeline known as the Tuxedo pipeline (1). The Tuxedo pipeline suite offers a set of tools for analyzing a variety of RNA-Seq data, including short-read mapping, identification of splice junctions, transcript, and isoform detection, differential expression, visualizations, and quality control metrics. The tested workflow is a differentially expressed gene (DEG) analysis, and the detailed steps in the pipeline are shown in Figure 1.

Figure 1. Updated tuxedo pipeline with cuffquant step

A single node study with AMD EPYC 7002 series (Rome) and AMD EPYC 7003 series (Milan) on Dell EMC PowerEdge R6525 server was done. The configurations of the test system are summarized in Table 1.

Table 1. Tested compute node configuration

Dell EMC PowerEdge R6525
CPU	Tested AMD Milan: 2x 7763 (Milan), 64 Cores, 2.45 GHz – 3.5 GHz Base-Boost, TDP 280 W, 256 MB L3 Cache 2x 7713 (Milan), 64 Cores, 2.0 GHz – 3.7 GHz Base-Boost, TDP 225 W, 256 MB L3 Cache 7543 (Milan), 32 Cores, 2.8 GHz – 3.7 GHz Base-Boost, TDP 225 W, 256 MB L3 Cache Tested AMD Rome: 7702 (Rome), 64 Cores, 2.0 GHz – 3.35 GHz Base-Boost, TDP 200 W, 256 MB L3 Cache
RAM	DDR4 256 GB (16 Gb x 16) 3200 MT/s
Operating system	RHEL 8.3 (4.18.0-240.el8.x86_64)
Interconnect	Mellanox InfiniBand HDR100
Filesystem	Dell EMC Ready Solutions for HPC BeeGFS High Capacity Storage
BIOS system profile	Performance optimized
Logical processor	Disabled
Virtualization technology	Disabled
tophat	2.1.1
bowtie2	2.2.5
R	3.6
bioconductor cummerbund	2.26.0

A performance study of the RNA-Seq pipeline is not trivial because the nature of the workflow requires non-identical input files yet similar input files in size. Hence, 185 RNA-Seq paired-end read data are collected from a public data repository. All the read datafiles contain around 25 Million Fragments (MF) and have similar read lengths. The samples for a test are randomly selected from the pool of the 185 paired-end read files. Although these test data will not have any biological meaning, certainly these data with the high level of noise will put the tests in the worst-case scenario.

Performance Evaluation

Throughput test - Single pipeline with more than two samples, biological, and technical duplicates

Typical RNA-Seq studies consist of multiple samples, sometimes 100s of different samples, normal versus disease, or untreated versus treated samples. These samples tend to have a high level of noise due to biological reasons; hence, the analysis requires vigorous data preprocessing procedure.

A number of various samples were processed, with different RNA-Seq data selected from 185 paired-end reads dataset, to see how much data a single node can process. Typically, when the number of samples increases, the runtime of the Tuxedo pipeline increases. However, as shown in the figure below, the runtimes with two sample tests using 7713, are higher than the runtimes from four samples. The standard error from five repeated runs does not overlap with four and eight sample results. The interference of other users may cause this large variance. The current testing environment, especially a shared file system designed for large capacity, is not ideal for a Next Generation Sequencing (NGS) data analysis benchmark.

Figure 2. Runtime comparisons among various AMD EPYC 7003 Series processors: Standard error is estimated from an estimated standard deviation based on a sample (STDDEV.S function in Excel)

The eight sample test results show that AMD Milan processors perform better than one of the Rome processors (7702) in a higher workload.

Conclusion

Many tests are still required to obtain a better insight from the AMD Milan processors for the NGS data analysis area. Unfortunately, the tests could not exceed eight samples due to storage limitations. However, there seems to be plenty of room for a higher throughput that processes more than eight samples together. AMD Milan 7763 performed 20% better than AMD Rome 7702. AMD Milan 7713 performed 18% better in eight sample tests for the Tuxedo pipeline as described in Figure 2.

HPC AMD cooling PowerEdge C6615 Siena power

The Dell PowerEdge C6615: Maximizing Value and Minimizing TCO for Dense Compute and Scale-out Workloads

Mon, 02 Oct 2023 21:35:01 -0000

Read Time: 0 minutes

In the ever-evolving landscape of data centers and IT infrastructure, meeting the demands of scale-out workloads is a continuous challenge. Organizations seek solutions that not only provide superior performance but also optimize Total Cost of Ownership (TCO).

Enter the new Dell PowerEdge C6615, a modular node designed to address these challenges with innovative solutions. Let's delve into the key features and benefits of this groundbreaking addition to the Dell PowerEdge portfolio.

Industry challenges

Maximizing Rack utilization: One of the primary challenges in the data center world is maximizing rack utilization. The Dell PowerEdge C6615 addresses this by offering dense compute options.
Cutting-edge processors: High-performance processors are crucial for scalability and security. The C6615 is powered by a 4th Generation AMD EPYC 8004 series processor, ensuring top-tier performance.
Total Cost of Ownership (TCO): TCO is a critical consideration that encompasses power and cooling efficiency, licensing costs, and seamless integration with existing data center infrastructure. The C6615 is designed to reduce TCO significantly.

Introducing the Dell PowerEdge C6615

The Dell PowerEdge C6615 is a modular node designed to revolutionize data center infrastructure. Here are some key highlights:

Price-performance ratio: The C6615 offers outstanding price per watt for scale-out workloads, with up to a 315% improvement compared to a one-socket (1S) server with AMD EPYC 9004 Series server processor.
Optimized thermal solution: It features an optimized thermal solution that allows for air-cooling configurations with up to 53% improved cooling performance compared to the previous generation chassis.
Density-optimized compute: The C6615's architecture is tailored for scale-out WebTech workloads, offering exceptional performance with reduced TCO.
High-speed NVMe storage: It provides high-speed NVMe storage for applications with intensive IOPS requirements, ensuring efficient performance.
Efficient scalability: With 40% more cores per rack compared to the AMD EPYC 9004 Series server processors, the C6615 allows for quicker and more efficient scalability.
SmartNIC: It includes a SmartNIC with hardware-accelerated networking and storage, saving CPU cycles and enhancing security.

Key features

To maximize efficiency and reduce environmental impact, the PowerEdge C6615 incorporates several key features:

Power and thermal efficiency: The 2U chassis with four nodes enhances power and thermal efficiency, eliminating the need for liquid cooling.
Flexible I/O options: It supports up to two PCIe Gen5 slots and one 16 PCIe Gen5 OCP 3.0 slot for network cards, ensuring versatile connectivity.
Security: Security is integrated at every phase of the PowerEdge lifecycle, from supply chain protection to Multi-Factor Authentication (MFA) and role-based access controls.

Accelerating performance

In benchmark testing, the C6615 outperforms the competition:

HPL Benchmark: It showcases up to a 335% improvement in performance per watt per dollar and a 210% increase in performance per CPU dollar compared to other 1S systems with the AMD EPYC 9004 Series server processor.

Figure 1. HPL benchmark performance

SPEC_CPU2017 Benchmark: Results demonstrate up to a 205% improvement in performance per watt per dollar and a remarkable 128% increase in performance per CPU dollar compared to similar systems.

Figure 2. SPEC_CPU2017 benchmark performance

Final thoughts

The seamless integration of the Dell PowerEdge C6615 into existing processes and toolsets is facilitated by comprehensive iDRAC9 support for all components. This ensures a smooth transition while leveraging the full potential of your server infrastructure.

Dell's commitment to environmental sustainability is evident in its use of recycled materials and energy-efficient options, helping to reduce carbon footprints and operational costs.

In conclusion, the Dell PowerEdge C6615 emerges as a leading dense compute solution, delivering exceptional value and unmatched performance. For more information, visit the PowerEdge Servers Powered by AMD site and explore how this innovative solution can transform your data center operations.

Note: Performance results may vary based on specific configurations and workloads. It's recommended to consult with Dell or an authorized partner for tailored solutions.

Author: David Dam

Your Browser is Out of Date

BWA-GATK Pipeline Performance on Dell EMC PowerEdge R6525 Server

Overview

Performance Evaluation

Characterizing steps in BWA-GATK Pipeline

Single Sample Performance

Socket to Socket Comparison

Multiple Sample Performance - Throughput

Conclusion

Related Blog Posts

Tuxedo Pipeline Performance on Dell EMC PowerEdge R6525

Overview

Performance Evaluation

Throughput test - Single pipeline with more than two samples, biological, and technical duplicates

Conclusion

The Dell PowerEdge C6615: Maximizing Value and Minimizing TCO for Dense Compute and Scale-out Workloads

Industry challenges

Introducing the Dell PowerEdge C6615

Key features

Accelerating performance

Final thoughts