Tuxedo Pipeline Performance on Dell EMC PowerEdge R6525
Tue, 27 Apr 2021 03:48:30 -0000|
Read Time: 0 minutes
Gene expression analysis is as important as identifying Single Nucleotide Polymorphism (SNP), insertion/deletion (indel), or chromosomal restructuring. Ultimately, all physiological and biochemical events depend on the final gene expression products, proteins. Although most mammals have an additional controlling layer before protein expression, knowing how many transcripts exist in a system helps to characterize the biochemical status of a cell. Ideally, technology would enable us to quantify all proteins in a cell, which would significantly advance the progress of Life Science. However, we are far from achieving this.
This blog provides the test results of one popular RNA-Seq data analysis pipeline known as the Tuxedo pipeline (1). The Tuxedo pipeline suite offers a set of tools for analyzing a variety of RNA-Seq data, including short-read mapping, identification of splice junctions, transcript, and isoform detection, differential expression, visualizations, and quality control metrics. The tested workflow is a differentially expressed gene (DEG) analysis, and the detailed steps in the pipeline are shown in Figure 1.
Figure 1. Updated tuxedo pipeline with cuffquant step
A single node study with AMD EPYC 7002 series (Rome) and AMD EPYC 7003 series (Milan) on Dell EMC PowerEdge R6525 server was done. The configurations of the test system are summarized in Table 1.
Table 1. Tested compute node configuration
|Dell EMC PowerEdge R6525|
Tested AMD Milan:
2x 7763 (Milan), 64 Cores, 2.45 GHz – 3.5 GHz Base-Boost, TDP 280 W, 256 MB L3 Cache
2x 7713 (Milan), 64 Cores, 2.0 GHz – 3.7 GHz Base-Boost, TDP 225 W, 256 MB L3 Cache
7543 (Milan), 32 Cores, 2.8 GHz – 3.7 GHz Base-Boost, TDP 225 W, 256 MB L3 Cache
Tested AMD Rome:7702 (Rome), 64 Cores, 2.0 GHz – 3.35 GHz Base-Boost, TDP 200 W, 256 MB L3 Cache
|RAM||DDR4 256 GB (16 Gb x 16) 3200 MT/s|
|Operating system||RHEL 8.3 (4.18.0-240.el8.x86_64)|
|Interconnect||Mellanox InfiniBand HDR100|
|Filesystem||Dell EMC Ready Solutions for HPC BeeGFS High Capacity Storage|
|BIOS system profile||Performance optimized|
A performance study of the RNA-Seq pipeline is not trivial because the nature of the workflow requires non-identical input files yet similar input files in size. Hence, 185 RNA-Seq paired-end read data are collected from a public data repository. All the read datafiles contain around 25 Million Fragments (MF) and have similar read lengths. The samples for a test are randomly selected from the pool of the 185 paired-end read files. Although these test data will not have any biological meaning, certainly these data with the high level of noise will put the tests in the worst-case scenario.
Throughput test - Single pipeline with more than two samples, biological, and technical duplicates
Typical RNA-Seq studies consist of multiple samples, sometimes 100s of different samples, normal versus disease, or untreated versus treated samples. These samples tend to have a high level of noise due to biological reasons; hence, the analysis requires vigorous data preprocessing procedure.
A number of various samples were processed, with different RNA-Seq data selected from 185 paired-end reads dataset, to see how much data a single node can process. Typically, when the number of samples increases, the runtime of the Tuxedo pipeline increases. However, as shown in the figure below, the runtimes with two sample tests using 7713, are higher than the runtimes from four samples. The standard error from five repeated runs does not overlap with four and eight sample results. The interference of other users may cause this large variance. The current testing environment, especially a shared file system designed for large capacity, is not ideal for a Next Generation Sequencing (NGS) data analysis benchmark.
Figure 2. Runtime comparisons among various AMD EPYC 7003 Series processors: Standard error is estimated from an estimated standard deviation based on a sample (STDDEV.S function in Excel)
The eight sample test results show that AMD Milan processors perform better than one of the Rome processors (7702) in a higher workload.
Many tests are still required to obtain a better insight from the AMD Milan processors for the NGS data analysis area. Unfortunately, the tests could not exceed eight samples due to storage limitations. However, there seems to be plenty of room for a higher throughput that processes more than eight samples together. AMD Milan 7763 performed 20% better than AMD Rome 7702. AMD Milan 7713 performed 18% better in eight sample tests for the Tuxedo pipeline as described in Figure 2.
Related Blog Posts
HPC Application Performance on Dell EMC PowerEdge R7525 Servers with the AMD MI100 Accelerator
Tue, 15 Dec 2020 14:23:27 -0000|
Read Time: 0 minutes
The Dell EMC PowerEdge R7525 server supports the AMD MI100 GPU Accelerator. The server is a two-socket, 2U rack-based server that is designed to run complex workloads using highly scalable memory, I/O capacity, and network options. The system is based on the 2nd Gen AMD EPYC processor (up to 64 cores), has up to 32 DIMMs, and has PCI Express (PCIe) 4.0-enabled expansion slots. The server supports SATA, SAS, and NVMe drives and up to three double-wide 300 W accelerators.
The following figure shows the front view of the server:
Figure 1. Dell EMC PowerEdge R7525 server
The AMD Instinct™ MI100 accelerator is one of the world’s fastest HPC GPUs available in the market. It offers innovations to obtain higher performance for HPC applications with the following key technologies:
- AMD Compute DNA (CDNA)—Architecture optimized for compute-oriented workloads
- AMD ROCm—An Open Software Platform that includes GPU drivers, compilers, profilers, math and communication libraries, and system resource management tools
- Heterogeneous-Computing Interface for Portability (HIP)—An interface that enables developers to covert CUDA code to portable C++ so that the same source code can run on AMD GPUs
This blog focuses on the performance characteristics of a single PowerEdge R7525 server with AMD MI100-32G GPUs. We present results from the general matrix multiplication (GEMM) microbenchmarks, the LAMMPS benchmarks, and the NAMD benchmarks to showcase performance and scalability.
The following table provides the configuration details of the PowerEdge R7525 system under test (SUT):
Table 1. SUT hardware and software configurations
AMD EPYC 7502 32-core processor
512 GB (32 GB 3200 MT/s * 16)
2 x 1.8 TB SSD (No RAID)
Red Hat Enterprise Linux Server 8.2
3 x AMD MI100-PCIe-32G
Processor Settings > Logical Processors
NUMA node per socket
Version: NAMD 3.0 ALPHA 6
LAMMPS (KOKKOS) benchmark
Version: LAMMPS patch_18Sep2020+AMD patches
The following table lists the AMD MI100 GPU specifications:
Table 2. AMD MI100 PCIe GPU specification
Peak Engine Clock (MHz)
Peak FP64 (TFLOPS)
Peak FP64 Tensor DGEMM (TFLOPS)
Peak FP32 (TFLOPS)
Peak FP32 Tensor SGEMM (TFLOPS)
Memory size (GB)
Memory ECC support
The GEMM benchmark is a simple, multithreaded dense matrix-to-matrix multiplication benchmark that can be used to test the performance of GEMM on a single GPU. The rocblas-bench binary compiled from https://github.com/ROCmSoftwarePlatform/rocBLAS was used to collect DGEMM and SGEMM results. The results of these tests reflect the performance of an ideal application that only runs matrix multiplication in the form of the peak TFLOPS that the GPU can deliver. Although GEMM benchmark results might not represent real-world application performance, it is still a good benchmark to demonstrate the performance capability of different GPUs.
The following figure shows the observed numbers of DGEMM and SGEMM:
Figure 2. DGEMM and SGEMM for both AMD MI100 peak and AMD-PCIe sustained
The results indicate:
- In the DGEMM (double-precision GEMM) benchmark, the theoretical peak performance of the AMD MI100 GPU is 11.5 TFLOPS and the measured sustained performance is 7.9 TFLOPS. As shown in Table 2, the standard double precision (FP64) theoretical peak and the FP64 tensor DGEMM peak performance are both at 11.5 TFLOPS. Because most real world HPC applications typically are not heavily implemented with DGEMM or other matrix operations, this high standard FP64 capability boosts the performance on other non-matrix double-precision math calculations.
- For FP32 Tensor operations in the SGEMM (single-precision GEMM) benchmark, the theoretical peak performance of the AMD MI100 GPU is 46.1 TFLOPS, and the measured sustained performance is approximately 30 TFLOPS.
The LAMMPS benchmark
The Large-Scale Atom/Molecular Massively Parallel Simulator (LAMMPS) runs threads in parallel using message-passing techniques. This benchmark measures the scalability and performance of large, parallel systems of multiple GPUs.
The following figure shows the KOKKOS implementation of LAMMPS scaled relatively linearly as AMD MI100 GPUs were added across four datasets: EAM, LJ, Tersoff, and ReaxFF/C.
Figure 3. LAMMPS benchmark showing scaling of multiple AMD MI100 GPUs
The NAMD Benchmark
Nanoscale Molecular Dynamics (NAMD) is a parallel molecular dynamics system designed for simulation of large biomolecular systems. The NAMD benchmark stresses the scaling and performance aspects of the server and GPU configuration.
The following figure plots the results of the NAMD microbenchmark:
Figure 4. NAMD benchmark performance
Aggregate data of multiple GPU cards is preferred because the Alpha builds of the NAMD 3.0 binary do not scale beyond a single accelerator. Three replica simulations were launched on the same server, one on each GPU, in parallel. NAMD was CPU-bound in previous versions. The new 3.0 version has reduced the CPU dependence. As a result, three-copy simulation produced linear scaling performing three times faster across all datasets.
As part of the optimization, the NAMD benchmark numbers in the following figure show the relative performance difference using different numbers of CPU cores for the STMV dataset:
Figure 5. CPU core dependency on NAMD
The AMD MI100 GPU exhibited an optimum configuration of four CPU cores per GPU.
The AMD MI100 accelerator delivers industry-leading performance, and it is a well-positioned performance per dollar GPU for both FP32 and FP64 HPC parallel codes.
- FP32 applications perform well using the AMD MI100 GPU based on the SGEMM, LAMMPS, and NAMD benchmarks performance by using tensor cores and native FP32 compute cores.
- FP64 applications perform well using the AMD MI100 GPU by using native FP64 compute cores.
In the future, we plan to test other HPC and Deep Learning applications. We also plan to research using “Hipify” tools to port CUDA sources to HIP.
HPC Application Performance on Dell PowerEdge R7525 Servers with NVIDIA A100 GPGPUs
Tue, 24 Nov 2020 17:39:49 -0000|
Read Time: 0 minutes
High-Performance Linpack benchmark
High Performance Conjugate Gradient benchmark