Home > Workload Solutions > High Performance Computing > White Papers > Dell Validated Design for HPC Life Science with 4th Gen Intel Xeon Scalable Processors and Dell PowerScale > Tuxedo pipeline
Typical RNA-Seq studies consist of multiple, sometimes hundreds, of samples. Examples of these include normal compared to disease or untreated compared to treated samples. These samples tend to have a high level of noise for biological reasons; hence, the analysis requires vigorous data preprocessing procedures.
We tested various numbers of samples (all different RNA-Seq data selected from 185 paired-end reads datasets) to see how much data can be processed by a single node. Typically, when the number of samples increases, the runtime of the Tuxedo Pipeline increases as shown in Figure 10 Unlike BWA-GATK pipelines, some steps cannot be parallelized and need to wait for the completion of all previous parallelized jobs (Steps 3 and 5 in Figure 3). The runtimes of Cuffmerge and Cuffdiff grow exponentially when the number of samples increases although no other jobs are competing for the computational
resources at these steps.
Typical RNA-Seq study samples tend to have high levels of noise due to their biological reasons; hence, the analysis requires a vigorous data preprocessing procedure.
The throughput test results show that 16-node PowerEdge C6620s can process roughly 12.8 billion fragments from 512 samples with around 50 million paired reads each (25 million fragments) through the Tuxedo pipeline illustrated in Figure 3. Since the Tuxedo pipeline is relatively faster than other popular pipelines, it is hard to generalize or utilize these results for sizing an HPC system accurately. However, the results can help make a rough estimation of the HPC system's size.