Home > Workload Solutions > High Performance Computing > White Papers > Dell Validated Design for Health Care and Life Sciences with Dell PowerEdge C6520 and XE8545 > Introduction
DNA is the code of life. This molecule carries the genetic instructions for the growth, development, and reproduction of all living organisms.
The building block of DNA is a four-letter code: "A, T. G, or C.” These four letters are nucleotides and are known as bases. The human genome consists of three billion bases where the specific order of the bases is responsible for every phenotype, traits like eye color or drug sensitivity. DNA sequencing is the process of writing out the order of the bases for an organism of interest. The entire complement of DNA for an organism is a genome. The first human genome took approximately ten years, multiple teams across universities and government labs, and over $2.7 B USD to sequence (NHGRI, 2019). Today a single technician can sequence an entire human genome in one to two days for less than $1000 USD.
As NGS sequencing platforms continue to drive down the costs for a whole human genome, genomics now plays an increasingly important role in clinical practice for patient care. Genomics is also a critical tool for public health initiatives. The information encoded in an individual's genome is the foundation of precision medicine. It drives diagnosis and supports therapeutic decisions for disease treatment and sometimes prevention strategies due to person-to-person variability (Suwinski, 2019). Identifying genetic variants or differences within a genome is done by comparing an individual’s genome to a genome reference, also known as “secondary analysis.” This process for generating a list of genetic variants can take minutes to days depending on the size of the data and the available software, computing, and storage resources.
Extending this approach to assess the genetic variability of patient populations requires operating the latest NGS instrumentation and computing resources at scale. For example, the latest Illumina NovaSeq 6000 system can output approximately five times more DNA bases than the previous generation of instrumentation (Illumina Inc., 2019). One Illumina NovaSeq system can produce between around 1.5 TB to 2.5 TB of raw data per day, representing approximately 20 to 48 whole genome sequences (WGS) per day. Today, it is common for life science organizations to operate more than one NGS instrument and routinely process from hundreds to tens of thousands of samples of WGS per week. To avoid any analysis bottlenecks, an organization has enough computing and storage resources matched to the output capacity for a fleet of sequencing instruments. This capacity ensures that the rate of secondary analysis keeps pace with the rate of raw NGS data generation.
The desired product of whole-genome sequencing (WGS) is a catalog of all genetic variants within a given sample. Although motivations may be different, minimizing the time and cost to generate this catalog of variants is a common goal shared across most healthcare and life science organizations. Research organizations competing for grant awards must be competitive in the time and costs for generating the most comprehensive variant catalog possible. To recognize revenue, a sequencing service provider must return a list of variants to its customer per agreed-on timelines, all the while, containing costs to maximize their returns. While in a clinical setting, a diagnostic variant report is needed at a level of accuracy and speed that impacts the care of a patient.
Due to the size of individual sample data and volume of samples, WGS secondary analysis is a compute and storage-intensive process. The most commonly used and cited methods for secondary analysis include the Burrows-Wheeler Alignment (BWA-Mem) (Li, 2009), and the Genome Analysis Tool Kit (GATK) (McKenna, 2010). Using the Broad GATK Best Practices workflow (pipeline) requires over 30 hours to process 40x WGS (Goyal, 2017).
Note: The configuration used included a 48-core Intel Xeon E5-2697v2 12C, 2.7 GHz processors with 128 GB RAM, 3.2 TB SSD, and CentOS 6.6.
Dell Technologies’ most recent test results as part of the endeavor to build a DVD show roughly 24 hours to process 50x WGS with a single Intel® Xeon® Platinum 8358 processor. See https://infohub.delltechnologies.com/p/processing-six-human-50x-wgs-per-day-with-3rd-gen-intel-xeon-scalable-processors/.
Note: The configuration used included Dell PowerEdge C6520 with two Intel® Xeon® Platinum 8358 processors, 32 cores, 2.60 GHz, and 512 GB RAM. It also included the DVD for HPC BeeGFS High Capacity Storage, and Red Hat Enterprise Linux 8.3 (4.18.0-240.22.1).
Analyzing a few genomes per day is far from ideal when a modern, high-throughput NGS instrument can generate unanalyzed, raw NGS data for 20 or more WGS per day.
Organizations must consider all the critical variables that may impact the total secondary analysis (wall clock) time when choosing technologies that enable secondary analysis of NGS data. These variables are wide-ranging and entail the type of NGS sequencing application. They include the sequencing coverage per sample, supporting analysis software, application-specific strategies, output file types, application file access patterns, and number and type of available computing resources.
When planning time and resources to complete secondary analysis, you must be aware of the sequencing depth of coverage for sample data as it will impact analysis time per sample. Coverage describes the average number of reads in sequencing that align to, or cover, a known reference sequence. The coverage often determines if a variant exists with a certain degree of confidence at a specific genomic location. Coverage requirements vary by sequencing application. For example, 30x to 60x coverage is common for human WGS applications (Illumina, 2019). However, the analysis of cancer genomes may require sequencing to a depth of coverage higher than 100x. Doing so achieves the necessary sensitivity and specificity to detect rare, low-frequency variants (Griffith, 2015).
Coverage is also a measure of the amount of data per sample. As coverage increases, so does the amount of data per sample. For example, a 50x (coverage) WGS sample contains approximately five times more data than a 10x WGS sample. Secondary analysis time will also increase proportionally to the amount of data.
Dell Technologies and Clara Parabricks have created a modular, easy to scale reference architecture to meet these needs. This architecture simplifies and streamlines technology choices that lead to reduced secondary analysis times while keeping pace with NGS data generation.