Home > Workload Solutions > High Performance Computing > White Papers > Dell Technologies Validated Design for Genomics with NVIDIA Clara Parabricks On AMD-powered Dell PowerEdge > Genomics analysis
DNA is the code of life. This molecule carries the genetic instructions for growth, development, and reproduction of all living organisms. The building block of DNA is a four-letter code: adenine (A), thymine (T), guanine (G), or cytosine (C). These four letters are nucleotides and are known as bases. The human genome consists of 3 billion bases. The specific order in which the A, T, G, and C bases appear is responsible for all phenotypes (traits like eye color or drug sensitivity). DNA sequencing is the process of writing out the order of the bases for an organism of interest. The entire complement of DNA for an organism is a genome. Multiple teams across universities and government labs took approximately ten years and over $2.7 B United States Dollars (USD) to sequence the first human genome (NHGRI, 2019). Today a single technician can sequence a whole human genome in one to two days for less than $1000 (USD).
Next-generation sequencing (NGS) sequencing platforms continue to drive down the costs for a whole human genome. Genomics now plays an increasingly important role in clinical practice for patient care. It is also a critical tool for public health initiatives. The information encoded in the genome of an individual is the foundation to precision medicine. It drives diagnosis and supports therapeutic decisions for disease treatment. Sometimes, the information supports prevention strategies due to person-to-person variability (Suwinski, 2019). Identifying genetic variants or differences within a genome is done by comparing the genome of an individual to a genome reference. Also known as “Secondary Analysis,” this process for generating a list of genetic variants can take minutes to days. The time to generate the list depends on the size of the dataset that is coupled to the available software, computing, and storage resources.
Extending this approach to assess the genetic variability of patient populations requires operating the latest NGS instrumentation and computing resources at scale. For example, the latest Illumina NovaSeq 6000 system can output approximately five times more DNA bases than the previous generation of instrumentation (Illumina Inc., 2021). One Illumina NovaSeq system can produce between ~1.5 TB to 2.5 TB raw data per day. That amount of data represents approximately 20 to 48 whole genome sequences (WGS) per day. Today, it is common for life science organizations to operate more than one NGS instrument. They routinely process from hundreds to tens of thousands of samples WGS per week. An organization must have enough computing and storage resources that are matched to the output capacity for a fleet of sequencing instruments. That way, they can avoid any analysis bottlenecks and the rate of secondary analysis can keep pace with the rate of raw NGS data generation.