Processing Six Human 50x WGS per day with 3rd Gen Intel Xeon Scalable Processors
Mon, 24 May 2021 22:07:44 -0000
|Read Time: 0 minutes
Overview
Intel® Xeon® Scalable Processors have been proven for consistent and stable performance for many workload types. New 3rd Generation Intel® Xeon® Scalable Processors, also known by the code name of Ice Lake perform exceptionally well for a BWA-GATK pipeline. In this study, we tested two Ice Lake processors, 8352Y and 8358, and the test server configuration is also summarized in Table 1.
Dell EMC PowerEdge C6520 | |
CPU | Tested 3rd Gen Intel® Xeon® Scalable Processors: 2x Intel® Xeon® Platinum 8352Y Processor, 32 cores, 2.20 GHz – 3.40 GHz Base-Boost, TDP 205 W, 48 MB L3 Cache 2x Intel® Xeon® Platinum 8358 Processor, 32 cores, 2.60 GHz – 3.40 GHz Base-Boost, TDP 250 W, 48 MB L3 Cache |
RAM | DDR4 512G (32 GB x 12) 3200 MT/s |
Operating system | RHEL 8.3 (4.18.0-240.22.1) |
Filesystem network | NVIDIA Mellanox InfiniBand HDR100 |
Filesystem | Dell EMC Ready Solutions for HPC BeeGFS High Capacity Storage |
BIOS system profile | Performance Optimized |
Logical processor | Disabled |
Virtualization technology | Disabled |
BWA | |
Sambamba | |
Samtools | |
GATK |
The test data was chosen from one of Illumina’s Platinum Genomes. ERR194161 was processed with Illumina HiSeq 2000 submitted by Illumina and can be obtained from EMBL-EBI. The DNA identifier for this individual is NA12878. The description of the data from the linked website shows that this sample has a >30x depth of coverage, and it reaches ~53x.
Performance evaluation
Single sample performance
Table 2 summarizes the overall runtimes and the comparisons between each step for our 9-step BWA-GATK pipeline with a single sample.
The mapping and sorting step is the only step that we can peak the true performance variations across different CPUs in Table 2. A rough estimation of overall performance improvements from 6248R (6248) to 8352Y and 8358 are 3.8 (9.0) % and 4.8 (10.0) %, respectively. The test batch for 6248R was Dell EMC PowerEdge R640 server with 394 GB RAM and local storage, and the configuration details for 6248 can be found from the embedded link.
The mapping and sorting step shows a descent ~36 % runtime reduction due to the nature of the good scalability of BWA. The base recalibration step also takes advantage of a higher core count from Ice Lake CPUs.
Steps | 8352Y 32c 2.2 GHz | 8358 32c 2.6 GHz | 6248R 24c 3.0 GHz | 6248 20c 2.5 GHz |
Mapping and sorting | 3.23 (32) | 3.23 (32) | 5.04 (24) | 5.22 (20) |
Mark duplicates | 1.16 (13) | 1.16 (13) | 1.14 (13) | 1.29 (13) |
Generate realigning targets | 0.47 (32) | 0.46 (32) | 0.16 (24) | 0.42 (20) |
Insertion and deletion realigning | 8.16 (1) | 7.97 (1) | 7.20 (1) | 7.87 (1) |
Base recalibration | 2.06 (32) | 2.07 (32) | 2.41 (24) | 2.30 (20) |
Haplotypercaller | 8.01 (16) | 7.96 (16) | 8.06 (16) | 8.25 (16) |
Genotype GVCFs | 0.01 (32) | 0.01 (32) | 0.01 (24) | 0.01 (20) |
Variant recalibration | 0.20 (1) | 0.20 (1) | 0.19 (1) | 0.23 (1) |
Apply variant recalibration | 0.01 (1) | 0.01 (1) | 0.01 (1) | 0.01 (1) |
Total runtime (hours) | 23.32 | 23.07 | 24.23 | 25.61 |
Note: The number of cores used for the test is parenthesized.
Multiple sample performances – throughput
A typical way of running an NGS pipeline is to process multiple samples on a compute node and use multiple compute nodes to maximize the throughput. However, this time the tests were performed on a single compute node due to the limited number of servers available at this moment.
The current pipeline invokes many pipe operations in the first step to minimize the amount of writing intermediate files. Although this saves a day of runtime and lowers the storage usage significantly, the cost of invoking pipes is quite heavy. Hence, this limits the number of concurrent sample processings. Typically, a process silently fails when there is not enough resource left to start an additional process.
As shown in Table 3 for the 8352Y test, the maximum number of samples that can be processed simultaneously is around 14 samples. Although a 14-sample test was not performed, 14 samples could likely be the maximum number of samples that can be processed together because the two pipelines were failed on the 16-sample test. In other words, ~ 6 genomes per day throughput is achievable with 8352Y. Also, 8358 shows 2 failed processes when 16 samples were processed simultaneously while the throughput reached ~7 genomes per day (Table 4).
Steps | Runtime with a various number of samples | |||||
Number of samples | 1 | 2 | 4 | 8 | 12 | 16 |
Number of samples Failed | 0 | 0 | 0 | 0 | 0 | 2 |
Mapping and sorting | 2.84 | 4.20 | 7.11 | 13.44 | 20.77 | 26.62 |
Mark duplicates | 1.17 | 1.18 | 1.29 | 1.77 | 2.49 | 3.05 |
Generate realigning targets | 0.46 | 0.51 | 0.52 | 0.77 | 1.09 | 1.25 |
Insertion and deletion realigning | 7.94 | 8.04 | 8.02 | 8.00 | 8.26 | 8.11 |
Base recalibration | 2.00 | 2.16 | 2.83 | 4.41 | 6.04 | 7.20 |
Haplotypercaller | 8.00 | 7.93 | 9.10 | 9.24 | 9.31 | 9.26 |
Genotype GVCFs | 0.02 | 0.02 | 0.03 | 0.02 | 0.03 | 0.04 |
Variant recalibration | 0.17 | 0.20 | 0.21 | 0.20 | 0.19 | 0.23 |
Apply variant recalibration | 0.01 | 0.02 | 0.02 | 0.02 | 0.02 | 0.03 |
Total runtime (hours) | 22.60 | 24.26 | 29.12 | 37.89 | 48.20 | 55.78 |
Genomes per day | 1.06 | 1.98 | 3.30 | 5.07 | 5.98 | 6.02 |
Steps | Runtime with a various number of samples | |||||
Number of samples | 1 | 8 | 12 | 14 | 16 | 1 |
Number of samples Failed | 0 | 0 | 0 | 0 | 2 | 0 |
Mapping and sorting | 2.67 | 11.79 | 18.26 | 22.84 | 24.34 | 2.67 |
Mark duplicates | 1.16 | 1.51 | 2.18 | 2.59 | 2.65 | 1.16 |
Generate realigning targets | 0.43 | 0.70 | 0.96 | 1.17 | 1.15 | 0.43 |
Insertion and deletion realigning | 7.97 | 8.00 | 7.99 | 8.20 | 8.19 | 7.97 |
Base recalibration | 1.94 | 4.05 | 5.65 | 6.47 | 6.56 | 1.94 |
Haplotypercaller | 8.00 | 8.21 | 8.22 | 8.24 | 8.25 | 8.00 |
Genotype GVCFs | 0.02 | 0.03 | 0.03 | 0.03 | 0.02 | 0.02 |
Variant recalibration | 0.18 | 0.25 | 0.14 | 0.30 | 0.30 | 0.18 |
Apply variant recalibration | 0.01 | 0.01 | 0.02 | 0.02 | 0.02 | 0.01 |
Total runtime (hours) | 22.37 | 34.55 | 43.44 | 49.86 | 51.49 | 22.37 |
Genomes per day | 1.07 | 5.56 | 6.63 | 6.74 | 6.53 | 1.07 |
Conclusion
The field of NGS data analysis has been moving fast in terms of data growth and data variations. The majority of the open-source applications in NGS data analysis are unable to take advantage of accelerator technology and do not scale well over the number of cores. It is time that users need to think about how this problem can be tackled. One simple way to avoid this problem is to perform data-level parallelization. Although the decision of making when to split data is pretty hard, it is tractable with careful interventions in an existing BWA-GATK pipeline without diluting statistical power with a sheer number of data. If each smaller data chunk goes through an individual pipeline on each core and is merged at the end, it could be possible to achieve better performance on a single sample. This performance gain could lead to higher throughput if the overall runtime reduces significantly.
Nonetheless, 3rd Generation Intel® Xeon® Scalable Processors, especially 8352Y, and 8358 are excellent choices for the highest variant calling analysis throughput and single sample analysis.
Related Blog Posts
Six Years of Tower Servers: Accelerate Business Insights with AI Inferencing and the PowerEdge T560
Mon, 13 Nov 2023 19:44:02 -0000
|Read Time: 0 minutes
Tasked with describing PowerEdge tower servers in three words, ChatGPT landed on, “Reliable. Versatile, Scalable,” perfectly capturing the key qualities of PowerEdge towers. In the following blog, we’ll cover scalability in terms of – you guessed it – AI inferencing workloads.
Our deep learning and AI inferencing benchmarks revealed the PowerEdge T560 to perform up to 15.8x better than the T440 and up to 3.8x better than the T550. Even with over triple the performance, the T560 had nearly 74% lower latency compared to the T550 for the same workload. The rest of this blog highlights why the 2-socket T560 is well-suited for AI inferencing on CPU and provides greater detail behind the benchmarks – TensorFlow and OpenVINO – we tested in our lab.
In case you missed it in our last post, we covered exceptional database workload performance gains across the PowerEdge T440, T550, and T560. Make sure to give that a read to learn how these towers represent six years of innovation since the launch of 14th Generation PowerEdge servers.
PowerEdge towers and AI – a perfect pair
Databases, businesses applications, and virtualization are use cases commonly associated with tower servers. While the PowerEdge tower portfolio is designed to accelerate these more traditional workloads, it simultaneously matches the exploding business demand for AI solutions. In fact, IDC projects $154 billion in global AI spending this year, with retail and banking topping the industries with the greatest AI investment.
It is important to note that not all AI workloads look the same; they vary widely in scope and necessary compute power. Use cases range from predicting cancerous regions on CT scans to identifying the most trafficked aisles in a retail store. Irrespective of the specific application, McKinsey reveals organizations that adopted AI for specific functions in 2022 are already seeing a return on investment in 2023. Specifically, across all functions, an average of 59% of organizations report revenue increases from AI adoption and 42% report cost decreases.
Whether a business has a clearly defined need for AI compute power or anticipates having one in the future, the PowerEdge T560 scales with evolving industry demands. The key product features that drive the PowerEdge T560’s “AI-readiness” include:
- 2x Intel® Xeon® Scalable Processors
- Up to six single-width or two double-width GPUs
- PCle Gen 5 and DDR5 memory
Figure 1. PowerEdge T560 AI accelerators
Testing details and benchmark information
For our testing, we evaluated two AI inferencing performance benchmarks, TensorFlow and Intel’s OpenVINO, on the PowerEdge T440, T550, and T560 using Phoronix Test Suites. Inferencing, a subset of AI workloads, refers to the use of input data and an associated trained model to make real-time predictions. Common applications include detecting faces and monitoring traffic for incoming vehicles and pedestrians.
Both TensorFlow and OpenVINO are image-based, and we ran both on CPU. All systems tested were equipped with Intel® Xeon® processors, which is especially relevant to inferencing given that Intel reports “up to 70% of CPUs installed for inferencing are Intel Xeon processors.” While the T560’s GPU capacity allows businesses to scale up their AI workloads, our results show that inferencing on CPU alone still lends itself to impressive performance.
The full testing configurations are listed in the following table. Each system has a Gold-class Intel® Xeon® processor, equal memory capacity, and storage to reflect industry transitions. All testing was conducted in a Dell Technologies lab.
Note: We set the System Profile in BIOS setting to “Performance” on all systems, which has shown to boost out-of-the-box performance by up to 10%. Check out this paper for more details and other ways to simply and quickly optimize your AI workload performance.
Table 1. Testing configurations
| PowerEdge T440 | PowerEdge T550 | PowerEdge T560 |
CPU | Intel® Xeon® Gold 5222 4c/8T, TDP 105W | Intel® Xeon® Gold 6338N 32c/64T, TDP 185W | Intel® Xeon® Gold 6448Y 32c/64T, TDP 225W |
Storage | 4x 800 GB SAS SSD (RAID 5) | 4x 960 GB SAS SSD | 4x 1.6TB NVMe |
Memory | 512 GB DDR4
| 512 GB DDR4 | 512 GB DDR5 |
PowerEdge T560 inferencing performance “clean sweep”
We report TensorFlow inferencing performance results for three common deep learning architectures: AlexNet, VGG-16, and RestNet-50. Performance – or in this case throughput – is measured by the number of images processed every second. The higher the images per second value, the better the inferencing performance.
As shown in Figure 1, the PowerEdge T560 processed significantly more images per second compared to both prior-gen towers and across all three architectures. Most notably, the T560 demonstrated up to 318% higher throughput than the T440.
Figure 2. TensorFlow benchmark performance
Table 2 provides more details about the performance improvements across all systems and architectures tested.
Table 2. TensorFlow benchmark results
| T440 to T550 | T550 to T560 | T440 to T560 |
CPU-Batch Size[1]-Architecture | Percent Uplift in Throughput | ||
CPU -512- ResNet-50 | 171.32% | 22.11% | 231.32% |
CPU -512- VGG-16 | 234.12% | 25.13% | 318.08% |
CPU – 16 - AlexNet | 175.21% | 20.54% | 231.74% |
In a similar vein, we report OpenVINO performance results for four computer vision use cases:
- Person Detection
- Face Detection
- Age & Gender Recognition in Retail
- Person, Vehicle & Bike Detection
Performance is measured by both throughput in number of frames processed per second (FPS) and latency in milliseconds (ms). The higher the FPS value, the better the inferencing performance. Conversely, a lower latency indicates a quicker system response and therefore better performance.
The figures below illustrate changes in FPS for the four use cases across all three generations of tower servers. For Face Detection specifically, the T560 has 15.8x the FPS compared to the T440 and almost 4x the FPS compared to the T550.
Figure 3. Face Detection and Person Detection OpenVINO FPS
Figure 4. Age Gender Recognition Retail OpenVINO FPS
Figure 5. Person Vehicle Bike Detection OpenVINO FPS
The following table provides the FPS values for the use cases and all three systems tested.
Table 3. OpenVINO frames per second results
| PowerEdge T440 | PowerEdge T550 | PowerEdge T560 |
Model | Throughput in Frames per Second, More is Better | ||
Face Detection FP16 | 3.54 | 14.77 | 55.94 |
Person Detection FP16 | 1.94 | 7.6 | 17.37 |
Person Vehicle Bike Detection FP16 | 249.62
| 701.76 | 2732.94 |
Age Gender Recognition Retail FP16 | 8396.74 | 34131.92 | 80733.72 |
Lastly, the T560 reduces inferencing latency by up to 73% compared to the T550 on these same models, as illustrated in Figure 6.
Figure 6. Percent decrease in latency
The following table presents the latency values in ms for the T550 and T560.
Table 4. OpenVINO latency results
| PowerEdge T550 | PowerEdge T560 | Latency Reduction from T550 to T560 |
Model | Latency in ms, Less is Better | Reduction | |
Face Detection | 2164.53 | 570.48 | -73.64% |
Person Detection | 4130.79 | 1833.29 | -55.62% |
Person Vehicle Bike Detection | 45.56 | 23.4 | -48.64% |
Age Gender Recognition Retail | 1.73 | 0.72 | -58.38% |
Concluding Thoughts
Emerging AI workloads have taken numerous industries by storm, and the latest-gen PowerEdge T560 is built for businesses looking to scale up and reap the benefits of AI-generated insights. Between support for 4th Gen Intel® Xeon® Scalable Processors, up to 6 graphics cards, and DDR5 memory, this tower can handle both CPU- and GPU-heavy use cases.
Our recent AI inferencing testing on CPU revealed the PowerEdge T560 has:
Up to 318% percent better inferencing performance than the T440 for the TensorFlow benchmark
Up to 15.8x the inferencing performance compared to the T440 and almost 4x the performance compared to the T550 for the OpenVINO benchmark
Up to 73% lower latency compared to the T550 for the OpenVINO benchmark
While this concludes our blog series on “Six Years of Tower Servers,” we hope we have left you wanting to learn more about the PowerEdge T560. Don’t forget to check out our previous blog detailing exceptional database workload performance gains across tower servers. We’ll part ways with this short unboxing video for a look under the lid of the server:
Resources
- Six Years of Tower Servers: Exceptional Database Performance with PowerEdge T560 | Dell Technologies Info Hub
- Worldwide Spending on AI-Centric Systems Forecast to Reach $154 Billion in 2023, According to IDC
- The state of AI in 2023: Generative AI’s breakout year | McKinsey
- Tensorflow Benchmark - OpenBenchmarking.org
- OpenVINO Benchmark - OpenBenchmarking.org
- Optimize Inference with Intel® CPU Technology
[1] This is a manually set parameter, ranging from 16 to 512. Read about the parameter meaning here.
Legal Disclosures
Based on September 2023 Dell labs testing subjecting the PowerEdge T440, T550, and T560 tower servers to AI inference benchmarks – OpenVINO and TensorFlow via the Phoronix Test Suite. Actual results will vary.
Authors: Olivia Mauger, Jeremy Johnson, Delmar Hernandez | Compute Tech Marketing
Discover Your Servers’ Untapped AI Potential: PowerEdge Offering Accelerated AI Adoption
Wed, 24 Apr 2024 15:32:16 -0000
|Read Time: 0 minutes
Imagine a world where machines comprehend our needs before we even express them—where AI drives innovation, and data centers form the beating heart of cutting-edge technology. That world is becoming increasingly lucid with every passing hour. In a world where technology is advancing at lightning speed, artificial intelligence (AI) has emerged as a game-changer. By delivering solutions that range from self-driving cars to personalized recommendations, AI is transforming industries and changing lives. But with great power comes great demand for computing resources. As the thirst for AI-driven solutions grows, so does the need for powerful servers that can handle the intense computational demands. The journey to this AI-driven utopia starts with a simple question: How can we turbocharge existing servers to tackle the AI revolution?
And what if we told you that the solution lies within your very own data center? That's right. The key to unlocking the true power of AI workloads might be hiding in plain sight. Welcome to the world of Dell PowerEdge servers.
At Dell Labs, we are obsessed with innovation. Our team of passionate engineers embarked on a mission to push the boundaries of what our PowerEdge servers could do. And boy, did we make a discovery that left us in awe!
By exploring the depths of server BIOS settings and fine-tuning them for AI workloads, we witnessed an astounding transformation. The performance boost was nothing short of extraordinary. Our engineers tinkered with these BIOS settings and uncovered hidden gems. These BIOS settings, often used for multitasking, proved to be the secret sauce for AI inferencing tasks. When we thoroughly tested industry-standard AI workloads against these settings, the server's performance skyrocketed like never before.
So, what does this mean for you? It means you may already have a secret weapon in your data center—Dell PowerEdge servers with Intel CPUs! Whether you're already harnessing the power of the latest PowerEdge servers or planning your next investment, our proven optimizations are set to maximize your returns. We've done rigorous testing and distilled the results into simple accessible settings that you can apply yourself. Alternatively, let us tailor your server's configuration at the point of purchase to unlock its full potential right out of the box. Imagine the possibilities: real-time facial recognition, ultra-fast data analysis, and predictions that can change the game for your business. The hidden power of your servers is waiting to be unleashed, and it's easier than you might think.
AI is the future, and you have a front-row seat to the revolution. With Dell PowerEdge servers optimized for AI workloads, you can embrace the full potential of AI without breaking a sweat. Your existing infrastructure holds the key to supercharging your journey in the fast-paced world of AI, where every advantage counts. PowerEdge servers have the untapped potential to fuel your AI ambitions like never before. The key to unlocking this hidden power lies in BIOS tuning—a simple yet powerful technique. Embrace this hidden power within your servers and join the ranks of AI pioneers to tap into the immense possibilities that await. The future is now, and your Dell PowerEdge servers are ready to lead the way.
Ready to dive into the secrets of BIOS tuning and its impact on AI performance? Our team has shared the magic of BIOS tuning with you in our comprehensive white paper. Let’s explore the fascinating world of AI together and see what's possible when you unleash the full power of your Dell PowerEdge servers.
To learn more about this groundbreaking solution and how it can revolutionize AI, see this Direct from Development Tech Note: Unlock the Power of PowerEdge Servers for AI Workloads: Experience Up to 177% Performance Boost!
Author:
Shreena Bhati, Product Management Intern