To obtain an industry-leading result, every effort is made to ensure that software application settings align with the underlying hardware infrastructure and that no bottlenecks exist in the system. The following settings were optimized as part of this work:
- Server BIOS and firmware
- Disk controller and RAID configuration
- Operating system
- Data analytics platform (Cloudera Data Platform)
- Application-level settings (Hive)
Dell Performance Engineers work closely with OEM partners and follow their guidelines to configure and tune these settings. For this test, they used guidelines and best practices from Intel for server BIOS and firmware settings. Similarly, they followed recommendations from Red Hat for operating system configuration.
This test optimized the Cloudera Data Platform (CDP) by creating YARN queues, which had a significant impact on the number of streams that can be run concurrently. The throughput phase performance improved significantly. By using YARN queues, the number of concurrent streams was raised from four to 12.
In YARN, the Capacity Scheduler is the central component enabling operators to manage the resources that can be assigned to applications1. The Capacity Scheduler allocates and limits the resources that each application can consume from those resources available in the cluster. It is responsible for scheduling applications so that cluster utilization is maximized according to some operator-defined criteria. Using the Capacity Scheduler was crucial to achieving the results presented in this section.
The fundamental unit of scheduling in YARN is a queue. For the execution of this benchmark test, the YARN Queue Manager UI included with CDP was used to configure the Capacity Scheduler. It created a queue hierarchy that maximized the use of the cluster throughout the different phases of the benchmark, particularly the throughput test. Because the throughput test simulates multiple users submitting applications (that is, queries) concurrently, it requires an appropriate balance in resource consumption by each query so that others do not fall into a resource starvation scenario. For each queue, a certain capacity must be configured that dictates the percentage of cluster resources that applications can use.