Executive Summary

To establish a thorough solution collateral for the Dell PowerEdge HS5610 platform integrated with Cloudera software, we are commencing benchmarking initiatives this year. These benchmarks will form the foundational baseline for our future testing endeavors, and it's essential to emphasize that we will not be making comparisons with previous generations.

This initiative holds great significance, prompted directly by Dell's explicit request to craft this reference solution. Intel has taken charge of executing the benchmark tests and generously shared their Best-Known Methods (BKMs), providing invaluable guidance for this critical undertaking.

What are the key takeaways?

Cloudera Data Platform built on Dell’s 16G PowerEdge servers with Intel® 4th Generation Xeon processor architecture can accommodate growing enterprise data workloads and efficiently manage increasing demands for analytics and machine learning in a smaller footprint. Cloudera Data Platform delivers easier data management and scalability for data anywhere with optimal performance, scalability, and security.

As organizations create more diverse and more user-focused data products and services, there is a growing need for machine learning, which can be used to develop personalization, recommendations, and predictive insights. But as organizations amass greater volumes and greater varieties of data, data scientists are spending most of their time supporting their infrastructure instead of building the models to solve their data problems. To help solve this problem, as an integrated part of Cloudera’s platform, Spark provides a general machine learning library that is designed for simplicity, scalability, and easy integration with other tools. With the scalability, language compatibility, simple administration and compliance-ready security and governance provided through cloudera, data scientists can solve and iterate through their data problems faster.

Spark MLlib

Spark MLlib is a distributed machine learning framework built on top of Spark Core. The key benefit of MLlib is that it allows data scientists to focus on their data problems and models instead of solving the complexities surrounding distributed data. MLlib leverages the advantages of in-memory computation and is optimized for matrix and vector operations, aligning its capabilities with specific algorithmic requirements for the given use case.

A blue and white logo Description automatically generated with medium confidence

K-means Overview

Clustering stands as a fundamental exploratory data analysis technique, providing valuable insights into the inherent structure of data. One prominent algorithm for this purpose is K-means, widely recognized for partitioning data points into a predefined number of clusters. This technique finds extensive applications in diverse domains including market segmentation, document clustering, image segmentation, search engines, real estate, anomaly detection and image compression, highlighting it’s versatility and importance in data analysis.

K-Means Overview

K-means clustering performance

We achieved remarkable results by clustering 10 million samples with 1,000 dimensions in just 283 seconds. This accomplishment was made possible through the application of the K-Means algorithm from Spark's ML library, which was provided by Cloudera 7.1.8 and deployed on Dell PowerEdge HS5610 platform.

We conducted a performance evaluation of Spark's MLlib K-Means algorithm using the HiBench Benchmark.

For detailed information on our benchmarking process, you can refer to Intel GitHub repository: https://github.com/Intel-bigdata/HiBench

Note - This result is not compared against any other platform hardware or software. We will use this as baseline for future products.

Text Box

Configuration Details

Workload Configuration
Platform	Dell PowerEdge HS5610
CPU	6448Y
Memory	512 GB (16 x 32GB DDR5-4800)
Boot Device	Dell EMC™ Boot Optimized Server Storage (BOSS-N1) with 2 x 480 GB M.2 NVMe SSDs (RAID1)
HDFS Data Disk	2 x Dell Ent NVMe P5500 RI U.2 3.84TB
HDFS Namenode Disk	1 x Dell Ent NVMe P5500 RI U.2 3.84TB
Yarn Cache Disk	1 x Dell Ent NVMe P5500 RI U.2 3.84TB
Network Interface Controller	NetXtreme BCM5720 Gigabit Ethernet PCIe
Cluster size	1
Cloudera Distribution	Cloudera Data Platform 7.1.8
Compute Engine	Spark 3.2.0
Workload	Hibench 7.1.1 – Kmeans Algorithm
Iterations and result choice	3 iterations, average
Spark Configuration
spark.deploy.mode	yarn
Executor Numbers	16
Executor cores	8
spark.executor.memory	24g
spark.executor.memoryOverhead	4g
spark.driver.memory	20g
spark default parallelism	128
spark.driver.maxResultSize	20g
spark.serializer	org.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer.max	1g
spark.network.timeout	1200s
K-means Configuration
Number of clusters	5
Dimensions	1,000
Number of Samples	10,000,000
Samples per inputfile	10,000
Number of Iterations	40
k	300

Your Browser is Out of Date

Spark Machine Learning on Dell HS5610 Platform with Cloudera

Executive Summary

What are the key takeaways?

Spark MLlib