Insights on Cloudera Data Platform on VMware Cloud Foundation Powered by VMware vSAN
Download PDFThu, 05 Oct 2023 19:34:38 -0000
|Read Time: 0 minutes
Summary
This joint paper outlines a brief discussion on the key hardware considerations when configuring a successful deployment and recommends configurations based on 15th Generation PowerEdge Server.
Market positioning
VMware Cloud Foundation is built on VMware’s leading hyperconverged architecture, VMware vSAN, with all-flash performance and enterprise-class storage services including deduplication, compression, and erasure coding. vSAN implements hyperconverged storage architecture by delivering an elastic storage and simplifying the storage management.
VMware vSAN is the market leader in hyperconverged Infrastructure (HCI), enabling low cost and high-performance next-generation HCI solutions. It converges traditional IT infrastructure silos onto industry-standard servers, virtualizes physical infrastructure to help customers easily evolve their infrastructure without risk, improves TCO over traditional resource silos, and scales to tomorrow with support for new hardware, applications, and cloud strategies.
Cloudera Data Platorm (CDP) Private Cloud Base supports a variety of hybrid solutions where compute tasks are separated from data storage and where data can be accessed from remote clusters, including workloads created using CDP Private Cloud Experiences. This hybrid approach provides a foundation for containerized applications by managing storage, table schema, authentication, authorization, and governance.
Key Considerations
- Often, enterprises have at least a development CDP cluster, a preproduction staging CDP cluster, and a production cluster. With virtualization, there is the flexibility to share the hardware for these Hadoop clusters. The CDP version for the development cluster is likely more current than that of the others because developers like to work with the newer versions. Dedicating a set of hardware to one version of a Hadoop vendor’s product does not make the best use of resources.
- Co-locating CDP VMs on host servers with VMs supporting different workloads is also possible, particularly for situations that are not performance critical. Doing this can balance the use of the system. This often enables better overall utilization by consolidating applications that either use different kinds of hardware resources or use the hardware resources at different times of the day or night.
- Efficiency: VMware enables easy and efficient deployment of CDP on an existing virtual infrastructure as well as consolidation of otherwise dedicated CDP cluster hardware into a data center or cloud environment.
- Availability and fault tolerance: vSphere features such as VMware vSphere High Availability (vSphere HA) and VMware vSphere Fault Tolerance (vSphere FT) can protect the CDP components from server failure and improve availability. Resource management tools such as VMware vSphere vMotion can provide availability during planned server downtime and maintenance windows.
Available Configurations
| Cloudera Data Platform on VMware Cloud Foundation (VCF) with vSAN |
| ||
| VCF Management Domain 4 nodes required
| VCF Workload Domain for Cloudera Data Platform Base
4 (minimum) up to 64 nodes per workload domain Up to 15 workload domains (including management domain)
|
| |
Platform | PowerEdge R650 supporting 10 NVMe drives (direct), or VxRail E660N |
| ||
CPU | 2x Intel® Xeon® Gold 5318Y processor (2.1GHz, 24 cores) | 2x Intel Xeon Gold 6348 processor (2.6GHz, 28 cores 4 GHz)
|
| |
DRAM | 256GB (16x 16GB DDR4-3200) or more | 512 GB (16 x 32 GB DDR4-3200) or more |
| |
Boot Device | Dell BOSS-S2 with 2x 240GB or 2x 480GB M.2 SATA SSD (RAID1) |
| ||
Cache tier Drives | 2x 400GB Intel Optane P5800X (PCIe Gen4) |
| ||
Capacity tier Drives (1) | 6x (up to 8x) 1.92TB Enterprise NVMe Read Intensive AG Drive U.2 Gen4 | 8x 1.92TB or 3.84TB Enterprise NVMe Read Intensive AG Drive U.2 Gen4 |
| |
Network Interface Controller | Intel E810-XXVDA2 for OCP3 (dual-port 25Gb) | Intel E810-XXVDA2 for OCP3 (dual-port 25Gb), or Intel E810-CQDA2 PCIe (dual-port 100Gb) |
|
Note: For more than 7 workload domains, each node needs a minimum of 512GB DRAM (16x 32GB) and more capacity (use 3.84TB drives instead of 1.92TB).
This solution can be deployed on either Dell PowerEdge based vSAN ReadyNodes or VxRail appliances.
Solution adopted from https://core.vmware.com/resource/cloudera-data-platform-vmware-cloud-foundation-powered-vmware-vsan.
For more information and specifications, contact a Dell representative. Alternative storage configurations can be considered.
Authors: Todd Mottershead (Dell), Seamus Jones (Dell), Esther Baldwin (Intel), Krzysztof Cieplucha (Intel), Teck Joo (Intel), Amandeep Raina (Intel), and Patryk Wolsza (Intel)
Related Documents
Extracting Insights on a Scalable and Security-Enabled Data Platform from Cloudera
Fri, 14 Jul 2023 19:48:55 -0000
|Read Time: 0 minutes
Summary
This joint paper outlines a brief discussion on the key hardware considerations when configuring a successful deployment and recommends configurations based on the most recent PowerEdge Server portfolio offerings.
Market positioning
Cloudera Data Platform (CDP) Private Cloud is a scalable data platform that allows data to be managed across its lifecycle—from ingestion to analysis—without leaving the data center. It comprises two products: Cloudera Private Cloud Base (the on-premises portion built on Dell PowerEdge servers) and Cloudera Private Cloud Data Services. The Data Services provide containerized compute analytics applications that scale dynamically and can be upgraded independently. This platform simplifies managing the growing volume and variety of data in your enterprise, and unleashes the business value of that data. By disaggregating compute and storage, and supporting a container based environment, CDP Private cloud helps enhance business agility and flexibility. The platform also includes secure user access and data governance features.
Key considerations
- Data throughput - CDP Private Cloud on Dell PowerEdge servers is built on high-performing Intel architecture. Intel® Ethernet network controllers, adapters, and accessories enable agility in the data center and support high throughput. Unlike many other point solutions, CDP Private Cloud is an end-to-end platform for data, from collecting and engineering to reporting and using AI capabilities.
- Balanced system configuration - CDP Private Cloud can handle multiple varying workloads, including analytics and machine learning (ML). Its capabilities are supported by generation-over-generation improvements in underlying Intel technologies that offer more cores and higher memory capacity.
- Data latency - As data grows and needs to be accessed across the cluster, data-access response times are critical, especially for real-time analytics applications.
Available configurations
Table 1. Cloudera Data Platform (CDP) Private Cloud Base Cluster
Note: For a storage-only configuration (HDFS/Ozone), customers can still choose traditional high-density storage nodes with high-capacity rotational HDDs based on the PowerEdge R740xd2 platform, although external storage systems, such as Dell PowerScale or Dell ECS, are recommended. Customers should be aware that using large capacity HDDs increases the time of background scans (bit-rot detection) and block report generation for HDFS. It also significantly increases recovery time after a full node failure. Also, using nodes with more than 100 TB of storage is not recommended by Cloudera. Source: https://blog.cloudera.com/disk-and-datanode-size-in-hdfs/. For more information and specifications, contact a Dell representative.
Table 2. CDP Private Cloud Data Services (Red Hat OpenShift Kubernetes)/Embedded Container Service (ECS) Cluster
Learn more
Contact your Dell Technologies or Intel account team for a customized quote 1-877-289-3355.
Note: This document may contain language from third-party content that is not under Dell Technologies’ control and is not consistent with current guidelines for Dell Technologies’ own content. When such third-party content is updated by the relevant third parties, this document will be revised accordingly.
Extract Insights on a Scalable and Security-Enabled Data Platform from Cloudera
Mon, 29 Jan 2024 22:48:44 -0000
|Read Time: 0 minutes
Summary
This joint paper outlines the key hardware considerations when configuring a data platform based on the most recent Dell’s 16th Generation PowerEdge Server portfolio offerings.
Market positioning
Cloudera® Data Platform (CDP) Private Cloud is a scalable data platform that allows data to be managed across its life cycle—from ingestion to analysis—without leaving the data center. It consists of two products: Cloudera Private Cloud Base (the on-premises portion built on Dell PowerEdge™ servers[RAK1] [DD2] [DD3] ) and Cloudera Private Cloud Data Services. The Data Services provide containerized compute analytic applications that scale dynamically and can be upgraded independently. This platform simplifies managing the growing volume and variety of data in your enterprise, unleashing the business value of that data. CDP Private Cloud helps enhance business agility and flexibility by disaggregating compute and storage and supporting a container-based environment. The platform also includes secure user access and data governance features.
Key Considerations
- Scalability and Performance: The CDP Platform is built on Dell’s 16th Generation PowerEdge servers with Intel® 4th Generation Xeon processor architecture. It can accommodate growing enterprise data workloads and efficiently handle increasing demands for analytics and machine learning in a smaller footprint.
- Compatibility and Integration: Ensuring compatibility and seamless integration between CDP Private Cloud and the hardware components is essential for a successful deployment in a Cloud environment. Delivering faster time-to-market and minimizing the total cost of ownership are ensured with Intel architecture-based Dell PowerEdge servers that are well suited to work with the CDP Platform running on a private cloud
- Availability and Resilience: The reliability and resilience features of the 16th Generation PowerEdge servers, (such as redundant power supplies, hardware monitoring, and failover capabilities, so on), are critical for maintaining[RAK4] [RAK5] the reliability and availability of the CDP Platform.
Available Configurations
The new Dell PowerEdge HS5610 is a 1U, two-socket rack server purpose-built for Cloud Service Providers’ most popular IT applications, this also lends itself well for Hybrid Cloud Edge deployments. Vi This scalable server optimizes technology without the financial and operational burden of supporting extreme configurations. With tailored performance, I/O flexibility, and open ecosystem system management, you gain simplicity for large-scale, heterogeneous SaaS, PaaS, and IaaS data centers.
Some of the benefits include –
- Faster performance by using 4th generation Intel® Xeon® Scalable processors with up to 32 cores per socket
- Accelerated in-memory applications with up to 16 DDR5 RDIMMS with speeds up to 4800 MT/sec
- Designed to take up less space than traditional servers, which makes them a good option for data centers with limited space and for cloud service providers
- Designed to be cooled efficiently, which can help to prevent overheating and ensure the longevity of the servers at cloud and on-premises
- Power efficient, which can help to reduce the overall operating costs of a data center
- Configurations that can easily scale to meet changing demand, which can help to optimize the cost of a data center
- Long living instances for space and cost reductions
- Validated workloads that reduce data center costs and overhead
- Resilient Architecture for Zero Trust IT environment and operations
| Cloudera® Data Platform (CDP) Private Cloud Base Cluster |
| |||
| Edge Node (1 Node) + Master Nodes (Minimum of Three Nodes Required)
| Worker Nodes for Use with External Storage System (Minimum of Three Nodes Required) | Worker Nodes with Local All-Flash Storage (Minimum of Three Nodes Required) | Worker Nodes with Local HDDs (Minimum of Three Nodes Required) |
|
Functions | Edge node: Apache Hadoop® clients, NameNode, Resource Manager, Apache ZooKeeper | DataNode, NodeManager, CDP DC (YARN) workloads |
| ||
Platform | Dell PowerEdge HS5610 (1RU) Chassis with up to 10x 2.5" SAS/SATA/NVMe Direct Drives | Dell PowerEdge HS5610 (1RU) Chassis with up to 10x 2.5" SAS/SATA/NVMe Direct Drives | Dell PowerEdge HS5620 (2RU) Chassis with up to 16x 2.5" SAS/SATA and 8x 2.5” NVMe | Dell PowerEdge HS5620 (2RU) Chassis with up to 12x 3.5" Drives and 2 x 2.5” rear storage (NVMe) |
|
CPU | 2 x 4th Gen Intel® Xeon® Gold 6426Y processor | 2 x 4th Gen Intel® Xeon® Gold 6448Y processor
|
| ||
DRAM | 256 GB (16 x 16 GB DDR5-4800) | 512 GB (16 x 32 GB DDR5-4800) |
| ||
Boot Device | Dell EMC™ Boot Optimized Server Storage (BOSS-N1) with 2 x 480 GB M.2 NVMe SSDs (RAID 1) |
| |||
Storage Adapter | Dell PERC H755N NVMe RAID adapter | None | Dell HBA355i |
| |
Storage HDFS/Ozone | 2x (up to 4x) 3.84 TB SATA Read Intensive SSD 2.5in AG Drive, 1DWPD | Not Required. Use an external storage system instead | 8x (up to 16x) 3.84 TB SATA Read Intensive SSD 2.5in AG Drive, 1DWPD | 12x 4 TB (or larger) 7.2 K RPM NLSAS 12 Gbps 512n 3.5” hot plug HDD |
|
Storage Fast Cache | 1 x 1.6 TB or 3.2 TB Enterprise NVMe Mixed Use AG Drive U.2 Gen4 | 1 x 3.2 TB Enterprise NVMe Mixed Use AG Drive U.2 Gen4 |
| ||
Network Interface Controller | Intel Ethernet Network Adapter E810-XXVDA2 for OCP3 (dual-port 10/25 GbE) |
| |||
Additional NIC | None | Intel Ethernet Network Adapter E810-XXV (dual-port 10/25 GbE), or | None | None |
|
Note: For storage-only configuration (Hadoop Distributed File System/Ozone), customers can still choose traditional high-density storage nodes with high-capacity rotational HDDs based on the HS5610 platform, however, external storage systems like Dell PowerScale or ECS are recommended. Customers should be aware that using large capacity HDDs increases the time of background scans (bit-rot detection) and block report generation for HDFS and significantly increases recovery time after full node failure. Also, using nodes with more than 100 TB of storage is not recommended by Cloudera. Source: https://blog.cloudera.com/disk-and-datanode-size-in-hdfs/. For more information and specifications, contact a Dell representative.
| CDP Private Cloud Data Services (Red Hat® OpenShift® Kubernetes®)/Embedded Container Service (ECS) Cluster | ||||
| Container Services Administration Host | Master Nodes (Three Nodes Required) | Worker Nodes (10 Nodes or More) | ||
Functions | OpenShift administration services
| OpenShift services, Kubernetes services | Kubernetes operators, Cloudera® Data Platform (CDP) Private Cloud workload pods | ||
Platform | Dell PowerEdge HS5610 (1RU) Chassis with up to 10x 2.5" SAS/SATA/NVMe Direct Drives | ||||
CPU | 2 x 4th Gen Intel® Xeon® Gold 6426Y processor | 2 x 4th Gen Intel® Xeon® Gold 6448Y processor
| |||
DRAM | 128 GB (16 x 8 GB DDR5-4800) | Standard configuration: 512 GB (16 x 32 GB DDR5-4800) Large memory configuration: 1024 GB (16 x 64 GB DDR5-4800) | |||
Boot device | Dell EMC™ Boot Optimized Server Storage (BOSS-N1) with 2 x 480 GB M.2 NVMe SSDs (RAID 1) | ||||
Storage adapter | Not required for all-NVMe configuration. | ||||
Storage (NVMe) | 1 x 1.6 TB Enterprise NVMe Mixed Use AG Drive U.2 Gen4 | 1 x 3.2 TB Enterprise NVMe Mixed Use AG Drive U.2 Gen4 | 1 x 6.4 TB Enterprise NVMe Mixed Use AG Drive U.2 Gen4
| ||
NOTHING |
| Intel Ethernet Network Adapter E810-XXVDA2 for OCP3 (dual-port 10/25 GbE) |
| ||
Additional NIC | Intel Ethernet Network Adapter E810-XXV (dual-port 10/25 GbE) | ||||
Learn More
Contact your Dell account team for a customized quote on 1-877-289+-3355 or go to the Intel and Cloudera solutions page.
- For workloads requiring high network bandwidth, customers might use an Intel Ethernet Network Adapter E810-CQDA2 with PCIe (dual-port 100 GbE) and 100 GbE top-of-rack (ToR) switches.
- Additional NIC is recommended for connectivity to an external storage system using a dedicated storage network. we [repeat endnote 2]
[RAK1]Dell to confirm the legal name for this platform
[DD2]“Dell PowerEdge HS5610 cloud scale server” is the correct name.
[RAK4]Can add more based on dells feedback
[RAK5]Added benefit section in available configs