CDP Data Center is a comprehensive, on-premises platform for integrated data analytics. CDP Data Center encompasses ingest, processing, analysis, experimentation, and deployment. It integrates the best of CDH and HDP to deliver the latest and best open-source data management and analytics technologies. CDP Data Center is optimized for deployment within the data center, and ready for private cloud.
A core layer of CDP Data Center is Cloudera Shared Data Experience (SDX), with uniform capabilities of Data catalog, Schema, Replication, Security, and Governance.
Cloudera SDX Shared Data Experience includes the following capabilities:
Schema | Automatic capture and storage of all schema and metadata definitions as platform workloads use and create them. |
Replication | Deliver data copies and data policies that the enterprise requires to work, with complete consistency and security. |
Security | Role-based access control applied consistently across the platform, including full stack encryption and key management. |
Governance | Enterprise-grade auditing, lineage, and governance capabilities applied across the platform with rich extensibility for partner integrations. |
Figure 2 shows a high-level view of the CDP Data Center architecture. CDP Data Center Runtime consists of a large set of software components such as Apache HDFS, Apache Hive 3, Apache HBase, and Apache Impala, and many other components for specialized workloads. The full list is shown in Table 3.
Several preconfigured packages of services, sometimes known as cluster shapes, are available for common workloads. These services include:
Data Engineering | Provides the abilities to ingest, transform, and analyze data. Services include: HDFS, YARN, YARN Queue Manager, Ranger, Atlas, Hive, Hive on Tez, Spark, Oozie, Hue, and Data Analytics Studio. |
Data Mart | Enables you to browse, query, and explore your data in an interactive way. Services include: HDFS, Ranger, Atlas, Hive, Impala, and Hue. |
Operational Database | Provides low-latency writes, reads, and persistent access to data for Online Transactional Processing (OLTP) use cases and real-time insights. Services include: HDFS, Ranger, Atlas, and HBase. |
Figure 2: CDP Data Center high-level architecture
You can also create custom services and clusters from Cloudera Manager, which will deploy any combination of supported services that you select. Looking ahead to the CDP Private Cloud release, many of these preconfigured and custom packages will become containerized services known as Analytic Experiences.
Streaming Data | Using the custom services option in Cloudera Manager you can create either a simple or full Kafka cluster for data ingest and streams messaging, with monitoring and replication. This provides the capabilities for what Cloudera term their Stream Processing and Streams Messaging. Services include: Kafka, Schema Registry, Streams Messaging Manager, Streams Replication Manager, Cruise Control, and ZooKeeper. | Following the initial CDP Data Center, Cloudera will release Cloudera Flow Management (CFM), with support for the latest Apache NiFi and NiFi Registry releases, to be followed by Edge Management and Streaming Analytics with Apache Flink, Kafka Streams, and Spark Streaming. All these products will eventually and collectively become the Cloudera DataFlow (CDF) platform. |
Machine Learning | Machine Learning (ML) capabilities, also available with CDP Data Center, include support for the Cloudera Data Science Workbench (CDSW), a platform for collaborative data science at scale. CDSW enables data scientists and IT professionals to build and manage their own analytics pipelines and to quickly deploy models and interactive visual apps. | |
Key features, improvements, and benefits of CDP Data Center 7.1.1 include:
Streams Messaging | Complete and comprehensive Kafka streaming experience improving operational efficiency, business continuity and scalability. |
Data Engineering | Improved performance and interoperability for Apache Spark and management of data engineering workflows and pipeline creation. |
Data Warehouse | Faster SQL analytics on larger data sets, deeper understanding from unstructured data sources, and easier visualizations of business insights. |
Machine Learning | Data Science Workbench is now available on CDP Data Center with advanced control over experiments and model deployment. |
Operational Database | Improved performance, policy management and availability. |
SDX | Enhanced security, compliance and consistency across CDP. |
Support for in-place upgrades and migrations | From CDH 5.x and HDP 2.x to CDP Data Center. |
All the features and capabilities that are new to users migrating or upgrading from CDH or HDP are described in CDP Data Center components.