Cloudera Runtime is the core open-source software distribution within CDP that Cloudera maintains, supports, versions, and packages as a single entity. Cloudera Runtime includes over 40 open-source projects that constitute the core distribution of data management tools within CDP. Cloudera Runtime also includes Cloudera Manager, which is used to configure and monitor clusters that are managed in CDP.
Table 3 shows the major software components that constitute Cloudera Runtime 7.1.1 for CDP Data Center, along with a brief description of each. For more information, see Cloudera documentation. After the table, there is an explanation of what changes and differences users can expect when migrating to CDP Data Center from either CDH or HDP.
Infrastructure summary describes where these components are deployed across the various nodes in this reference architecture design.
Cloudera Manager is a web application that administrators and others can use to configure, manage, and monitor CDP clusters and Cloudera Runtime services. You can also use the Cloudera Manager API to programmatically perform management tasks.
Accumulo is a sorted, distributed key-value store that provides robust, scalable data storage and retrieval.
Atlas provides data governance capabilities for Hadoop. Atlas is also a common metadata store, which is designed to exchange metadata within and outside of the Hadoop stack.
Arrow is a cross-language development platform for in-memory data.
A subproject of Apache Calcite, Avatica is a framework for building database drivers.
Avro is a row-oriented remote procedure call and data serialization framework for Apache Hadoop.
Calcite is a framework for building databases and data management systems that includes:
DataFu is a collection of libraries for working with large-scale data in Hadoop.
Druid is a distributed data store that creates a unified system for real-time analytics by combining ideas from:
Cruise Control automates the dynamic workload rebalancing and self-healing of a Kafka cluster.
Apache Hadoop is a framework that enables distributed processing of large datasets across clusters of systems, using simple programming models. Apache Hadoop is designed to scale out from single servers to thousands of servers.
HBase provides random, persistent access to data as a natively nonrelational database. HBase is ideal for scenarios that require real-time analysis and tabular data for end-user applications.
Hadoop Distributed File System is a Java-based file system that provides scalable, reliable data storage for large volumes of data.
Hive is a data warehouse system for summarizing, querying, and analyzing huge, disparate datasets.
Hue is a web-based, interactive query editor that enables users to interact with data warehouses.
Impala provides high-performance, low-latency SQL queries on data stored in Apache Hadoop file formats.
Kafka is a high-performance, highly available, redundant streaming message platform. Kafka functions like a publish-subscribe messaging system, but with:
Knox is an application gateway for interacting in a secure way with the REST APIs and user interfaces of one or more Hadoop clusters.
Kudu combines fast inserts and updates, and efficient columnar scans, to enable multiple real-time analytic workloads across a single storage layer. Kudu provides fast analytics on fast data.
Livy is a service that enables easy interaction with a Spark cluster over a REST interface.
Oozie is a workflow and coordination service for managing Apache Hadoop jobs.
Optimized Row Columnar (ORC) is a self-describing, type-aware columnar file format designed for Hadoop workloads.
Apache Ozone (Beta)
Ozone is a scalable, redundant, and distributed object store optimized for big data workloads. A Beta is not for production use.
Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the:
Phoenix is an add-on for Apache HBase that provides a programmatic ANSI SQL interface.
Ranger is a CDP security component that enables you to control access to CDP services. Ranger also provides access auditing and reporting.
Schema Registry is a distributed storage layer for schemas which uses Kafka as its underlying storage mechanism.
Cloudera Search uses Apache Solr to provide integrated full text search and natural language access to data stored in, or ingested into, Hadoop, HBase, or cloud storage.
Solr provides natural language access to data stored in, or ingested into, Hadoop, HBase, or cloud storage.
Spark is a distributed, in-memory data processing engine designed for large-scale data processing and analytics.
Sqoop is a CLI-based tool for bulk transfers of data between relational databases and HDFS or cloud object stores.
Streams Messaging Manager
Streams Messaging Manager is an operations monitoring and management tool that provides end-to-end visibility in an enterprise Apache Kafka environment.
Streams Replication Manager
Streams Replication Manager is an enterprise-grade replication solution that enables fault tolerant, scalable, robust cross-cluster Kafka topic replication.
Tez is an extensible framework for building high-performance batch and interactive data processing applications, which YARN coordinates in Apache Hadoop.
YARN is the processing layer for managing distributed applications that run on multiple machines in a network.
Zeppelin is a multipurpose, web-based notebook which brings the following features to Hadoop and Spark:
ZooKeeper is a centralized service that is used for: