Cloudera Runtime is the core open-source software distribution within CDP that Cloudera maintains, supports, versions, and packages as a single entity. Cloudera Runtime includes over 40 open-source projects that constitute the core distribution of data management tools within CDP. Cloudera Runtime also includes Cloudera Manager, which is used to configure and monitor clusters that are managed in CDP.
Table 3 shows the major software components that constitute Cloudera Runtime 7.1.1 for CDP Data Center, along with a brief description of each. For more information, see Cloudera documentation. After the table, there is an explanation of what changes and differences users can expect when migrating to CDP Data Center from either CDH or HDP.
Infrastructure summary describes where these components are deployed across the various nodes in this reference architecture design.
Table 3: CDP Data Center software components
Component | Version | Description |
Cloudera Manager | 7.1.1 | Cloudera Manager is a web application that administrators and others can use to configure, manage, and monitor CDP clusters and Cloudera Runtime services. You can also use the Cloudera Manager API to programmatically perform management tasks. |
Apache Accumulo | 1.7.0 | Accumulo is a sorted, distributed key-value store that provides robust, scalable data storage and retrieval. |
Apache Atlas | 2.0.0 | Atlas provides data governance capabilities for Hadoop. Atlas is also a common metadata store, which is designed to exchange metadata within and outside of the Hadoop stack. |
Apache Arrow | 0.8.0 | Arrow is a cross-language development platform for in-memory data. |
Apache Avatica | 1.10.0 | A subproject of Apache Calcite, Avatica is a framework for building database drivers. |
Apache Avro | 1.8.2 | Avro is a row-oriented remote procedure call and data serialization framework for Apache Hadoop. |
Apache Calcite | 1.19.0 | Calcite is a framework for building databases and data management systems that includes: |
Apache DataFu | 1.3.0 | DataFu is a collection of libraries for working with large-scale data in Hadoop. |
Apache Druid | 0.15.1 | Druid is a distributed data store that creates a unified system for real-time analytics by combining ideas from: |
Cruise Control | 2.0.100 | Cruise Control automates the dynamic workload rebalancing and self-healing of a Kafka cluster. |
Apache Hadoop | 3.1.1 | Apache Hadoop is a framework that enables distributed processing of large datasets across clusters of systems, using simple programming models. Apache Hadoop is designed to scale out from single servers to thousands of servers. |
Apache HBase | 2.2.3 | HBase provides random, persistent access to data as a natively nonrelational database. HBase is ideal for scenarios that require real-time analysis and tabular data for end-user applications. |
Apache HDFS | 3.1.1 | Hadoop Distributed File System is a Java-based file system that provides scalable, reliable data storage for large volumes of data. |
Apache Hive | 3.1.3000 | Hive is a data warehouse system for summarizing, querying, and analyzing huge, disparate datasets. |
Hue | 4.5.0 | Hue is a web-based, interactive query editor that enables users to interact with data warehouses. |
Apache Impala | 3.4.0 | Impala provides high-performance, low-latency SQL queries on data stored in Apache Hadoop file formats. |
Apache Kafka | 2.4.1 | Kafka is a high-performance, highly available, redundant streaming message platform. Kafka functions like a publish-subscribe messaging system, but with: |
Apache Knox | 1.3.0 | Knox is an application gateway for interacting in a secure way with the REST APIs and user interfaces of one or more Hadoop clusters. |
Apache Kudu | 1.12.0 | Kudu combines fast inserts and updates, and efficient columnar scans, to enable multiple real-time analytic workloads across a single storage layer. Kudu provides fast analytics on fast data. |
Apache Livy | 0.6.0 | Livy is a service that enables easy interaction with a Spark cluster over a REST interface. |
Apache Oozie | 5.1.0 | Oozie is a workflow and coordination service for managing Apache Hadoop jobs. |
Apache ORC | 1.5.1 | Optimized Row Columnar (ORC) is a self-describing, type-aware columnar file format designed for Hadoop workloads. |
Apache Ozone (Beta) | 0.5.0 | Ozone is a scalable, redundant, and distributed object store optimized for big data workloads. A Beta is not for production use. |
Apache Parquet | 1.10.99 | Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the: |
Apache Phoenix | 5.0.0 | Phoenix is an add-on for Apache HBase that provides a programmatic ANSI SQL interface. |
Apache Ranger | 2.0.0 | Ranger is a CDP security component that enables you to control access to CDP services. Ranger also provides access auditing and reporting. |
Schema Registry | 0.8.1 | Schema Registry is a distributed storage layer for schemas which uses Kafka as its underlying storage mechanism. |
Cloudera Search | 1.0.0 | Cloudera Search uses Apache Solr to provide integrated full text search and natural language access to data stored in, or ingested into, Hadoop, HBase, or cloud storage. |
Apache Solr | 8.4.1 | Solr provides natural language access to data stored in, or ingested into, Hadoop, HBase, or cloud storage. |
Apache Spark | 2.4.0 | Spark is a distributed, in-memory data processing engine designed for large-scale data processing and analytics. |
Apache Sqoop | 1.4.7 | Sqoop is a CLI-based tool for bulk transfers of data between relational databases and HDFS or cloud object stores. |
Streams Messaging Manager | 2.1.0 | Streams Messaging Manager is an operations monitoring and management tool that provides end-to-end visibility in an enterprise Apache Kafka environment. |
Streams Replication Manager | 1.0.0 | Streams Replication Manager is an enterprise-grade replication solution that enables fault tolerant, scalable, robust cross-cluster Kafka topic replication. |
Apache Tez | 0.9.1 | Tez is an extensible framework for building high-performance batch and interactive data processing applications, which YARN coordinates in Apache Hadoop. |
Apache YARN | 3.1.1 | YARN is the processing layer for managing distributed applications that run on multiple machines in a network. |
Apache Zeppelin | 0.8.2 | Zeppelin is a multipurpose, web-based notebook which brings the following features to Hadoop and Spark: |
Apache ZooKeeper | 3.5.5 | ZooKeeper is a centralized service that is used for: |