Hadoop is an ecosystem of open-source components that fundamentally change the way enterprises store, process, and analyze their data. Unlike traditional systems, Hadoop enables multiple types of analytic workloads to run on the same data, simultaneously, at massive scale on industry-standard hardware[2].
Hadoop’s infinitely scalable flexible architecture is based on the HDFS file system. HDFS allows organizations to store and analyze unlimited amounts and types of data in a single open-source platform on industry-standard hardware. Hadoop can quickly integrate with existing applications and transform complex data using multiple data access options like Apache Hive, or Apache Pig for batch processing, or Apache Spark for fast in-memory processing.
While there are several vendors for Hadoop, our tested configuration implemented Cloudera Hadoop.
There are two steps to configure Cloudera Hadoop:
Both steps are described in the following sections