The proliferation of data science tools and platforms over the last decade has left many organizations stagnated during long periods of software and hardware evaluation. Dell EMC has shown how Spark hosted in a Kubernetes managed cluster can provide a wide range of data science capability that IT can install, configure, and manage.
The use case showed that data scientists and data engineers can collaborate to build a full analytics pipeline without having to go outside the Spark ecosystem, for:
The demonstration of Jupyter notebooks and server adds rapid prototyping and visualization capabilities to the data science team. This method uses the same container or Kubernetes management tool set as all the other Spark-specific services.
This reference architecture also showed IT professionals how they can leverage the growing capabilities of Kubernetes to manage infrastructure for Spark analytics. Distributed analytics workloads have been difficult to migrate to containers and Kubernetes. This document offered lab-tested recommendations that should save an organization new to Spark with Kubernetes countless hours of research and prototyping, including:
Finally, all the specifics of the reference lab hardware, software, and configurations used for the retail inventory management use case demonstration were documented. This document provided guidance for a representative reference architecture that Dell EMC considers appropriate for general-purpose data analytics involving all stages of an analytics pipeline using Apache Spark.
Dell EMC chose the Dell EMC Ready Stack for Red Hat OpenShift Container Platform 4.2 Design Guide as a base implementation for Kubernetes. The detailed Red Hat OpenShift Container Platform information, and Spark background and configuration details in this reference architecture can jumpstart any organization wanting to test an analytics platform that enthuses data scientists and IT professionals.