The on-disk files in a Kubernetes container are ephemeral.
A container always starts with a clean state, including after a crash and restart. The files cannot be shared with other containers in the same pod. This ephemeral storage is allocated from the local storage of the host on which the container is running.
A typical Spark environment needs additional storage beyond the ephemeral storage that is provided, including persistent storage. The environment also needs access to external big data stores, data lakes, and databases.
This reference architecture uses Kubernetes Volumes to provide access to any additional run-time storage that is needed.
The Kubernetes Container Storage Interface (CSI) enables storage from local drives, and supported external storage systems, to be mapped as volumes into a container at run time. These volumes can be ephemeral or persistent. A full list of the CSI-compatible storage systems that OpenShift supports can be found in the Dell EMC Ready Stack for Red Hat OpenShift Container Platform 4.2 Design Guide. This reference architecture uses volumes from local storage and Dell EMC Isilon for the work.
The bulk of the source data and results for this reference architecture are stored on HDFS.
This reference architecture uses native Spark support for HDFS to connect to a data lake hosted on Isilon. This data path goes directly over TCP/IP, and does not require any special Kubernetes support beyond the network layer. The necessary HDFS support libraries are compiled into Spark, and are in the image file.
This reference architecture uses native Spark support for database connections over JDBC to access external databases.
Like the HDFS path, this capability does not require any special Kubernetes support, and the necessary libraries are in the image file.