A Spark application generally runs on Kubernetes the same way as it runs under other cluster managers, with a driver program, and executors. Under Kubernetes, the driver program and executors are run in individual Kubernetes pods.
All containers within a Kubernetes pod:
The Kubernetes scheduler allocates the pods across available nodes in the cluster.
Kubernetes requires users to provide a Kubernetes-compatible image file that is deployed and run inside the containers. This image should include:
Dell EMC built a custom image for the inventory management example. See Spark image for the details.
Launching a Spark program under Kubernetes requires a program or script that uses the Kubernetes API (using the Kubernetes apiserver) to:
There are two ways to launch a Spark program under Kubernetes:
Dell EMC uses spark-submit as the primary method of launching Spark programs.
The spark-submit script that is included with Apache Spark supports multiple cluster managers, including Kubernetes. Most Spark users understand spark-submit well, and it works well with Kubernetes.
With Kubernetes, the –master argument should specify the Kubernetes API server address and port, using a k8s:// prefix. The --deploy mode argument should be cluster. All the other Kubernetes-specific options are passed as part of the Spark configuration. You can run spark-submit of outside the cluster, or from a container running on the cluster.
spark-submit shows an example of using spark-submit to launch a time-series model training job with Spark on Kubernetes.
Example 1 spark-submit
--master k8s://https://100.84.118.17:6443/ \
--deploy-mode cluster \
--name tsmodel \
--conf spark.executor.extraClassPath=/opt/spark/examples/jars/ \
--conf spark.driver.extraClassPath=/opt/spark/examples/jars/ \
--conf spark.eventLog.enabled=true \
--conf spark.eventLog.dir=hdfs://isilon.tan.lab/history/spark-logs \
--conf spark.kubernetes.namespace=spark-jobs \
--conf spark.executor.instances=4 \
--conf spark.kubernetes.container.image.pullPolicy=Always \
--conf spark.kubernetes.container.image=infra.tan.lab/tan/ \
--conf spark.kubernetes.authenticate.submission.caCertFile=/etc/ \
Operators are software extensions to Kubernetes that are used to manage applications and their components. Operators all follow the same design pattern and provide a uniform interface to Kubernetes across workloads.
The Spark Operator for Kubernetes can be used to launch Spark applications. The Spark Operator uses a declarative specification for the Spark job, and manages the life cycle of the job. Internally, the Spark Operator uses spark-submit, but it manages the life cycle and provides status and monitoring using Kubernetes interfaces.
Dell EMC also uses the Kubernetes Operator to launch Spark programs. It works well for the application, but is relatively new and not as widely used as spark-submit.
The following examples describe using the Spark Operator:
Example 2 Operator YAML file (sparkop-ts_model.yaml) used to launch an application
Example 3 Launching a Spark application
k8s1:~/SparkOperator$ kubectl apply -f sparkop-ts_model.yaml
k8s1:~/SparkOperator$ k get sparkApplications
Example 4 Checking status of a Spark application
k8s1:~/SparkOperator$ kubectl describe sparkApplications sparkop-tsmodel
API Version: sparkoperator.k8s.io/v1beta2
Normal SparkExecutorPending 9s (x2 over 9s) spark-operator Executor sparkop-tsmodel-1575645577306-exec-7 is pending
Normal SparkExecutorPending 9s (x2 over 9s) spark-operator Executor sparkop-tsmodel-1575645577306-exec-8 is pending
Normal SparkExecutorRunning 7s spark-operator Executor sparkop-tsmodel-1575645577306-exec-7 is running
Normal SparkExecutorRunning 7s spark-operator Executor sparkop-tsmodel-1575645577306-exec-6 is running
Normal SparkExecutorRunning 6s spark-operator Executor sparkop-tsmodel-1575645577306-exec-8 is running
Example 5 Checking Spark application logs
k8s1:~/SparkOperator$ kubectl logs tsmodel-1575512453627-driver
The prior examples include both interactive and batch execution.
Dell EMC uses Jupyter for interactive analysis and connects to Spark from within Jupyter notebooks. The image that was created earlier includes Jupyter. The Jupyter image runs in its own container on the Kubernetes cluster independent of the Spark jobs.
When a job is submitted to the cluster, the OpenShift scheduler is responsible for identifying the most suitable compute node on which to host the pods. The default scheduler is policy-based, and uses constraints and requirements to determine the most appropriate node.
Platform administrators can control the scheduling with advanced features including pod and node affinity, node selectors, and overcommit rules. The user specifies the requirements and constraints when submitting the job, but the platform administrator controls how the request is ultimately handled.
In general, the scheduler abstracts the physical nodes, and the user has little control over which physical node a job runs on. OpenShift MachineSets and node selectors can be used to get finer grained control over placement on specific nodes.