Understanding containers and images

The concepts of containers and images are crucial when creating and managing containerized software. Containers are instances of container images running in real time, while images contain ready-to-run software. Containers in OpenShift Container Platform are based on OCI- or Docker-formatted container images. OpenShift Container Platform provides redundancy and horizontal scaling to a service packaged into an image by deploying the same image into containers across multiple hosts.

The podman or docker CLIs can also be used to build images. OpenShift Container Platform also provides builder images that help you create images by adding your code or configuration to existing images.

Once the image build process completes, you must push these images to the Image registry. A registry contains a collection of one or more image repositories, which contain one or more tagged images. OpenShift Container Platform can also provide its own OpenShift image registry for managing custom container images, similar to the Red Hat registry at registry.redhat.io.

Dell Technologies implemented and deployed workloads as part of the solution verification process. The following sections describe the implementations of three of these workloads.

Apache Spark

Spark can run on Kubernetes managed clusters. With this feature, Spark uses the native Kubernetes Scheduler which is more robust than the stand-alone scheduler that is packaged with Spark by default. It can be further enhanced with better resource allocation, monitoring, and logging, making Kubernetes an ideal Spark companion.

The process can be as simple as supplying Spark applications to Kubernetes with the spark-submit command. As a result, resources are created inside a Kubernetes pod and orchestration activities are passed to the Kubernetes API server and Scheduler.

An alternative solution is to use the Operator to manage and monitor your Spark jobs in a declarative manner. You do not have to check job statuses or log files, or track job versions, as it streamlines the process for tracking jobs. ML experts can take advantage of the best of both worlds when Spark and Kubernetes are used together.

See the Red Hat OpenShift Container Platform blog Getting Started running Spark workloads on OpenShift for more information.

The Spark operator available on Red Hat Marketplace OperatorHub helps managing the life cycle of Spark applications on a Kubernetes cluster. However, the operator available on Red Hat OperatorHub at the time of this release provides older versions of Spark and other dependencies that are required to run the Delta Lake stack. Dell Technologies has created its own customized image with the following versions:

Apache Spark 3.4.1
Apache Hadoop client libraries 3.3.4 (for HDFS access)
Apache Hadoop AWS libraries 3.3.4 (for S3 access)
Delta Lake 2.4.0 and 3.0.0rc1 libraries
Apache Iceberg 1.3.1

A screenshot of the Spark operator from Red Hat Marketplace — Figure 7. Spark operator

Several executors can be specified, along with their resource requirements, when running the spark-submit command. For example:

spark-submit \
  --packages io.delta:delta-core_2.12:3.0.0,org.apache.hadoop:/
    hadoop-aws:3.3.4,org.apache.hadoop:hadoop-common:3.3.4 \
  --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
  --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql./
    delta.catalog.DeltaCatalog \
  --master k8s://https://your-kubernetes-api-server:port \
  --deploy-mode client \
  --conf spark.kubernetes.driver.request.cores=1 \
  --conf spark.kubernetes.executor.request.cores=1 \
  --conf spark.kubernetes.executor.limit.cores=2 \
  --conf spark.kubernetes.executor.request.memory=2g \
  --conf spark.kubernetes.executor.limit.memory=4g \
  --conf spark.executor.instances=4 \
  local:///your-spark-python-script.py

In the example above Dell Technologies uses four Spark executors, each with one vCPU and 4 GB of memory.

Confluent Kafka

Apache Kafka is a widely used technology for event and data streaming, which is at the core of Confluent Platform. Kafka is designed for building next-generation, event-driven applications. It provides organizations with a secure, enterprise-ready platform for real-time and historical event processing.

The Kafka on Kubernetes platform helps solve data pipeline scaling issues. It reduces the onboarding time of publishers and subscribers from several weeks to a few days. Kafka clusters can be scaled on demand on the OpenShift Container Platform using the HorizontalPodAutoscaler (HPA).

Confluent for Kubernetes is a Red Hat Certified OpenShift Operator available in the Red Hat Marketplace OperatorHub. This operator includes standard Kafka components, including:

Kafka Broker
Kafka Schema Registry
Kafka REST proxy
Confluent Control Center
ZooKeeper (for runtime support)
ksqlDB

A screenshot of the Confluent Kafka operator from Red Hat Marketplace — Figure 8. Confluent Kafka operator

The persistent volume storage class that is derived from Dell PowerStore storage is used to fulfill Kafka pods Persistent Volume Claim (PVC) requirements. The number of cores and memory per broker can be configured using:

The OpenShift Web UI
The Kafka broker configuration YAML file, provided in the wizard while deploying Kafka brokers

Using Confluent Kafka, Dell Technologies built a sample ETL pipeline that demonstrates how to stream live data into Kafka topics. It also demonstrates how to subscribe to a Spark instance that includes Delta Lake and Iceberg libraries. The modern data stack is configured on top of Dell PowerScale, ECS, or ObjectScale storage to write data into Delta Lake or Iceberg formats.

Starburst

Starburst is a SQL-based Massively Parallel Processing (MPP) query engine for data lakes, data warehouses, and data meshes. Customers can address data silos and speed of access issues with enterprise and cloud solutions by Starburst. The Trino open-source query engine (formerly Presto) is the foundation of Starburst, which aggregates data from distributed data sources.Starburst provides access to over 50 enterprise data sources ranging from data lakes and warehouses to streaming systems, relational database systems, and more. Starburst Enterprise enables you to use different protocols for querying your different data sources, regardless of whether they are structured, semistructured, or unstructured.

See Starburst Connectors for more information about the list of available connectors.

A screenshot of the Starburst operator from the Red Hat Marketplace — Figure 9. Starburst operators

The Dell Technologies and Starburst Multicloud Data Analytics solution uses the Starburst federated query engine to enable customers to:

Run their business intelligence, analytics, and data science workloads.
Access data spread across the enterprise.

This operator includes standard Starburst components:

Starburst Hive
Starburst Enterprise
- Starburst Coordinator
- Starburst Workers

See Multicloud Data Analytics with Dell Technologies Powered by Starburst for more information about the Dell Technologies and Starburst published referenced architecture.

Your Browser is Out of Date

Workload design

Workload design

Understanding containers and images

Apache Spark

Confluent Kafka

Starburst