Accelerate DevOps and Cloud Native Apps with the Dell Validated Platform for Red Hat OpenShift
Thu, 15 Sep 2022 13:28:43 -0000|
Read Time: 0 minutes
Today we announce the release of the Dell Validated Platform for Red Hat OpenShift. This platform has been jointly validated by Red Hat and Dell, and is an evolution of the design referenced in the white paper “Red Hat OpenShift 4.6 with CSI PowerFlex 1.3.0 Deployment on Dell EMC PowerFlex Family”.
Figure 1: The Dell Validated Platform for Red Hat OpenShift
The world is moving faster and with that comes the struggle to not just maintain, but to streamline processes and accelerate deliverables. We are no longer in the age of semi-annual or quarterly releases, as some industries need multiple releases a day to meet their goals. To accomplish this requires a mix of technology and processes … enter the world of containers. Containerization is not a new technology, but in recent years it has picked up a tremendous amount of steam. It is no longer just a fringe technology reserved for those on the bleeding edge; it has become mainstream and is being used by organizations large and small. However, technology alone will not solve everything. To be successful your processes must change with the technology – this is where DevOps comes in. DevOps is a different approach to Information Technology; it involves a blending of resources usually separated into different teams with different reporting structures and often different goals. It systematically looks to eliminate process bottlenecks and applies automation to help organizations move faster than they ever thought possible. DevOps is not a single process, but a methodology that can be challenging to implement.
Why Red Hat OpenShift?
Red Hat OpenShift is an enterprise-grade container orchestration and management platform based on Kubernetes. While many organizations understand the value of moving to containerization, and are familiar with the name Kubernetes, most don’t have a full grasp of what Kubernetes is and what it isn’t. OpenShift uses their own Kubernetes distribution, and layers on top critical enterprise features like:
- Built-in underlying hardware management and scaling, integrated with Dell iDRAC
- Multi-Cluster deployment, management, and shift-left security enforcement
- Developer Experience – CI/CD, GitOps, Pipelines, Logging, Monitoring, and Observability
- Integrated Networking including ServiceMesh and multi-cluster networking
- Integrated Web Console with distinct Admin and Developer views
- Automated Platform Updates and Upgrades
- Multiple workload options – containers, virtual machines, and serverless
- Operators for extending and managing additional capabilities
All these capabilities mean that you have a full container platform with a rigorously tested and certified toolchain that can accelerate your development, and reduce the costs associated with maintenance and downtime. This is what has made OpenShift the number 1 container platform in the market.
Figure 2: Realizing business value from a hybrid strategy - Source: IDC White Paper, sponsored by Red Hat, "The Business Value of Red Hat OpenShift", doc # US47539121, February 2021.
Meeting the performance needs
Scalable container platforms like Red Hat OpenShift work best when paired with a fast, scalable infrastructure platform, and this is why OpenShift, and Dell PowerFlex are the perfect team. With PowerFlex, organizations can have a single software-defined platform for all their workloads, from bare metal, to virtualized, to containerized. All on a blazing-fast infrastructure that can scale to thousands of nodes. Not to mention the API-driven architecture of PowerFlex fits perfectly in a methodology centered on automation. To help jumpstart customers on their automation journey we have already created robust infrastructure and DevOps automation through our extensive tooling that includes:
- Dell Container Storage Modules (CSM)/Container Storage Interface (CSI) Plugins
- Ansible Modules
- AppSync Integration
Being software-defined means that PowerFlex can deliver linear performance by being able to balance data across all nodes. This ensures that you can spread the work out over the cluster to scale well beyond the limits of the individual hardware components. This also allows PowerFlex to be incredibly resilient, capable of seamlessly recovering from individual component or node failures.
Putting it all together
Introducing the Dell Validated Platform for Red Hat OpenShift, the latest collaboration in the long 22-year partnership between Red Hat and Dell. This platform brings together the power of Red Hat OpenShift with the flexibility and performance of Dell PowerFlex into a single package.
Figure 3: The Dell Validated Platform for Red Hat OpenShift Architecture
This platform uses PowerFlex in a 2-tier architecture to give you optimal performance, and the ability to scale storage and compute independently, up to thousands of nodes. We are also taking advantage of Red Hat capabilities to run PowerFlex Manager and its accompanying services in OpenShift Virtualization to make efficient use of compute nodes and minimize the required hardware footprint.
The combined platform gives you the ability to become more agile and increase productivity through the extensive automation already available, along with the documented APIs to extend that automation or create your own.
This platform has been fully validated by both Dell and Red Hat, so you can run it with confidence. We have also streamlined the ordering process, so the entire platform can be acquired directly from Dell, including the Red Hat software and subscriptions. All of this is implemented using Dell’s ProDeploy services to ensure that the platform is implemented optimally and gets you up and running faster. This means you can start realizing the value of the platform faster, while reducing risk.
If you are interested in getting more information about the Dell Validated Platform for Red Hat OpenShift please contact your Dell representative.
Michael Wells, PowerFlex Engineering Technologist
Rhys Oxenham, Director, Customer & Field Engagement
Related Blog Posts
Driving Innovation with the Dell Validated Platform for Red Hat OpenShift and IBM Instana
Wed, 14 Dec 2022 21:20:39 -0000|
Read Time: 0 minutes
“There is no innovation and creativity without failure. Period.” – Brené Brown
In the Information Technology field today, it seems like it’s impossible to go five minutes without someone using some variation of the word innovate. We are constantly told we need to innovate to stay competitive and remain relevant. I don’t want to spend time arguing the importance of innovation, because if you’re reading this then you probably already understand its importance.
What I do want to focus on is the role that failure plays in innovation. One of the biggest barriers to innovation is the fear of failure. We have all experienced some level of failure in our lives, and the costly mistakes can be particularly memorable. To create a culture that fosters innovation, we need to create an environment that reduces the costs associated with failure – these can be financial costs, time costs, or reputation costs. This is why one of the core tenets of modern application architecture is “fail fast”. Put simply, it means to identify mistakes quickly and adjust. The idea is that a flawed process or assumption will cost more to fix the longer it is present in the system. With traditional waterfall processes, that flaw could be present and undetected for months during the development process, and in some cases, even make it through to production.
While the benefits of fail fast can be easy to see, implementing it can be a bit harder. It involves streamlining not just the development process, but also the build process, the release process, and having proper instrumentation all the way through from dev to production. This last part, instrumentation, is the focus of this article. Instrumentation means monitoring a system to allow the operators to:
- See current state
- Identify application performance
- Detect when something is not operating as expected
While the need for instrumentation has always been present, developers are often faced with difficult timelines and the first feature areas that tend to be cut are testing and instrumentation. This can help in the short term, but it often ends up costing more down the road, both financially and in the end-user experience.
IBM Instana is a tool that provides observability of complete systems, with support for over 250 different technologies. This means that you can deploy Instana into the environment and start seeing valuable information without requiring any code changes. If you are supporting web-based applications, you can also take things further by including basic script references in the code to gain insights from client statistics as well.
Announcing Support for Instana on the Dell Validated Platform for Red Hat OpenShift
Installing IBM Instana into the Dell Validated Platform for Red Hat OpenShift can be done by Operator, Helm Chart, or YAML File.
The simplest way is to use the Operator. This consists of the following steps:
- Create the instana-agent project
- Set the policy permissions for the instana-agent service account
- Install the Operator
- Apply the Operator Configuration using a custom resource YAML file
You can configure IBM Instana to point to IBM’s cloud endpoint. Or for high security environments, you can choose to connect to a private IBM Instana endpoint hosted internally.
Figure 1. Infrastructure view of the OpenShift Cluster
Once configured, the IBM Instana agent starts sending data to the endpoint for analysis. The graphical view in Figure 1 shows the overall health of the Kubernetes cluster, and the node on which each resource is located. The resources in a normal state are gray: any resource requiring attention would appear in a different color.
Figure 2: Cluster View
We can also see the metrics across the cluster, including CPU and Memory statistics. The charts are kept in time sync, so if you highlight a given area or narrow the time period, all of the charts remain in the same context. This makes it easy to identify correlations between different metrics and events.
Figure 3: Application Calls View
Looking at the application calls allows you to see how a given application is performing over time. Being able to narrow down to a one second granularity means that you can actually follow individual calls through the system and see things like the parameters passed in the call. This can be incredibly helpful for troubleshooting intermittent application issues.
Figure 4: Application Dependencies View
The dependencies view gives you a graphical representation of all the components within a system and how they relate to each other, in a dependency diagram. This is critically important in modern application design because as you implement a larger number of more focused services, often created by different DevOps teams, it can be difficult to keep track of what services are being composed together.
Figure 5: Application Stack Traces
The application stack trace allows you to walk the stack of an application to see what calls were made, and how much time each call took to complete. Knowing that a page load took five seconds can help indicate a problem, but being able to walk the stack and identify that 4.8 seconds was spent running a database query (and exactly what query that was) means that you can spend less time troubleshooting, because you already know exactly what needs to be fixed.
For more information about the Dell Validated Platform for Red Hat OpenShift, see our launch announcement: Accelerate DevOps and Cloud Native Apps with the Dell Validated Platform for Red Hat OpenShift | Dell Technologies Info Hub.
Author: Michael Wells, PowerFlex Engineering Technologist
NVIDIA AI Enterprise on Red Hat OpenShift
Wed, 15 Nov 2023 14:20:48 -0000|
Read Time: 0 minutes
NVIDIA AI Enterprise on Red Hat OpenShift
Red Hat OpenShift Container Platform is an enterprise-grade Kubernetes platform for deploying and managing secure and hardened Kubernetes clusters at scale. This Kubernetes distribution enables users to easily configure and use GPU resources to accelerate deep learning (DL) and machine learning (ML) workloads.
The NVIDIA H100 Tensor Core GPU, an integral part of the NVIDIA data center platform, is a high-performance GPU that is designed and optimized for AI workloads that are intended for data center and cloud-based applications. The GPU features major advances to accelerate AI, HPC, memory bandwidth, interconnect, and communication at data center scale. For more information, see NVIDIA H100 Tensor Core GPU.
NVIDIA AI Enterprise
NVIDIA AI Enterprise is an end-to-end, secure, cloud-native suite of AI software that enables organizations to solve new challenges while increasing operational efficiency. NVIDIA AI Enterprise accelerates the data science pipeline and streamlines development and deployment of production AI, including generative AI, computer vision, speech AI, and more. For more information, see NVIDIA AI Enterprise.
NVIDIA NGC catalog
The NVIDIA NGC catalog is a curated set of GPU-optimized software for AI, HPC, and Visualization. The NGC catalog simplifies building, customizing, and integrating GPU-optimized software into workflows on a variety of platforms, accelerating the time to solutions for users. The catalog includes containers, pre-trained models, Helm charts for Kubernetes deployments, and industry-specific AI toolkits. These toolkits consist of software development kits (SDKs) for NVDIA AI Enterprise that can be deployed on OpenShift Container Platform.
Prerequisites for installing NVIDIA AI Enterprise on OpenShift Container Platform
- An OpenShift cluster with a minimum of three nodes, at least one of which has an NVIDIA-supported GPU. For the list of supported GPUs, see the NVIDIA Product Support Matrix.
- A service instance for licenses. This blog briefly describes how to deploy a containerized DLS instance on OpenShift Container Platform that serves licenses to the clients.
NVIDIA license system
The NVIDIA license system is used to provide software licenses to licensed NVIDIA software products. The licenses are available from the NVIDIA Licensing Portal (access requires NVIDIA login credentials). The NVIDIA license system supports the following types of service instances: a Cloud License Service (CLS) instance that is hosted on the NVIDIA Licensing Portal, and a Delegated License Service (DLS) instance that is hosted on-premises at a location that is accessible from your private network, such as inside your data center.
A DLS instance is fully disconnected from the NVIDIA Licensing Portal. Licenses are downloaded from the portall and uploaded manually to the instance. The following figure depicts the flow:
The following DLS software image types are available:
- A virtual appliance image to be installed in a virtual machine on a supported hypervisor.
- A containerized software image for bare-metal deployment on a supported container orchestration platform.
Setting up a DLS instance
1. Download the latest "NLS License Server (DLS) 2.1 for Container Platforms" software from the NVIDIA Licensing Portal.
2. To import DLS appliance and PostgreSQL, run the following commands:
podman load --input dls_appliance_2.1.0.tar.gz
podman load --input dls_pgsql_2.1.0.tar.gz
3. Upload the DLS appliance artifact and the PostgreSQL database artifact images to a private repository.
4. Edit the deployment files for the DLS appliance artifact, and then use the PostgreSQL database artifact to pull these artifacts from the private repository.
You must provide an IP address for DLS_PUBLIC_IP. Optionally, you can edit the DLS default ports in the nls-si-0-deployment.yaml and nls-si-0-service.yaml deployment files. If a registry secret is required to pull the images from the private repository, edit the deployment files for the DLS appliance and the PostgreSQL database to reference the secret.
5. Create a Postgres instance by running the following command:
oc create -f directory/postgres-nls-si-0-deployment.yaml
6. Fetch the IP address of the Postgres pod that you created in the previous step, and then set the DLS_DB_HOST environment variable in the nls-si-0-deployment.yaml file to the IP address of the postgres pod:
oc create -f directory/nls-si-0-deployment.yaml
7. Access the DLS instance at https://<worker-node-ip>:30001. Register the default admin user dls_admin with a new password during the first login.
8. Create a license server on the NVIDIA Licensing Portal, and then add the licenses for the products that you want to allot to this license server.
9. Register the on-premises DLS instance by uploading the DLS token file dls_instance_token_mm-dd-yyyy-hh-mm-ss.tok to the NVIDIA Licensing Portal. Bind the license server that you created in the preceding step to the registered service instance.
10. Download the license file license_mm-dd-yyyy-hh-mm-ss.bin from the license server on the portal and upload it to your on-premises DLS instance. The licenses on the server are made available to the DLS instance.
11. Generate the client configuration token file from the DLS instance. The client configuration token contains information about the service instance, license servers, and fulfillment conditions to be used to serve a license in response to a client request.
12. Copy the client configuration token to clients so that the service instance has the necessary information to serve licenses to clients.
Installing NVIDIA AI Enterprise on OpenShift
1. Install the Node Feature Discovery (NFD) operator.
Install the NFD operator from the embedded Red Hat OperatorHub. After the operator is installed, create an NFD API so that the NFD operator can label the cluster nodes that have GPUs.
2. Install the NVIDIA GPU operator.
Install the NVIDIA GPU operator from the embedded Red Hat OperatorHub. The GPU operator enables Kubernetes cluster engineers to manage GPU nodes just like CPU nodes in the cluster. The operator installs and manages the life cycle of software components so that GPU-accelerated applications can be run on Kubernetes. This operator is installed in the nvidia-gpu-operator namespace by default.
3. Create an NGC secret.
Create an image pull secret object n the nvidia-gpu-operator namespace. This object is for storing the NGC API key to authenticate your access to the NGC container registry. Generate the API key from the NGC catalog.
Use the following credentials for the NGC secret:
- Authentication type in the secret Image registry: the registry server address is nvcr.io/nvaie
- Username: $oauthtoken
- Password: the generated API key.
Figure 3. NGC secret
4. Create a ConfigMap with configuration data.
Create a configmap in the nvidia-gpu-operator namespace with the client configuration token as data.
gridd.conf: '# empty file'
5. Create a Cluster Policy Custom Resource instance.
When you install the NVIDIA GPU operator in OpenShift Container Platform, a custom resource definition for a cluster policy is created. The policy configures the GPU stack that will be deployed, configuring the image names and repository, pod restrictions or credentials, and so on. When creating the cluster policy from the OpenShift web console, make the following customizations:
1. Enter the configmap containing the client configuration token that you created in the NVIDIA GPU/vGPU driver configuration file and enable the NLS.
2. Enable the deployment of the NVIDIA driver through the operator. The image repository is nvcr.io/nvaie.
3. Enter the NGC secret name in the driver configuration.
4. Specify the image name and NVIDIA vGPU driver version in the NVIDIA GPU/vGPU driver configuration section. Get this information from the NGC catalog, as shown in the following figure:
gridd.conf: '# empty file'
Figure 4. Configmap with Client configuration token
For a cluster on OpenShift Container Platform version 4.12, the NVIDIA GPU driver image is vgpu-guest-driver-3-1 and the version is 525.105.17. The GPU operator installs all the components that are required to set up the NVIDIA GPUs in the OpenShift cluster.
Environment overview: The Dell OpenShift validation team used Dell PowerEdge servers hosting Red Hat OpenShift Platform 4.12 to validate the NVIDIA AI Enterprise on OpenShift. The validated environment consisted of three compute nodes hosted on PowerEdge R760, R750 and R7525 servers and equipped respectively with NVIDIA GPU H100, A40, and A100. For more information about deploying an OpenShift cluster on Dell-powered bare metal servers, see the .
A containerized DLS instance is present on the same OpenShift cluster with all the required licenses.
The team created a TensorFlow pod using the "tensorflow-3-1" image from the nvcr.io/nvaie repository by running the following commands:
- image: nvcr.io/nvaie/tensorflow-3-1:23.03-tf1-nvaie-3.1-py3
The ResNet-50 convolutional neural network with FP32 and FP16 precision from inside the TensorFlow pod ran successfully.
To run the test, the team used the following commands:
python resnet.py --layers 50 -b 64 -i 200 -u batch --precision fp16
python resnet.py --layers 50 -b 64 -i 300 -u batch --precision fp32