Short topics related to data storage essentials.
SolutionPack for iDRAC PowerEdge
Wed, 01 Mar 2023 17:16:08 -0000
|Read Time: 0 minutes
Dell Storage Resource Manager (SRM) provides comprehensive monitoring, reporting, and analysis for heterogeneous block, file, object, and virtualized storage environments. It enables you to visualize applications to storage dependencies, monitor, and analyze configurations and capacity growth. It has visibility into the environment’s physical and virtual relationships to ensure consistent service levels.
To enable storage administrators to monitor their physical and virtual compute environment, Dell provides SRM solution packs. These solution packs include SolutionPack for Physical Hosts, Microsoft Hyper-V, IBM LPAR, Brocade FC Switch and Cisco MDS/Nexus with passive host discovery options, VMware vSphere & vSAN, and Dell VxRail.
With the new SolutionPack for iDRAC PowerEdge, we can monitor the status of server hardware components such as power supplies, temperature probes, cooling fans, and battery. We can also gather historical information about electrical energy usage and other key performance indicators that measure the proper functioning of a server device.
To illustrate SRM’s cross-domain functionality, we examine the most common use case, where Dell PowerEdge physical servers are deployed as part of VMware hypervisor clusters.
SolutionPack for VMware vSphere & vSAN provides capacity, performance, and relationship data for all VMware discovered components, such as VMs, hypervisors, clusters, and datastores, as well as their relationship with fabric and backend storage arrays. Here is one example of the end-to-end topology of the virtualized environment:
Figure 1. Example of end-to-end topology of a virtualized environment
To gain physical access to the PowerEdge servers and their hardware components, we rely on integrated Dell Remote Access Controller (iDRAC), which is a baseboard management controller that is integrated in PowerEdge servers.
iDRAC exposes hardware components’ data through several APIs, one of them being SNMP. With SRM SNMP collector, which is part of the SolutionPack for iDRAC PowerEdge, we discover iDRACs from which we pull PowerEdge server data. This data includes electrical energy usage (Wh), probes temperature (C), power supply output (W), and cooling devices speed (RPM). It also includes status of power supplies, battery, cooling devices, temperature probes, and server availability. SRM provides historical reports for all the metrics, with a maximum 7-year data retention for weekly aggregates.
With the data available from the iDRAC PowerEdge, VMware vSphere & vSAN, and relevant fabric and storage array solution packs, users can seamlessly navigate from the context of physical server hardware component reports to the context of the physical server reports within the broader SAN environment.
Let’s examine the component status data, performance data, and alerts provided by the SolutionPack for iDRAC PowerEdge.
The Summary page Card View and Table View for PowerEdge servers show hardware components status (temperature probes, cooling devices, battery, power supply), server availability, daily electrical energy usage (kWh), energy cost ($), and daily carbon emission (kgCO2e). Energy cost and carbon footprint metrics are calculated based on server location. In the following example, we see significant difference in daily carbon emission between Poland and Germany, even though there is small difference in daily energy usage. The same applies to energy cost prices.
Figure 2. Card view of hardware component status
Figure 3. Table view of hardware component status (first 10 columns)
Figure 4. Table view of hardware component status (final columns—continuation of preceding figure)
Energy cost and carbon emissions per country are calculated dynamically based on data enrichment enabled on SRM collectors. Metrics collected from each iDRAC are automatically tagged with location, carbon intensity, and energy cost properties. Here is an example of data enrichment configuration from the SRM admin UI:
Figure 5. SRM admin UI showing data enrichment configuration
CSV files that contain values for energy cost and carbon intensity per country are available publicly and can be transferred automatically through FTP to SRM collectors as part of the data enrichment process. Here is a CSV file excerpt that contains kWh cost ($) per country:
Figure 6. Excerpt of kwh-cost-per-country CSV file
And here is a CSV file excerpt that contains carbon intensity per kWh per country: Figure 7. Excerpt of carbon-intensity-by-country CSV file
The CSV file for data enrichment with device,location mapping is specific to every customer.
From the initial Card View or Table View, you can drill down to the PowerEdge server end-to-end topology map. This is a host-based landing page where you can see the server’s relationship with the rest of the SAN components, as well as server attributes, performance, capacity, alerts, and inventory data. This is an example of SRM’s powerful capability to aggregate and visualize data from multiple contexts within the same reporting UI.
Figure 8. End-to-end topology map
The iDRAC PowerEdge Inventory report shows servers’ hardware component names, quantities, server hostname, serial number, operating system version, model, and IP address:
Figure 9. Inventory report (first six columns)
Figure 10. Inventory report (final columns—continuation of preceding figure)
Drilling down from the preceding table leads to the daily status dashboard of a selected server’s hardware components. Here are a few examples:
Figure 11. Status of cooling devices
Figure 12. Power supply output watts
Figure 13. Energy usage (Wh)
The iDRAC PowerEdge Performance report shows key metric values for servers’ hardware components, such as probes temperature (C), temperature lower and upper thresholds, cooling device (RPM), and cooling device critical and non-critical thresholds. Each selected row plots interactively historical performance data on the charts below the table, including server electrical energy usage (Wh), probes temperature (C), and cooling devices (RPM).
Figure 14. Trend chart—Electrical energy usage (Wh)
Figure 15. Trend chart—Probes temperature (C) values plotted alongside threshold values
The following trend chart shows cooling device (RPM) values plotted on the same chart with the active alerts relevant to the cooling device. The alert is displayed as a black dot with pop-up details of the issue that caused the alert. This feature greatly improves troubleshooting and is another example of SRM’s powerful capability to aggregate and visualize data from multiple contexts within the same reporting UI.
Figure 16. Trend chart—Cooling device (RPM) values plotted on the same chart with the active alerts relevant to the cooling device
The following bar charts show Carbon Emission, Energy Cost ($), Cooling (RPM), Energy Usage (kWh), and Temperature (C) per location during the last month. You can drill down on each bar chart to see reports for each location to analyze the top 10 contributing items per device type (hypervisor, host) and per server.
Figure 17. Carbon Emission and Energy Cost bar charts
Figure 18. Energy Usage and Temperature bar charts
The iDRAC PowerEdge Operations report shows currently active alerts received from iDRAC as SNMP traps. The solution ack contains 80 certified alert definitions that cover iDRAC System Health and Storage category alerts, including AmperageProbe, Battery, Cable, CMC, Fan, FC, LinkStatus, MemoryDevice, Network, OS, PhysicalDisk, PowerSupply, PowerUsage, TemperatureProbe, TemperatureStatistics, VoltageProbe, LiquidCoolingLeak, and others.
You can enable any or all alerts on each iDRAC under Configuration > System Settings > Alert Configuration > Alerts. You can configure SNMP trap receivers under Configuration > System Settings > Alert Configuration > SNMP Traps. In this case, the SNMP trap receiver is the SRM collector server.
Figure 19. Active alerts on iDRAC PowerEdge Operations report
By right-clicking an alert row, you can acknowledge, assign, close, take ownership of, or assign a ticket ID to the alert.
Figure 20. Acting on an alert
By clicking on an alert row, you can see a detailed report about the alert. Also, the SRM alerting module includes functionality to forward selected alerts to external applications, such as ServiceNow ITSM through a Webhook API or fault management applications through an SNMP trap or email.
You can navigate directly from the alerts report to the affected server’s landing page by clicking the device name link in the Device column of the All Alerts report. SRM relates alert-specific data with the time-series data originated from the same device and seamlessly navigates through corresponding reports. The following figure shows an affected server’s summary report with the topology and underlying Operations section showing the server’s active alerts.
Figure 21. Server summary report with topology and active alerts
SRM’s powerful framework allows storage administrators to easily integrate environmental data for PowerEdge physical servers into the existing end-to-end SAN inventory, performance, capacity, and alert reports. SRM reduces the time that is required to identify the cause of issues occurring in the data center.
With the new SolutionPack for iDRAC PowerEdge, administrators can monitor PowerEdge hardware components and obtain historical information about energy usage and other key performance indicators.
The iDRAC PowerEdge Solution Pack supports:
Author: Dejan Stojanovic
Talking CloudIQ: Cybersecurity
Mon, 20 Feb 2023 21:08:34 -0000
|Read Time: 0 minutes
This is the fourth in a series of blogs discussing CloudIQ. Previous blogs provide an overview of CloudIQ and discuss proactive health scores and capacity monitoring and planning. This blog discusses the cybersecurity feature in CloudIQ. Cyber-attacks have become a significant issue for all companies across all industries. The immediate economic consequences, combined with the longer-term impact of the loss of organizational reputation, can have both immediate and lasting effects.
Misconfigurations of infrastructure systems can open your organization to cyber intrusion and is a leading threat to data security. The CloudIQ cybersecurity feature proactively monitors infrastructure security configurations for Dell PowerStore and PowerMax storage systems and PowerEdge servers, and notifies users of security risks. A risk level is assigned to each system, placing the system into one of four categories, depending on the number and severity of the issues: Normal, Low, Medium, or High.
Figure 1. Cybersecurity system risk levels
When a security risk is found, remediation instructions are provided to help you address the issue as quickly as possible.
Figure 2. Cybersecurity details with remediation
CloudIQ evaluates outgoing Dell Security Advisories (DSAs) and intelligently notifies users when those advisories are applicable to their specific Dell system models with specific system software and firmware versions. This eliminates the need for users to investigate if a Security Advisory applies to their systems and allows them to immediately focus on remediation.
Figure 3. Dell Security Advisory listing
By using CloudIQ Cybersecurity policy templates, users can quickly set up security configuration evaluation tests and assign them to large numbers of systems with just a few clicks. Once assigned, the test plan is evaluated against each associated system, and the system administrator is notified in minutes of any unwanted configuration settings.
Testing has shown that it takes less than 3 minutes to set policies and automate security configuration checking for 1 to 1,000 systems. That’s a dramatic time savings versus the 6 minutes that it would take to manually check each individual system’s security configuration.1
Figure 4. Evaluation plan templates
Cybersecurity has clearly become a challenge and priority for companies of all sizes. With the large and growing number of systems distributed across core and edge locations, it is impractical for any IT organization to manually check those systems for misconfigurations. Dell CloudIQ eliminates manual checking by automating it and recommending how to quickly mitigate misconfiguration risks that can lead to unwanted intrusions threatening data security. With the intelligent evaluation of Dell Security Advisories, CloudIQ identifies applicable DSAs, further saving time and expediting remediation.
For additional cybersecurity related information, see the following documents:
How do you become more familiar with Dell Technologies and CloudIQ? The Dell Technologies Info Hub provides expertise that helps to ensure customer success with Dell platforms. We have CloudIQ demos, white papers, and videos available at the Dell Technologies CloudIQ page. Also, you can refer to the CloudIQ: A Detailed Review white paper, which provides in-depth summary of CloudIQ.
Author: Derek Barboza, Senior Principal Engineering Technologist
1Dell CloudIQ Cybersecurity for PowerEdge: The Benefits of Automation
Dell Container Storage Modules 1.5 Release
Thu, 12 Jan 2023 19:27:23 -0000
|Read Time: 0 minutes
Made available on December 20th, 2022, the 1.5 release of our flagship cloud-native storage management products, Dell CSI Drivers and Dell Container Storage Modules (CSM), is here!
See the official changelog in the CHANGELOG directory of the CSM repository.
First, this release extends support for Red Hat OpenShift 4.11 and Kubernetes 1.25 to every CSI Driver and Container Storage Module.
Featured in the previous CSM release (1.4), avid customers may recall a few new additions to the portfolio made available in tech preview. Primarily:
Building on these three new modules, Dell Technologies is adding deeper capabilities and major improvements as part of today’s 1.5 release for CSM, including:
For the platform updates included in today’s 1.5 release, the major new features are:
This feature is named “Auto RDM over FC” in the CSI/CSM documentation.
The concept is that the CSI driver will connect to both Unisphere and vSphere API to create the respective objects.
When deployed with “Auto-RDM” the driver can only function in that mode. It is not possible to combine iSCSI and FC access within the same driver installation.
The same limitation applies for RDM usage. You can learn more about it at RDM Considerations and Limitations on the VMware website.
That’s all for CSM 1.5! Feel free to share feedback or send questions to the Dell team on Slack: https://dell-csm.slack.com.
Author: Florian Coulombel
Velero Backup to PowerScale S3 Bucket
Fri, 23 Dec 2022 21:50:39 -0000
|Read Time: 0 minutes
Velero is one of the most popular tools for backup and restore of Kubernetes resources.
You can use Velero for different backup options to protect your Kubernetes cluster. The three modes are:
In all cases, Velero syncs the information (YAML and restic data) to a storage object.
PowerScale is Dell Technologies’ leading scale-out NAS solution. It supports many different access protocols including NFS, SMB, HTTP, FTP, HDFS, and, in the case that interests us, S3!
Note: PowerScale is not 100% compatible with the AWS S3 protocol (for details, see the PowerScale OneFS S3 API Guide).
For a simple backup solution of a few terabytes of Kubernetes data, PowerScale and Velero are a perfect duo.
To deploy this solution, you need to configure PowerScale and then install and configure Velero.
Prepare PowerScale to be a target for the backup as follows:
You can check that in the UI under Protocols > Object Storage (S3) > Global Settings or in the CLI.
In the UI:
In the CLI:
PS1-1% isi s3 settings global view HTTP Port: 9020 HTTPS Port: 9021 HTTPS only: No S3 Service Enabled: Yes
2. Create a bucket with the permission to write objects (at a minimum).
That action can also be done from the UI or CLI.
In the UI:
In the CLI:
See isi S3 buckets create in the PowerScale OneFS CLI Command Reference.
3. Create a key for the user that will be used to upload the objects.
Important notes:
Now that PowerScale is ready, we can proceed with the Velero deployment.
We assume that the Velero binary is installed and has access to the Kubernetes cluster. If not, see the Velero installation document for the deployment instructions.
Configure Velero:
$ cat ~/credentials-velero [default] aws_access_key_id = 1_admin_accid aws_secret_access_key = 0**************************i …
$ velero install \ --provider aws \ --plugins velero/velero-plugin-for-aws:v1.5.1 \ --bucket velero-backup \ --secret-file ./credentials-velero \ --use-volume-snapshots=false \ --cacert ./ps2-cacert.pem \ --backup-location-config region=powerscale,s3ForcePathStyle="true",s3Url=https://192.168.1.21:9021 …
The preceding command shows how to use Velero most simplistically and securely.
It is possible to add parameters to enable protection with snapshots. Every Dell CSI driver has snapshot support. To take advantage of that support, we use the install command with this addition:
velero install \ --features=EnableCSI \ --plugins=velero/velero-plugin-for-aws:v1.5.1,velero/velero-plugin-for-csi:v0.3.0 \ --use-volume-snapshots=true ...
Now that CSI snaps are enabled, we can enable restic to move data out of those snapshots into our backup target by adding:
--use-restic
As you can see, we are using the velero/velero-plugin-for-aws:v1.5.1 image, which is the latest available at the time of the publication of this article. You can obtain the current version from GitHub: https://github.com/vmware-tanzu/velero-plugin-for-aws
After the Velero installation is done, check that everything is correct:
kubectl logs -n velero deployment/velero
If you have an error with the certificates, you should see it quickly.
You can now back up and restore your Kubernetes resources with the usual Velero commands. For example, to protect the entire Kubernetes except kube-system, including the data with PV snapshots:
velero backup create backup-all --exclude-namespaces kube-system
You can check the actual content directly from PowerScale file system explorer:
Here is a demo:
Conclusion
For easy protection of small Kubernetes clusters, Velero combined with PowerScale S3 is a great solution. If you are looking for broader features (for a greater amount of data or more destinations that go beyond Kubernetes), look to Dell PowerProtect Data Manager, a next-generation, comprehensive data protection solution.
Interestingly, Dell PowerProtect Data Manager uses the Velero plug-in to protect Kubernetes resources!
Talking CloudIQ: Capacity Monitoring and Planning
Fri, 09 Dec 2022 15:37:42 -0000
|Read Time: 0 minutes
This is the third in a series of blogs discussing CloudIQ. In my first blog, I provided a high-level overview of CloudIQ and some of its key features. My second blog talked about the CloudIQ Proactive Health Score. I will continue the series with a discussion of the capacity monitoring and planning features in CloudIQ.
Capacity monitoring helps you plan for expansions of storage arrays, data protection appliances, storage-as-a-service, and hyperconverged infrastructure (HCI) to help overcome unexpected spikes in storage consumption. CloudIQ uses advanced analytics to provide short-term capacity prediction analysis, longer-term capacity forecasting, and capacity anomaly detection. Capacity anomaly detection is the identification of a sudden surge in utilization that may result in a space full condition in less than 24 hours.
The CloudIQ Home page displays the Capacity Approaching Full tile which identifies storage entities that are full or expected to be full in each of the following time ranges:
Figure 1. The Capacity Approaching Full tile
In situations where there is a storage entity in the Imminent category, CloudIQ identifies the components of the entity that are experiencing the sudden increase in utilization. This gives users the necessary information about where to look to correct the offending behavior. In the following example, CloudIQ has identified a storage pool that is expected to run out of space in five hours. The pool details page identifies the file systems and LUNs that are the top contributors to the expected rise in utilization.
Figure 2. Capacity Forecast for a pool that has a capacity anomaly
Two other CloudIQ features help you quickly find a solution for storage that is fast approaching full. First, there is the identification of reclaimable storage that shows you where you can recover unused capacity in a system. Second, there is the multisystem capacity view that lets you scan all your storage systems to pinpoint which have excess capacity to relieve approaching-full systems of their workloads.
CloudIQ identifies different types of storage that are potentially reclaimable. The following criteria are used to identify reclaimable storage:
Users can quickly see the storage objects, where the object resides, and the amount of reclaimable space. The Last IO Time is provided for block and file objects that have no detected IO activity in the last week. For VMs that have been shut down for at least a week, the storage object on which the VM resides along with the vCenter and time that the VM was shut down is available. The following figure shows an example of reclaimable storage for block objects that have had no front-end IO activity in the past week.
Figure 3. The Reclaimable Storage page – Block Objects with no front end IO activity
The multisystem capacity view provides a quick view of physical usable, used, free, and storage efficiencies across all storage, HCI, and data protection systems monitored by CloudIQ. This allows users to see quickly which systems are low on usable space, determine which systems are good targets for workload migration, and verify that their storage efficiencies and data reduction numbers are what they are expecting.
Figure 4. Multisystem capacity view for storage
Detailed capacity views for storage systems and storage objects provide additional information, including data efficiencies and data reduction metrics. The following figure shows the physical and logical storage breakdown and data reduction charts for a PowerStore cluster.
Figure 5. PowerStore cluster storage details
For APEX block storage service subscriptions, CloudIQ provides both subscribed and physical storage views. Subscribed views provide the storage usage including base and on-demand storage usage.
Figure 6. APEX block storage services subscription view
With custom reports and the use of custom tags, users can create meaningful business reports and schedule those reports to be delivered to the required end users. Reports can include both line charts and tables and can be filtered on any field. The following figure shows a simple table that includes used and free capacities, data reduction values, and several custom tags.
Figure 7. Custom report for storage
CloudIQ’s intelligence and predictive analytics helps users proactively manage and accurately plan data storage and workload expansions, and to act quickly to avoid rapidly approaching capacity full conditions. Custom reports and tagging allows users to create, schedule, and deliver reports with technical and business information tailored to a wide variety of stakeholders. And for users looking to integrate data from CloudIQ with existing IT management tools, CloudIQ provides a public REST API.
How do you become more familiar with Dell Technologies and CloudIQ? The Dell Technologies Info Hub site provides expertise that helps to ensure customer success with Dell Technologies platforms. We have CloudIQ demos, white papers, and videos available at the Dell Technologies CloudIQ page. Also, feel free to reference the CloudIQ Overview Whitepaper which provides an in-depth summary of CloudIQ. Interested in DevOps? Go to our public API page for information about integrating CloudIQ with other IT tools using Webhooks or REST API.
Author: Derek Barboza, Senior Principal Engineering Technologist
Integrating CloudIQ Webhooks with BigPanda Events
Tue, 22 Nov 2022 17:27:08 -0000
|Read Time: 0 minutes
This tutorial blog demonstrates how to use CloudIQ Webhooks to integrate CloudIQ health notifications with BigPanda (https://www.bigpanda.io/), an event management processing tool. This allows users to integrate CloudIQ notifications with events from other IT tools into BigPanda. We will show how to create a REST API Integration in BigPanda and provide an example of intermediate code that uses Google Cloud functions to process Webhooks.
BigPanda offers a solution that has a modern twist on event management process. The main product consists of a fully customizable cloud-hosted event management console for event integration, reporting, correlation, and enrichment.
A CloudIQ Webhook is a notification that is sent when a health issue changes. CloudIQ sends the Webhook notification when a new or resolved health issue is identified in CloudIQ. A Webhook is an HTTP post composed of a header and JSON payload that is sent to a user configured destination. Webhooks are available under the Admin > Integrations menu in the CloudIQ UI. Users must have the CloudIQ DevOps role to access the Integrations menu.
A Webhook consists of data in the header and the payload. The header includes control information; the payload is a JSON data structure that includes useful details about the notification and the health issue. Examples of the header and payload JSON files can be found here.
In CloudIQ, we enable Webhook integration by configuring a name, destination, and the secret to sign the payload.
In BigPanda, we have a couple of possibilities for third-party integration:
In our example, we use the REST API. Note that some of the requirements of the Open Integration Hub (alert severity, configurable application key, and so on) are not configurable today in CloudIQ Webhooks.
The main challenge when integrating CloudIQ health events with BigPanda alerts is implementing a mapping function to translate CloudIQ fields to BigPanda fields.
To do this, we will use a serverless function to:
In this integration, the serverless function is a Google Cloud Function. Any other serverless framework can work.
The first step is to create an application for integration in BigPanda. Do the following:
1. Log into the BigPanda console.
2. Click the Integrations button at the top of the console.
3. Click the blue New Integration button.
4. Select Alerts Rest API (the first card).
5. Set an integration name, then click Generate App Key.
6. Save the generated app key and bearer token.
If you forgot to save the “application key” or “token”, you can obtain them later by selecting `Review Instructions`.
Note that the “application key” and “token” will be needed later to configure the trigger to post data to that endpoint.
This step is very similar to what has been presented in the CloudIQ to Slack tutorial. The only changes are that we are using a golang runtime and we store the authentication token in a secret instead of in a plain text environment variable.
2. Provide a name (BP_TOKEN in this example).
3. Paste the Authorization token from the HTTP headers section of the BigPanda integration into the ‘Secret value’ field.
4. Select Create Function and provide a function name (ciq-bigpanda-integration in this example).
5. Under the Trigger section, keep a trigger type of HTTP and select Allow unauthenticated invocations.
6. Take note of the Trigger URL because it will be used as the Payload URL when configuring the Webhook in CloudIQ.
7. Select SAVE.
8. Expand the RUNTIME, BUILD AND CONNECTIONS SETTINGS section.
9. Under the RUNTIME tab, click the + ADD VARIABLE button to create the following variable:
BP_APP_KEY. The value is set to the application key obtained after creating the BigPanda integration.
10. Select the SECURITY AND IMAGE REPO tab.
11. Select REFERENCE A SECRET.
12. Select the BP_TOKEN secret from the pulldown.
13. Select Exposed as environment variable from the Reference Method pulldown.
14. Enter BP_TOKEN as the environment variable name.
15. Select DONE, then click Next.
16. Select Go 1.16 from the Runtime pulldown.
17. Change the Entry point to CiqEventToBigPandaAlert.
18. Replace the code for function.go with the example function.go code.
19. Replace the go.mod with the example go.mod code.
20. Select DEPLOY.
Using Go's static typing first approach, we have clearly defined `struct` for the input (`CiqHealthEvent`) and output (`BigPandaAlerts`).
Most of the logic consists of mapping one field to the other.
func CiqEventMapping(c *CiqHealthEvent, bp *BigPandaClient) *BigPandaAlerts { log.Println("mapping input CloudIQ event: ") log.Printf("%+v", c) alert := BigPandaAlerts{ AppKey: bp.AppKey, Cluster: "CloudIQ", Host: c.SystemName, } if len(c.NewIssues) > 0 { for _, v := range c.NewIssues { alert.Alerts = append(alert.Alerts, BigPandaAlert{ Status: statusForScore(c.CurrentScore), Timestamp: c.Timestamp, Host: c.SystemName, Description: v.Description, Check: v.RuleID, IncidentIdentifier: v.ID, }) } } return &alert }
Two things to note here:
1. Because CloudIQ doesn't have the notion of severity, we convert the score to a status using the code below.
2. CloudIQ has an event identifier that will help to deduplicate the alert in BigPanda or reopen a closed event in case of a re-notify.
// BigPanda status values: ok,ok-suspect,warning,warning-suspect,critical,critical-suspect,unknown,acknowledged,oksuspect,warningsuspect,criticalsuspect,ok_suspect,warning_suspect,critical_suspect,ok suspect,warning suspect,critical suspect func statusForScore(s int) string { if s == 100 { return "ok" } else if s <= 99 && s > 95 { return "ok suspect" } else if s <= 95 && s > 70 { return "warning" } else if s <= 70 { return "critical" } else { return "unknown" } }
Behind the scenes, the GCP Cloud Functions are built and executed as a container. To develop and test the code locally (instead of doing everything in the GCP Console), we can develop locally and then build the package using buildpack (https://github.com/googlecloudplatform/buildpacks) as GCP does:
pack build \ --builder gcr.io/buildpacks/builder:v1 \ --env GOOGLE_RUNTIME=go \ --env GOOGLE_FUNCTION_SIGNATURE_TYPE=http \ --env GOOGLE_FUNCTION_TARGET=ciq-bigpanda-integration \ ciq-bigpanda-integration
After the build is successful, we can test it with something similar to:
docker run --rm -p 8080:8080 -e BP_TOKEN=xxxxx -e BP_APP_KEY=yyyyy ciq-bigpanda-integration
Alternatively, you can create a “main.go” and run it with:
FUNCTION_TARGET=CiqEventToBigPandaAlert go run cmd/main.go
Users can choose to deploy the function outside of the GCP console. You can publish it with:
cloud functions deploy ciq-bigpanda-integration --runtime go116 --entry-point ciq-bigpanda-integration --trigger-http --allow-unauthenticated
It is time to point the CloudIQ Webhook to the GCP Function trigger URL. From the Admin > Integrations menu in CloudIQ, go to the Webhooks tab.
To ease the simulation of a Webhook event, go to the CloudIQ Integration and click the TEST WEBHOOK button. This sends a ping request to the destination. You can also go to CloudIQ and redeliver an existing event.
For an actual event and not just a `ping`, use the `easy_post.sh` script after configuring the appropriate ENDPOINT.
#!/bin/bash HEADERS_FILE=${HEADERS_FILE-./headers.json} PAYLOAD_FILE=${PAYLOAD_FILE-./payload.json} ENDPOINT=${ENDPOINT-https://webhook.site/6fd7d650-1b5b-4b8c-9781-2043005bdf2d} mapfile -t HEADERS < <(jq -r '. | to_entries[] | "-H \(.key):\(.value)"'< ${HEADERS_FILE}) curl -k -H "Content-Type: application/json" ${HEADERS[@]} --request POST --data @${PAYLOAD_FILE} ${ENDPOINT}
If everything flows correctly, you will see the health alerts delivered to the BigPanda console. This allows users to consolidate CloudIQ notificaitons with events from other IT tools into a single monitoring interface.
Author: Derek Barboza
Dell Container Storage Modules—A GitOps-Ready Platform!
Mon, 26 Sep 2022 15:17:45 -0000
|Read Time: 0 minutes
One of the first things I do after deploying a Kubernetes cluster is to install a CSI driver to provide persistent storage to my workloads; coupled with a GitOps workflow; it takes only seconds to be able to run stateful workloads.
The GitOps process is nothing more than a few principles:
Nonetheless, to ensure that the process runs smoothly, you must make certain that the application you will manage with GitOps complies with these principles.
This article describes how to use the Microsoft Azure Arc GitOps solution to deploy the Dell CSI driver for Dell PowerMax and affiliated Container Storage Modules (CSMs).
The platform we will use to implement the GitOps workflow is Azure Arc with GitHub. Still, other solutions are possible using Kubernetes agents such as Argo CD, Flux CD, and GitLab.
Azure GitOps itself is built on top of Flux CD.
The first step is to onboard your existing Kubernetes cluster within the Azure portal.
Obviously, the Azure agent will connect to the Internet. In my case, the installation of the Arc agent fails from the Dell network with the error described here: https://docs.microsoft.com/en-us/answers/questions/734383/connect-openshift-cluster-to-azure-arc-secret-34ku.html
Certain URLs (even when bypassing the corporate proxy) don't play well when communicating with Azure. I have seen some services get a self-signed certificate, causing the issue.
The solution for me was to put an intermediate transparent proxy between the Kubernetes cluster and the corporate cluster. That way, we can have better control over the responses given by the proxy.
In this example, we install Squid on a dedicated box with the help of Docker. To make it work, I used the Squid image by Ubuntu and made sure that Kubernetes requests were direct with the help of always_direct:
docker run -d --name squid-container ubuntu/squid:5.2-22.04_beta ; docker cp squid-container:/etc/squid/squid.conf ./ ; egrep -v '^#' squid.conf > my_squid.conf docker rm -f squid-container
Then add the following section:
acl k8s port 6443 # k8s https always_direct allow k8s
You can now install the agent per the following instructions: https://docs.microsoft.com/en-us/azure/azure-arc/kubernetes/quickstart-connect-cluster?tabs=azure-cli#connect-using-an-outbound-proxy-server.
export HTTP_PROXY=http://mysquid-proxy.dell.com:3128 export HTTPS_PROXY=http://mysquid-proxy.dell.com:3128 export NO_PROXY=https://kubernetes.local:6443 az connectedk8s connect --name AzureArcCorkDevCluster \ --resource-group AzureArcTestFlorian \ --proxy-https http://mysquid-proxy.dell.com:3128 \ --proxy-http http://mysquid-proxy.dell.com:3128 \ --proxy-skip-range 10.0.0.0/8,kubernetes.default.svc,.svc.cluster.local,.svc \ --proxy-cert /etc/ssl/certs/ca-bundle.crt
If everything worked well, you should see the cluster with detailed info from the Azure portal:
To benefit from all the features that Azure Arc offers, give the agent the privileges to access the cluster.
The first step is to create a service account:
kubectl create serviceaccount azure-user kubectl create clusterrolebinding demo-user-binding --clusterrole cluster-admin --serviceaccount default:azure-user kubectl apply -f - <<EOF apiVersion: v1 kind: Secret metadata: name: azure-user-secret annotations: kubernetes.io/service-account.name: azure-user type: kubernetes.io/service-account-token EOF
Then, from the Azure UI, when you are prompted to give a token, you can obtain it as follows:
kubectl get secret azure-user-secret -o jsonpath='{$.data.token}' | base64 -d | sed $'s/$/\\\n/g'
Then paste the token in the Azure UI.
The GitOps agent installation can be done with a CLI or in the Azure portal.
As of now, the Microsoft documentation presents in detail the deployment that uses the CLI; so let's see how it works with the Azure portal:
The Git repository organization is a crucial part of the GitOps architecture. It hugely depends on how internal teams are organized, the level of information you want to expose and share, the location of the different clusters, and so on.
In our case, the requirement is to connect multiple Kubernetes clusters owned by different teams to a couple of PowerMax systems using only the latest and greatest CSI driver and affiliated CSM for PowerMax.
Therefore, the monorepo approach is suited.
The organization follows this structure:
.
├── apps
│ ├── base
│ └── overlays
│ ├── cork-development
│ │ ├── dev-ns
│ │ └── prod-ns
│ └── cork-production
│ └── prod-ns
├── clusters
│ ├── cork-development
│ └── cork-production
└── infrastructure
├── cert-manager
├── csm-replication
├── external-snapshotter
└── powermax
You can see all files in https://github.com/coulof/fluxcd-csm-powermax.
Note: The GitOps agent comes with multi-tenancy support; therefore, we cannot cross-reference objects between namespaces. The Kustomization and HelmRelease must be created in the same namespace as the agent (here, flux-system) and have a corresponding targetNamespace to the resource to be installed.
This article is the first of a series exploring the GitOps workflow. Next, we will see how to manage application and persistent storage with the GitOps workflow, how to upgrade the modules, and so on.
Network Design for PowerScale CSI
Tue, 23 Aug 2022 17:09:57 -0000
|Read Time: 0 minutes
Network connectivity is an essential part of any infrastructure architecture. When it comes to how Kubernetes connects to PowerScale, there are several options to configure the Container Storage Interface (CSI). In this post, we will cover the concepts and configuration you can implement.
The story starts with CSI plugin architecture.
Like all other Dell storage CSI, PowerScale CSI follows the Kubernetes CSI standard by implementing functions in two components.
The CSI controller plugin is deployed as a Kubernetes Deployment, typically with two or three replicas for high-availability, with only one instance acting as a leader. The controller is responsible for communicating with PowerScale, using Platform API to manage volumes (to PowerScale it’s to create/delete directories, NFS exports, and quotas), to update the NFS client list when a Pod moves, and so on.
A CSI node plugin is a Kubernetes DaemonSet, running on all nodes by default. It’s responsible for mounting the NFS export from PowerScale, to map the NFS mount path to a Pod as persistent storage, so that applications and users in the Pod can access the data on PowerScale.
Because CSI needs to access both PAPI (PowerScale Platform API) and NFS data, a single user role typically isn’t secure enough: the role for PAPI access will need more privileges than normal users.
According to the PowerScale CSI manual, CSI requires a user that has the following privileges to perform all CSI functions:
Privilege | Type |
ISI_PRIV_LOGIN_PAPI | Read Only |
ISI_PRIV_NFS | Read Write |
ISI_PRIV_QUOTA | Read Write |
ISI_PRIV_SNAPSHOT | Read Write |
ISI_PRIV_IFS_RESTORE | Read Only |
ISI_PRIV_NS_IFS_ACCESS | Read Only |
ISI_PRIV_IFS_BACKUP | Read Only |
Among these privileges, ISI_PRIV_SNAPSHOT and ISI_PRIV_QUOTA are only available in the System zone. And this complicates things a bit. To fully utilize these CSI features, such as volume snapshot, volume clone, and volume capacity management, you have to allow the CSI to be able to access the PowerScale System zone. If you enable the CSM for replication, the user needs the ISI_PRIV_SYNCIQ privilege, which is a System-zone privilege too.
By contrast, there isn’t any specific role requirement for applications/users in Kubernetes to access data: the data is shared by the normal NFS protocol. As long as they have the right ACL to access the files, they are good. For this data accessing requirement, a non-system zone is suitable and recommended.
These two access zones are defined in different places in CSI configuration files:
If an admin really cannot expose their System zone to the Kubernetes cluster, they have to disable the snapshot and quota features in the CSI installation configuration file (values.yaml). In this way, the PAPI access zone can be a non-System access zone.
The following diagram shows how the Kubernetes cluster connects to PowerScale access zones.
Normally a Kubernetes cluster comes with many networks: a pod inter-communication network, a cluster service network, and so on. Luckily, the PowerScale network doesn’t have to join any of them. The CSI pods can access a host’s network directly, without going through the Kubernetes internal network. This also has the advantage of providing a dedicated high-performance network for data transfer.
For example, on a Kubernetes host, there are two NICs: IP 192.168.1.x and 172.24.1.x. NIC 192.168.1.x is used for Kubernetes, and is aligned with its hostname. NIC 172.24.1.x isn’t managed by Kubernetes. In this case, we can use NIC 172.24.1.x for data transfer between Kubernetes hosts and PowerScale.
Because by default, the CSI driver will use the IP that is aligned with its hostname, to let CSI recognize the second NIC 172.24.1.x, we have explicitly set the IP range in “allowedNetworks” in the values.yaml file in the CSI driver installation. For example:
allowedNetworks: [172.24.1.0/24]
Also, in this network configuration, it’s unlikely that the Kubernetes internal DNS can resolve the PowerScale FQDN. So, we also have to make sure the “dnsPolicy” has been set to “ClusterFirstWithHostNet” in the values.yaml file. With this dnsPolicy, the CSI pods will reach the DNS server in /etc/resolv.conf in the host OS, not the internal DNS server of Kubernetes.
The following diagram shows the configuration mentioned above:
Please note that the “allowedNetworks” setting only affects the data access zone, and not the PAPI access zone. In fact, CSI just uses this parameter to decide which host IP should be set as the NFS client IP on the PowerScale side.
Regarding the network routing, CSI simply follows the OS route configuration. Because of that, if we want the PAPI access zone to go through the primary NIC (192.168.1.x), and have the data access zone to go through the second NIC (172.24.1.x), we have to change the route configuration of the Kubernetes host, not this parameter.
Hopefully this blog helps you understand the network configuration for PowerScale CSI better. Stay tuned for more information on Dell Containers & Storage!
Authors: Sean Zhan, Florian Coulombel
Talking CloudIQ: Proactive Health Scores
Fri, 05 Aug 2022 20:29:33 -0000
|Read Time: 0 minutes
This is the second in a series of blogs discussing CloudIQ. In my first blog, I provided a high-level overview of CloudIQ and some of its key features. I will continue with a series of blogs, each talking about one of the key features in more detail. This blog discusses one of CloudIQ’s key differentiating features: the Proactive Health Score.
The Proactive Health Score uses various factors to provide a consolidated view of a system’s health into a single health score. Health scores are based on up to five categories: Components, Configuration, Capacity, Performance, and Data Protection. Based on the resulting health score, the system is put into one of three risk categories: Poor, Fair, or Good. The score starts at 100 and is reduced by the issue with the highest deduction.
A system in the Poor category has a score of 0 to 70 and poses an imminent critical risk. It could be a storage pool that is overprovisioned and full, meaning that systems will be trying to write to storage that is unavailable. Or it could be a significant component failure. Whatever the issue, it is something that requires your immediate attention.
A system in the Fair category has a score of 71 to 94. Systems in this category have an issue that should be looked at, but certainly not something that requires you to get out of bed at 3:00am to address immediately. It could be something like a storage pool predicted to be full in a week or a system inlet temperature that exceeds the upper warning threshold on a PowerEdge server.
A system in the Good category has a score of 95 to 100 and is doing fine. There may be a minor issue that you need to look at, but nothing significant that is expected to cause any near-term problems. An example would be a fibre port with a warning status on a Connectrix switch.
Now what happens if there are multiple issues on a system? We hinted at this earlier. The score is only affected by the most critical issue. Let’s say that there are four issues on a system: one 30-point deduction, one 10-point deduction, and two 5-point deductions. In this case, the health score is 70. When the 30-point deduction is addressed, the score would become 90. We do this to prevent a system with several minor issues from appearing at high risk or at a higher risk than a system with a significant issue.
Figure 1. System Health page
So now that we have been notified of an issue on a system, what do we do next? Well, with CloudIQ, we will offer up recommended remediation actions to address the issue before it has a significant impact on the environment. This may come in the form of a recommended configuration change or other action, a knowledge base article with a resolution, or some commands to run to gain the necessary information to resolve the issue.
Figure 2. Recommended remediation
CloudIQ also tracks the history of the Proactive Health Score. We can see both new and resolved issues along a chart with a selectable date range. Details of the issues are listed below the chart. By providing the history of the health score, CloudIQ allows users to identify possible recurring issues in the environment.
Figure 3. Health Score history
What if we do not want to log in to CloudIQ on a daily or weekly basis to check our systems? We can easily be notified by email any time a system health change occurs. These notifications can be set up for a configurable set of systems, allowing users only to receive notifications for those systems for which they are responsible.
For the more motivated user, CloudIQ supports Webhooks. With this feature, users can send a Webhook for any health change notification to integrate with third-party tools such as ServiceNow, Slack, or Teams. Webhooks are sent for both open and closed issues with a unique identifier. This allows users to correlate the resolved issue with the open issue to automatically close out any created incident. Some Webhook integration examples can be found here.
Whether it be for storage, networking, hyperconverged, servers, or data protection, the Proactive Health Score summarizes the health of a system into a single number, providing an immediate indication of the status of each system. Developed in tandem with experts from each product team, any issues identified for a system are accompanied by recommended remediation to help with self-service and quickly reduce risk. And with email notifications and Webhooks, users can be notified proactively any time an issue is identified.
How do you become more familiar with Dell Technologies and CloudIQ? The Dell Technologies Info Hub site provides expertise that helps to ensure customer success with Dell Technologies platforms. We have CloudIQ demos, white papers, and videos available at the Dell Technologies CloudIQ page. Also, feel free to reference the CloudIQ Overview Whitepaper which provides an in-depth summary of CloudIQ. Interested in DevOps? Go to our public API page for information about integrating CloudIQ with other IT tools using Webhooks or REST API.
Stay tuned for my next blog, where I'll talk about capacity forecasting and capacity anomaly detection in CloudIQ.
Author: Derek Barboza, Senior Principal Engineering Technologist
Explore Real-World Cases with the Dell SRM Interactive Demo
Thu, 17 Nov 2022 15:04:10 -0000
|Read Time: 0 minutes
At Dell Technologies, we are proud to announce a new interactive demo for Storage Resource Manager (SRM), located here:
This interactive demo is based on the SRM release 4.7.0.0, which introduces several new features, enhancements, and platform supports.
The landing page of the interactive demo provides a summary of the use cases and features covered. This demo has the same look and feel as the actual HTML-based SRM user interface, where you can scroll up and down the page and click on each page object.
Dell SRM provides insight into data center operations from application to storage. Through automated discovery and reporting, Dell SRM breaks down the silos. Its simple use-case driven user interface simplifies tasks such as:
There are eight independent interactive demo modules available, each of which covers a main SRM use case or feature:
Here is a peek inside each of the eight demo modules:
The data that is available in this comprehensive eight module demo is from the following supported vendors and technologies:
|
|
Enjoy this demo and let us know how you like it!
Author: Dejan Stojanovic
CloudIQ: Cloud-based Monitoring for your Dell Technologies IT Environment
Wed, 25 May 2022 19:49:28 -0000
|Read Time: 0 minutes
CloudIQ is Dell’s cloud-based AIOps application for monitoring Dell core, edge, and cloud. Born out of the Dell Unity storage product group several years ago, CloudIQ has quickly grown to cover a broad range of Dell Technologies products. With the latest addition of PowerSwitch, CloudIQ now covers Dell’s entire infrastructure portfolio, including compute, networking, CI/HCI, data protection, and storage systems.
According to a survey conducted last year, IT organizations were able to resolve infrastructure issues two to ten times faster and save a day per week on average with CloudIQ.1
Figure 1. CloudIQ Supported Platforms
CloudIQ has a variety of innovative features based on machine learning and other algorithms that help you reduce risk, plan ahead, and improve productivity. These features include the proactive health score, performance impact and anomaly detection, workload contention identification, capacity forecasting and anomaly detection, cybersecurity monitoring, reclaimable storage identification, and VMware integration.
With custom reporting features, Webhooks, and a REST API, you can integrate data from CloudIQ into ticketing, collaboration, and automation tools and processes that you use in day-to-day IT operations.
Best of all, CloudIQ comes with your standard Dell ProSupport and ProSupport Plus contracts at no extra cost.
Keep an eye out for follow up blogs discussing CloudIQ’s key features in more detail!
Figure 2. CloudIQ Overview Page
With the addition of PowerSwitch support, CloudIQ now gives users the ability to monitor the full range of their Dell Technologies IT infrastructure from a single user interface. And the fact that it is a cloud offering hosted in a secure Dell IT environment means that it is accessible from virtually anywhere. Simply open a web browser, point to https://cloudiq.dell.com, and log in with your Dell support credentials. As a cloud-based application, it also means that you always have access to the latest features because CloudIQ’s agile development process allows for continuous and seamless updates without any effort from you. There is also a mobile app, so you can take it anywhere.
How do you become more familiar with Dell Technologies and CloudIQ? The Dell Technologies Info Hub site provides expertise that helps to ensure customer success with Dell Technologies platforms. We have CloudIQ demos, white papers, and videos available at the Dell Technologies CloudIQ page. Also, feel free to reference the CloudIQ Overview Whitepaper which provides an in-depth summary of CloudIQ.
[1] Based on a Dell Technologies survey of CloudIQ users conducted May-June 2021. Actual results may vary.
Author: Derek Barboza, Senior Principal Engineering Technologist
How to Build a Custom Dell CSI Driver
Wed, 20 Apr 2022 21:28:38 -0000
|Read Time: 0 minutes
With all the Dell Container Storage Interface (CSI) drivers and dependencies being open-source, anyone can tweak them to fit a specific use case.
This blog shows how to create a patched version of a Dell CSI Driver for PowerScale.
As a practical example, the following steps show how to create a patched version of Dell CSI Driver for PowerScale that supports a longer mounted path.
The CSI Specification defines that a driver must accept a max path of 128 bytes minimal:
// SP SHOULD support the maximum path length allowed by the operating
// system/filesystem, but, at a minimum, SP MUST accept a max path
// length of at least 128 bytes.
Dell drivers use the gocsi library as a common boilerplate for CSI development. That library enforces the 128 bytes maximum path length.
The PowerScale hardware supports path lengths up to 1023 characters, as described in the File system guidelines chapter of the PowerScale spec. We’ll therefore build a csi-powerscale driver that supports that maximum length path value.
The Dell CSI drivers are all built with golang and, obviously, run as a container. As a result, the prerequisites are relatively simple. You need:
The first thing to do is to clone the official csi-powerscale repository in your GOPATH source directory.
cd $GOPATH/src/github.com/
git clone git@github.com:dell/csi-powerscale.git dell/csi-powerscale
cd dell/csi-powerscale
You can then pick the version of the driver you want to patch; git tag gives the list of versions.
In this example, we pick the v2.1.0 with git checkout v2.1.0 -b v2.1.0-longer-path.
The next step is to obtain the library we want to patch.
gocsi and every other open-source component maintained for Dell CSI are available on https://github.com/dell.
The following figure shows how to fork the repository on your private github:
Now we can get the library with:
cd $GOPATH/src/github.com/
git clone git@github.com:coulof/gocsi.git coulof/gocsi
cd coulof/gocsi
To simplify the maintenance and merge of future commits, it is wise to add the original repo as an upstream branch with:
git remote add upstream git@github.com:dell/gocsi.git
The next important step is to pick and choose the correct library version used by our version of the driver.
We can check the csi-powerscale dependency file with: grep gocsi $GOPATH/src/github.com/dell/csi-powerscale/go.mod and create a branch of that version. In this case, the version is v1.5.0, and we can branch it with: git checkout v1.5.0 -b v1.5.0-longer-path.
Now it’s time to hack our patch! Which is… just a oneliner:
--- a/middleware/specvalidator/spec_validator.go
+++ b/middleware/specvalidator/spec_validator.go
@@ -770,7 +770,7 @@ func validateVolumeCapabilitiesArg(
}
const (
- maxFieldString = 128
+ maxFieldString = 1023
maxFieldMap = 4096
maxFieldNodeId = 256
)
We can then commit and push our patched library with a nice tag:
git commit -a -m 'increase path limit'
git push --set-upstream origin v1.5.0-longer-path
git tag -a v1.5.0-longer-path
git push --tags
With the patch committed and pushed, it’s time to build the CSI driver binary and its container image.
Let’s go back to the csi-powerscale main repo: cd $GOPATH/src/github.com/dell/csi-powerscale
As mentioned in the introduction, we can take advantage of the replace directive in the go.mod file to point to the patched lib. In this case we add the following:
diff --git a/go.mod b/go.mod
index 5c274b4..c4c8556 100644
--- a/go.mod
+++ b/go.mod
@@ -26,6 +26,7 @@ require (
)
replace (
+ github.com/dell/gocsi => github.com/coulof/gocsi v1.5.0-longer-path
k8s.io/api => k8s.io/api v0.20.2
k8s.io/apiextensions-apiserver => k8s.io/apiextensions-apiserver v0.20.2
k8s.io/apimachinery => k8s.io/apimachinery v0.20.2
When that is done, we obtain the new module from the online repo with: go mod download
Note: If you want to test the changes locally only, we can use the replace directive to point to the local directory with:
replace github.com/dell/gocsi => ../../coulof/gocsi
We can then build our new driver binary locally with: make build
After compiling it successfully, we can create the image. The shortest path to do that is to replace the csi-isilon binary from the dellemc/csi-isilon docker image with:
cat << EOF > Dockerfile.patch
FROM dellemc/csi-isilon:v2.1.0
COPY "csi-isilon" .
EOF
docker build -t coulof/csi-isilon:v2.1.0-long-path -f Dockerfile.patch .
Alternatively, you can rebuild an entire docker image using provided Makefile.
By default, the driver uses a Red Hat Universal Base Image minimal. That base image sometimes misses dependencies, so you can use another flavor, such as:
BASEIMAGE=registry.fedoraproject.org/fedora-minimal:latest REGISTRY=docker.io IMAGENAME=coulof/csi-powerscale IMAGETAG=v2.1.0-long-path make podman-build
The image is ready to be pushed in whatever image registry you prefer. In this case, this is hub.docker.com: docker push coulof/csi-isilon:v2.1.0-long-path.
The last step is to replace the driver image used in your Kubernetes with your custom one.
Again, multiple solutions are possible, and the one to choose depends on how you deployed the driver.
If you used the helm installer, you can add the following block at the top of the myvalues.yaml file:
images:
driver: docker.io/coulof/csi-powerscale:v2.1.0-long-path
Then update or uninstall/reinstall the driver as described in the documentation.
If you decided to use the Dell CSI Operator, you can simply point to the new image:
apiVersion: storage.dell.com/v1
kind: CSIIsilon
metadata:
name: isilon
spec:
driver:
common:
image: "docker.io/coulof/csi-powerscale:v2.1.0-long-path"
...
Or, if you want to do a quick and dirty test, you can create a patch file (here named path_csi-isilon_controller_image.yaml) with the following content:
spec:
template:
spec:
containers:
- name: driver
image: docker.io/coulof/csi-powerscale:v2.1.0-long-path
You can then apply it to your existing install with: kubectl patch deployment -n powerscale isilon-controller --patch-file path_csi-isilon_controller_image.yaml
In all cases, you can check that everything works by first making sure that the Pod is started:
kubectl get pods -n powerscale
and that the logs are clean:
kubectl logs -n powerscale -l app=isilon-controller -c driver.
As demonstrated, thanks to the open source, it’s easy to fix and improve Dell CSI drivers or Dell Container Storage Modules.
Keep in mind that Dell officially supports (through tickets, Service Requests, and so on) the image and binary, but not the custom build.
Thanks for reading and stay tuned for future posts on Dell Storage and Kubernetes!
Author: Florian Coulombel
Announcing: the New Dell SRM Hands on Lab
Thu, 07 Apr 2022 14:26:51 -0000
|Read Time: 0 minutes
We are happy to announce the release of the new SRM hands on lab:
This new SRM hands on lab is based on the latest SRM release (4.7.0.0), which introduced many new features, enhancements, and platform supports.
To find this lab, go to the demo center (https://democenter.delltechnologies.com) and enter “srm” in the search box. This link to the lab will appear:
The welcome screen on the lab looks like this. It includes a network diagram and a comprehensive lab guide:
In the first module, called “What’s New”, the lab focuses on the following new features, enhancements, and newly supported platforms:
The rest of the modules cover in-depth SRM use-cases listed below. Each module is independent so that you can focus on your area of interest:
and some of the main SRM features:
Check out some of the SRM dashboards available:
The lab includes a great variety of SRM reports containing data from supported vendors and technologies:
|
|
The SRM 4.7.0.0 hands on lab helps you experience SRM use-cases and features, by browsing through the powerful user interface and elaborating on data from multiple vendors and technologies.
Enjoy the SRM hands on lab! If you have any questions, please contact us at support@democenter.dell.com.
Author: Dejan Stojanovic
Looking Ahead: Dell Container Storage Modules 1.2
Mon, 21 Mar 2022 14:42:31 -0000
|Read Time: 0 minutes
The quarterly update for Dell CSI Drivers & Dell Container Storage Modules (CSM) is here! Here’s what we’re planning.
Dell Container Storage Modules (CSM) add data services and features that are not in the scope of the CSI specification today. The new CSM Operator simplifies the deployment of CSMs. With an ever-growing ecosystem and added features, deploying a driver and its affiliated modules need to be carefully studied before beginning the deployment.
The new CSM Operator:
In the short/middle term, the CSM Operator will deprecate the experimental CSM Installer.
For disaster recovery protection, PowerScale implements data replication between appliances by means of the the SyncIQ feature. SyncIQ replicates the data between two sites, where one is read-write while the other is read-only, similar to Dell storage backends with async or sync replication.
The role of the CSM replication module and underlying CSI driver is to provision the volume within Kubernetes clusters and prepare the export configurations, quotas, and so on.
CSM Replication for PowerScale has been designed and implemented in such a way that it won’t collide with your existing Superna Eyeglass DR utility.
A live-action demo will be posted in the coming weeks on our VP YouTube channel: https://www.youtube.com/user/itzikreich/.
In this release, each CSI driver:
Kubernetes v1.19 introduced the fsGroupPolicy to give more control to the CSI driver over the permission sets in the securityContext.
There are three possible options:
In all cases, Dell CSI drivers let kubelet perform the change ownership operations and do not do it at the driver level.
Drivers for PowerFlex and Unity can now be installed with the help of the install scripts we provide under the dell-csi-installer directory.
A standalone Helm chart helps to easily integrate the driver installation with the agent for Continuous Deployment like Flux or Argo CD.
Note: To ensure that you install the driver on a supported Kubernetes version, the Helm charts take advantage of the kubeVersion field. Some Kubernetes distributions use labels in kubectl version (such as v1.21.3-mirantis-1 and v1.20.7-eks-1-20-7) that require manual editing.
Drivers for PowerFlex and Unity implement Volume Health Monitoring.
This feature is currently in alpha in Kubernetes (in Q1-2022), and is disabled with a default installation.
Once enabled, the drivers will expose the standard storage metrics, such as capacity usage and inode usage through the Kubernetes /metrics endpoint. The metrics will flow natively in popular dashboards like the ones built-in OpenShift Monitoring:
All Dell drivers and dependencies like gopowerstore, gobrick, and more are now on Github and will be fully open-sourced. The umbrella project is and remains https://github.com/dell/csm, from which you can open tickets and see the roadmap.
The Dell partnership with Google continues, and the latest CSI drivers for PowerScale and PowerStore support Anthos v1.9.
Both CSI PowerScale and PowerStore now allow setting the default permissions for the newly created volume. To do this, you can use POSIX octal notation or ACL.
For more details you can:
Author: Florian Coulombel
PowerMax and PowerStore Cyber Security
Tue, 15 Mar 2022 19:24:40 -0000
|Read Time: 0 minutes
Dell Technologies takes a comprehensive approach to cyber resiliency and is committed to helping customers achieve their security objectives and requirements. Storage Engineering Technologists Richard Pace, Justin Bastin, and Derek Barboza worked together, cross platform, to deliver three independent cyber security white papers for PowerMax, Mainframe, and PowerStore:
Each paper acts as a single point where customers can gain an understanding of the respective robust features and data services available to safeguard sensitive and mission critical data in the event of a cyber crime. All three papers leverage CloudIQ and the CyberSecurity feature to provide customers insight in anomaly detection.
The following figure shows a CloudIQ anomaly that indicates unusual behavior in a customer’s environment:
Backed by CyberSecurity in CloudIQ, we can see how quickly CloudIQ detects the issue and provides the details for manual remediation.
Dell has an ingrained culture of security. We follow a 'shift-left' approach that ensures that security is baked into every process in the development life cycle. The Dell Secure Development Lifecycle (SDL) defines security controls based on industry standards that Dell product teams adopt while developing new features and functionality. Dell’s SDL defines security controls that our product teams adopt while developing new features and functionality. Our SDL includes both analysis activities and prescriptive proactive controls around key risk areas.
Dell strives to help our customers minimize risk associated with security vulnerabilities in our products. Our goal is to provide customers with timely information, guidance, and mitigation options to address vulnerabilities. The Dell Product Security Incident Response Team (Dell PSIRT) is chartered and responsible for coordinating the response and disclosure for all product vulnerabilities that are reported to Dell. Dell employs a rigorous process to continually evaluate and improve our vulnerability response practices, and regularly benchmarks these against the rest of the industry.
Authors: Richard Pace, Justin F. Bastin
CSI drivers 2.0 and Dell EMC Container Storage Modules GA!
Thu, 14 Oct 2021 11:40:35 -0000
|Read Time: 0 minutes
The quarterly update for Dell CSI Driver is here! But today marks a significant milestone because we are also announcing the availability of Dell EMC Container Storage Modules (CSM). Here’s what we’re covering in this blog:
Dell Container Storage Modules is a set of modules that aims to extend Kubernetes storage features beyond what is available in the CSI specification.
The CSM modules will expose storage enterprise features directly within Kubernetes, so developers are empowered to leverage them for their deployment in a seamless way.
Most of these modules are released as sidecar containers that work with the CSI driver for the Dell storage array technology you use.
CSM modules are open-source and freely available from : https://github.com/dell/csm.
Many stateful apps can run on top of multiple volumes. For example, we can have a transactional DB like Postgres with a volume for its data and another for the redo log, or Cassandra that is distributed across nodes, each having a volume, and so on.
When you want to take a recoverable snapshot, it is vital to take them consistently at the exact same time.
Dell CSI Volume Group Snapshotter solves that problem for you. With the help of a CustomResourceDefinition, an additional sidecar to the Dell CSI drivers, and leveraging vanilla Kubernetes snapshots, you can manage the life cycle of crash-consistent snapshots. This means you can create a group of volumes for which the drivers create snapshots, restore them, or move them with one shot simultaneously!
To take a crash-consistent snapshot, you can either use labels on your PersistantVolumeClaim, or be explicit and pass the list of PVCs that you want to snap. For example:
apiVersion: v1 apiVersion: volumegroup.storage.dell.com/v1alpha2 kind: DellCsiVolumeGroupSnapshot metadata: # Name must be 13 characters or less in length name: "vg-snaprun1" spec: driverName: "csi-vxflexos.dellemc.com" memberReclaimPolicy: "Retain" volumesnapshotclass: "poweflex-snapclass" pvcLabel: "vgs-snap-label" # pvcList: # - "pvcName1" # - "pvcName2"
For the first release, CSI for PowerFlex supports Volume Group Snapshot.
The CSM Observability module is delivered as an open-telemetry agent that collects array-level metrics to scrape them for storage in a Prometheus DB.
The integration is as easy as creating a Prometheus ServiceMonitor for Prometheus. For example:
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: otel-collector namespace: powerstore spec: endpoints: - path: /metrics port: exporter-https scheme: https tlsConfig: insecureSkipVerify: true selector: matchLabels: app.kubernetes.io/instance: karavi-observability app.kubernetes.io/name: otel-collector
With the observability module, you will gain visibility on the capacity of the volume you manage with Dell CSI drivers and their performance, in terms of bandwidth, IOPS, and response time.
Thanks to pre-canned Grafana dashboards, you will be able to go through these metrics’ history and see the topology between a Kubernetes PersistentVolume (PV) until its translation as a LUN or fileshare in the backend array.
The Kubernetes admin can also collect array level metrics to check the overall capacity performance directly from the familiar Prometheus/Grafana tools.
For the first release, Dell EMC PowerFlex and Dell EMC PowerStore support CSM Observability.
Each Dell storage array supports replication capabilities. It can be asynchronous with an associated recovery point objective, synchronous replication between sites, or even active-active.
Each replication type serves a different purpose related to the use-case or the constraint you have on your data centers.
The Dell CSM replication module allows creating a persistent volume that can be of any of three replication types -- synchronous, asynchronous, and metro -- assuming the underlying storage box supports it.
The Kubernetes architecture can build on a stretched cluster between two sites or on two or more independent clusters. The module itself is composed of three main components:
The usual workflow is to create a PVC that is replicated with a classic Kubernetes directive by just picking the right StorageClass. You can then use repctl or edit the DellCSIReplicationGroup CRD to launch operations like Failover, Failback, Reprotect, Suspend, Synchronize, and so on.
For the first release, Dell EMC PowerMax and Dell EMC PowerStore support CSM Replication.
With CSM Authorization we are giving back more control of storage consumption to the storage administrator.
The authorization module is an independent service, installed and owned by the storage admin.
Within that module, the storage administrator will create access control policies and storage quotas to make sure that Kubernetes consumers are not overconsuming storage or trying to access data that doesn’t belong to them.
CSM Authorization makes multi-tenant architecture real by enforcing Role-Based Access Control on storage objects coming from multiple and independent Kubernetes clusters.
The authorization module acts as a proxy between the CSI driver and the backend array. Access is granted with an access token that can be revoked at any point in time. Quotas can be changed on the fly to limit or increase storage consumption from the different tenants.
For the first release, Dell EMC PowerMax and Dell EMC PowerFlex support CSM Authorization.
When dealing with StatefulApp, if a node goes down, vanilla Kubernetes is pretty conservative.
Indeed, from the Kubernetes control plane, the failing node is seen as not ready. It can be because the node is down, or because of network partitioning between the control plane and the node, or simply because the kubelet is down. In the latter two scenarios, the StatefulApp is still running and possibly writing data on disk. Therefore, Kubernetes won’t take action and lets the admin manually trigger a Pod deletion if desired.
The CSM Resiliency module (sometimes named PodMon) aims to improve that behavior with the help of collected metrics from the array.
Because the driver has access to the storage backend from pretty much all other nodes, we can see the volume status (mapped or not) and its activity (are there IOPS or not). So when a node goes into NotReady state, and we see no IOPS on the volume, Resiliency will relocate the Pod to a new node and clean whatever leftover objects might exist.
The entire process happens in seconds between the moment a node is seen down and the rescheduling of the Pod.
To protect an app with the resiliency module, you only have to add the label podmon.dellemc.com/driver to it, and it is then protected.
For more details on the module’s design, you can check the documentation here.
For the first release, Dell EMC PowerFlex and Dell EMC Unity support CSM Resiliency.
Each module above is released either as an independent helm chart or as an option within the CSI Drivers.
For more complex deployments, which may involve multiple Kubernetes clusters or a mix of modules, it is possible to use the csm installer.
The CSM Installer, built on top of carvel gives the user a single command line to create their CSM-CSI application and to manage them outside the Kubernetes cluster.
For the first release, all drivers and modules support the CSM Installer.
For each driver, this release provides:
VMware Tanzu offers storage management by means of its CNS-CSI driver, but it doesn’t support ReadWriteMany access mode.
If your workload needs concurrent access to the filesystem, you can now rely on CSI Driver for PowerStore, PowerScale and Unity through the NFS protocol. The three platforms are officially supported and qualified on Tanzu.
NFS Driver, PowerStore, PowerScale, and Unity have all been tested and work when the Kubernetes cluster is behind a private network.
By default, the CSI driver creates volumes with 777 POSIX permission on the directory.
Now with the isiVolumePathPermissions parameter, you can use ACLs or any more permissive POSIX rights.
The isiVolumePathPermissions can be configured as part of the ConfigMap with the PowerScale settings or at the StorageClass level. The accepted parameter values are: private_read, private, public_read, public_read_write, and public for the ACL or any combination of [POSIX Mode].
For more details you can:
Author: Florian Coulombel