OneFS HealthCheck Auto-updates
Tue, 21 May 2024 17:11:27 -0000
|Read Time: 0 minutes
Prior to OneFS 9.4, Healthchecks were frequently regarded by storage administrators as yet another patch that needed to be installed on a PowerScale cluster. As a result, their adoption was routinely postponed or ignored, potentially jeopardizing a cluster’s well-being. To address this, OneFS HealthCheck auto-updates enable new Healthchecks to be automatically downloaded and non-disruptively installed on a PowerScale cluster without any user intervention.
The automated HealthCheck update framework helps accelerate the adoption of OneFS Healthchecks, by removing the need for manual checks, downloads, and installation. In addition to reducing management overhead, the automated Healthchecks integrate with CloudIQ to update the cluster health score - further improving operational efficiency, while avoiding known issues that affect cluster availability.
Formerly known as Healthcheck patches, or RUPs, with OneFS 9.4 and later these are renamed as ‘Healthcheck definitions’. The Healthcheck framework checks for updates to these definitions using Dell Secure Remote Services (SRS).
An auto-update configuration setting in the OneFS SRS framework controls whether the Healthcheck definitions are automatically downloaded and installed on a cluster. A OneFS platform API endpoint has been added to verify the Healthcheck version, and Healthchecks also optionally support OneFS compliance mode.
Healthcheck auto-update is enabled by default in OneFS 9.4 and later, and is available for both existing and new clusters, but can also be easily disabled from the CLI. If the auto-update is on and SRS is enabled, the Healthcheck definition is downloaded to the desired staging location and then automatically and non-impactfully installed on the cluster. Any Healthcheck definitions that are automatically downloaded are obviously signed and verified before being applied, to ensure their security and integrity.
So, the Healthcheck auto-update execution process itself is as follows:
On the cluster, the Healthcheck auto-update utility isi_healthcheck_update monitors for a new package once a night, by default. This Python script checks the cluster’s current Healthcheck definition version and new updates availability using SRS. Next, it performs a version comparison of the install package, after which the new definition is downloaded and installed. Telemetry data is sent and the /var/db/healthcheck_version.json file is created if it’s not already present. This JSON file is then updated with the new Healthcheck version info.
To configure and use the Healthcheck auto-update functionality, you must perform the following steps:
- Upgrade the cluster to OneFS 9.4 or later and commit the upgrade.
- To use the isi_healthcheck script, OneFS needs to be licensed and connected to the ESRS gateway. OneFS 9.4 also introduces a new option for ESRS, ‘SRS Download Enabled’, which must be set to ‘Yes’ (the default value) to allow the isi_healthcheck_update utility to run. To do this, use the following syntax (in this example, using lab-sea-esrs.onefs.com as the primary ESRS gateway):
# isi esrs modify --enabled=yes --primary-esrs-gateway=10.12.15.50 --srs-download-enabled=true
Confirm the ESRS configuration as follows:
# isi esrs view Enabled: Yes Primary ESRS Gateway: 10.12.15.50 Secondary ESRS Gateway: Alert on Disconnect: Yes Gateway Access Pools: - Gateway Connectivity Check Period: 60 License Usage Intelligence Reporting Period: 86400 Download Enabled: No SRS Download Enabled: Yes ESRS File Download Timeout Period: 50 ESRS File Download Error Retries: 3 ESRS File Download Chunk Size: 1000000 ESRS Download Filesystem Limit: 80 Offline Telemetry Collection Period: 7200 Gateway Connectivity Status: Connected
- Next, use the CloudIQ web interface to onboard the cluster. This requires creating a site, and then from the Add Product page, configuring the serial number of each node in the cluster, along with the product type ISILON_NODE, the site ID, and then selecting Submit.
CloudIQ cluster onboarding typically takes a couple of hours. When complete, the Product Details page shows the CloudIQ Status, ESRS Data, and CloudIQ Data fields as Enabled, as shown here:
- Examine the cluster status to verify that the cluster is available and connected in CloudIQ.
When these prerequisite steps are complete, use the new isi_healthcheck_update CLI command to enable auto-update. For example, to enable:
# isi_healthcheck_update --enable 2022-05-02 22:21:27,310 - isi_healthcheck.auto_update - INFO - isi_healthcheck_update started 2022-05-02 22:21:27,513 - isi_healthcheck.auto_update - INFO - Enable autoupdate
Similarly, you can also easily disable auto-update:
# isi esrs modify --srs-download-enabled=false
Auto-update also has the following gconfig global config options and default values:
# isi_gconfig -t healthcheck Default values: healthcheck_autoupdate.enabled (bool) = true healthcheck_autoupdate.compliance_update (bool) = false healthcheck_autoupdate.alerts (bool) = false healthcheck_autoupdate.max_download_package_time (int) = 600 healthcheck_autoupdate.max_install_package_time (int) = 3600 healthcheck_autoupdate.number_of_failed_upgrades (int) = 0 healthcheck_autoupdate.last_failed_upgrade_package (char*) = healthcheck_autoupdate.download_directory (char*) = /ifs/data/auto_upgrade_healthcheck/downloads
The isi_healthcheck_update Python utility is scheduled by cron and executed across all the nodes in the cluster, as follows:
# grep -i healthcheck /etc/crontab # Nightly Healthcheck update 0 1 * * * root /usr/bin/isi_healthcheck_update -s
This default /etc/crontab entry executes auto-update once daily at 1am. However, this schedule can be adjusted to meet the needs of the local environment.
Auto-update checks for new package availability and downloads and performs a version comparison of the installed and the new package. The package is then installed, telemetry data sent, and the healthcheck_version.json file updated with new version.
After the Healthcheck update process has completed, you can use the following CLI command to view any automatically downloaded Healthcheck packages. For example:
# isi upgrade patches list Patch Name Description Status ----------------------------------------------------------------------------- HealthCheck_9.4.0_32.0.3 [9.4.0 UHC 32.0.3] HealthCheck definition Installed ----------------------------------------------------------------------------- Total: 1
Additionally, viewing the JSON version file will also confirm this:
# cat /var/db/healthcheck_version.json {“version”: “32.0.3”}
In the unlikely event that auto-updates run into issues, the following troubleshooting steps can be useful:
- Confirm that Healthcheck auto-update is actually enabled:
Check the ESRS global config settings and verify they are set to ‘True’.
# isi_gconfig -t esrs esrs.enabled esrs.enabled (bool) = true # isi_gconfig -t esrs esrs.srs_download_enabled esrs.srs_download_enabled (bool) = true
If not, run:
# isi_gconfig -t esrs esrs.enabled=true # isi_gconfig -t esrs esrs.srs_download_enabled=true
- If an auto-update patch installation is not completed within 60 minutes, OneFS increments the unsuccessful installations counter for the current patch, and re-attempts installation the following day.
- If the unsuccessful installations counter exceeds five attempts, the installation will be aborted. However, you can reset the following auto-update gconfig values, as follows, to re-enable the installation:
# isi_gconfig -t healthcheck healthcheck_autoupdate.last_failed_upgrade_package = 0 # isi_gconfig -t healthcheck healthcheck_autoupdate.number_of_failed_upgrades = ""
- If a patch installation status is reported as ‘failed’, the recommendation is to contact Dell Support to diagnose and resolve the issue:
# isi upgrade patches list Patch Name Description Status ----------------------------------------------------------------------------- HealthCheck_9.4.0_32.0.3 [9.4.0 UHC 32.0.3] HealthCheck definition Failed ----------------------------------------------------------------------------- Total: 1
However, the following CLI command can be carefully used to repair the patch system by attempting to abort the most recent failed action:
# isi upgrade patches abort
The isi upgrade archive --clear command stops the current upgrade and prevents it from being resumed:
# isi upgrade archive --clear
When the upgrade status is reported as ‘unknown’, run:
# isi upgrade patch uninstall
- The file /var/log/isi_healthcheck.log is also a great source for detailed auto-upgrade information.
Author: Nick Trimbee