For optimal cluster performance, Dell Technologies recommends observing the following OneFS Job Engine best practices:
- Schedule jobs to run during the cluster’s low usage hours—such as overnight and on weekends.
- Use the default priority, impact, and scheduling settings for each job, when possible.
- To complement the four default impact profiles, create additional profiles such as daytime_medium, after_hours_medium, weekend_medium, and so on, to fit specific environment needs.
- Ensure that the cluster, including any individual node pools, is less than 90 percent full, so performance is not affected and sufficient space is always available to reprotect data in case of drive failures. Also enable virtual hot spare (VHS) to reserve space in case you need to smartfail devices.
- If SmartPools is licensed, ensure that spillover is enabled (default setting).
- Configure and monitor alerts. Set up event notification rules so that you are notified of events—when the cluster begins to reach capacity thresholds, for example. Enter a current email address to ensure that you receive the notifications.
- Do not disable the snapshot delete job. In addition to preventing unused disk space from being freed, disabling the snapshot delete job can cause performance degradation.
- Delete snapshots in order, beginning with the oldest. Do not delete snapshots from the middle of a time range. Newer snapshots are mostly pointers to older snapshots, and they look larger than they really are.
- If you need to delete snapshots and the cluster has down or smart failed devices, or the cluster is in an otherwise “degraded protection” state, contact Dell Technical Support for assistance.
- Run the FSAnalyze job only if you are using InsightIQ. FSAnalyze creates data for the InsightIQ file system analytics tools, providing details about data properties and space usage within /ifs. Unlike SmartQuotas, FSAnalyze only updates its views when the FSAnalyze job runs. Since FSAnalyze is a fairly low-priority job, it can sometimes be preempted by higher-priority jobs and, therefore, take a long time to gather all the data.
- Schedule deduplication jobs to run every 10 days or so, depending on the size of the dataset.
- In a heterogeneous cluster, tune job priorities and impact policies to the level of the lowest performance tier.
- Before running a major (non-rolling) OneFS upgrade, allow active jobs to complete, where possible, and cancel any outstanding running jobs.
- TreeDelete can delete a directory to which a quota has been applied, with the use of the --delete-quotas flag. For example:
#isi job start TreeDelete --paths=/ifs/quota –-delete-quotas
- If FlexProtect is running, allow it to finish completely before powering down any nodes or the entire cluster. While shutting down the cluster during restripe will not hurt anything directly, it does increase the risk of a second device failure before FlexProtect finishes reprotecting data.
- In OneFS 9.3 and later, node exclusions allow, by default (but configurable), up to 10 percent of a cluster’s nodes to be removed from the job pool.
- When enabling and using the FilePolicy job, continue running the SmartPools job at a reduced frequency.