For optimal cluster performance, keep in mind the following OneFS Job Engine considerations:
- When reconfiguring the default priority, schedule, and impact profile for a job, consider the following questions:
- What resources will be affected?
- What would I be gaining or losing if I reprioritized this job?
- What are my impact options and their respective benefits and drawbacks?
- How long will the job run and what other jobs will it contend with?
- SyncIQ, the OneFS replication product, does not use Job Engine. However, it has both influenced and been strongly influenced by the Job Engine’s design. SyncIQ also terms its operations "jobs," and its processes and terminology bear some similarity to Job Engine. The Job Engine impact management framework is aware of the resources consumed by SyncIQ, in addition to client load, and throttles jobs accordingly.
- A job with a name suffixed by Lin, for example FlexProtectLin, scans an SSD-based copy of the LIN tree metadata, rather than accessing the hard drives themselves. This process can significantly improve job performance, depending on the specific workflow.
- OneFS 9.2 and later versions default to running the Lin-based version of the AutoBalance and FlexProtect jobs when these jobs start automatically, where appropriate.
- When more than three jobs with the same priority level and no exclusion set restrictions are scheduled to run simultaneously, the three jobs with the lowest job ID value run, and the remainder are paused.
- Only one mark cookie is available for jobs within the marking exclusion set. So, if marking job A is interrupted by marking job B, job A will be automatically canceled when it resumes after the completion of job B.
- If mixed-node (heterogeneous) clusters do not have a license for the OneFS SmartPools data tiering module, the SetProtectPlus job will run instead and apply the default file policy. SetProtectPlus will be automatically disabled if a valid SmartPools license key is added to the cluster.
- In OneFS 8.2 and later, FlexProtect does not pause when there is only one temporarily unavailable device in a disk pool, or when a device is smart failed or dead. In OneFS versions earlier than 8.2, FlexProtect is the only job allowed to run on a cluster in degraded mode. Other jobs are automatically paused and do not resume until FlexProtect has completed and the cluster is healthy again.
- Restriping jobs only block each other when the current phase might perform restriping, which is most evident with MultiScan, whose final phase only sweeps rather than restripes. Similarly, MediaScan, which rarely ever restripes, usually can run to completion more without contending with other restriping jobs.
- MediaScan restripes in phases 3 and 5 of the job and only if there are disk errors (ECCs) that require data reprotection. If MediaScan reaches phase 3 with ECCs, it will pause until AutoBalanceLin is no longer running. A MediaScan priority in the range of 1 through 3 would cause AutoBalanceLin to pause instead.
- Disabling default services such as MediaScan can result in potential file system integrity issues if the job is not allowed to complete a full scan for hardware errors and failures.
- If two jobs reach their restriping phases simultaneously and the jobs have different priorities, the higher-priority job (that is, the job with a priority value closer to 1) will continue to run, and the other job will pause. If the two jobs have the same priority, the job already in its restriping phase will continue to run, and the job that is newly entering its restriping phase will pause.
- During MediaScan’s verify and repair phases, attempts to re-read bad sectors can occasionally cause drives to stall briefly while trying to correct the error. This interruption is typically brief and limited.
- When a cluster is low on free space, new jobs are not started, and any jobs that are not space-saving are paused. Once the cluster is no longer space-constrained, any paused jobs are automatically resumed.
- MultiScan starts when data is unbalanced within one or more disk pools or when drives have been unavailable for long enough to warrant a Collect job run.
- The FilePolicy and FSAnalyze jobs automatically share the same snapshots and index, created and managed by the IndexUpdate job.
- When a cluster running FSAnalyze is upgraded to OneFS 8.2 and later, the legacy FSAnalyze index and snapshots are removed and replaced by new snapshots the first time that IndexUpdate runs. The new index stores considerably more file and snapshot attributes than the old FSA index. Until the IndexUpdate job effects this change, FSA keeps running on the old index and snapshots.
- OneFS uses the TreeDelete job to remove a writable snapshot and unlink all its contents. Running the isi snapshots writable delete CLI command automatically queues a TreeDelete instance, which the Job Engine runs asynchronously to remove and clean up a writable snapshot’s namespace and contents. However, the TreeDelete job processing, and hence the data deletion, is not instantaneous. Instead, the writable snapshot’s directories and files are moved to a temporary *.deleted directory.
- The Job Engine and restriping jobs support writable snapshots In OneFS 9.3. In general, most jobs can be run from inside a writable snapshot’s path. However:
- Jobs involving tree walks do not perform copy-on-read for LINs under writable snapshots.
- The PermissionsRepair job cannot fix the files under a writable snapshot that has yet to be copy-on-read. Before starting the PermissionsRepair job, you can run the find CLI command (which searches for files in directory hierarchy) on the writable snapshot’s root directory to populate the writable snapshot’s namespace.
- The TreeDelete job works for subdirectories under a writable snapshot. Running TreeDelete on or above a writable snapshot does not remove the root, or head, directory of the writable snapshot (unless scheduled through writable snapshot library).
- The ChangeList, FileSystemAnalyze, and IndexUpdate jobs cannot see files in a writable snapshot. As such, the FilePolicy job, which relies on index update, cannot manage files in a writable snapshot.