For optimal cluster performance, bear in mind the following OneFS Job Engine considerations:
- When reconfiguring the default priority, schedule, and impact profile for a job, consider the following questions:
- What resources am I impacting?
- What would I be gaining or loosing if I reprioritized this job?
- What are my impact options and their respective benefits and drawbacks?
- How long will the job run and what other jobs will it contend with?
- SyncIQ, the OneFS replication product, does not use Job Engine. However, it has both influenced, and been strongly influenced by, the Job Engine design. SyncIQ also terms its operations "jobs," and its processes and terminology bear some similarity to Job Engine. The Job Engine impact management framework is aware of the resources consumed by SyncIQ, in addition to client load, and will throttle jobs accordingly.
- A job with a name suffixed by ‘Lin’, for example FlexProtectLin, indicates that this job will scan an SSD-based copy of the LIN tree metadata, rather than access the hard drives themselves. This can significantly improve job performance, depending on the specific workflow.
- When more than three jobs with the same priority level and no exclusion set restrictions are scheduled to run simultaneously, the three jobs with the lowest job ID value will run. The remainder will be paused.
- There is only one ‘mark cookie’ available for marking jobs. So, if marking job A is interrupted by another marking job B, job A will automatically be cancelled when it resumes after the completion of B.
- For mixed node (heterogeneous) clusters that do not have a license for SmartPools data tiering module, the SetProtectPlus job will run instead, and apply the default file policy. SetProtectPlus will be automatically disabled if a valid SmartPools license key is added to the cluster.
- By default, FlexProtect is the only job allowed to run if a cluster is in degraded mode. Other jobs will automatically be paused and will not resume until FlexProtect has completed and the cluster is healthy again.
- Restriping jobs only block each other when the current phase might perform restriping. This is most evident with MultiScan, whose final phase only sweeps rather than restripes. Similarly, MediaScan, which rarely ever restripes, is can usually run to completion more often without contending with other restriping jobs.
- MediaScan restripes in phases 3 and 5 of the job, and only if there are disk errors (ECCs) which require data reprotection. If MediaScan reaches its third phase with ECCs, it will pause until AutoBalanceLin is no longer running. If MediaScan's priority were in the range 1-3, it would cause AutoBalanceLin to pause instead.
- If two jobs happen to reach their restriping phases simultaneously and the jobs have different priorities, the higher priority job (priority value closer to “1”) will continue to run, and the other will pause. If the two jobs have the same priority, the job that is already in its restriping phase will continue to run, and the job newly entering its restriping phase will pause.
- During MediaScan’s verify and repair phases, attempts to re-read bad sectors can occasionally cause drives to stall briefly while trying to correct the error. This interruption is typically brief and limited.
For more information, see the PowerScale OneFS Job Engine white paper.