Job Engine checkpoints

Thank you for your feedback!

As jobs are processed, the coordinator consolidates the task status from the constituent nodes and periodically writes the results to checkpoint files. These checkpoint files allow jobs to be paused and resumed, either proactively or in the event of a cluster outage. For example, if the node on which the Job Engine coordinator is running goes offline for any reason, a new coordinator automatically starts on another node. This new coordinator reads the last consistency checkpoint file, job control and task processing resume across the cluster, and no work is lost.
Job Engine checkpoint files are stored in results and tasks subdirectories under the path /ifs/.ifsvar/modules/jobengine/cp/<job_id>/ for a given job. On large clusters or with a job running at HIGH impact, many checkpoint files can be accessed from all nodes, which might result in contention. In OneFS 8.2 and later, checkpoints are split into 16 subdirectories under both tasks and results to alleviate this bottleneck.