Job engine monitoring

Thank you for your feedback!

Monitoring job performance and resource utilization is critical for the successful administration and troubleshooting of large clusters. A variety of Job Engine specific metrics are available through the OneFS CLI, including per job disk usage. For example, worker statistics and job level resource usage can be viewed with the CLI command 'isi job statistics list'. Additionally, the status of the Job Engine workers is available through the OneFS CLI using the ‘isi job statistics view’ command.
Job events, including pause/resume, waiting, phase completion, job success, and failure, are reported, plus a comprehensive job report is also provided for each phase of a job. This report contains detailed information about runtime, CPU, drive and memory utilization, the number of data and metadata objects scanned, and other work details or errors specific to the job type. While a job is running, an Active Job Details report is also available. This provides contextual information, including elapsed time, current job phase, job progress status, and so on.
For inode (LIN) based jobs, progress as an estimated percentage completion is also displayed, based on processed LIN counts.
Detailed, granular job performance information and statistics are available in a job’s report. These include per job phase CPU and memory utilization (min, max and average), and total read and write IOPS and throughput.
In addition to detailed NFS, SMB and S3 protocol and workflow breakdowns, OneFS also includes a real-time job performance resource monitoring framework, which provides statistics for the resources used by jobs - both cluster-wide and per-node. This information is provided using the isi statistics workload CLI command. Available in a ‘top’ format, this command displays the top jobs and processes, and periodically updates the information.
For example, the following syntax shows, and indefinitely refreshes, the top five processes on a cluster:
# isi statistics workload --limit 5 –-format=top
last update: 2018-06-11T06:45:25 (s)ort: default
CPU Reads Writes L2 L3 Node SystemName  JobType
1.4s 9.1k 0.0  3.5k 497.0 2 Job: 237  IntegrityScan[0]
1.2s 85.7 714.7  4.9k 0.0 1 Job: 238  Dedupe[0]
1.2s 9.5k 0.0  3.5k 48.5 1 Job: 237  IntegrityScan[0]
1.2s 7.4k 541.3  4.9k 0.0 3 Job: 238  Dedupe[0]
1.1s 7.9k 0.0  3.5k 41.6 2 Job: 237  IntegrityScan[0]
The resource statistics tracked per job, per job phase, and per node include CPU, reads, writes, and L2 & L3 cache hits. Unlike the output from the ‘top’ command, this makes it easier to diagnose individual job resource issues.

Your Browser is Out of Date

Job engine monitoring

Job engine monitoring