When a Job Engine job must work on a large portion of the file system, it uses one of the following primary methods:
The most straightforward access method is through metadata, using a logical inode (LIN) scan. In addition to being simple to access in parallel, LIN scans also provide a useful way of accurately determining the amount of work required.
A directory tree walk is the traditional access method. It works similarly to common UNIX utilities, such as find—albeit in a far more distributed way. For parallelism, the various job tasks are each assigned a separate subdirectory tree. Unlike LIN scans, tree walks can prove to be heavily unbalanced, due to varying subdirectory depths and file counts.
Disk drives provide excellent linear read access, so a drive scan can deliver orders-of-magnitude better performance than a directory tree walk or LIN scan for jobs that do not require insight into file system structure. As such, drive scans are ideal for jobs such as MediaScan, which linearly traverses each node’s disks looking for bad disk sectors.
Some Job Engine jobs use a changelist, rather than LIN-based scanning. The changelist approach analyzes two snapshots to find the LINs that changed (delta) between the snapshots and then proceeds to determine the exact changes.