For optimal cluster performance, Dell Technologies recommends observing the following SmartDedupe best practices:
- Deduplication is most effective when applied to datasets with a low rate of change—for example, archived data.
- Enable SmartDedupe to run at subdirectory levels below /ifs.
- Avoid adding more than 10 subdirectory paths to the SmartDedupe configuration policy,
- SmartDedupe is ideal for home directories, departmental file shares, and warm and cold archive datasets.
- Run SmartDedupe against a smaller sample dataset first to evaluate performance impact compared to space efficiency.
- Schedule deduplication to run during the cluster’s low-usage hours—that is, overnight, weekends, and so on.
- After the initial deduplication job has completed, schedule incremental deduplication jobs to run every two weeks or so, depending on the size and rate of change of the dataset.
- Always run SmartDedupe with the default low-impact Job Engine policy.
- Run the deduplication assessment job on a single root directory at a time. If multiple directory paths are assessed in the same job, you will not be able to determine which directory should be deduplicated.
- When replicating deduplicated data, to avoid running out of space on target, verify that the logical data size (that is, the amount of storage space saved plus the actual storage space consumed) does not exceed the total available space on the target cluster.
- Run a deduplication job on an appropriate dataset before enabling a snapshots schedule.
- Where possible, perform any snapshot restores (reverts) before running a deduplication job, and run a deduplication job directly after restoring a prior snapshot version.