Performance with SmartDedupe

Thank you for your feedback!

Deduplication is a compromise. To gain increased levels of storage efficiency, additional cluster resources (CPU, memory, and disk I/O) are used to find and share common data blocks.
Another important performance impact consideration with deduplication is the potential for data fragmentation. After deduplication, files that previously enjoyed contiguous on-disk layout often have chunks spread across less optimal file system regions. This can lead to slightly increased latencies when accessing these files directly from disk, rather than from cache. To help reduce this risk, SmartDedupe does not share blocks across node pools or data tiers and does not attempt to deduplicate files smaller than 32 KB. On the other end of the spectrum, the largest contiguous region that is matched is 4 MB.
Because deduplication is a data efficiency product rather than performance enhancing tool, usually the consideration is around cluster impact management. This consideration is from both the client data access performance front, because, by design, multiple files share common data blocks, and also from the deduplication job processing perspective, because additional cluster resources are consumed to detect and share commonality.
The first deduplication job run often takes a substantial amount of time to run because it must scan all files under the specified directories to generate the initial index and then create the appropriate shadow stores. However, deduplication job performance typically improves significantly on the second and subsequent job runs (incrementals), once the initial index and the bulk of the shadow stores have already been created.
If incremental deduplication jobs do take a long time to complete, this is most likely indicative of a dataset with a high rate of change. If a deduplication job is paused or interrupted, it automatically resumes the scanning process from where it left off.
As mentioned previously, deduplication is a long running process that involves multiple job phases that are run iteratively. SmartDedupe typically processes around 1 TB of data per day, per node.

Your Browser is Out of Date

Performance with SmartDedupe

Performance with SmartDedupe