Deduplication engine – sampling, fingerprinting, and matching | Dell PowerScale OneFS: Data Reduction and Storage Efficiency

Deduplication engine – sampling, fingerprinting, and matching

Thank you for your feedback!

One of the most fundamental components of SmartDedupe, and deduplication in general, is ‘fingerprinting’. In this part of the deduplication process, unique digital signatures, or fingerprints, are calculated using the SHA-1 hashing algorithm, one for each 8KB data block in the sampled set.
When SmartDedupe runs for the first time, it scans the data set and selectively samples data blocks from it, creating the fingerprint index. This index contains a sorted list of the digital fingerprints, or hashes, and their associated blocks. After the index is created, the fingerprints are checked for duplicates. When a match is found, during the sharing phase, a byte-by-byte comparison of the blocks is performed to verify that they are identical and to ensure that there are no hash collisions. Then, if they are determined to be identical, the block’s pointer is updated to the existing data block and the new, duplicate data block is released.
Hash computation and comparison is only used during the sampling phase. The deduplication job phases are covered in detail below. For the block sharing phase, full data comparison is employed. SmartDedupe also operates on the premise of variable length deduplication, where the block matching window is increased to encompass larger runs of contiguous matching blocks.