Home > Data Protection > PowerProtect DD Series Appliances > Dell PowerProtect Data Domain SISL Scaling Architecture > No data available.
The basic algorithm for deduplication is to break the incoming data stream into segments in a repeatable way and compute a unique fingerprint for the segment. This fingerprint is then compared to all others in the system to determine whether it is unique or redundant. Only unique data is stored to disk. To its clients, the system appears to store the data in the usual way, but internally it does not use disk space to store the same segment repeatedly. Instead, it creates additional references to the previously stored unique segment.
For good data reduction, the segments should be small to maximize the data reduction effect. Smaller segments are more likely to be found in more places. But smaller segments mean there are more segments and therefore more fingerprints to compute and compare. Data Domain deduplication technology uses a relatively small segment size (8 KB average, varying in size). This provides the best deduplication results and provides a flexible, application-independent store. After identifying unique segments, local compression such as, LZ, gz, and gzfast, is applied and only that data is stored to disk.
The fingerprint index in this kind of approach can be a magnitude bigger than system RAM. As a result, it is typically stored on disk. So, for index lookups, the system will typically perform a disk read for each incoming segment. That is where things can get tricky.
This would mean that for a 100 MB/s throughput, a typical hash-based system would need about 100 disks. Here’s why. A 500 GB SATA disk can sustain approximately 120 disk accesses for index lookups per second. For a segment size of 8 KB, that means a single disk could support an incoming data rate of about 120 x 8 KB, or about 1 MB/s. To go faster, more disks would be required to spread the access load. Such a system would be too expensive to compete with tape in most situations.
Simple alternatives are sub-optimal. One resolution could be to use a much bigger segment size on average, but that would provide significantly worse deduplication effects and again make the system uncompetitive with tape automation on configured price. Otherwise, faster Fibre Channel disks could be used – but for twice the speed, they often cost 3x-5x more per gigabyte.
As shown in Figure 1, In a finger printing approach using high capacity, low-cost SATA disk, random lookups of fingerprints for segments that average 8 KB limit throughput to about 1MB/s per disk.