Appropriate capacity

Thank you for your feedback!

If 100 disks are too much, how much is enough? A dedupe process with traditional compression and a conventional onsite retention period generally achieves >10x aggregate data reduction. A common onsite retention period stores 10x the amount of data in the starting set (for example, a weekly full backup with daily increments, over two months). So, it is reasonable to assume the dedupe store should be about as big as the starting set of primary data being backed up to it.
If performance is the limiting factor in matching a dataset to a backup window, the weekly full backup and backup window often determines how fast a system is needed. The most challenging throughput configuration is when all full backups are on one weekend day. If using a 16-hour weekend window to allow for a restart on finding a problem, at 100 MB/s, a starting dataset would have to be less than 5.75 TB (16 hours times 100 MB/s). A dedupe storage system using 500 GB drives should only require 12 drives for storage, apart from RAID, spares, and others. Even with RAID 6, adding two parity disks for a total of 14, this would mean 100 MB/s / 14, or about 7 MB/s per disk. Projecting forward, if each disk stores 1 TB, half as many disks would be needed, so they’d have to go twice as fast to stay on the curve dictated by capacity. Disks themselves will not do this.
Figure 2. Dedpue Disk Efficiency
As shown in Figure 2, In a scalable deduplication system, fingerprints need to be indexed in an on-disk structure. To achieve speed, the system needs to seek to disk to determine whether a fingerprint is new and unique, or redundant. With current average seek speeds and a small segment size for good compression, more disks are required to get to speed than the desired number for capacity (the figure above assumes 500 GB, 7.2k rpm SATA disks and an average 8 KB segment size).

Your Browser is Out of Date

Appropriate capacity

Appropriate capacity