Home > Storage > PowerScale (Isilon) > Product Documentation > Data Efficiency > Dell PowerScale OneFS: Data Reduction and Storage Efficiency > SmartDedupe and OneFS feature integration
When deduplicated files are replicated to another cluster using SyncIQ, or backed up to a tape device, the deduplicated files are inflated (or rehydrated) back to their original size, since they no longer share blocks on the target cluster. However, once replicated data has landed, SmartDedupe can be run on the target cluster to provide the same space efficiency benefits as on the source.
Shadow stores are not transferred to target clusters or backup devices. Because of this, deduplicated files do not consume less space than non-deduplicated files when they are replicated or backed up. To avoid running out of space on target clusters or tape devices, it is important to verify that the total amount of storage space saved, and storage space consumed, does not exceed the available space on the target cluster or tape device. To reduce the amount of storage space consumed on a target cluster, you can configure deduplication for the target directories of your replication policies. Although this will deduplicate data on the target directory, it will not allow SyncIQ to transfer shadow stores. Deduplication is still performed post-replication, by a deduplication job running on the target cluster.
Because files are backed up as if the files were not deduplicated, backup and replication operations are not faster for deduplicated data. You can deduplicate data while the data is being replicated or backed up.
Note: OneFS NDMP backup data will not be deduplicated unless deduplication is provided by the backup vendor’s DMA software. However, compression is often provided natively by the backup tape or VTL device.
SmartDedupe will not deduplicate the data stored in a snapshot. However, snapshots can be created of deduplicated data. If a snapshot is taken of a deduplicated directory, and then the contents of that directory are modified, the shadow stores will be transferred to the snapshot over time. Because of this, more space will be saved on a cluster if deduplication is run prior to enabling snapshots.
If deduplication is enabled on a cluster that already has a significant amount of data stored in snapshots, it will take time before the snapshot data is affected by deduplication. Newly created snapshots will contain deduplicated data, but older snapshots will not.
It is also good practice to revert a snapshot before running a deduplication job. Restoring a snapshot will cause many of the files on the cluster to be overwritten. Any deduplicated files are reverted to normal files if they are overwritten by a snapshot revert. However, once the snapshot revert is complete, deduplication can be run on the directory again and the resulting space savings will persist on the cluster.
Deduplication of writable snapshot data is not supported. SmartDedupe will ignore the files under writable snapshots.
SmartDedupe is also fully compatible with SmartLock, OneFS’ data retention and compliance solution. SmartDedupe delivers storage efficiency for immutable archives and write once, read many (or WORM) protected data sets.
OneFS SmartQuotas accounts for deduplicated files as if they consumed both shared and unshared data. From the quota side, deduplicated files appear no differently than regular files to standard quota policies. However, if the quota is configured to include data-protection overhead, the additional space used by the shadow store will not be accounted for by the quota.
SmartDedupe does not deduplicate files that span SmartPools node pools or tiers, or that have different protection levels, access patterns, or caching configurations set. This is to avoid potential performance or protection asymmetry which could occur if portions of a file live on different classes of storage.
However, a deduped file that is moved by SmartPools to a different pool or tier will retain the shadow references to the shadow store on the original pool. This breaks the rule for deduping across different disk pool policies, but it is less impactful to do this than rehydrate files that are moved. Further dedupe activity on that file will no longer be allowed to reference any blocks in the original shadow store. The file will need to be deduped against other files in the same disk pool policy. If the file had not yet been deduped, the dedupe index may have knowledge about the file and will still think it is on the original pool. This will be discovered and corrected when a match is made against blocks in the file.
Because the moved file has already been deduped, the dedupe index will have knowledge of the shadow store only. Since the shadow store has not moved, it will not cause problems for further matching. However, if the shadow store is moved as well (but not both files), then a similar situation occurs and the SmartDedupe job will discover this and purge knowledge of the shadow store from the dedupe index.
SmartDedupe post process dedupe is compatible with inline compression, currently available on the PowerScale F910, F900, F810, F710, F600, F210, F200, H700/7000, H5600, and A300/3000 platforms, and vice versa. In-line compression can compress OneFS shadow stores. However, in order for SmartDedupe to process compressed data, the SmartDedupe job will have to decompress it first in order to perform deduplication. In general, additional capacity savings may not warrant the overhead of running SmartDedupe on node pools with inline deduplication enabled.
While OneFS has offered a native file system deduplication solution for several years, until OneFS 8.2.1 this was always accomplished by scanning the data after it has been written to disk, or post-process. With inline data reduction, deduplication is now performed in real time as data is written to the cluster. Storage efficiency is achieved by scanning the data for identical blocks as it is received and then eliminating the duplicates using shadow stores.
Since inline dedupe and SmartDedupe use different hashing algorithms, the indexes for each are not shared directly. However, the work performed by each dedupe solution can be leveraged by each other. For instance, if SmartDedupe writes data to a shadow store, when those blocks are read, the read hashing component of inline dedupe will see those blocks and index them.
When a match is found, inline dedupe performs a byte-by-byte comparison of each block to be shared to avoid the potential for a hash collision. Data is prefetched prior the byte-by-byte check and then compared against the L1 cache buffer directly, avoiding unnecessary data copies and adding minimal overhead. Once the matching blocks have been compared and verified as identical, they are then shared by writing the matching data to a common shadow store and creating references from the original files to this shadow store.
In-line dedupe samples every whole block written and handles each block independently, so it can aggressively locate block duplicity. If a contiguous run of matching blocks is detected, inline dedupe will merge the results into regions and process them efficiently.
In-line dedupe is also detects dedupe opportunities from the read path, and blocks are hashed as they are read into L1 cache and inserted into the index. If an existing entry exists for that hash, inline dedupe knows there is a block sharing opportunity between the block it just read and the one previously indexed. It combines that information and queues a request to an asynchronous dedupe worker thread. As such, it is possible to deduplicate a data set purely by reading it all. To help mitigate the performance impact, all the hashing is performed out-of-band in the prefetch path, rather than in the latency-sensitive read path.
SFSE is mutually exclusive to all the other shadow store consumers (file clones, inline dedupe, SmartDedupe). Files can either be packed with SFSE or cloned/deduped, but not both. Inlined files (small files with their data stored in the inode) will not be deduplicated and non-inlined data files that are once deduped will not inline afterwards.
InsightIQ, the Dell PowerScale multi-cluster reporting and trending analytics suite, is integrated with SmartDedupe. Included in the data provided by the File Systems Analytics module is a report detailing the space savings efficiency delivered by deduplication.