Dell Unity Data Reduction works the same for both Block and File storage resources. Data reduction uses a software algorithm to analyze and achieve space savings within a storage resource. Error! Reference source not found. below is a high-level diagram of a storage resource with data reduction enabled. As shown in 0, data reduction occurs inline between System Cache and the storage resource.
When data is written to the system, the data is saved in System Cache, and the write is acknowledged with the host. The data reduction algorithm is not invoked for write I/Os at this point in time in order to provide the fastest response to the host. Figure 2 below outlines an example of a write to a storage resource with data reduction enabled. No data has been written to the drives within the Pool at this time.
In Dell Unity, before a write is saved in System Cache, the system ensures space is available and allocated for the I/O within the target storage resource. As all back-end allocations and lookups within the target resource are deferred until after writes are accepted into System Cache and the host is acknowledged, a portion of the private space within the storage resource’s overhead is tracked and utilized as a possible location to store the I/O when accepting data into cache. A storage resource’s private space is fixed in size and allocated at time of the storage resource’s creation. After the I/O is acknowledged, the normal cache cleaning process occurs. Space within the storage resource is utilized or allocated, if needed, and the data is saved to disk. This caching behavior not only applies to data reduction enabled resources, but it is also applicable to Block and File storage resources (excluding vVols).
For data reduction enabled storage resources, the data reduction process occurs during the System Cache’s proactive cleaning operations or when System Cache is flushing cache pages to the drives within the Pool. The data in this scenario may be new to the storage resource, or the data may be an update to existing blocks of data currently residing on disk. In either case, the data reduction algorithm occurs before the data is written to the drives within the Pool. During the data reduction process, multiple blocks are aggregated together and sent through the algorithm. After determining if savings can be achieved or data needs to be written to disk, space within the Pool is allocated if needed and the data is written to the drives. A high-level diagram of this operation is displayed in Error! Reference source not found. below.
Dell Unity’s Data Reduction feature includes multiple space efficiency algorithms to help reduce the total space occupied by a dataset. Included in the Data Reduction feature is deduplication, compression, and optionally Advanced Deduplication algorithms. Error! Reference source not found. below is an overview of the data reduction feature with Advanced Deduplication enabled. Before data is sent to the Data Reduction algorithm, it is first segmented into 8KB blocks. As an 8KB block of data passes through the algorithm, it may or may not touch all efficiency algorithms within data reduction. If a block can be deduplicated, the remainder of the data reduction algorithms are skipped, saving time and processing overhead. Each of the algorithms within data reduction feature will be discussed in detail later in this section.
With Advanced Deduplication disabled, a block of data entering the data reduction feature is only passed through the deduplication and compression algorithms. The compression algorithm is only reached when zeros or common patterns are not detected on the block of data. An example of data reduction with Advanced Deduplication disabled is shown in Figure 5.
When new data first enters the data reduction logic, it is first passed through the deduplication algorithm. The deduplication algorithm is a lightweight software algorithm which analyzes the blocks of data for known patterns. The patterns may be a block of zeros written by the host, or common patterns found in Dell Unity’s many use cases, such as virtual environments. If a pattern is detected, the private space within the storage resource is updated to include that the particular block is a pattern and information about how to recreate the data block if it is accessed in the future. No data is written to disk in this scenario, which helps reduce storage consumption and drive wear. Also, when deduplication finds a pattern match, the remainder of the data reduction feature is skipped for those blocks which saves system resources. In the instance where no pattern is detected, the data is passed to Advanced Deduplication if it is enabled. If the Advanced Deduplication option is disabled, the data is passed through the compression logic within the data reduction algorithm.
If Advanced Deduplication is enabled, and deduplication did not detect a pattern, the data is passed through Advanced Deduplication. Advanced Deduplication is a dynamic deduplication algorithm which reduces storage consumption by eliminating duplicate 8KB blocks within a storage resource. Advanced Deduplication will only compare and detect duplicate data found within a single storage resource, such as a LUN or File System. The Advanced Deduplication algorithm utilizes fingerprints created for each block of data to quickly identify duplicate data within the dataset. Error! Reference source not found. below shows the Advanced Deduplication algorithm in detail.
The fingerprint cache is a component of the Advanced Deduplication algorithm. The fingerprint cache is a region in system memory reserved for storing fingerprints for each storage resource with Advanced Deduplication enabled. There is one fingerprint cache per storage processor, and it contains the fingerprints for storage resources residing on that SP. Through machine learning and statistics, the fingerprint cache determines which fingerprints to keep, and which ones to replace with new fingerprints. The fingerprint cache algorithm learns which resources have high deduplication rates and allows those resources to consume more fingerprint locations.
If an 8KB block is not deduplicated by the zero and common pattern deduplication algorithm, the data is passed into the fingerprint calculation portion of the Advanced Deduplication algorithm. Each 8KB block receives a fingerprint, which is compared to the fingerprints for the storage resource. If a matching fingerprint is found, deduplication occurs and the private space within the resource is updated to include a reference to the block of data residing on disk. No data is written to disk at this time. Storage resource savings are compounded as deduplication can reference compressed blocks on disk. If a match is not found, the data is passed to the compression algorithm.
As blocks enter the compression algorithm, they are passed through the compression software. If savings can be achieved, space is allocated within the Pool which matches the compressed size of the data, the data is compressed, and the data is written to the Pool. When Advanced Deduplication is enabled, the fingerprint for the block of data is also stored with the compressed data on disk. The fingerprint cache is then updated to include the fingerprint for the new data. Compression will not compress data if no savings can be achieved. In this instance, the original block of data will be written to the Pool. Waiting to allocate space within the resource until after the compression algorithm is complete helps to not over-allocate space within the storage resource.