The SISL Process

Thank you for your feedback!

When considering a newly arrived segment, the system first checks the summary vector. If the summary vector indicates the segment is new and needs to be stored, the system, informed by the stream itself, adds the segment to the current segment locality in the order it appears. Otherwise, the segment is probably a duplicate and the system looks in a fingerprint cache held in RAM.
In backup/restore, most segments are accessed very infrequently. A full backup passes an entire file system serially through the backup process and references huge numbers of segments that will not be referenced again until the next full backup. Therefore, a conventional caching strategy based on data recently accessed would not be effective.
With SISL, when a segment is found in the cache, the system looks it up in the on-disk index and then prefetches the fingerprints of an entire stream-informed locality into the cache. Most of the following segments in the incoming backup stream are then typically found in the cache without further disk accesses.
Together, these techniques and others make it possible to find duplicate segments at high speed in an application-independent way while minimizing the array hardware. It requires neither huge amounts of RAM nor large numbers of disk drives. The summary vector avoids pointless index lookups for new segments. Localities organize segments and segment fingerprints on disk, so each disk access fetches data that is relevant for a sequence of segments.
Prefetching brings these localities into the cache so that most duplicate segments are found at high speed in the cache. On long-running experiments with real backup data, these techniques together eliminate up to 98 percent of the disk reads and deliver balanced performance using the full capacity of low-cost SATA disk drives, making inline deduplication possible.
Figure 4. Segment localities on disk
As shown in Figure 4, new data segments for a backup stream are stored together in units called localities that, along with their fingerprints and other metadata, are packed into a container and appended to the log of containers. The fingerprints for the segments in the localities are kept together in a metadata section of the container, along with other file system structural elements. This keeps fingerprints and data that were written together close together on disk for efficient access during writes when looking for duplicates and during reads when reconstructing the deduplicated stream.

Your Browser is Out of Date

The SISL Process

The SISL Process