The time that it takes a storage system to rebuild data from a failed disk drive is crucial to the data reliability of that system. With the advent of multi-terabyte drives, and the creation of increasingly larger single volumes and file systems, typical recovery times for large, multi-terabyte SATA drive failures are becoming multiple days or even weeks. During this Mean Time to Data Loss (MTTDL) period, storage systems are vulnerable to additional drive failures and the resulting data loss and downtime.
Because OneFS is built upon a highly distributed architecture, it can use the CPU, memory, and spindles from multiple nodes to reconstruct data from failed drives in a highly parallel and efficient manner. Because a cluster is not bound by the speed of any particular drive, OneFS can recover from drive failures extremely quickly, and this efficiency grows relative to cluster size. As such, a failed drive within a cluster is rebuilt an order of magnitude faster than hardware RAID-based storage devices. Also, OneFS has no requirement for dedicated hot spare drives.
Because OneFS protects each file individually through FEC erasure encoding, the wall clock time required to reprotect data when a drive fails depends upon:
While difficult to predict with high accuracy, OneFS strives to ensure that the drive rebuild rates remain above an MTTDL of at least 5,000 years or higher, depending on the configured protection level.
Most of the reprotection job’s total run time is usually spent scanning all files on the cluster to determine which require repairing. On most clusters, because only a small fraction of the files will need attention, the size of those files and the disk pool’s I/O performance become significant.
For flash drives there is a negligible difference between TLC and QLC media, or between drives sizes, because the drives themselves are not the bottleneck. This means that the reprotection time for flash drives (SSDs) of any type or capacity is fairly uniform. It can be expressed as:
Data size on the drive / 1,500 GiB/hr (1.46 TiB/hr)
where 1,500 GiB/hr is the lower bound (as seen in internal testing).