A typical SQL Server storage architecture consists of a single SQL Server host connected through redundant storage paths to a storage appliance such as Dell PowerStore. This SQL Server architecture has many redundant features and works well to provide roughly 99.9%, or “three nines”, of availability which is 8.77 hours of downtime per year. For many organizations, this is perfectly acceptable. The system will be taken offline for short periods of time to apply upgrades and if something happens to the SQL Server host, the database recovery plan is to restore from a backup. Often, this is all achievable within the allotted downtime.
Figure 1. Basic SQL Server storage configuration
The single points of failure still include the SQL Server host itself or the entire storage appliance. There will also be times when the host or storage appliance may need to be taken offline for an upgrade. If a site disaster occurred, the whole system would need to be reestablished at another site.
SQL Server relies on resilient storage hardware for reliable data storage. If the underlying storage hardware becomes unavailable, SQL Server takes databases offline in order to preserve data integrity. This action results in an outage to the SQL Server application and its users.
Resilient data storage architectures leverage redundant components wherever possible to reduce the possibility of component failure. Dell PowerStore leverages fully redundant hardware and high-availability features to achieve the 99.9999% availability1 design. Complete information about Dell PowerStore availability can be found in the Dell PowerStore: Clustering and High Availability paper.
________________
1 Based on the Dell Technologies specification for Dell EMC PowerStore, April 2020. Actual system availability may vary.
However, there are many more components in the application stack other than PowerStore, such as power, cooling, networking, application servers, and so on. An outage of any single component can impact availability.
Designing additional fault tolerance into an architecture to improve uptime and add “nines” of availability typically involves eliminating single points of failure and adding nondisruptive transitions to redundant components whenever possible.
The next few sections of this paper describe some approaches to improving fault tolerance, using SQL Server features and Dell PowerStore metro node, and how to combine both to improve fault tolerance.