Home > Storage > ObjectScale and ECS > Product Documentation > Dell ECS: Technical FAQ > Fault tolerance
Question: What is the expected behavior during the loss of one or more disks?
Data that exists on failed disks will be reconstructed on available capacity across other nodes and disks, using either the remaining erasure coded data and parity fragments or the replica copies. When the failed disk is replaced, it is used as free capacity.
For more details on node failures, see the Dell ECS: High Availability Design white paper.
Question: What is the expected behavior during loss of node(s)?
Any request for system metadata owned by a node that is not responding, will trigger the requested metadata ownership to be redistributed across the remaining nodes in the site. When this redistribution completes, the request for system metadata will complete successfully. Data that exists on disks from the unresponsive node will be reconstructed using either the remaining erasure coded data and parity fragments or the replica copies.
For erasure-coded content for each single site, the following chart is provided.
EC scheme | Nodes in VDC | Concurrent failure | One-by-one failure |
12+4 | 5 nodes | Loss of 1 node: reads and writes are successful, erasure coding continues. Loss of 2 or 3 nodes: some reads will fail, new writes will stop, erasure coding stops and new writes will be triple mirrored. | Loss of 1 node: reads and writes are successful, erasure coding continues. Loss of 2 or 3 nodes: all reads will succeed, new writes will stop, erasure coding stops and new writes will be triple mirrored. |
10+2 | 6 nodes | Loss of 1 node: reads and writes are successful, erasure coding stops and new writes will be triple mirrored. Loss of 2 nodes: some reads will fail, new writes will be successful. Loss of 3 nodes: some reads and writes will fail. | Loss of 1 node: reads and writes are successful, erasure coding stops and new writes will be triple mirrored. |
The basic rules for determining what operations fail in a single site with multiple node failures include:
For more details on node failures, see the Dell ECS: High Availability Design white paper.
Question: What are the types of site outages and how does ECS handle it?
Site outages can be classified as a temporary site outage (TSO) or a permanent site outage (PSO). A TSO is a failure of the WAN connection between two sites, or a temporary failure of an entire site (for example, a power failure). A site can be brought back online after a TSO. ECS can detect and automatically handle these types of temporary site failures. A PSO is when an entire site becomes permanently unrecoverable, such as when a disaster occurs. In this case, the System Administrator must permanently fail over the site from the federation to initiate failover processing. VDCs in a geo-replicated environment have a heartbeat mechanism. Sustained loss of heartbeats for a configurable duration (by default, 15 minutes) indicates a network or site outage and the system transitions to identify the TSO.
If a disaster occurs, an entire site can become unrecoverable; it is referred to in ECS as a permanent site outage (PSO). ECS treats the unrecoverable site as a temporary site failure, but only if the entire site is down or unreachable over the WAN. If the failure is permanent, the System Administrator must permanently fail over the site from the federation to initiate failover processing. This initiates resynchronization and reprotection of the objects that are stored on the failed site. The recovery tasks are run as a background process.
Starting with version 3.7, ECS supports recovery from a multiple simultaneous site (N-1 site) failure. This shortens the data recovery time. The customer must contact Dell to support the operation. It only supports the replication group setting with replication to all sites.
For more details, see the Dell ECS: High Availability Design white paper.
Question: What is the expected behavior during loss of a site?
If a single site is temporarily unavailable, in a replication group containing more than one site, some operations will be limited such as:
For more details on site failures, see the Dell ECS: High Availability Design white paper.