Fault tolerance

Thank you for your feedback!

Question: What is the expected behavior during the loss of one or more disks?

Data that exists on failed disks will be reconstructed on available capacity across other nodes and disks, using either the remaining erasure coded data and parity fragments or the replica copies. When the failed disk is replaced, it is used as free capacity.

For more details on node failures, see the Dell ECS: High Availability Design white paper.

Question: What is the expected behavior during loss of node(s)?

Any request for system metadata owned by a node that is not responding, will trigger the requested metadata ownership to be redistributed across the remaining nodes in the site. When this redistribution completes, the request for system metadata will complete successfully. Data that exists on disks from the unresponsive node will be reconstructed using either the remaining erasure coded data and parity fragments or the replica copies.

Concurrent failure: When nodes fail concurrently it means nodes fail almost simultaneously, or a node fails before recovery from a previous failed node completes.
One-by-one failure: When nodes fail one by one it means one node fails, all recovery operations complete and then a second node fails. This can occur multiple times and is analogous to a VDC going from something like four sites to three sites to two sites to one site. This requires that the remaining nodes have sufficient space to complete recovery operations.

For erasure-coded content for each single site, the following chart is provided.

EC scheme	Nodes in VDC	Concurrent failure	One-by-one failure
12+4	5 nodes	Loss of 1 node: reads and writes are successful, erasure coding continues. Loss of 2 or 3 nodes: some reads will fail, new writes will stop, erasure coding stops and new writes will be triple mirrored.	Loss of 1 node: reads and writes are successful, erasure coding continues. Loss of 2 or 3 nodes: all reads will succeed, new writes will stop, erasure coding stops and new writes will be triple mirrored.
10+2	6 nodes	Loss of 1 node: reads and writes are successful, erasure coding stops and new writes will be triple mirrored. Loss of 2 nodes: some reads will fail, new writes will be successful. Loss of 3 nodes: some reads and writes will fail.	Loss of 1 node: reads and writes are successful, erasure coding stops and new writes will be triple mirrored.

EC scheme

Nodes in VDC

Concurrent failure

One-by-one failure

12+4

5 nodes

Loss of 1 node: reads and writes are successful, erasure coding continues.

Loss of 2 or 3 nodes: some reads will fail, new writes will stop, erasure coding stops and new writes will be triple mirrored.

Loss of 1 node: reads and writes are successful, erasure coding continues.

Loss of 2 or 3 nodes: all reads will succeed, new writes will stop, erasure coding stops and new writes will be triple mirrored.

10+2

6 nodes

Loss of 1 node: reads and writes are successful, erasure coding stops and new writes will be triple mirrored.

Loss of 2 nodes: some reads will fail, new writes will be successful.

Loss of 3 nodes: some reads and writes will fail.

Loss of 1 node: reads and writes are successful, erasure coding stops and new writes will be triple mirrored.

The basic rules for determining what operations fail in a single site with multiple node failures include:

If you have three or more concurrent node failures, some reads and writes will fail due to the potential loss of all three replica copies of the associated triple mirrored metadata chunks.
Writes require a minimum of three nodes.
Erasure coding will stop, and erasure coded chunks will be converted to triple mirror protection if the number of nodes is less than the minimum required for each erasure coding scheme. Since the default erasure coding scheme, 12+4 requires four nodes, erasure coding will stop if there are fewer than four nodes. For cold storage erasure coding, 10+2, erasure coding will stop if there are less than six nodes.
If the node count goes below the minimum required for the erasure coding scheme, erasure coded chunks will be converted to triple mirror protection. As an example, in a VDC with default erasure coding and four nodes, after a node failure the following would happen:

Node failure causes four fragments to be lost.
Missing fragments are rebuilt.
Chunk creates three replica copies, one on each node.
EC copy is deleted.

For more details on node failures, see the Dell ECS: High Availability Design white paper.

Question: What are the types of site outages and how does ECS handle it?

Site outages can be classified as a temporary site outage (TSO) or a permanent site outage (PSO). A TSO is a failure of the WAN connection between two sites, or a temporary failure of an entire site (for example, a power failure). A site can be brought back online after a TSO. ECS can detect and automatically handle these types of temporary site failures. A PSO is when an entire site becomes permanently unrecoverable, such as when a disaster occurs. In this case, the System Administrator must permanently fail over the site from the federation to initiate failover processing. VDCs in a geo-replicated environment have a heartbeat mechanism. Sustained loss of heartbeats for a configurable duration (by default, 15 minutes) indicates a network or site outage and the system transitions to identify the TSO.

If a disaster occurs, an entire site can become unrecoverable; it is referred to in ECS as a permanent site outage (PSO). ECS treats the unrecoverable site as a temporary site failure, but only if the entire site is down or unreachable over the WAN. If the failure is permanent, the System Administrator must permanently fail over the site from the federation to initiate failover processing. This initiates resynchronization and reprotection of the objects that are stored on the failed site. The recovery tasks are run as a background process.

Starting with version 3.7, ECS supports recovery from a multiple simultaneous site (N-1 site) failure. This shortens the data recovery time. The customer must contact Dell to support the operation. It only supports the replication group setting with replication to all sites.

For more details, see the Dell ECS: High Availability Design white paper.

Question: What is the expected behavior during loss of a site?

If a single site is temporarily unavailable, in a replication group containing more than one site, some operations will be limited such as:

File systems within NFS buckets that are owned by the unavailable site are read-only.
Buckets, namespaces, object users, authentication providers, replication groups, and NFS user and group mappings cannot be created, deleted, or updated from any site (replication groups can be removed from a VDC during a permanent site failover).
You cannot list buckets for a namespace when the namespace owner site is not reachable.
OpenStack Swift users cannot log in to OpenStack during a TSO because ECS cannot authenticate Swift users during the TSO. After the TSO, Swift users must re-authenticate.
Create, read, update objects and list object in a bucket may be interrupted depending upon replication group options configured on the bucket.

For more details on site failures, see the Dell ECS: High Availability Design white paper.

Your Browser is Out of Date

Fault tolerance

Fault tolerance