Home > Storage > ObjectScale and ECS > Product Documentation > Dell ECS: High Availability Design > ECS node failure
ECS is constantly performing health checks on nodes. To maintain system availability, the ECS distributed architecture allows any node to accept client requests. If one node is down, a client can be redirected either manually or automatically (for example, using DNS or a load balancer) to another node that can service the request.
To avoid reconstruction operations being triggered for spurious events, a full reconstruction operation is not triggered unless a node fails a set number of sequential health checks. The default is 60 minutes. The following actions occur if an I/O request comes in for a node that is not responding but before a full reconstruction is triggered:
After a node fails a set number of sequential health checks (the default is 60 minutes), the node is deemed to be down. This action automatically triggers a re-creation operation of the partition tables and the chunk fragments on the disks owned by the failed node.
As part of the re-creation operation, a notification is sent to chunk manager, which starts a parallel recovery of all chunk fragments stored on the failed node’s disks. This recovery can include chunks containing object data, custom client-provided metadata, and ECS metadata. If the failed node comes back online, an updated status is sent to chunk manager, and any incomplete recovery operations are canceled. For more details about chunk fragment recovery, see Disk failure.
Besides monitoring hardware, ECS also monitors all services and data tables on each node.
If these efforts fail, ECS redistributes ownership of the tables owned by the down node or service across all remaining nodes in the VDC/site. The ownership changes involve updating the vnest information and re-creating the memory tables owned by the failed node. The vnest information is updated on the remaining nodes with new partition table owner information.
The memory tables from the failed node are re-created by replaying journal entries that were written after the latest successful journal checkpoint.
In some scenarios, multiple nodes within a site can fail, either one by one or concurrently:
The impact of the failure depends on which nodes go down. The following tables describe the best-case scenarios for failure tolerances of a single site based on the erasure coding scheme and number of nodes in a VDC.
Number of nodes in VDC at creation | Total number of failed nodes since VDC creation | Status after concurrent failures | Status after most recent one-by-one failure | Current VDC state after previous one-by-one failures |
5 | 1 | 5-node VDC with 1 node failing | ||
2 | VDC previously went from 5 to 4 nodes; | |||
3–4 | VDC previously went from | |||
6 | 1 | 6-node VDC with 1 node is failing | ||
2 | VDC previously went from 6 to 5 nodes; | |||
3 | VDC previously went from 6 to 5 to 4 nodes; | |||
4–5 | VDC previously went from | |||
8 | 1–2 | 8-node VDC or VDC that went from 8 to 7 nodes; | ||
3–4 | VDC previously went from | |||
5 | VDC previously went from | |||
6–7 | VDC previously went from |
Number of nodes in VDC at creation | Total number of failed nodes since VDC creation | Status after concurrent failures | Status after most recent one-by-one failure | Current VDC state after previous one-by-one failures |
6 | 1 | 6-node VDC with 1 node failing | ||
2 | VDC previously went from 6 to 5 nodes; | |||
3 | VDC previously went from 6 to 5 to 4 nodes; | |||
4–5 | VDC previously went from | |||
8 | 1 | 8-node VDC with 1 node failing | ||
2 | VDC previously went from 8 to 7 nodes; | |||
3–5 | VDC previously went from | |||
6–7 | VDC previously went from | |||
12 | 1–2 | 12-node VDC or 12-node VDC that went from 12 to 11 nodes; | ||
3–6 | VDC previously went from | |||
7–9 | VDC previously went from | |||
10–11 | VDC previously went from 12 to 11 to 10 to 9 to 8 to 7 to 6 to 5 to 4 to 3 nodes |
The following basic rules determine what operations fail in a single site with multiple node failures:
As an example, with 6 nodes and default erasure coding: