Home > Storage > ObjectScale and ECS > Product Documentation > Dell ECS: High Availability Design > Permanent site outage (PSO)
If a disaster occurs at a site and the administrator determines the site is unrecoverable, the administrator can initiate a permanent site failover (remove the VDC from the federation). When a permanent site failover is initiated, all chunks from the failed site are recovered on the remaining sites to reestablish data durability.
The recovery process involves the remaining sites scanning their local chunk manager table to look for references to sites that include the failed site.
Any that are found with a chunk type of:
Once permanent site failover starts, access to the data owned by the failed site will not be available until after the permanent site failover process is complete. Replication of data is separate from failover operations; as such, replication does not have to be complete for access to the data owned by the failed site to be restored.
This example is of a three-site configuration where Site 1 fails. Table 22 and Table 23 are the chunk manager tables for the remaining sites.
Chunk ID | Primary site | Secondary site | Type |
C1 | Site 1 | Site 2 | Copy |
C2 | Site 2 | Site 3 | Local |
C3 | Site 1 | Site 3 | Remote |
C4 | Site 2 | Site 1 | Local |
In this example, Site 2 acts as follows:
Chunk ID | Primary site | Secondary site | Type |
C1 | Site 1 | Site 2 | Remote |
C2 | Site 2 | Site 3 | Encoded |
C3 | Site 1 | Site 3 | Encoded |
C4 | Site 2 | Site 1 | Remote |
C5 | Site 3 |
| Parity (C2 and C3) |
Site 3 acts as follows:
Site 3 becomes the new primary site.
The site with the new chunk becomes the new secondary site.
Table 24 and Table 25 are the chunk manager tables for the two remaining sites after permanent site failover is complete.
Chunk ID | Primary site | Secondary site | Type |
C1 | Site 2 | Site 3 | Local |
C2 | Site 2 | Site 3 | Local |
C3 | Site 3 | Site 2 | Copy |
C4 | Site 2 | Site 3 | Local |
Chunk ID | Primary site | Secondary site | Type |
C1 | Site 2 | Site 3 | Copy |
C2 | Site 2 | Site 3 | Copy |
C3 | Site 3 | Site 2 | Local |
C4 | Site 2 | Site 3 | Copy |
Permanent site outage is handled slightly differently for data that is replicated using passive geo-replication. Passively geo-replicated data will not reestablish data durability during a PSO; instead, data durability is reestablished after a new third site is added to the replication group. The PSO operations are different, based on whether the site that has permanently failed is one of the source sites or the replication target site.
The recovery process still involves the remaining sites scanning their local chunk manager table looking for references to sites that include the failed site.
Any that a site finds with a chunk type of:
No secondary sites are created until after a third site is added to the replication group.
Post-PSO, a third site can be added to reestablish data durability and protect against site-wide failures. After a third site is added to the passive geo-replication group, the previous two sites scan their local chunk manager table looking for chunks with no secondary chunk listed. The following activities occur:
Once PSO starts on a source site, access to the data owned by the failed site is not available until after the permanent site failover process is complete. Once PSO is complete, access to the data is restored. Until a third site is added to the replication group, any new writes to the online source site are replicated to the replication target. However, XOR operations will not occur because XOR only runs on chunks from two different source sites. Because all new source sites are the same, XOR cannot run.
After PSO is complete, a third site can be added to the replication group to restore data durability and protect against site-wide failures. Also, the replication target can resume running XOR operations on any two chunks from different source sites marked as type Copy.
In the next example, Site 1 and Site 2 are the source sites, and Site 3 is the target destination site. Table 26 and Table 27 are the chunk manager tables of the two source sites. Table 28 is the chunk manager table of the replication target site.
Chunk ID | Primary site | Secondary site | Type |
C1 | Site 1 | Site 3 | Local |
C2 | Site 2 | Site 3 | Remote |
C3 | Site 1 | Site 3 | Local |
Chunk ID | Primary site | Secondary site | Type |
C1 | Site 1 | Site 3 | Remote |
C2 | Site 2 | Site 3 | Local |
C3 | Site 1 | Site 3 | Remote |
Chunk ID | Primary site | Secondary site | Type |
C1 | Site 1 | Site 3 | Encoded |
C2 | Site 2 | Site 3 | Encoded |
C3 | Site 1 | Site 3 | Copy |
C4 | Site 3 |
| Parity (C1 & C2) |
Example 1: If Site 3 is removed because of a PSO, the secondary sites would become empty, but the primary sites and types would remain. Until a new replication target is added, any new writes would have a primary site listed but no secondary site.
Example 2: If Site1 is removed because of a PSO, the following activities would occur:
After a permanent site failover of Site 1 is complete, the chunk manager tables of the two remaining sites would be like those shown in the following two tables.
Table 29 and Table 30 are the chunk manager tables for the two remaining sites after permanent site failover of Site 1 is complete.
Chunk ID | Primary site | Secondary site | Type |
C1 | Site 3 |
| Remote |
C2 | Site 2 | Site 3 | Local |
C3 | Site 3 |
| Remote |
Chunk ID | Primary site | Secondary site | Type |
C1 | Site 3 |
| Local |
C2 | Site 2 | Site 3 | Copy |
C3 | Site 3 |
| Local |
Until a new source site is added, any new writes will have a primary site of Site 2 with a type of Local and a secondary site of Site 3 with a type of Copy.
After a new source site is added to the replication group, data durability to protect against site-wide failure will be reestablished by adding a secondary site and replicating the data to it. XOR operations will also resume on the replication target. Table 31, Table 32, and Table 33 show the new chunk manager tables.
Chunk ID | Primary site | Secondary site | Type |
C1 | Site 1 | Site 3 | Local |
C2 | Site 2 | Site 3 | Remote |
C3 | Site 1 | Site 3 | Local |
Chunk ID | Primary site | Secondary site | Type |
C1 | Site 1 | Site 3 | Remote |
C2 | Site 2 | Site 3 | Local |
C3 | Site 1 | Site 3 | Remote |
Chunk ID | Primary site | Secondary site | Type |
C1 | Site 1 | Site 3 | Encoded |
C2 | Site 2 | Site 3 | Encoded |
C3 | Site 1 | Site 3 | Copy |
C4 | Site 3 |
| Parity (C1 & C2) |
ECS can recover from multiple site failures if both PSO and data recovery operations are complete between the site outages. If the second site fails before recovery is complete:
In a four-site scenario, successful recovery from the loss of all but one site occurs as follows (assumes sufficient space exists to store all the data in the remaining site):
A three-site federation, containing Site 1, Site 2, and Site 3, remains.
A two-site federation, containing Site 1 and Site 3, remains.
A single-site federation, containing Site 3, remains.
This example walked through multiple site failures. However, this example is not a normal scenario; permanent site failures are generally caused by disaster scenarios such as earthquakes and fires and do not commonly occur in multiple locations in short succession. Typically, after a single site permanently fails, a new site is added before a subsequent site fails.
Beginning with version 3.7, ECS supports recovery from multiple (N-1) simultaneous site failures. This procedure, which can shorten recovery time, only supports the replication group setting with replication to all sites. To get support for this operation, contact your Dell representative.