Permanent site outage (PSO)

Thank you for your feedback!

If a disaster occurs at a site and the administrator determines the site is unrecoverable, the administrator can initiate a permanent site failover (remove the VDC from the federation). When a permanent site failover is initiated, all chunks from the failed site are recovered on the remaining sites to reestablish data durability.
The recovery process involves the remaining sites scanning their local chunk manager table to look for references to sites that include the failed site.
Any that are found with a chunk type of:
- Encoded:
  1. For chunks whose type is Encoded and whose primary site is online, ECS re-creates the data locally using the data from the primary site. When re-creation is complete, it marks this chunk as a Copy type.
  2. ECS re-creates the Encoded chunk whose primary site is the failed site by performing an XOR operation of the previously re-created Copy types with the Parity chunk. This site becomes the chunk’s primary site and the chunk is marked as Local type.
  3. These chunks are added to the replication queue to be replicated to other sites listed within the replication group.
- Copy: If a primary site is listed as the failed site, it becomes the new primary site. It then adds the chunk to its replication queue to be replicated to a new secondary site.
- Local: If its secondary site is the failed site, a task is inserted to replicate the chunk to a new secondary site.
Once permanent site failover starts, access to the data owned by the failed site will not be available until after the permanent site failover process is complete. Replication of data is separate from failover operations; as such, replication does not have to be complete for access to the data owned by the failed site to be restored.
This example is of a three-site configuration where Site 1 fails. Table 22 and Table 23 are the chunk manager tables for the remaining sites.
Table 22. Site 2 chunk manager table

Chunk ID

Primary site

Secondary site

Type

C1

Site 1

Site 2

Copy

C2

Site 2

Site 3

Local

C3

Site 1

Site 3

Remote

C4

Site 2

Site 1

Local

In this example, Site 2 acts as follows:
1. Adds chunk C1 to the replication queue to be replicated. Site 2 becomes the new primary site, and the site with the new chunk becomes the new secondary site.
1. Adds chunk C4 to the replication queue to be replicated and updates the secondary site in the table.
Table 23. Site 3 chunk manager table

Chunk ID

Primary site

Secondary site

Type

C1

Site 1

Site 2

Remote

C2

Site 2

Site 3

Encoded

C3

Site 1

Site 3

Encoded

C4

Site 2

Site 1

Remote

C5

Site 3

Parity (C2 and C3)

Site 3 acts as follows:
1. Re-creates chunk C2 data locally using the data from the primary site (Site 2) and changes chunk type to Copy.
2. Reconstructs chunk C3 using C2 data and the C5 parity data using the XOR operation C2 ⊕ C5.
Site 3 becomes the new primary site.
1. Deletes chunk C5.
2. Adds chunk C3 to the replication queue to be re-replicated.
The site with the new chunk becomes the new secondary site.
Table 24 and Table 25 are the chunk manager tables for the two remaining sites after permanent site failover is complete.
Table 24. Site 2 chunk manager table after PSO is complete

Chunk ID

Primary site

Secondary site

Type

C1

Site 2

Site 3

Local

C2

Site 2

Site 3

Local

C3

Site 3

Site 2

Copy

C4

Site 2

Site 3

Local

Table 25. Site 3 chunk manager table after PSO is complete

Chunk ID

Primary site

Secondary site

Type

C1

Site 2

Site 3

Copy

C2

Site 2

Site 3

Copy

C3

Site 3

Site 2

Local

C4

Site 2

Site 3

Copy

PSO with passive geo-replication
Permanent site outage is handled slightly differently for data that is replicated using passive geo-replication. Passively geo-replicated data will not reestablish data durability during a PSO; instead, data durability is reestablished after a new third site is added to the replication group. The PSO operations are different, based on whether the site that has permanently failed is one of the source sites or the replication target site.
The recovery process still involves the remaining sites scanning their local chunk manager table looking for references to sites that include the failed site.
Any that a site finds with a chunk type of:
- Encoded (exists on the replication target site):
  1. For chunks whose type is Encoded and whose primary site is online, ECS re-creates the data locally using the data from the primary site. When the re-creation is complete, it marks this chunk as a Copy type.
  2. ECS re-creates the Encoded chunk whose primary site is the failed source site by performing an XOR operation of the previously re-created Copy types with the Parity chunk. This site now becomes the chunk’s primary site and the chunk is marked as Local type.
No secondary sites are created until after a third site is added to the replication group.
- Copy: If a primary site is listed as the failed site, it becomes the new primary site, and its type is changed to Local. No secondary sites are created until after a third site is added to the replication group.
- Local: If its secondary site (the replication target) is the failed site, no new secondary sites are created until after a third site is added to the replication group.
Post-PSO, a third site can be added to reestablish data durability and protect against site-wide failures. After a third site is added to the passive geo-replication group, the previous two sites scan their local chunk manager table looking for chunks with no secondary chunk listed. The following activities occur:
- Local chunks on a source site with no secondary chunk listed initiate the replication of a secondary chunk to the new replication target site. The chunk manager table is updated to include the new secondary chunk location.
- Local chunks on the replication target site initiate a replication of the chunk to a new source site. After replication is complete, the replication target site type changes from Local to Copy, and the source site type changes from Copy to Local. XOR operations continue on the destination as normal.
Once PSO starts on a source site, access to the data owned by the failed site is not available until after the permanent site failover process is complete. Once PSO is complete, access to the data is restored. Until a third site is added to the replication group, any new writes to the online source site are replicated to the replication target. However, XOR operations will not occur because XOR only runs on chunks from two different source sites. Because all new source sites are the same, XOR cannot run.
After PSO is complete, a third site can be added to the replication group to restore data durability and protect against site-wide failures. Also, the replication target can resume running XOR operations on any two chunks from different source sites marked as type Copy.
In the next example, Site 1 and Site 2 are the source sites, and Site 3 is the target destination site. Table 26 and Table 27 are the chunk manager tables of the two source sites. Table 28 is the chunk manager table of the replication target site.
Table 26. Site 1 source site chunk manager table

Chunk ID

Primary site

Secondary site

Type

C1

Site 1

Site 3

Local

C2

Site 2

Site 3

Remote

C3

Site 1

Site 3

Local

Table 27. Site 2 source site chunk manager table

Chunk ID

Primary site

Secondary site

Type

C1

Site 1

Site 3

Remote

C2

Site 2

Site 3

Local

C3

Site 1

Site 3

Remote

Table 28. Site 3 replication target chunk manager table

Chunk ID

Primary site

Secondary site

Type

C1

Site 1

Site 3

Encoded

C2

Site 2

Site 3

Encoded

C3

Site 1

Site 3

Copy

C4

Site 3

Parity (C1 & C2)

Example 1: If Site 3 is removed because of a PSO, the secondary sites would become empty, but the primary sites and types would remain. Until a new replication target is added, any new writes would have a primary site listed but no secondary site.
Example 2: If Site1 is removed because of a PSO, the following activities would occur:
1. Site 3 becomes the new primary site with a type of Local for chunk C3; no site is listed as the secondary site.
1. Site 3 re-creates chunk C2 data using the data from the primary site (Site 2) and changes its chunk type to Copy.
2. Site 3 reconstructs chunk C1 using C2 data and the C4 parity data using the XOR operation C2 ⊕ C4. Site 3 becomes the new primary site; no secondary site is listed.
3. Site 3 deletes chunk C4.
After a permanent site failover of Site 1 is complete, the chunk manager tables of the two remaining sites would be like those shown in the following two tables.
Table 29 and Table 30 are the chunk manager tables for the two remaining sites after permanent site failover of Site 1 is complete.
Table 29. Site 2, source site chunk manager table after PSO is complete

Chunk ID

Primary site

Secondary site

Type

C1

Site 3

Remote

C2

Site 2

Site 3

Local

C3

Site 3

Remote

Table 30. Site 3, replication target site chunk manager table after PSO is complete

Chunk ID

Primary site

Secondary site

Type

C1

Site 3

Local

C2

Site 2

Site 3

Copy

C3

Site 3

Local

Until a new source site is added, any new writes will have a primary site of Site 2 with a type of Local and a secondary site of Site 3 with a type of Copy.
After a new source site is added to the replication group, data durability to protect against site-wide failure will be reestablished by adding a secondary site and replicating the data to it. XOR operations will also resume on the replication target. Table 31, Table 32, and Table 33 show the new chunk manager tables.
- Chunks C1 and C3 are replicated to the new source site, Site 1. After replication is complete, the primary site is listed as Site 1 and the secondary site as Site 3.
- Site 3 performs XOR encoding on chunks C1 and C2, resulting in a new C4 chunk with a type of Parity. Chunks C1 and C2 types change to Encoded.
Table 31. New Site 1 chunk manager table after data durability is reestablished

Chunk ID

Primary site

Secondary site

Type

C1

Site 1

Site 3

Local

C2

Site 2

Site 3

Remote

C3

Site 1

Site 3

Local

Table 32. Site 2 chunk manager table after new Site 1 is added and data durability is reestablished

Chunk ID

Primary site

Secondary site

Type

C1

Site 1

Site 3

Remote

C2

Site 2

Site 3

Local

C3

Site 1

Site 3

Remote

Table 33. Site 3 replication target site chunk manager table after new Site 1 is added and data durability is reestablished

Chunk ID

Primary site

Secondary site

Type

C1

Site 1

Site 3

Encoded

C2

Site 2

Site 3

Encoded

C3

Site 1

Site 3

Copy

C4

Site 3

Parity (C1 & C2)

Recoverability from multiple site failures
ECS can recover from multiple site failures if both PSO and data recovery operations are complete between the site outages. If the second site fails before recovery is complete:
- For permanent site outage recovery operations to run, all other sites in the system must be online. If multiple sites fail concurrently, all but one must recover from the TSO before a PSO can be run on a site.
- If a second site failure occurs after PSO is complete but before data recovery is complete, some data might be lost.
In a four-site scenario, successful recovery from the loss of all but one site occurs as follows (assumes sufficient space exists to store all the data in the remaining site):
- Site 4 fails.
  1. Administrator initiates a PSO operation to remove Site 4.
  2. Data is recovered on the remaining sites to reestablish data durability.
A three-site federation, containing Site 1, Site 2, and Site 3, remains.
- A second site (Site 2) fails sometime after PSO and data recovery are complete.
  1. Administrator initiates a PSO operation to remove Site 2.
  1. Data is recovered on the remaining sites to reestablish data durability.
A two-site federation, containing Site 1 and Site 3, remains.
- A third site (Site 1) fails sometime after PSO and data recovery are complete.
- Administrator initiates a PSO operation to remove Site 1
A single-site federation, containing Site 3, remains.
This example walked through multiple site failures. However, this example is not a normal scenario; permanent site failures are generally caused by disaster scenarios such as earthquakes and fires and do not commonly occur in multiple locations in short succession. Typically, after a single site permanently fails, a new site is added before a subsequent site fails.
Beginning with version 3.7, ECS supports recovery from multiple (N-1) simultaneous site failures. This procedure, which can shorten recovery time, only supports the replication group setting with replication to all sites. To get support for this operation, contact your Dell representative.

Chunk ID	Primary site	Secondary site	Type
C1	Site 1	Site 2	Copy
C2	Site 2	Site 3	Local
C3	Site 1	Site 3	Remote
C4	Site 2	Site 1	Local

Your Browser is Out of Date

Permanent site outage (PSO)

Permanent site outage (PSO)

PSO with passive geo-replication

Recoverability from multiple site failures