Home Storage PowerScale (Isilon) Blogs

Announcing Drain-based Nondisruptive Upgrades (NDUs)

Thu, 16 Sep 2021 17:43:39 -0000

Read Time: 0 minutes

Vincent Shen

During an NDU workflow, nodes are rebooted or the protocol service must be stopped temporarily. Up until now, this has required a disruption for the clients who are connected to the rebooting node.

A drain-based NDU provides a mechanism by which nodes are prevented from rebooting or restarting protocol services until all SMB clients have disconnected from the node. Because a single SMB client that does not disconnect can cause the upgrade to be delayed indefinitely, the user is now provided with options to reboot the node despite persisting clients.

A drain-based upgrade supports the following scenarios and is available for WebUI, CLI, and PAPI:

SMB protocol
OneFS upgrades
Firmware upgrades
Cluster reboots
Combined upgrades (OneFS and firmware)

A drain-based upgrade is built upon a parallel upgrade workflow, introduced in OneFS 8.2.2.0, that offers parallel node upgrade and reboot activity across node neighborhoods. It upgrades at most one node per neighborhood at any time. By doing that, it can shorten upgrade time and ensure that the end-user can continue to have access to their data. The more node neighborhoods within a cluster, the more parallel activity can occur.

Figure 1 shows how it works. In this example, there are two neighborhoods in a 6-node PowerScale cluster. Nodes 1 thru 3 belong to Neighborhood 1; Nodes 4 thru 6 belong to Neighborhood 2.

Figure 1: An example of Drain based NDU

You can use the following command to identify the correlation between your PowerScale nodes and neighborhoods (failure domains):

# sysctl efs.lin.lock.initiator.coordinator_weights

Once the drain-based upgrade is started, at most one node from each neighborhood will get the reservation that allows the nodes to upgrade simultaneously. OneFS will not reboot these nodes until the number of SMB clients is “0”. In this example, Node 3 and Node 4 get the reservation for upgrading at the same time. However, there is one SMB connection for Node 3 and two SMB connections for Node 4. They will not be able to reboot until the SMB connections get to “0”. At this stage, there are three options:

Wait - Wait until the number of SMB connections reaches “0” or it hits the drain timeout value. The drain timeout value is the configurable parameter for each upgrade process. It is the maximum waiting period. If drain timeout is set to “0”, it means wait forever.
Delay drain - Add the node to the delay list to delay client draining. The upgrade process will continue on another node in this neighborhood. After all the non-delayed nodes are upgraded, OneFS will return to the node in the delay list.
Skip drain - Stop waiting for clients to migrate away from the draining node and reboot immediately.

To run the drain-based NDU, follow these steps:

1. In the OneFS CLI, run the following command to perform the drain-based upgrade. In this example, we have set the drain timeout value to 60 minutes and the alert timeout value to 45 minutes. This means if there is still some connection after 45 minutes, a CELOG notification will be triggered to the administrator.

# isi upgrade start --parallel --skip-optional --install-image-path=/ifs /data/<installation-file-name> --drain-timeout=60m --alert-timeout=45m

The draining service is now waiting for further action (wait, delay, or skip) from the end user, when it detects that there is an active SMB connection between client and PowerScale.

2. In the OneFS WebUI, navigate to Upgrade under Cluster management. In this window you will see the node waiting for draining clients. You can either specify Skip or Delay. In this case, Skip is selected as shown in Figure 2. In the prompt window, click the Skip button to skip draining.

Figure 2. Skip the draining clients

Conclusion

Drain-based NDU can minimize the business impact during the OneFS upgrade process by allowing you to control how and when clients disconnect from the PowerScale cluster. This new feature can significantly improve the user experience and business continuity.

Author: Vincent Shen

Tags:

KPI Name	Description	Scope
Protocol Latency SMB	Average latency within last 10 minutes required for the various operations for the SMB protocol	Across all nodes and clients per cluster.
Protocol Latency NFS	Average latency within last 10 minutes required for the various operations for the NFS protocol.	Across all nodes and clients per cluster.
Active Clients NFS	The current number of active clients using NFS. The client is active when it is transmitting or receiving data.	Across all nodes per cluster.
Active Clients SMB 1	The current number of active clients using SMB 1. The client is active when it is transmitting or receiving data.	Across all nodes per cluster.
Active Clients SMB 2	The current number of active clients using SMB 2. The client is active when it is transmitting or receiving data.	Across all nodes per cluster.
Connected Clients NFS	The current number of connected clients using NFS. The client is connected when it has an open TCP connection to the cluster. It can transmit or receive data or it can be in an idle state.	Across all nodes per cluster.
Connected Clients SMB	The current number of connected clients using SMB. The client is connected when it has an open TCP connection to the cluster. It can transmit or receive data or it can be in an idle state.	Across all nodes per cluster.
Pending Disk Operation Count	The average pending disk operation count within the last 10 minutes. It is the number of I/O operations that are pending at the file system level and waiting to be issued to an individual drive.	Across all disks per cluster.
CPU Usage	The average usage of CPU cores including the physical cores and hyperthreaded core within last 10 minutes.	Across all nodes per cluster.
Cluster Capacity	The current used capacity for the cluster.	N/A
Nodepool Capacity	The current used capacity for the node pool in a cluster.	N/A
Drive Capacity	The current used capacity for a drive in a cluster.	N/A
Node Capacity	The current used capacity for a node in a cluster.	N/A
Network Throughput Equivalency	Checks whether the network throughput for each node within the last 10 minutes is within the specified threshold percentage of the average network throughput of all nodes in the node pool for the same time.	Across all nodes per node pool.

Your Browser is Out of Date

Announcing Drain-based Nondisruptive Upgrades (NDUs)

Related Blog Posts

Backing Up and Restoring PowerScale Cluster Configurations in OneFS 9.7

Alert in IIQ 5.0.0 – Part I

Introduction