Combined with the dynamic allocation method, dynamic failover provides high availability by transparently migrating IP addresses to another node when an interface is not available. If a node becomes unavailable, all the IP addresses it was hosting are reallocated across the new set of available nodes in accordance with the configured failover load-balancing policy. The default IP address failover policy is round robin, which evenly distributes IP addresses from the unavailable node across available nodes. Because the IP address remains consistent, irrespective of which node it resides on, failover to the client is transparent, so high availability is seamless.
The other available IP address failover policies are the same as the initial client connection balancing policies, that is, connection count, throughput, or CPU usage. In most scenarios, round robin is not only the best option but also the most common. However, the other failover policies are available for specific workflows. As mentioned previously, with the initial load-balancing policy, test the IP failover policies in a lab environment to find the best option for a specific workflow.
The following examples illustrate how IP addresses move during a dynamic failover. These examples illustrate the concepts of how the IP address quantity affects user experience during a failover; use these guidelines when determining IP address quantity.
In this scenario, 150 clients are actively connected to each node over NFS using a connection policy of round robin. Most NFSv3 mounted clients perform a nslookup only the first time that they mount, never performing another nslookup to check for an updated IP address. If the IP address changes, the NFSv3 clients have a stale mount and retain that IP address.
Suppose that one of the nodes fails, as shown in the following figure:
A SmartConnect zone with dynamic allocation for IP addresses immediately hot-moves the one IP address on the failed node to one of the other three nodes in the cluster. It sends out several gratuitous address resolution protocol (ARP) requests to the connected switch, so that client I/O continues uninterrupted.
Although all four IP addresses are still online, two of them—and 300 clients—are now connected to one node. In practice, SmartConnect can fail only one IP to one other place, and one IP address and 150 clients are already connected to each of the other nodes. The failover process means that a failed node has just doubled the load on one of the three remaining nodes while not disrupting the other two nodes. This process results in declining client performance, but not equally. The goal of any scale-out NAS solution must be consistency. To double the I/O on one node and not on another is inconsistent.
Dynamic SmartConnect zones require a greater number of IP addresses than the number of nodes at a minimum to handle failover behavior. In the following example, the formula used to calculate the number of IP addresses required is N*(N-1), where N is the number of nodes. The formula is used for illustration purposes only to demonstrate how IP addresses, and in turn, clients, move from one node to another, and how this could potentially lead to an imbalance across nodes. Every workflow and cluster are unique, and this formula is not applicable to every scenario.
This example considers the same four-node cluster as the previous example, but now following the rule of N*(N-1). In this case, 4*(4-1) = 12, equaling three IPs per node, as shown in the following figure:
When the same failure event as the previous example occurs, the three IP addresses are spread over all the other nodes in that SmartConnect zone. This failover results in each remaining node having 200 clients and four IP addresses. Although performance might degrade to a certain degree, it might not be as drastic as the failure in the first scenario, and the experience is consistent for all users, as shown in the following figure: