Other OneFS group management considerations include:
- Having several drive outages/failures can cause considerable group state churn. As such, the best practice is to promptly replace any failed drives.
- Within OneFS, the /ifs/.ifsvar directory contains most of the cluster’s configuration and contextual information. With a high node count and in heavily degraded conditions, GMP can still have quorum with eight nodes down. In such a situation, there may be portions of the /ifs/.ifsvar directory structure that are unavailable.
- Assuming all nodes are connected to the network, be aware of the impact of group changes. For instance, if a node is rebooted, the back-end updates between nodes can be disruptive for some protocols and applications.
- Other protocols like NFS and SMB3 continuous availability will gracefully handle the disruption.
- Avoid direct IP connections (non-dynamic) to the cluster by using the SmartConnect VIP.
- At large cluster scale, a group change resulting from adding/removing/rebooting a node can impact I/O for 15 seconds or more. Similarly, a drive stall event can delay an I/O for 2 or more seconds.
- The following sysctls can help reduce excessive GMP activity on a busy extra-large cluster by increasing its tolerance to ping timeouts. Only modify these settings under the direction and supervision of Dell Support.
# isi_sysctl_cluster efs.rbm.dwt_handle_pings=1
# isi_sysctl_cluster net.inet.sdp.fin_wait_timeout=10
# isi_sysctl_cluster net.inet.sdp.time_wait_timeout=3
- In OneFS, there is no automatic repair job started when a node is lost. It requires manual intervention. In the past, OneFS did have a concept of ‘down for time’ timeout after which FlexProtect would start in the presence of a down node. This did not work well in practice given the transient nature of some node failures (plus maintenance), and ended up causing more repair work to get done (initial repair, plus the ‘un-repair’ when the node was returned to the group).
- With newer nodes having swappable journals and with disk tango a more frequent function, fixing a node and returning it to service is more commonplace nowadays.
- The SmartConnect process will continue to give out IP addresses during a group merge or split.