With large clusters, the group state output can often be complex and difficult to parse. For example, consider the following group state report from an extra-large cluster:
# sysctl efs.gmp.group
efs.gmp.group: <47,3501>: { 1:0-22, 2-3:0-17, 4, 5:0-11, 6-10:0-22, 11-13:0-23, 14:0-11,13-19,21-23, 15:0-22, 16-31:0-23, 32-38:0-17, 39-41:0-35, 42:0-14,16-35, 43-45:0-33, 46-48:0-35, 49-53:0-22, 54-69, 70-80:0-11, 81:0-10, 82-89,91-93,95-126:0-11, 127-129:0-10, 130-133:0-11, diskless: 4, 54-69, smb: 1-89,91-93,95-133, nfs: 1-89,91-93,95-133, hdfs: 1-89,91-93,95-133, all_enabled_protocols: 1-89,91-93,95-133, isi_cbind_d: 1-89,91-93,95-133, lsass: 1-89,91-93,95-133, s3: 1-89,91-93,95-133 }
From this output, we can make determinations such as:
- The cluster consists of 131 nodes, with IDs 90 and 94 unused (1-89,91-93,95-133)
- NFS, HDFS, SMB, and S3 protocols are running on all nodes (all_enabled_protocols: 1-89,91-93,95-133)
- Diskless accelerator nodes consist of 17 of the cluster’s nodes (4, 54-69)
- Node 1 has 23 drives, either with 1 SSD or a failed drive (1:0-22)
- Node 14 has drives 12 and 20 missing (14:0-11,13-19,21-23)
If more detail is desired, the sysctl efs.gmp.current_info command will report extensive current GMP info.
Additional OneFS group management considerations include:
- Having several drive outages/failures can cause considerable group state churn. As such, the best practice is to promptly replace any failed drives.
- Within OneFS, the /ifs/.ifsvar directory contains most of the cluster’s configuration and contextual information. With a high node count and in heavily degraded conditions, GMP can still have quorum with eight nodes down. In such a situation, there may be portions of the /ifs/.ifsvar directory structure that are unavailable.
- Assuming all nodes are connected to the network, be aware of the impact of group changes. For instance, if a node is rebooted, the back-end updates between nodes can be disruptive for some protocols and applications.
- Other protocols like NFS and SMB3 continuous availability will gracefully handle the disruption.
- Avoid direct IP connections (non-dynamic) to the cluster by using the SmartConnect VIP.
- At large cluster scale, a group change resulting from adding/removing/rebooting a node, can impact I/O for 15 seconds or more. Similarly, a drive stall event can delay an I/O for 2 or more seconds.
- The following sysctls can help reduce excessive GMP activity on a busy extra-large cluster by increasing its tolerance to ping timeouts. Only modify these settings under the direction and supervision of Dell Support.
# isi_sysctl_cluster efs.rbm.dwt_handle_pings=1
# isi_sysctl_cluster net.inet.sdp.fin_wait_timeout=10
# isi_sysctl_cluster net.inet.sdp.time_wait_timeout=3
- In OneFS, there is no automatic repair job started when a node is lost. It requires manual intervention. In the past, OneFS did have a concept of ‘down for time’ timeout after which FlexProtect would start in the presence of a down node. This did not work well in practice given the transient nature of some node failures (plus maintenance), and ended up causing more repair work to get done (initial repair, plus the ‘un-repair’ when the node was returned to the group).
- With newer nodes having swappable journals and with disk tango a more frequent function, fixing a node and returning it to service is more commonplace nowadays.
- The SmartConnect process will continue to give out IP addresses during a group merge or split.
- OneFS 8.2 and later sees isi_boot_d replaced by isi_array_d and the adoption of the Paxos protocol as part of the infrastructure scale enhancements to support 252 node clusters.
- OneFS 9.0 sees the addition of the S3 object protocol.
Further information is available in the OneFS Cluster Composition, Quorum, and Group State white paper.