Interpreting group changes

Thank you for your feedback!

Group changes may be caused by drive removals or replacements, node additions, node removals, node reboots or shutdowns, backend (internal) network events, and the transition of a node into read-only mode. For debugging purposes, group change messages can be reviewed to determine whether any devices are in a failure state.
When a group change occurs, a cluster-wide process writes a message describing the new group membership to /var/log/messages on every node. Similarly, if a cluster ‘splits’, the newly formed subclusters behave in the same way: each node records its group membership to /var/log/messages. When a cluster splits, it breaks into multiple clusters (multiple groups). This is rarely, if ever, a desirable event. A cluster is defined by its group members. Nodes or drives which lose sight of other group members no longer belong to the same group and therefore no longer belong to the same cluster.
The ‘grep’ CLI utility can be used to view group changes from one node’s perspective, by searching /var/log/messages for the expression ‘new group’. This will extract the group change statements from the logfile. The output from this command may be lengthy, so can be piped to the ‘tail’ command to limit it to the preferred number of lines.
For the sake of clarity, the protocol information has been removed from the end of each group string in all the following examples. For example:
{ 1-3:0-11, smb: 1-3, nfs: 1-3, hdfs: 1-3, all_enabled_protocols: 1-3, isi_cbind_d: 1-3, lsass: 1-3, s3: 1-3 }
Will be represented as:
{ 1-3:0-11 }
In the following example, the ‘tail -10’ command limits the outputted list to the last ten group changes reported in the file:
tme-1# grep -i 'new group' /var/log/messages | tail –n 10
2024-04-15-T08:07:50 -04:00 <0.4> tme-1 (id1) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1814="kt: gmp-drive-updat") new group: : { 1:0-4, down: 1:5-11, 2-3 }
2024-04-15-T08:07:50 -04:00 <0.4> tme-1 (id1) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1814="kt: gmp-drive-updat") new group: : { 1:0-5, down: 1:6-11, 2-3 }
2024-04-15-T08:07:50 -04:00 <0.4> tme-1(id1) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1814="kt: gmp-drive-updat") new group: : { 1:0-6, down: 1:7-11, 2-3 }
2024-04-15-T08:07:50 -04:00 <0.4> tme-1 (id1) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1814="kt: gmp-drive-updat") new group: : { 1:0-7, down: 1:8-11, 2-3 }
2024-04-15-T08:07:50 -04:00 <0.4> tme-1 (id1) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1814="kt: gmp-drive-updat") new group: : { 1:0-8, down: 1:9-11, 2-3 }
2024-04-15-T08:07:50 -04:00 <0.4> tme-1 (id1) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1814="kt: gmp-drive-updat") new group: : { 1:0-9, down: 1:10-11, 2-3 }
2024-04-15-T08:07:50 -04:00 <0.4> tme-1 (id1) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1814="kt: gmp-drive-updat") new group: : { 1:0-10, down: 1:11, 2-3 }
2024-04-15-T08:07:50 -04:00 <0.4> tme-1 (id1) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1814="kt: gmp-drive-updat") new group: : { 1:0-11, down: 2-3 }
2024-04-15-T08:07:51 -04:00 <0.4> tme-1 (id1) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1814="kt: gmp-merge") new group: : { 1:0-11, 3:0-7,9-12, down: 2 }
2024-04-15-T08:07:52 -04:00 <0.4> tme-1 (id1) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1814="kt: gmp-merge") new group: : { 1-2:0-11, 3:0-7,9-12 }
All the group changes in this set happen within two seconds of each other, so it is worth looking earlier in the logs prior to the incident being investigated.
Here are some useful data points that can be gleaned from the example above:
1. The last line shows that the cluster’s nodes are operational belong to the group. No nodes or drives report as down or split. (At some point in the past, drive ID 8 on node 3 was replaced, but a replacement disk was then added successfully.)
2. Node 1 rebooted. In the first eight lines, each group change is adding back a drive on node 1 into the group, and nodes two and three are inaccessible. This occurs on node reboot prior to any attempt to join an active group and is indicative of healthy behavior.
3. Nodes 3 forms a group with node 1 before node 2 does. This could suggest that node 2 rebooted while node 3 remained up.
A review of group changes from the other nodes’ logs can confirm this. In this case node 3’s logs show:
# grep -i 'new group' /var/log/messages | tail -10
2024-04-15-T08:07:50 -04:00 <0.4> tme-3(id3) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1814="kt: gmp-drive-updat") new group: : { 3:0-4, down: 1-2, 3:5-7,9-12 }
2024-04-15-T08:07:50 -04:00 <0.4> tme-3(id3) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1814="kt: gmp-drive-updat") new group: : { 3:0-5, down: 1-2, 3:6-7,9-12 }
2024-04-15-T08:07:50 -04:00 <0.4> tme-3(id3) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1814="kt: gmp-drive-updat") new group: : { 3:0-6, down: 1-2, 3:7,9-12 }
2024-04-15-T08:07:50 -04:00 <0.4> tme-3(id3) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1814="kt: gmp-drive-updat") new group: : { 3:0-7, down: 1-2, 3:9-12 }
2024-04-15-T08:07:50 -04:00 <0.4> tme-3(id3) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1814="kt: gmp-drive-updat") new group: : { 3:0-7,9, down: 1-2, 3:10-12 }
2024-04-15-T08:07:50 -04:00 <0.4> tme-3(id3) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1814="kt: gmp-drive-updat") new group: : { 3:0-7,9-10, down: 1-2, 3:11-12 }
2024-04-15-T08:07:50 -04:00 <0.4> tme-3(id3) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1814="kt: gmp-drive-updat") new group: : { 3:0-7,9-11, down: 1-2, 3:12 }
2024-04-15-T08:07:50 -04:00 <0.4> tme-3(id3) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1814="kt: gmp-drive-updat") new group: : { 3:0-7,9-12, down: 1-2 }
2024-04-15-T08:07:50 -04:00 <0.4> tme-3(id3) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1828="kt: gmp-merge") new group: : { 1:0-11, 3:0-7,9-12, down: 2 }
2024-04-15-T08:07:52 -04:00 <0.4> tme-3(id3) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1828="kt: gmp-merge") new group: : { 1-2:0-11, 3:0-7,9-12 }
Since node 3 rebooted at the same time, it is worth checking node 2's logs to see if it also rebooted simultaneously. In this instance, the logfiles confirm this. Given that all three nodes rebooted at once, it is highly likely that this was a cluster-wide event, rather than a single-node issue. OneFS ‘software watchdog’ timeouts (also known as softwatch or swatchdog), for example, cause cluster-wide reboots. However, these are typically staggered rather than simultaneous reboots. The Softwatch process monitors the kernel and dumps a stack trace and/or reboots the node when the node is not responding. This helps to protect the cluster from the impact of heavy CPU starvation and aids the issue detection and resolution process.
If a cluster experiences multiple, staggered group changes, it can be extremely helpful to construct a timeline of the order and duration in which nodes are up or down. This info can then be cross-referenced with panic stack traces and other system logs to help diagnose the root cause of an event.
For example, in the following log excerpt, a node cluster experiences six different node reboots over a twenty-minute period. These are the group change messages from node 14, which that stayed up the whole duration:
# grep -i 'new group' /var/log/messages
2024-04-10-T14:54:00 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1060="kt: gmp-merge") new group: : { 1-2:0-11, 6-8, 9,13-15:0-11, 16:0,2-12, 17-18:0-11, 19-21, diskless: 6-8, 19-21 }
2024-04-15-T06:44:38 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1060="kt: gmp-split") new group: : { 1-2:0-11, 6-8, 13-15:0-11, 16:0,2-12, 17- 18:0-11, 19-21, down: 9}
2024-04-15-T06:44:58 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1066="kt: gmp-split") new group: : { 1:0-11, 6-8, 13-14:0-11, 16:0,2-12, 17- 18:0-11, 19-21, down: 2, 9, 15}
2024-04-15-T06:45:20 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1066="kt: gmp-split") new group: : { 1:0-11, 6-8, 14:0-11, 16:0,2-12, 17-18:0- 11, 19-21, down: 2, 9, 13, 15}
2024-04-15-T06:47:09 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1066="kt: gmp-merge") new group: : { 1:0-11, 6-8, 9,14:0-11, 16:0,2-12, 17- 18:0-11, 19-21, down: 2, 13, 15}
2024-04-15-T06:47:27 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1066="kt: gmp-split") new group: : { 6-8, 9,14:0-11, 16:0,2-12, 17-18:0-11, 19-21, down: 1-2, 13, 15}
2024-04-15-T06:48:11 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 2102="kt: gmp-split") new group: : { 6-8, 9,14:0-11, 16:0,2-12, 17:0-11, 19- 21, down: 1-2, 13, 15, 18}
2024-04-15-T06:50:55 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 2102="kt: gmp-merge") new group: : { 6-8, 9,13-14:0-11, 16:0,2-12, 17:0-11, 19- 21, down: 1-2, 15, 18}
2024-04-15-T06:51:26 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 85396="kt: gmp-merge") new group: : { 2:0-11, 6-8, 9,13-14:0-11, 16:0,2-12, 17:0-11, 19-21, down: 1, 15, 18}
2024-04-15-T06:51:53 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 85396="kt: gmp-merge") new group: : { 2:0-11, 6-8, 9,13-15:0-11, 16:0,2-12, 17:0-11, 19-21, down: 1, 18}
2024-04-15-T06:54:06 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 85396="kt: gmp-merge") new group: : { 1-2:0-11, 6-8, 9,13-15:0-11, 16:0,2-12, 17:0-11, 19-21, down: 18}
2024-04-15-T06:56:10 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 2102="kt: gmp-merge") new group: : { 1-2:0-11, 6-8, 9,13-15:0-11, 16:0,2-12, 17-18:0-11, 19-21}
2024-04-15-T06:59:54 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 85396="kt: gmp-split") new group: : { 1-2:0-11, 6-8, 9,13-15,17-18:0-11, 19-21, down: 16}
2024-04-15-T07:05:23 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1066="kt: gmp-merge") new group: : { 1-2:0-11, 6-8, 9,13-15:0-11, 16:0,2-12, 17-18:0-11, 19-21}
First, run the ‘isi_nodes "%{name}: LNN %{lnn}, Array ID %{id}"’ CLI command to map the cluster’s node names to their respective Array IDs.
Before the cluster node outage event on April 15 there was a group change on April 10:
2024-04-10-T14:54:00 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1060="kt: gmp-merge") new group: : { 1-2:0-11, 6-8, 9,13-15:0-11, 16:0,2-12, 17-18:0-11, 19-21, diskless: 6-8, 19-21 }
After that, all nodes came back online, and the cluster could be considered healthy. The cluster contains nine nodes with twelve drives apiece and six diskless nodes (accelerators). The Array IDs now extend to 21, and Array IDs 3 through 5 and 10 through 12 are missing. This confirms that six nodes were added to or removed from the cluster.
The first event occurs at 06:44:38 on 15 April:
2024-04-15-T06:44:38 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1060="kt: gmp-split") new group: : { 1-2:0-11, 6-8, 13-15:0-11, 16:0,2-12, 17- 18:0-11, 19-21, down: 9, diskless: 6-8, 19-21 }
Node 14 identifies Array ID 9 (LNN 6) as having left the group.
Next, twenty seconds later, two more nodes (2 & 15) are marked as offline:
2024-04-15-T06:44:58 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1066="kt: gmp-split") new group: : { 1:0-11, 6-8, 13-14:0-11, 16:0,2-12, 17- 18:0-11, 19-21, down: 2, 9, 15, diskless: 6-8, 19-21 }
Twenty-two seconds later, another node goes offline:
2024-04-15-T06:45:20 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1066="kt: gmp-split") new group: : { 1:0-11, 6-8, 14:0-11, 16:0,2-12, 17-18:0- 11, 19-21, down: 2, 9, 13, 15, diskless: 6-8, 19-21 }
At this point, four nodes (2,6,7, & 9) are marked as being offline:
Almost two minutes later, the previously down node (node 6) rejoins the group:
2024-04-15-T06:47:09 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1066="kt: gmp-merge") new group: : { 1:0-11, 6-8, 9,14:0-11, 16:0,2-12, 17- 18:0-11, 19-21, down: 2, 13, 15, diskless: 6-8, 19-21 }
However, twenty-five seconds after node 6 comes back, node 1 leaves the group:
2024-04-15-T06:47:27 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1066="kt: gmp-split") new group: : { 6-8, 9,14:0-11, 16:0,2-12, 17-18:0-11, 19-21, down: 1-2, 13, 15, diskless: 6-8, 19-21 }
Finally, the group returns to its original composition:
2024-04-15-T07:05:23 -04:00 <0.4> tme-14(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 1066="kt: gmp-merge") new group: : { 1-2:0-11, 6-8, 9,13-15:0-11, 16:0,2-12, 17-18:0-11, 19-21, diskless: 6-8, 19-21 }

Your Browser is Out of Date

Interpreting group changes

Interpreting group changes