Constructing an event timeline

Thank you for your feedback!

When investigating a cluster issue, it can be helpful to build a human-readable timeline of what occurred. This is useful in instances with multiple, non-simultaneous group changes. This timeline should include which nodes have come up or down and can be interpolated with panic stack summaries to describe an event.
As such, a timeline of the cluster event above could read:
1. April 15 06:44:38 6 down
1. April 15 06:44:58 2, 9 down (6 still down)
2. April 15 06:45:20 7 down (2, 6, 9 still down)
3. April 15 06:47:09 6 up (2, 7, 9 still down)
4. April 15 06:47:27 1 down (2, 7, 9 still down)
5. April 15 06:48:11 12 down (1, 2, 7, 9 still down)
6. April 15 06:50:55 7 up (1, 2, 9, 12 still down)
7. April 15 06:51:26 2 up (1, 9, 12 still down)
8. April 15 06:51:53 9 up (1, 12 still down)
9. April 15 06:54:06 1 up (12 still down)
10. April 15 06:56:10 12 up (none down)
11. April 15 06:59:54 10 down
12. April 15 07:05:23 10 up (none down)
The next step would be to review the logs from the other nodes in the cluster for this time period and construct similar timeline. Once done, these can be distilled into one comprehensive, cluster-wide timeline.
Before triangulating log events across a cluster, it is important to ensure that the constituent nodes' clocks are all synchronized. To check this, run the ‘isi_for_array –q date’ command on all nodes and confirm that they match. If not, apply the time offset for a particular node to the timestamps of its logfiles.
Here is another example of how to interpret a series of group events in a cluster. Consider the following group info excerpt from the logs on node 1 of the cluster:
2024-04-15-T18:01:17 -04:00 <0.4> tme-1(id1) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 5681="kt: gmp-config") new group: <1,270>: { 1:0-11, down: 2, 6-11, diskless: 6-8 }
2024-04-15-T18:02:05 -04:00 <0.4> tme-1(id1) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 5681="kt: gmp-config") new group: <1,271>: { 1-2:0-11, 6-8, 9-11:0-11, soft_failed: 11, diskless: 6-8 }
2024-04-15-T18:08:56 -04:00 <0.4> tme--1(id1) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 10899="kt: gmp-split") new group: <1,272>: { 1-2:0-11, 6-8, 9-10:0-11, down: 11, soft_failed: 11, diskless: 6-8 }
2024-04-15-T18:08:56 -04:00 <0.4> tme-1(id1) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 10899="kt: gmp-config") new group: <1,273>: { 1-2:0-11, 6-8, 9-10:0-11, diskless: 6-8}
2024-04-15-T18:09:49 -04:00 <0.4> tme-1(id1) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 10998="kt: gmp-config") new group: <1,274>: { 1-2:0-11, 6-8, 9-10:0-11, soft_failed: 10, diskless: 6-8 }
2024-04-15-T18:15:34 -04:00 <0.4> tme-1(id1) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 12863="kt: gmp-split") new group: <1,275>: { 1-2:0-11, 6-8, 9:0-11, down: 10, soft_failed: 10, diskless: 6-8 }
2024-04-15-T18:15:34 -04:00 <0.4> tme-1(id1) /boot/kernel.amd64/kernel: [gmp_info.c:1863] (pid 12863="kt: gmp-config") new group: <1,276>: { 1-2:0-11, 6-8, 9:0-11, diskless: 6-8 }
The timeline of events here can be interpreted as such:
1. In the first line, node 1 has rebooted: node 1 is up, and all other nodes that are part of the cluster are down. (Nodes with Array IDs 3 through 5 were removed from the cluster prior to this sequence of events.)
1. The second line indicates that all the nodes have returned to the group, except for Array ID 11, which has been smartfailed.
2. In the third line, Array ID 11 is both smartfailed but also offline.
3. Moments later in the fourth line, Array ID 11 has been removed from the cluster entirely.
4. Less than a minute later, the node with array ID 10 is smartfailed, and the same sequence of events occurs.
5. After the smartfail finishes, the cluster group shows node 10 as down, then removed entirely.
Because group changes document the cluster's actual configuration from OneFS’ perspective, they are a vital tool in understanding which devices the cluster considers available and which it considers as having failed, at a specific point in time. This information, when combined with other data from cluster logs, can provide a succinct but detailed cluster history - simplifying both debugging and failure analysis.

Your Browser is Out of Date

Constructing an event timeline

Constructing an event timeline