We ran tests with the SmartConnect configuration in place and the SMB shares were mounted using SmartConnect zone name:
- Video playback at 20%
- As video is being written to the storage, we recall or review the video at a rate equal to 20% of the write rate.
- Disk Failure test
-
A single disk failure is the most common failure affecting storage systems today. When a disk fails, that disk is removed and replaced. The replacement disk is then reconstructed.
The Isilon cluster was protected using a +2 protection scheme that allows for two simultaneous disk failures. In our test, we failed and recovered two disks. The SmartFail process started and the CPU utilization of the node increased with no observed affect to the write streams.
- NIC Failure test
-
We performed the hard NIC failure test by removing one NIC cable from the active node that was involved in active recording. After the NIC failure, writing to the same node failed. When the network fails, the server must recognize the failure, then it must establish a new connection. Also, when the network fails TCP socket connections are left open and remain open on the cluster until Isilon's OneFS forces them closed, which allows the server to continue writing.
We can force the open TCP sockets to close for a duration of less than 2 minutes by reducing the TCP keep idle and TCP keep interval timeout to the optimum values recommended by Isilon Engineering.
To reduce the video loss duration due to the TCP Socket Open condition, set the persistent values in the sysctl.config file as follows to reduce the impact duration time significantly:
isi_sysctl_cluster
net.inet.tcp.keepidle=61000
isi_sysctl_cluster
net.inet.tcp.keepintvl=5000Refer to the KB article: 000089232 for further information about how to configure these parameters.
Note: NIC failure impact can be overcome by using NIC aggregation in Active/Passive Failure aggregation mode, which is explained in the next test case. Connectivity to the nodes that are not affected by the network outage continues to be available throughout the test scenario and no impact was observed.
- NIC Failure test with NIC aggregation in Active/Passive
-
We did a hard NIC failure test with Active/Passive aggregation by removing the active NIC port cable. After the network failure, writing to the same node continued and the NIC that was passive was immediately changed to the active NIC. The NIC failure caused no apparent loss.
Note: NIC aggregation in Active/Passive mode remedies only a network disconnection/NIC failure that happens on the Isilon node or the corresponding switch port where it is connected.
- Node Poweroff Test
-
To simulate an unexpected single node hard failure, we held down the power button until the node powered off. This causes the servers that were writing to that node to reconnect to a new node. In our tests, the servers on the failed node reconnected to a new node, but did not start writing again for an aggregate (reconnect and start writing) duration of up to 52 seconds while waiting for writing to the SMB share to be re-started.
The second issue is that the removal or addition of a node causes an interrupt to the cluster. Therefore, video servers writing to the other nodes might experience a short interruption. The duration of the interruption can be reduced by modifying the OneFS environment variables.
The following code makes the group changes to the cluster that reduce the interruption from 30 seconds to 1minute to 9 seconds or less. The changes affect the "lazy queue" and other cache related operations on each node.
These changes are required to modify the remove or add node interruption:declare -i COUNT MDS
BASE=10000
COUNT=$((1.01 * $BASE))
MDS=$(($BASE * 0.75))
isi_sysctl_cluster kern.maxvnodes=$BASE
isi_sysctl_cluster kern.minvnodes=$BASE
isi_sysctl_cluster efs.lin.lock.initiator.lazy_queue_goal=$COUNT
isi_sysctl_cluster efs.ref.initiator.lazy_queue_goal=$COUNT
isi_sysctl_cluster efs.mds.block_lock.initiator.lazy_queue_goal=$MDS
isi_sysctl_cluster efs.bam.datalock.initiator.lazy_queue_goal=$MDS