We ran tests with the SmartConnect™ configuration in place and the SMB shares were mounted using the SmartConnect zone name.

Video playback test

As video is being written to the storage and is being viewed live, video is simultaneously recalled or reviewed at a rate equal to 20 percent of the write rate. Tests are run with the SmartConnect™ configuration in place and the SMB shares are mounted using the SmartConnect zone name.

The review did not affect the write rate, video quality, or result in dropped video.

Disk failure test

A single disk failure is the most common failure affecting storage systems today. When a disk fails, that disk is removed and replaced. The replacement disk is then reconstructed.

The Unity and SC block storage arrays were protected using RAID with hot spare disks. For the test, disk failure scenarios were induced and the data rebuild to the hot spare disks was observed with effect to write bandwidth.

The Isilon cluster was protected using a +2 protection scheme that allows for two simultaneous disk failures. For the test, two disks are failed and then recovered. The SmartFail process started and the CPU utilization of the node increased with no observed effect to the write streams or video loss.

NIC failure test

The Unity block storage arrays were configured with multiple paths to the recorders using Microsoft MPIO. Multiple NICs were configured with the recorders and controllers for redundancy. The Unity hard NIC failure test removes one nic cable from the array. Recorders that were configured with multipathing reconnected to the volume across another available path. To reduce the reconnection time and eliminate video loss, adjust the TCP retransmission timers. For more information, see the Dell EMC SAN Storage with Video Management Systems - Configuration Best Practices Guide.

The Isilon hard NIC failure test removes one NIC cable from the active node that was involved in active recording. After the NIC failure, writing to the same node failed. When the network fails, the server must recognize the failure, then it must establish a new connection. Also, when the network fails TCP socket connections are left open and remain open on the cluster until Isilon's OneFS forces them closed, which allows the server to continue writing.

We can force the open TCP sockets to close for a duration of less than 2 minutes by reducing the TCP keep idle and TCP keep interval timeout to the optimum values recommended by Isilon Engineering.

To reduce the video loss duration due to the TCP Socket Open condition, set the persistent values in the sysctl.config file as follows to reduce the impact duration time significantly:

isi_sysctl_cluster
net.inet.tcp.keepidle=61000
 isi_sysctl_cluster
net.inet.tcp.keepintvl=5000

NIC Failure test with NIC aggregation in Active/Passive

The hard NIC failure test with aggregated links was run by removing one NIC. After the network failure, writing to the same node continued. The NIC failure caused no apparent loss.

Node poweroff test

We simulate an unexpected single node hard failure, which causes the servers that were writing to that node to reconnect to a new node.

In our tests, we could eliminate the global loss of video during high impact cluster interruptions caused by the removal of a node (intentional or failure) or when adding a node to the cluster. The changes also reduced the duration of server reconnections after a node was removed.

In our tests, the servers on the failed node reconnected to a new node, but did not start writing again for an aggregate (reconnect and start writing) duration of up to 52 seconds while waiting for writing to the SMB share to be re-started.

The second issue is that the removal or addition of a node causes an interrupt to the cluster. Therefore, video servers writing to the other nodes might experience a short interruption. The duration of the interruption can be reduced by modifying the OneFS environment variables.

The following code makes the group changes to the cluster that reduce the interruption from 30 seconds to no video loss globally. When video writing to the node failed, the cameras recognized a 50 percent reduction in the recovery duration from 1 minute to 30 seconds. The changes affect the "lazy queue" and other cache related operations on each node.

The following changes are required to modify the remove or add node interruption:

declare -i COUNT MDS
BASE=10000
COUNT=$((1.01 * $BASE))
MDS=$(($BASE * 0.75))
isi_sysctl_cluster kern.maxvnodes=$BASE
isi_sysctl_cluster kern.minvnodes=$BASE
isi_sysctl_cluster efs.lin.lock.initiator.lazy_queue_goal=$COUNT
isi_sysctl_cluster efs.ref.initiator.lazy_queue_goal=$COUNT
isi_sysctl_cluster efs.mds.block_lock.initiator.lazy_queue_goal=$MDS
isi_sysctl_cluster efs.bam.datalock.initiator.lazy_queue_goal=$MDS

Note: During an abrupt failure of a node, the recorders writing to that node reconnect to any other available node using SmartConnect. During the node fail testing we observed that the reconnected recorder lost about 45 seconds of video.

Warning:

If running a mixed workload, these changes can adversely affect the other workloads that might be present on the cluster.

Your Browser is Out of Date