We ran tests with the SmartConnect configuration in place and the SMB shares were mounted using the SmartConnect zone name.
Home > Workload Solutions > Safety & Security > Guides > Milestone > Sizing Guide—Dell Storage with Milestone XProtect Corporate > Tests conducted
We ran tests with the SmartConnect configuration in place and the SMB shares were mounted using the SmartConnect zone name.
As video is being written to storage, video is simultaneously recalled or reviewed at a rate equal to percent of the write rate. The review did not affect the write rate, video quality, or result in dropped video frames.
A single disk failure is the most common failure affecting storage systems today. When a disk fails, that disk is removed and replaced. The replacement disk is then reconstructed.
The Unity and SC series block storage arrays were protected using RAID with hot spare disks. For the test, disk failure scenarios were induced and the data rebuild to the hot spare disks was observed with effect to write bandwidth.
The Isilon cluster was protected using a +2 protection scheme that allows for two simultaneous disk failures. For the test, two disks are failed and then recovered. The SmartFail process started and the CPU utilization of the node increased with no observed effect to the write streams.
A single disk failure is the most common failure affecting storage systems today. When a disk fails, that disk is removed and replaced. The replacement disk is then reconstructed.
ECS employs a hybrid model of triple mirroring data, metadata, and indexing. Erasure coding is also used for enhanced data protection and reduction of storage overhead. For data integrity, ECS uses checksums.
When the system labels a drive as FAILED
, the data protection logic rebuilds the data on that drive on other drives in the system. The FAILED
drive no longer participates in the system in any way. ECS requires a minimum of four nodes to be able to conduct the default erasure coding and six nodes for the cold archive option.
The disk rebuild operation did not affect the write rate, video quality, or result in dropped video.
The Unity and SC series block storage arrays were protected using RAID with hot spare disks. For the test, disk failure scenarios were induced and the data rebuild to the hot spare disks was observed with effect to write bandwidth.
The Isilon cluster was protected using a +2 protection scheme that allows for two simultaneous disk failures. For the test, two disks are failed and then recovered. The SmartFail process started and the CPU utilization of the node increased with no observed effect to the write streams.
The Unity and SC series block storage arrays were configured with multiple paths to the recorders using Microsoft MPIO. Multiple NICs were configured with the recorders and controllers for redundancy. The Unity and SC series hard NIC failure test removes one nic cable from the array. Recorders that were configured with multipathing reconnected to the volume across another available path. To reduce the reconnection time and eliminate video loss, adjust the TCP retransmission timers. For more information, see the Dell EMC SAN Storage with Video Management Systems - Configuration Best Practices Guide.
The Isilon hard NIC failure test removes one NIC cable from the active node that was involved in active recording. After the NIC failure, writing to the same node failed. When the network fails, the server must recognize the failure, then it must establish a new connection. Also, when the network fails TCP socket connections are left open and remain open on the cluster until Isilon's OneFS forces them closed, which allows the server to continue writing.
We can force the open TCP sockets to close for a duration of less than 2 minutes by reducing the TCP keep idle and TCP keep interval timeout to the optimum values recommended by Isilon Engineering.
To reduce the video loss duration due to the TCP Socket Open condition, set the persistent values in the sysctl.config file as follows to reduce the impact duration time significantly:
isi_sysctl_cluster
net.inet.tcp.keepidle=61000
isi_sysctl_cluster
net.inet.tcp.keepintvl=5000
The ECS hard NIC failure test removes one NIC cable from the active node that was involved in active recording to simulate the NIC failure scenario.
The Dell Technologies Safety & Security Lab uses two 10 GbE, 24-port or 52-port Arista switches that are used to transfer data to and from customer applications as well as internal node-to-node communications. These switches are connected to the ECS nodes in the same rack and employ the Multi-Chassis Link Aggregation (MLAG) feature, which logically links the switches enabling active-active paths between the nodes and customer applications. This configuration results in higher bandwidth while preserving resiliency and redundancy in the data path. Any networking device supporting static LAG or IEEE 802.3ad LACP can connect to this MLAG switch pair. Because the switches are configured as MLAG, these two switches appear and act as one large switch.
The NIC failure tests did not affect the write rate, video quality, or result in dropped video.
The hard NIC failure test with Active/Passive aggregation was run by removing the active NIC port cable. After the network failure, writing to the same node continued and the NIC that was passive was immediately changed to the active NIC. The NIC failure caused no apparent loss.
TCP transmission timers can be adjusted to reduce the reconnection times during Nic failures on recorders that use Microsoft MPIO. For more information, see the Dell EMC SAN Storage with Video Management Systems - Configuration Best Practices Guide.
An unexpected single node hard failure was simulated, which causes the servers that were writing to that node to reconnect to a new node.
During the tests, the servers on the failed node reconnected to a new node, but did not start writing again for an aggregate (reconnect and start writing) duration of up to 52 seconds while waiting for writing to the SMB share to be re-started.
Also, the removal or addition of a node causes an interrupt to the cluster. Therefore, video servers writing to the other nodes might experience a short interruption. The duration of the interruption can be reduced by modifying the OneFS environment variables.
The following changes are required to modify the remove or add node interruption:
declare -i COUNT MDS
BASE=10000
COUNT=$((1.01 * $BASE))
MDS=$(($BASE * 0.75))
isi_sysctl_cluster kern.maxvnodes=$BASE
isi_sysctl_cluster kern.minvnodes=$BASE
isi_sysctl_cluster efs.lin.lock.initiator.lazy_queue_goal=$COUNT
isi_sysctl_cluster efs.ref.initiator.lazy_queue_goal=$COUNT
isi_sysctl_cluster efs.mds.block_lock.initiator.lazy_queue_goal=$MDS
isi_sysctl_cluster efs.bam.datalock.initiator.lazy_queue_goal=$MDS
If running a mixed workload, these changes can adversely affect the other workloads that might be present on the cluster.
ECS employs a hybrid model triple mirroring data, metadata, and indexing. Erasure coding is also used for enhanced data protection and reduction of storage overhead.
Erasure coding provides enhanced data protection from a disk or node failure that is storage efficient as compared to conventional protection schemes. The ECS storage engine implements the Reed Solomon 12+4 erasure-coding scheme, in which a chunk is broken into 12 data fragments and 4 coding fragments for parity. These 16 fragments are then dispersed across nodes at the local site. The data and coding fragments for each chunk are equally distributed across nodes in the cluster. For example, with 8 nodes, each node stores 2 of the 16 fragments. The storage engine can then reconstruct a chunk from any 12 fragments of the original 16.
One of the ECS nodes was manually shutdown. The GeoDrive tool load balanced the traffic across all the available nodes and the recorders bypassed the failed node. The node failure did not affect the write rate, video quality, or result in dropped video.
If running a mixed workload, these changes can adversely affect the other workloads that might be present on the cluster.
One of the ECS nodes was manually restarted to simulate a node reboot. The GeoDrive tool load balanced the traffic across all the available nodes and the recorders bypassed the failed node. The node reboot did not affect the write rate, video quality, or result in dropped video.