The recommended approach for measuring and optimizing performance is as follows:
For releases prior to OneFS 8.0, the number of primary and secondary workers is calculated between both clusters based on two factors. First, the lowest number of nodes between the two clusters is considered. The lowest number of nodes is then multiplied by the number of workers per node, which is a configurable value. The default value for workers per node is three. SyncIQ randomly distributes workers across the cluster with each node having at least one worker. If the number of workers is less than the number of nodes, then all nodes will not participate in the replication. An example calculation is illustrated in the following figure.
Figure 42. Calculating primary and secondary workers for release prior to OneFS 8.0
In OneFS 8.0, the limits have increased to provide additional scalability and capability in line with cluster sizes and higher performing nodes that are available. The maximum number of workers and the maximum number of workers per policy both scale as the number of nodes in the cluster increases. The defaults should be changed only with the guidance of PowerScale Technical Support.
Note: The source and target cluster must have the same number of workers, as each set of source and target workers create a TCP session. Any inconsistency in the number of workers results in failed sessions. As stated above, the maximum number of target workers is 100 per node, implying the total number of source workers is also 100 per node.
Note: The following example is provided for understanding how a node’s CPU type impact worker count, how workers are distributed across policies, and how SyncIQ works on a higher level. The actual number of workers is calculated dynamically by OneFS based on the node type. The calculations in the example are not a tuning recommendation and are merely for illustration. If the worker counts require adjustment, contact PowerScale Technical Support, as the number of virtual cores, nodes, and other factors are considered prior to making changes.
As an example, consider a 4-node cluster, with 4 cores per node. Therefore, there are 16 total cores in the cluster. Following the previous rules:
When the first policy starts, it will be assigned 32 workers (out of the maximum 64). A second policy starting will also be assigned 32 workers. The maximum number of workers per policy has been determined previously as 32, and there are now a total of 64 workers – the maximum for this cluster. When a third policy starts, assuming the first two policies are still running, the maximum of 64 workers are redistributed evenly, so that 21 workers are assigned to the third policy, and the first two policies have their number of workers reduced from 32 to 21 and 22 respectively, as 64 does not split into 3 evenly. Therefore, there are 3 policies running, each with 21 or 22 workers, keeping the cluster maximum number of workers at 64. Similarly, a fourth policy starting would result in all four policies having 16 workers. When one of the policies completes, the reallocation again ensures that the workers are distributed evenly amongst the remaining running policies.
Note: Any reallocation of workers on a policy occurs gradually to reduce thrashing when policies are starting and stopping frequently.
Administrators may want to specify a limit for the number of concurrent SyncIQ jobs running. Limiting the number is particularly useful during peak cluster usage and client activity. Forcing a limit on cluster resources for SyncIQ ensures that clients do not experience any performance degradation.
Note: Consider all factors prior to limiting the number of concurrent SyncIQ jobs, as policies may take more time to complete, impacting RPO and RTO times. As with any significant cluster update, testing in a lab environment is recommended prior to a production cluster update. Additionally, a production cluster should be updated gradually, minimizing impact and allowing measurements of the impacts.
To limit the maximum number of concurrent SyncIQ jobs, perform the following steps from the OneFS CLI:
OneFS 8.0 introduced an updated SyncIQ algorithm taking advantage of all available cluster resources, improving overall job run times significantly. SyncIQ is exceptionally efficient in network data scaling and utilizes 2 MB TCP windows, considering WAN latency while delivering maximum performance.
Note: The steps and processes mentioned in this section may significantly impact RPO times and client workflow. Prior to updating a production cluster, test all updates in a lab environment that mimics the production environment. Only after successful lab trials, should the production cluster be considered for an update. As a best practice, gradually implement changes and closely monitor the production cluster after any significant updates.
SyncIQ achieves maximum performance by utilizing all available cluster resources. If available, SyncIQ consumes the following:
As SyncIQ consumes cluster resources, this may impact current workflows depending on the environment and available resources. If data replication is impacting other workflows, consider tuning SyncIQ as a baseline by updating the following:
For information about updating the variables above, see SyncIQ performance rules. After the baseline is configured, gradually increase each parameter and collect measurements, ensuring workflows are not impacted. Additionally, consider modifying the maximum number of SyncIQ jobs, as explained in Specifying a maximum number of concurrent SyncIQ jobs.
Note: The baseline variables provided above are only for guidance and not a one size fits all metric. Every environment varies. Carefully consider cluster resources and workflow, while finding the intersection of workflow impacts with SyncIQ performance.