Home > Workload Solutions > High Performance Computing > White Papers > Dell Validated Design for HPC Digital Manufacturing with Altair Simulation Suite and 3rd Generation Intel Xeon > Multiserver scalability
AcuSolve is a hybrid parallel application, where it is possible to use both shared memory thread parallelism within a node and distributed memory domain parallelism both within a node and across nodes. It can be challenging for customers to find the proper balance between shared memory and distributed memory parallelism within a node. The following figure shows the parallel performance for these models using four and eight threads per domain. The number of domains per server was the total number of cores per server divided by the number of threads per domain.
These benchmarks were carried out on a cluster of eight servers, each with dual processor 32-core 8358 processors. The results are presented in relative performance compared with the single node results using four threads per domain.
These results may seem surprising for the Riser model where the performance increases by more than a factor of two going from one to two nodes. However, this behavior can be explained by “cache effects.” A cache effect occurs when the data set is distributed among a greater number of nodes, there can be a point where the entire data set can fit into cache, and the speed of the solver can increase dramatically. Such cache effects are highly problem specific. In general, there is a tradeoff in distributed memory parallelism where the cache performance typically improves as the problem is distributed to more nodes. However, this can also increase communication overhead, counteracting the increased performance from the caching benefit. Overall, the datasets show excellent parallel speedup up to four nodes. The largest model, ImpingingNozzle, displays nearly linear parallel scaling up to eight nodes. For the smallest model, Windmill, scaling is limited due to communication overhead as the number of nodes increases, as expected for a model of this size.
The optimal number of threads to obtain the best performance depends on the model, node count and HPC system configuration. Users are encouraged to try various thread counts to find the optimum values for their specific models and HPC systems.