In our testing, we discovered that during a node or NIC failure, all the recorders in the failed node may reconnect to a single available node. In this case, round-robin does not distribute the client connections across the available nodes and all the recorders in the failed node tried to reconnect at the exact same time.
The Microsoft DNS server caches the Node IP addresses for queries made with a time to live (TTL) of 1 second. If there are multiple recursive queries for the same DNS zone name within the same second, the DNS server responds with the same node IP for the client connection requests.
This issue can be resolved by using an alternate DNS implementation such as BIND or DNSMASQ. Another option is to use the SmartConnect service IP as the preferred DNS server and the Domain DNS server as the alternate DNS server IP.