Datacenter networks (DCNs) are crucial for extensive online services that operate at high utilization levels. The smallest downtime can lead to a significant revenue loss. Due to these high risks, it is essential to introduce an active model of DCN management, where the infrastructure observes, analyzes, and corrects faults in near-real-time. As the errors vary, it is difficult to debug such failures.
Some packets may experience an exceptionally high delay between two servers, but it may be challenging to track which link is responsible. Even when the packet-drop counters at switches display no packet-loss, packets that are destined for a specific set of servers might drop. TCP connections to a Virtual IP may encounter intermittent timeouts, where traceroute probes are blocked by the load balancers to debug the issue. The load may not be balanced among the group of ECMP (Equal Cost MultiPath) links.
The packets need examination at a granular level. Errors such as faulty interface or software issues can cause random failures that are difficult to troubleshoot. Collection and analysis at a packet-level are called packet-level network telemetry.