It's a new year and a great time to make changes for the better. In an effort to rid the world of needless application and network performance slowdowns, we turn to retransmission timeouts (RTOs). What are they and what can you do about them?
Check out our NEW whitepaper: TCP Optimization: Nagle Delays and Beyond
The ExtraHop platform captures a lot of application performance metrics and network performance metrics. One of the most important of these performance metrics quantifies TCP retransmission timeouts (RTOs), which create havoc for network and application performance. TCP starts a retransmission timer when an outbound segment is handed down to IP. If there is no acknowledgment for the data in a given segment before the timer expires, then the segment is retransmitted. TCP retransmissions occur on the network all the time. Typically, they don't pose much of a problem; as the retransmission timer counts down, the packets are resent, and the network continues to hum along.
The ExtraHop platform spots RTOs by simulating the TCP state machines at the endpoints of the connection and inferring when problems occur, detecting issues such as bad congestion avoidance, Nagle delays, PAWS drops, and excessive tinygrams.
In a real-world example, we received a call from one of our customers about a very high number of TCP RTOs on some key servers. He asked us to help him validate and analyze the data. We logged into the ExtraHop platform together, and sure enough, during traffic spikes, the RTO metric would climb to approximately eight million. Because the ExtraHop UI allows for easy drill-down, we were able to quickly determine that the majority of the RTOs could be traced to a single blade enclosure and two specific server instances.
Using this information, we pinpointed the problem quickly, allowing for a nearly immediate mean time to resolution (MTTR). The RTO metric helps to identify packet loss and to locate the congested links. There are a few areas within the network that could be likely causes: duplex mismatch on the switch, a bad cable, bad checksums, or driver issue. In this customer's case, we found an incorrect flow control setting. After adjusting this setting, the RTOs dropped by more than 90%, which was a big win for our customer and for the applications running on their servers.
In summary, retransmission timeouts result in serious network stalls and performance degradation. The ExtraHop platform makes it easy to identify RTOs and eliminate them quickly.
Want to see exactly how easy it is to monitor TCP round-trip times and retransmission timeouts in ExtraHop? Explore our free online demo.
This is a companion discussion topic for the original entry at http://www.extrahop.com/company/blog/2016/retransmission-timeouts-rtos-application-performance-degradation/