It's a new year and a great time to make changes for the better. In an effort to rid the world of needless application and network performance slowdowns, we turn to retransmission timeouts (RTOs). What are they and what can you do about them?
The ExtraHop platform captures a lot of application performance metrics and network performance metrics. One of the most important of these performance metrics quantifies TCP retransmission timeouts (RTOs), which create havoc for network and application performance. TCP starts a retransmission timer when an outbound segment is handed down to IP. If there is no acknowledgment for the data in a given segment before the timer expires, then the segment is retransmitted. TCP retransmissions occur on the network all the time. Typically, they don't pose much of a problem; as the retransmission timer counts down, the packets are resent, and the network continues to hum along.
A retransmission timeout (RTO), on the other hand, is quite a different beast. An RTO occurs when the sender is missing too many acknowledgments and decides to take a time out and stop sending altogether. After some amount of time, usually at least one second, the sender cautiously starts sending again, testing the waters with just one packet at first, then two packets, and so on. As a result, an RTO causes, at minimum, a one-second delay on your network. We've seen sites that show millions of RTOs in a 24-hour window, with one million RTOs translating to 277 hours of application delay. These retransmission timeouts add up to significant problems for network and application performance and certainly require some tuning and optimization.
The ExtraHop platform spots RTOs by simulating the TCP state machines at the endpoints of the connection and inferring when problems occur, detecting issues such as bad congestion avoidance, Nagle delays, PAWS drops, and excessive tinygrams.
In a real-world example, we received a call from one of our customers about a very high number of TCP RTOs on some key servers. He asked us to help him validate and analyze the data. We logged into the ExtraHop platform together, and sure enough, during traffic spikes, the RTO metric would climb to approximately eight million. Because the ExtraHop UI allows for easy drill-down, we were able to quickly determine that the majority of the RTOs could be traced to a single blade enclosure and two specific server instances.
Using this information, we pinpointed the problem quickly, allowing for a nearly immediate mean time to resolution (MTTR). The RTO metric helps to identify packet loss and to locate the congested links. There are a few areas within the network that could be likely causes: duplex mismatch on the switch, a bad cable, bad checksums, or driver issue. In this customer's case, we found an incorrect flow control setting. After adjusting this setting, the RTOs dropped by more than 90%, which was a big win for our customer and for the applications running on their servers.
In summary, retransmission timeouts result in serious network stalls and performance degradation. The ExtraHop platform makes it easy to identify RTOs and eliminate them quickly.
Want to see exactly how easy it is to monitor TCP round-trip times and retransmission timeouts in ExtraHop? Explore our free online demo.
This is a companion discussion topic for the original entry at http://www.extrahop.com/community/blog/2016/retransmission-timeouts-rtos-application-performance-degradation/