Retransmission timeout and round-trip time metrics can reveal virtual packet loss.
This month's Performance Metric turns to retransmission timeouts (RTOs) and round-trip times, particularly for transactions traveling to virtualized systems. Both high RTOs and high levels of jitter on round-trip times in these instances likely indicate virtual packet loss, a hard-to-detect performance problem that plagues underprovisioned virtual environments.
We've encountered this problem in several customer sites with virtualized application servers. Virtual packet loss is not detected by either hypervisor VM statistics or traditional network monitoring tools. To make matters worse, the problem often manifests itself only intermittently.
For one customer, virtual packet loss took the blush off the rose for a new virtualization project. The Application team had recently migrated an application to a virtualized environment and was now seeing slowness. As the virtualized application servers reported packet loss, the Application team naturally asked the Network team to investigate. Infrastructure monitoring tools, however, did not show any dropped packets on the switch ports or router links. Neither did network engineers find any problems when spot-checking a few flows in a packet sniffer tool.
This problem remained unresolved for several months until the customer engaged ExtraHop in a proof-of-concept trial. Several days after connecting the ExtraHop system to the network, the virtualized application started to inexplicably slow again, showing the usual misleading symptoms of network congestion. The ExtraHop system automatically classifies devices according to role and vendor, among other attributes. By filtering for "vmware" in the device list, for example, IT teams can easily discover all the virtual machines present on the network. Using this selection of devices, our sales engineer quickly navigated to the TCP analysis and pointed out the high RTOs and high levels of jitter on round-trip times, which are indicative of virtual packet loss. The customer referred to their vCenter management console and confirmed that the virtual machines in question were using all available memory and CPU on the host. The solution was relatively simple: move some of the VMs to a less crowded physical host.
TCP connection analysis can reveal the indicators of virtual packet loss.
We've written about the causes of virtual packet loss before. In summary, what happens is that CPU oversubscription at the hypervisor level delays the delivery of or acknowledgements for incoming packets. These delays cause the sender to back off its transmission rate as it resends the packets. It's important to note that although the network appears to be congested and losing packets, the network and hypervisor are actually delivering them—the packets are just arriving too late to fit the TCP requirements. See Virtualization Journal for a detailed technical article on virtual packet loss.
Because virtualization obscures traditional performance monitoring techniques, problems such as virtual packet loss are not well understood by many IT professionals. This obstacle likely explains the reluctance of many organizations to migrate their business-critical applications to virtual environments. The ExtraHop system provides a deterministic, objective view that can help companies manage application performance in dynamic virtualized environments. Read our recent white paper explaining why wire data analytics is necessary for virtualization.
Virtual packet loss and other performance-sapping problems can negate some of the benefits of virtualization technology, as well as distract IT Operations teams from tasks that add more direct business value. Want to see how wire data can help diagnose virtual packet loss in your environment? Try the free, interactive ExtraHop demo.
This is a companion discussion topic for the original entry at http://www.extrahop.com/post/blog/performance-metrics/performance-metric-month-virtual-packet-loss/