Larry Wall, the creator of the Perl programming language, championed the idea of "making easy things easy and hard things possible." This month's Performance Metric of the Month highlights how ExtraHop, with its programmatic interface to its parsing engine, makes hard things possible. In this case, a major web security firm used ExtraHop to pinpoint the cause of extreme latency experienced by a fraction of its users for its web gateway/proxy SaaS offering.
Statistical Averages and Performance OutliersWeb gateways, firewalls, load balancers, content filters, content optimizers, and other proxies break sessions into multiple transactions through network-address translation (NAT) so that A -> C becomes A -> B -> C, for example. This makes it difficult to measure key performance indicators (KPIs) such as latency and processing time per user, which are critical when trying to troubleshoot complaints of slow performance. The web security company mentioned above was receiving complaints from some of its users about extremely slow performance. Even though average performance for all users was great, the IT team knew that the performance of outliers, not statistical averages, held the key to excellent user experience. The trouble was, how to stitch together the transactions broken apart by the NATed proxy?
Challenge: Stitching Together Transactions Broken Up by NATed ProxiesAdding to the challenge, the web security company had no control over the endpoints of the proxy architecture (as would have been the case with a forward or reverse proxy) and could only control the security gateway itself. The IT Operations team considered instrumenting the application and modifying the gateway to insert unique identifier tags. Both options required development work and added complexity and cost—possibly hundreds of thousands of dollars depending on the transaction tracing tool used. Compared to these options, the solution with ExtraHop was trivial. Using Application Inspection Triggers, the programmatic interface to ExtraHop's Context and Correlation Engine, the web security company created unique session fingerprints and stitched together HTTP flows so that they could measure each leg of every transaction, even as those transactions traversed a complicated proxy architecture.
To build a unique identifier for the request flow (1-7 in the diagram above), the team built an Application Inspection Trigger that would recognize the URI, UserAgent, and the client IP contained in the X-Forwarded-For HTTP header field. For the response flow (4-11 in the diagram above), the team built the unique identifier from the URI, proxy IP address, ETag header, Set-Cookie header, and Expires header. Together, these identifiers comprised a unique fingerprint for each transaction that did not depend on instrumentation of the application code or inserted tags. It's worth noting that this type of agentless recognize-and-trace transaction tracing is even simpler in scenarios without a proxy. In those cases, IT teams can use a existing unique identifier such as a customer ID, Object ID, or the embedded tags and session IDs inserted by JSP, PHP, and Microsoft ASP. For example, ExtraHop offers a solution bundle that recognizes and traces multi-hop web-to-database transactions for SharePoint. The point of this web gateway illustration is to show the extensibility of ExtraHop wire data analysis to handle the worst-case scenario, or to put it in terms that Larry Wall would appreciate, to make easy things easy and hard things possible.
Solving Complex Performance ProblemsBy building a unique fingerprint for every transaction in ExtraHop, the IT Operations team at the web security company was able to answer the following questions:
- What was the latency for the complete transaction (1-11 in the diagram above)?
- What was the latency for the request across the proxy (2-6 in the diagram)?
- What was the latency for retrieving content from the destination (7-8 in the diagram)?
- What was the latency for the response across the proxy (9-10 in the diagram)?
In ExtraHop, the IT Operations team could see the median latency and 25th to 75th percentile spread for each leg of the complete transaction, shown in the graph above. These averages showed that performance was good. Viewing the ExtraHop metrics in a heatmap, however, revealed 95 percentile outliers all the way to the 9-second mark on responses traversing the web gateway proxy. This indicated that the proxy itself was introducing the delay. With the help of the development team, the IT Operations team found a DNS reverse-lookup process that was timing out. Fixing this process eliminated the unusual latency (as much as 2 minutes!) experienced by some users.
Extensibility Is ImportantWhile other IT monitoring solutions can do a few prescribed tasks well, ExtraHop offers a programmatic interface to its Context and Correlation Engine that enables IT teams to tackle unexpected challenges, such as transaction tracing across a NATed proxy such as a gateway, firewall, load balancer, content filter, or optimizer. As far as we know, there is no other solution that can do this without instrumenting the application or inserting tags. Would you like to try out the extensible ExtraHop platform for yourself? Try our free, interactive demo. If you're already an ExtraHop pro, the transaction tracing solution bundle for NATed proxies, which provides the capabilities described above, is also available for download in the ExtraHop bundles gallery.
Watch the video below to learn about the importance of investigating anomalies and outliers in your datasets. Learn more about how ExtraHop's visualizations preserve meaning when aggregating large sets of wire data.
This is a companion discussion topic for the original entry at http://www.extrahop.com/post/blog/transaction-tracing-for-web-gateways-load-balancers-and-other-proxies/