Scenario: Reported slowness and failures on a key application
ExtraHop troubleshooting process:
An application container was already created for this key application so we started there. The overview immediately showed traffic falling to nothing, coming back and falling off again. Since DB traffic is the most traffic we started there but quickly realized, by clicking and dragging on the graph to narrow the time, that there is no traffic, not just DB traffic falling off. We pivoted to the TCP metrics (now that time was locked) and see a spike in connections before the drop, indicating things starting to back up (odd). We look at TCP throttling (none) and congestion (RTO’s increase both in and out). So, something is an issue between this device and its peers. The server activity is SSL so we don’t see any of the HTTPS inside but that really doesn’t matter in this scenario. We verify on DB and other client activity that there are no errors, no increased processing time etc. indicating no app issues…once traffic gets to the device. At this point we determine that the application symptom is being caused by something else and it’s pretty catastrophic to impact all traffic. From TCP, it looks like all application transactions are interrupted and have to recover.
We drilled more into RTOs and someone asks if this device is a virtual server and if so, are other servers on the same host having an issues? Nailed it! The app server is virtual, and they did have a virtual host fall over which moved this server to another host. That is why traffic was up and down after the incident, as they were verifying and adjusting resources. Unfortunately, that application was not immune to that activity
ExtraHop shows every layer of the application, including all the dependent infrastructure (Storage, DB, etc.) This helps avoid troubleshooting ratholes and get’s you to root cause faster. Also, TCP analysis is a good barometer for issues with the underlying infrastructure.