I am requesting additional clarification on how EH classifies abort types. Our description of TCP aborts “A TCP connection is aborted when it is neither explicitly closed nor implicitly timed out: it is simply abandoned midstream” seems vague. Is it only when a graceful shutdown did not occur? Furthermore, additional information on what constitutes an HTTP abort and an SSL abort would be helpful. Our documentation says an SSL abort occurs when the SSL handshake was not completed, but what about when the application payload was not completed? Would that be considered an SSL abort? More clarification would be much appreciated.
I am looking at this myself
The Trigger API guide at the time I write this has a good description of what constitutes a TCP abort and L7 abort. You can search for ‘isAborted’ from this link, but excerpting a screenshot here:
@spokompton, can you share where you find the “abandoned midstream” version of the definition? I don’t find it in the in-product help, the metric catalog, or the the Trigger API Guide.
I am not sure where the abandoned midstream description is either but the customer responded saying they have seen the Trigger API isAborted section, but they are looking for clarification “For L4, I am mainly looking for how it is registered before syn ack vs between syn ack and fin vs after fin ack.”
The flow isn’t established until the three-way handshake completes, so early aborts would be tracked as unanswered SYNs or RSTs. There is also a metric called TCP setup time, which is fairly self explanatory.
There is some subtlety about what constitutes an abort when the connection is half closed, i.e. one side sends a FIN. Some hosts and devices will perform an “unclean shutdown” and respond to the FIN with a RST. As I recall, we do not count this as an abort. Otherwise, RSTs received during the flow will count as aborts.
I hope this helps.
I wouldn’t think that is an abort either. When the shutdown of a socket starts and the socket moves to the FIN state it is reasonable to assume it is done. I am currently trying to identify when the 3way hand shake fails and why. We have firewalls in the middle of applications and databases, and they are sometimes cutting the connections with a timer. I would love to find a way to tell if the connection breaks because of the firewall or one of the servers. If the firewall does this, i want to alert because it will cause production issues for my company…
Proving to be a real task
I recommend paying special attention to the following metrics:
TCP - Unanswered SYNs In / Out
This should indicate when the 3-way handshake is failing.
TCP - Setup Time
Long connection setup time could also indicate a problem.
TCP - PAWS dropped SYNs In
If this is non-zero, then you might be running into an issue with wrapped sequence numbers possibly due to NAT.
TCP.handshakeTime is a trigger property for a given flow.
TCP setup time is a built-in metric that provides a dataset or distribution for handshake times associated with a particular device or endpoint. This metric is not displayed by default but can be added to a dashboard.
I hope this helps!
I have a case where I have hundreds of “syns received” but no “accepted” connections, but also 0 “unanswered syns”. Clearly the handshake is failing, but they did not qualify as unanswered syns, which is what we normally use to track when a service stops responding. Under what conditions would that occur?
Unanswered SYNs are counted when a SYN is retransmitted. I would expect any legitimate client to retry the connection while things like port scans and SYN flood attacks would not. Do you know anything about the clients that are attempting to connect?
@raychambers: a quick note on the FIN close stage and assuming it’s done…this is true, but as a full duplex protocol, it’s only half true, at least until you’ve confirmed that both sides have FIN’d, hence webslinger’s comment on a half close -
particulary one that’s been circumvented with a (bad mannered, I’ll add) RST.
If you’re looking for failed setups, that’s definitely not an abort, which is incremented when we see a healthy, established socket get hammered with a RST.
I am looking for 2 things.
a request to a port that gets connection refused because it is closed (server is down/broken/port scan)
Situations where we see a synflood attack or a volume of traffic that hits the server and causes a backlog to take place and end users get timeouts.
The resets work, I can see a lot of those when I create both situations, but I simply can’t alert on what port and the client/server ip.
The TCP unanswered SYNs metric addresses the case where the server is down or broken.
Port scans are stealthy and somewhat harder to detect. We show how to do this in the Threat ID bundle, which I believe takes advantage of the new distinct metric type.
Volumetric attacks can be detected by large spikes in the SYNs received metric (assuming a SYN flood attack). Legitimate users attempting to establish new connections will likely see unanswered SYNs. However, existing connections will typically just experience retransmission timeouts. As I recall, TCP won’t disconnect (with a reset) until a keep-alive probe fails, and most operating systems have fairly long TCP keep-alive timeouts.
Would you be willing to help me understand the unanswered SYNs from a trigger view point?
The alert will not allow me to kick off cgi scripts or pass the (host/port) that is seeing the problem. I can see only that information in the alert display of extrahop.
Trigger events are generally drawn from after a flow has been established, and what we record as unanswered SYNs precedes the establishment of a flow.
To this point
could the “Additional metrics in emails” under the Notifications tab give you the detail you need?
Adding those metrics to the alert configuration yields lines like this in the notification email that gets sent:
I have to check that out. I didn’t see it when i setup that alert.
It’s been a minute since I checked on this and we have deployed ETA since then - which has afforded the opportunity to examine various abort and connection failure behaviors. I can see why there are few straight answers here - there is a lot of variety in aborts and failed connections. We had one case where there was a SYN and a SYN ACK, then the original SYNner denied the SYN ACK due to a subtle error on iptables configuration. So there were no unanswered syns, but there was also no complete handshake so no aborts. In HTTP aborts there was a 10 second idle timeout causing the web session to close while the APP tier was busily chugging away. Sometimes this would show up as an HTTP abort if the status 200 never came back from the app tier, but sometimes would show up at a response time of 10+ seconds if the response came back but the web tier session was already down so the transaction failed anyway from the client point of view. That is just 2 of many variations we have encountered. Point is, I think that there are just so many ways a connection or a transaction can fail that a handful of metrics can’t really do the situation justice. What we are doing now is allowing the metrics to show us that there was a general sort of problem, then using the ETA to grab the packets and see the exact behavior. It has sped up troubleshooting and allowed us to diagnose some really oddball issues that we could not otherwise have found root cause on.