The ExtraHop Nagle Counter
We count Nagle delays not just based on the presence of the (ubiquitous) algorithm, but based on the interaction between delayed acknowledgements and Nagle.
They both try to do the right thing, but can interact poorly.
Here’s the meat of the logic. We’ll start with the wikipedia article on Nagle’s algorithm (emphasis mine):
IF the window size >= MSS and the available data is >= MSS --> send complete MSS segment now ELSE IF there is <b>unconfirmed data still in the pipe --> enqueue data in the buffer **until an ACK is received**
So the “cause” is this logic and its caveat:
- You’ve got data to send, but you’ve got extra room in the segment.
- You’ll can fit more data onto the segment if you wait a bit. If more data appears, add it to the segment until it’s full and then send.
The benefit here is pretty obvious: more efficient network utilization, etc.
But sometimes (or often) you want to send what you’ve got and don’t want to wait. In these cases Nagle can hurt you.
Let’s go back to the algorithm - see the “until ack is received” bit? THAT is the killer here, because of Delayed ACKs.
Delayed ACK is another well intentioned algorithm with tries to send more data per segment if it can. But because part of Nagle’s algo depends on an ACK to send data, it creates a problem.
Effectively it’s a race condition - let’s imagine for a moment that your app is serving, say, lots of small (i.e. less than the MSS) images or small XML responses. The payloads are small and the should be delivered ASAP.
Here’s the race: You’ve got data to send. Great! But recall that Nagle is waiting until:
- It gets enough data to fill up the the MSS -or-
- Its timer expires -or-
- It gets an ACK from the receive side
See #3 above? That’s the issue - delayed ACK is doing something similar!
Its logic goes like this: "Hey, if I CAN STASH DATA on this segment, I will. Don’t send a “bare ACK” - we’ll be kinder to the network this way.
Here’s the algorithm, basically:
IF you are ready to send an ACK: --> wait (usually 200ms, can be up to 500ms) for data to piggyback. ELSE IF (the delayed_ack timer fires) OR (I get another inbound segment to ACK): --> send the packet
And there’s the deadlock:
Delayed ACKs are waiting around to send the ACK and Nagle’s is waiting around to receive the ACK!
So the net effect? You get random stalls of 200-500ms on segments that could otherwise be sent immediately and delivered to the receive side stack and apps above it.
Many servers these days disable nagle by setting TCP_NODELAY on its socket options, but not all. Many others expose it as a config option. Intermediate devices like proxies and load balancers often re-impose the algorithm because of “sane” defaults. They’re trying to do the right thing but they’re not.
Basically, here are the options to minimize or eliminate the issue:
Disable Nagle via global socket options on your servers, or profile tweaks on your proxy / LB / ADC. This will eliminate the issue.
Pull down the delayed ACK timer on your servers, LBs, etc. This will help, but not eliminate.
At the end of the day, it all boils down to your specific workload and dominant traffic patterns on a service. The extreme example is X windows: if I move my mouse or type a set of characters out on my terminal session, I want to see those events echoed immediately.
COMMON PROTOCOLS that take the Nagle hit are:
- Interactive type traffic
- RPC-style calls (MS-RPC, SOAP, XMLRPC, CICS, some JMS, etc.)
- HTTP / web traffic, at least in many cases
Some examples where Nagle really helps:
- Large file transfers
- Large responses in email
- Large CIFS
- Large images
…the pattern should be pretty clear by now.
Hope this helps!