Add sysctl to disable Nagle's algorithm (RFC 896 – Congestion Control)

https://marc.info/?l=openbsd-tech&m=171562561424289

89 points by peter_hansteen on 2024-05-14 | 40 comments

Automated Summary

The article discusses the potential deprecation of Nagle's algorithm, a classic TCP congestion control mechanism. The algorithm combines small packets from userland applications into a single TCP packet to increase throughput, at the cost of higher latency. Critics argue that Nagle's algorithm negatively impacts performance in modern high-speed networks. Some popular applications, such as ssh, httpd, and iscsid, have already disabled Nagle's algorithm. A post on tech@ by Job Snijders proposes a patch to implement a sysctl named net.inet.tcp.nodelay, which would disable Nagle's algorithm system-wide by setting TCP_NODELAY on all TCP sockets. The proposal is currently under discussion on tech@.

Comments

Animats on 2024-05-15

You can't turn off delayed ACKs and make them stay off, which is a related problem. The Linux API for that is very strange. It only applies for a short period.

Delayed ACKs and the Nagle algorithm should never be on at the same time. The trouble is, they're controlled at opposite ends of the connection. You can turn off the Nagle algorithm at your end, but you want to turn off delayed ACKs at the other end. That's the practical problem.

Still, what's the use case for having multiple tiny messages in flight during one RTT? Games usually send their interactive traffic over UDP. If you have delayed ACKs off, you should never have a propagation delay of more than one RTT. It's that fixed timer in delayed ACKS, set to a value that made sense for keyboard echo, that can cause delays of more than one RTT.

withinboredom on 2024-05-15

I find it hilarious that people want to turn off your algorithm because most people don’t know delayed ACKs is the “real problem”.

I’d much rather see delayed ACKs disabled as the new default vs. your algorithm being disabled by default.

I’ve seen too many applications not filling packets and sending tons of tiny packets. They’d benefit from your algorithm, even with delayed ACKs on… but people also confuse latency with throughput all the time (sometimes you want/need one more than the other — and languages like Go not giving programmers this ability to decide is frustrating, mainly because for awhile, those programs were the biggest offenders)…

I digress… I do find this whole thing rather amusing.

armitron on 2024-05-16

Delayed ACKs are not the "real problem". For the last 10 years I haven't encountered a single environment where Nagle's algorithm would be a net win. No heuristics, no fancy auto-sensing. It should be off by default.

If you're working in distributed systems, fintech, mobile network optimization, broadcasting, the first thing you should do is switch off Nagle's algorithm. Animats should focus on Second Life and stop holding on to what's now a very bad default.

Animats on 2024-05-16

It is not, in general, a net win. It is a preventative measure against things getting really bad.

My other work back then, on fair queuing, followed the same line - keep things from getting really bad just because something was slightly overloaded.

When I was working on this, funding was from DARPA, Defense Communications Agency, and such. They wanted networks to keep working under bad conditions. Maximum price/performance was far less important. The price of getting the last 10% in performance is usually complexity and often fragility.

withinboredom on 2024-05-16

See, this is why we can't have real conversations about this problem, because most people don't even understand what they are saying and just spout off dogma.

1. Just because you haven't "encountered a single environment" doesn't mean you haven't been in one or that you'd even know what to look for to know if you were in one where it would be a net win.

2. "If you're working in distributed systems," you likely have very fat, fast, and reliable pipes between your services. Nagle's algorithm is probably a Bad Thing[tm] in those situations. If you have a lot of Wi-Fi interference or dropped packets, Nagle's algorithm can be the difference between 50bps and 1mbps (assuming your application isn't filling packets), except that Delayed Acks prevents you from realizing all that.

There's no "one size fits all" solution, but you have control over Nagle's algorithm, you do not have control over Delayed Acks.

boulos on 2024-05-15

I kind of love that this continues to haunt you :).

schoen on 2024-05-15

For anyone who doesn't know, HN user Animats is John Nagle, eponym of Nagle's algorithm.

My main mental association with that algorithm is always being asked about it in "make menuconfig" when I used to compile my own Linux kernels. One of relatively few networking concepts I can think of that's named after a person (along with Van Jacobson header compression).

Animats on 2024-05-15

I called it "tinygram prevention".

jancsika on 2024-05-15

> The trouble is, they're controlled at opposite ends of the connection. You can turn off the Nagle algorithm at your end, but you want to turn off delayed ACKs at the other end. That's the practical problem.

For an extant case of this practical problem chosen at random, I'd be curious to know-- what's the likelihood that it's just openssh on both ends of the connection?

jdougan on 2024-05-15

Do you have any sense regarding how helpful tcp_autocorking on Linux is (or isn't)?

bhaney on 2024-05-15

I'd like to be able to confidently turn off Nagle's algorithm system-wide, but I'm always going to be concerned that I'll some day run a high-traffic application that depends on it without explicitly enabling it because it's been the default for so long.

I think this is one of the rare cases where I'd prefer if the kernel had a "magic" setting that went something along the lines of "if a connection isn't explicitly setting its use of Nagle's algorithm, default it to off, but occasionally look at the connections in this category that are generating the most packets on the sysetm and turn Nagle's on for these connections if their packets are mostly small (or other heuristics)."

Potential waste from overly fragmented packets is going to be negligible on connections that don't represent a large portion of the system's traffic, so there's no reason to bother observing and tuning them (in case there are many of them and the cost of doing so becomes noticeable). It's really only the highest traffic connections where Nagle's might help, and selectively turning it on for connections that seem to benefit from it would maintain backwards compatibility while still reaping the nodelay benefits for the majority of software that doesn't care.

Animats on 2024-05-15

> I think this is one of the rare cases where I'd prefer if the kernel had a "magic" setting...

All this stuff should be automatic. The problem is that it can take a few round trips to discover what the application is doing. This comes up with "slow start", where you need some time to discover what's going on. In a world of short-lived HTTP connections, self-adjusting algorithms don't have time to self-adjust before exit.

A delayed ACK is a bet. You're betting that the other end is going to respond with useful data before the delayed ACK timer runs out. If it does, you won the bet. If it doesn't, you lost. Nothing checks whether you're on a losing streak. But it takes a few round trips to make that decision.

When I was working on this, the object was to get from appallingly bad to acceptable performance without too much complexity. Today, people crank up things like HTTP/3 to get a few percent more performance at the cost of greatly increased complexity.

bhaney on 2024-05-15

> In a world of short-lived HTTP connections, self-adjusting algorithms don't have time to self-adjust before exit

We could potentially have the kernel group these heuristics to a process or process group, since there's no real reason to live in the network stack and have its context restricted to a single connection. Like, if most packets on a system are being generated by a few nginx processes, and most of them seem to be tiny (or are on a losing streak for any other kind of bet), enable Nagle's (or any other relevant optimization) for any connections those nginx processes create?

Animats on 2024-05-15

Yes, and then you need a UI and a dashboard and an alarm system and a policy deployment manager and a ...

As you go for the last 10% of performance, the complexity climbs rapidly.

bhaney on 2024-05-15

Ah, so you're saying this should be managed by systemd rather than the kernel ;)

kstrauser on 2024-05-15

Do not say that out loud 3 times into a mirror.

GoblinSlayer on 2024-05-15

Reset packet injection is popular in the wild, so I'm optimistic about tip to toe authenticated UDP transports, also less retarded congestion control that doesn't take network down just to see if it will break. Complexity can be moved to a proxy, and simple applications don't care about tuning anyway.

jdougan on 2024-05-15

Is the Linux tcp_autocorking setting sufficient?

https://marc.info/?l=openbsd-tech&m=171573285422908&w=2

tcp_autocorking - BOOLEAN Enable TCP auto corking : When applications do consecutive small write()/sendmsg() system calls, we try to coalesce these small writes as much as possible, to lower total amount of sent packets. This is done if at least one prior packet for the flow is waiting in Qdisc queues or device transmit queue. Applications can still use TCP_CORK for optimal behavior when they know how/when to uncork their sockets.

          Default : 1

bhaney on 2024-05-15

I have no idea! This is the first I'm hearing of corking, so I don't really know how it behaves in reality. It certainly seems pretty close to what I was talking about at first glance.

londons_explore on 2024-05-15

> This is done if at least one prior packet for the flow is waiting in Qdisc queues or device transmit queue.

It should be 'or was sent less than rtt/2 ago'.

augusto-moura on 2024-05-15

Maybe there's a way to do it with BPF or eBPF? I honestly don't know what are the limits for these two

meinersbur on 2024-05-15

Response from Theo de Raadt (https://marc.info/?l=openbsd-tech&m=171572099614639&w=2):

The proposal talks about a few applications which are better with nagle off by default. Most of those applications have already turned off Nagle, after deciding that the cognitive load of driving their small write system calls via single internal buffing layering is too complicated (that's ssh, that is most http services, etc). In that software, Nagle was manipulated by a developer after systematically studying & modifying the application as a whole.

But applying it to all applications, just because 'few applications prove Nagle bad'? That is backwards. It needs to prove that the entire application ecosystem is MAJORITY improved by disabling Nagle.

I strongly doubt it is improved. I suspect a majority of software is different from the few well-known ones disabling Nagle -- and I'm sure a few which intentionally leave Nagle enabled -- furthermore I suspect the majority of software gains full-system benefits from this 'teeny buffer bloat' layer.

It mostly has to do with what the internal IO subsystem of a program looks like. Does it use stdio, does it use raw writes, does it use BIO, etc. (That's where short writes due to intersecting layers of API).

So I suspect "Nagle always bad" would need to be disproven before we give people a dangerous knob -- which a segment of the user community would toggle, and thus increase our cognitive load when trying to diagnose their vague bug reports in the future...

kreetx on 2024-05-16

Related from a week ago: It's always TCP_NODELAY (brooker.co.za), https://news.ycombinator.com/item?id=40310896

wmf on 2024-05-15

Related discussion from 5 days ago: https://news.ycombinator.com/item?id=40310896

patrakov on 2024-05-15

Note: this is OpenBSD, not Linux.

Neil44 on 2024-05-16

A lof of stuff will have this already since it's an option that can be specified when opening the listening socket, for example apache httpd, nginx, openlitespeed all do this.

voidfunc on 2024-05-16

Not very familiar with the openbsd dev process but noted the patches at the bottom... do they still use CVS for version control?!

kreetx on 2024-05-16

Interestingly enough, I've seen other projects switch from github to email-based patch submission (using git behind the scenes though). If I understood correctly, to receive only higher effort contributions.

natebc on 2024-05-16

They do use CVS for version control. https://cvsweb.openbsd.org/

from their github: https://github.com/openbsd/src

> Read-only git conversion of OpenBSD's official CVS src repository. Pull requests not accepted - send diffs to the tech@ mailing list.

JSDevOps on 2024-05-15

I was just looking for this the other day!

kazinator on 2024-05-15

That was my reaction to the latest story: isn't there some darned sysctl to just turn that off globally?

silverwind on 2024-05-16

Does Linux have this as well?

deaddodo on 2024-05-16

It's built into the socket library, so most high performance web apps already manually enable TCP_NODELAY. This just allows you to force it OS-wide.

Linux used to have something similar, called TCP low-latency mode, but the flag is no longer functional. Now, there are various distribution specific options or you can rebuild your kernel with multiple related build options to achieve the same.

blueflow on 2024-05-16

With "socket library" you mean libc?

signa11 on 2024-05-16

yes, there is [typically] nothing else.

you can ofcourse do a syscall directly, but i am not sure what value add _that_ is, `man -S 2 syscall` for more information.

edit: 'typically' because some languages / runtimes might go the syscall route.

blueflow on 2024-05-16

Thats why i ask. "the socket library" is an very odd way to refer to the libc.

_flux on 2024-05-16

Some other Unix operating systems had the functions in a own socket library, e.g. SunOS: https://docs.oracle.com/cd/E19120-01/open.solaris/817-4415/s... .

I suppose though in the context of Linux it's a bit weird, but you did get the correct meaning :).

DEADMINCE on 2024-05-16

Not in context.

justincormack on 2024-05-16

Well, Go for example does syscalls directly.

dang on 2024-05-15

Url changed from https://www.undeadly.org/cgi?action=article;sid=202405140750..., which points to this.

Submitters: "Please submit the original source. If a post reports on something found on another site, submit the latter." - https://news.ycombinator.com/newsguidelines.html