SO_LINGER is not the same as Apache's "lingering close"
Have you ever wondered what SO_LINGER is actually for? What TIME_WAIT does?
What's the difference between FIN and RST, anyway? Why did web browsers
have to have pipelining disabled for so long? Why did all the original
Internet protocols have a "Quit" command, when the client could have just
closed the socket and been done with it?1
I've been curious about all those questions at different points in the past.
Today we ran headlong into all of them at once while testing the HTTP client
in EQL Data.
If you already know about SO_LINGER problems, then that probably doesn't
surprise you; virtually the only time anybody cares about SO_LINGER is with
HTTP. Specifically, with HTTP pipelining. And even more specifically, when
an HTTP server decides to disconnect you after a fixed number of requests,
even if there are more in the pipeline.
Here's what happens:
- Client sends request #1
- Client sends request #2
- Client sends request #100
- All those requests finally arrive at the server side, thanks to network latency.
- Server sends response #1
- Server sends response #10
- Server disconnects, because it only handles 10 queries per connection.
- Server kernel sends TCP RST because userspace didn't read all the input.
- Client kernel receives responses 1..10
- Client reads response #1
- Client reads most of response #7
- Client kernel receives RST, causing it to discard everything in the socket buffer(!!)
- Client thinks data from response 7 is cut off, and explodes.
Clearly, this is crap. The badness arises from the last two steps: it's
actually part of the TCP specification that the client has to discard
the unread input data - even though that input data has safely arrived -
just because it received a RST. (If the server had read all of its
input before doing close(), then it would have sent FIN instead of RST, and
FIN doesn't tell anyone to discard anything. So ironically, the server
discarding its input data on purpose
has caused the client to discard
its input data by accident
Perfectly acceptable behaviour, by the way, would be for the client to
receive a polite TCP FIN of the connection after response #10 is received.
It knows that since a) the connection closed early, and b) the connection
closed without error, that everything is fine, but the server didn't feel
like answering any more requests. It also knows exactly where the server
stopped, so there's no worrying about requests accidentally being run twice,
etc. It can just open a connection and resend the failed requests.
But that's not what happened in our example above. So what do you do about
The "lingering close"
The obvious solution for this is what Apache calls a lingering
close. As you can guess from the description, the solution is on the
What you do is you change the server so, instead of shutting down its socket
right away, it just does a shutdown(sock, SHUT_WR) to notify TCP/IP that it
isn't planning to write any more stuff. In turn, this sends a notice
to the client side, which (eventually) arrives and appears as an EOF - a
clean end of data marker, right after response #10. At that point, the
client can close() its socket, knowing that its input buffer is safely
empty, thus sending a FIN to the server side.
Meanwhile, the server can read all the data in its input buffer and throw it
away; it knows the client isn't expecting any more answers. It just needs
to flush all that received stuff to avoid accidentally sending an RST and
ruining everything. The server can just read until it receives its own
end-of-data marker, which we now know is coming, since the client has called
Throw in a timeout here and there to prevent abuse, and you're set.
You know what all the above isn't? The same thing as SO_LINGER.
It seems like there are a lot of people who are confused by this. I
certainly was; various Apache documentation,
including the actual comment above the actual implementation of "lingering
close" in Apache, implies that Apache's lingering code was written only
because SO_LINGER is broken on various operating systems.
Now, I'm sure it was broken on various operating system for various reasons.
But: even when it works, it doesn't solve this problem. It's
actually a totally different thing.
SO_LINGER exists to solve exactly one simple problem, and only one problem:
the problem that if you close() a socket after writing some stuff, close()
will return right away, even if the remote end hasn't yet received
everything you wrote.
This behaviour was supposed to be a feature, I'm sure. After all, the
kernel has a write buffer; the remote kernel has a read buffer; it's going
to do all that buffering in the background anyway and manage getting all the
data from point A to point B. Why should close() arbitrarily block waiting
for that data to get sent?
Well, it shouldn't, said somebody, and he made it not block, and that was
the way it was. But then someone realized that there's an obscure chance
that the remote end will die or disappear before all the data has been sent.
In that case, the kernel can deal with it just fine, but userspace will
never know about it since it has already closed the socket and moved on.
So what does SO_LINGER do? It changes close() to wait until all the data
has been sent. (Or, if your socket is non-blocking, to tell you it can't
close, yet, until all the data has been sent.)
What doesn't SO_LINGER do?
It doesn't read leftover data from your input buffer and throw it away,
which is what Apache's lingering close does. Even with SO_LINGER, your
server will still send an RST at the wrong time and confuse the client in
the example above.
What do the two approaches have in common?
They both involve close() and the verb "linger." However, they linger
waiting for totally different things, and they do totally different things
while they linger.
What should I do with this fun new information?
If you're lucky, nothing. Apache already seems to linger correctly, whether
because they eventually figured out why their linger implementation works
and SO_LINGER simply doesn't, or because they were lucky, or because they
were lazy. The comment in their code, though, is wrong.
If you're writing an HTTP client, hopefully nothing. Your client isn't
supposed to have to do anything here; there's no special reason for you to
linger (in either sense) on close, because that's exactly what an HTTP
client does anyway: it reads until there's nothing more to read, then
it closes the connection.
If you're writing an HTTP server: first of all, try to find an excuse not
to. And if you still find yourself writing an HTTP server, make sure you
linger on close. (In the Apache sense, not the SO_LINGER sense. The latter
is almost certainly useless to you. As an HTTP server, what do you care if
the data didn't get sent completely? What would you do about it anyway?)
Okay, Mr. Smartypants, so how did *you* get stuck with this?
Believe it or not, I didn't get stuck with this because I (or rather,
Luke) was writing an HTTP server. At least, not initially. We were
actually testing WvHttpPool
(an HTTP client that's part of WvStreams) for reliability,
and thought we (actually Luke :)) would write a trivial HTTP server in perl;
one that disconnects deliberately at random times to make sure we recover
What we learned is that our crappy test HTTP server has to linger, and I
don't mean SO_LINGER, or it totally doesn't work at all. WvStreams, it
turned out, works fine as long as you do this.
Epilogue: Lighttpd lingers incorrectly
The bad news is that the reason we were doing all this testing is that
the WvStreams http client would fail in some obscure cases. It turns out
this is the fault of lighttpd, not WvStreams.
lighttpd 1.4.19 (the version in Debian Lenny) implements its lingering
incorrectly. So does the current latest version, lighttpd 1.4.23.
lighttpd implements Apache-style lingering, as it should. Unfortunately it
stops lingering as soon as ioctl(FIONREAD) returns zero, which is wrong;
that happens when the local socket buffer is empty, but it doesn't
guarantee the remote end has finished sending yet. There might be another
packet just waiting to arrive a microsecond later, and when it does, blam:
Unfortunately, once I had debugged that, I found out that they actually
forgot to linger at all except in case of actual errors. If it's
just disconnecting you because you've made too many requests, it doesn't
work, and kaboom.
And once I had debugged that, I found out that it sets the linger timeout to
only one second. It should be more like 120 seconds, according to the RFCs,
though apparently most OSes use about 30 seconds. Gee.
I guess I'll send in a patch.
1 Oh, I promised you an answer about the Quit command. Here it
is: I'm pretty sure shutdown() was invented long after close(), so the only
way to ensure a safe close was to negotiate the shutdown politely at the
application layer. If the server never disconnects until it's been asked
to Quit, and a client never sends anything after the Quit request, you never
run into any of these, ahem, "lingering problems."
Syndicated 2009-08-14 03:59:53 from apenwarr - Business is Programming