socket option SO_KEEPALIVE not detecting crash?

socket option SO_KEEPALIVE not detecting crash?

Post by Hal Finn » Thu, 01 Apr 1993 06:30:01



I am trying to write a socket-based server program.  I want it to
detect when a client machine crashes and take some action.  I've tried
using setsockopt to set the KEEPALIVE option on the (stream) socket.
But if I connect a client, then halt and reboot the client machine by
toggling the power, the server never hears about it (it is waiting in a
select call).  Even after the machine comes up I can run a new client
and connect to the server OK, but the old connection is still being
held.

Am I right in thinking the KEEPALIVE option is supposed to prevent
this?  Just how long is the time out value in KEEPALIVE?  (I waited

Quote:>10 minutes.)

I am running under SunOS 4.1.2 but this code is intended to run on a
wide variety of Unix systems.  Any help will be appreciated.

Thanks -

Hal Finney

 
 
 

socket option SO_KEEPALIVE not detecting crash?

Post by Simon Bark » Thu, 01 Apr 1993 17:15:38



>I am trying to write a socket-based server program.  I want it to
>detect when a client machine crashes and take some action.  I've tried
>using setsockopt to set the KEEPALIVE option on the (stream) socket.
>But if I connect a client, then halt and reboot the client machine by
>toggling the power, the server never hears about it (it is waiting in a
>select call).  Even after the machine comes up I can run a new client
>and connect to the server OK, but the old connection is still being
>held.

>Am I right in thinking the KEEPALIVE option is supposed to prevent
>this?  Just how long is the time out value in KEEPALIVE?  (I waited
>>10 minutes.)

>I am running under SunOS 4.1.2 but this code is intended to run on a
>wide variety of Unix systems.  Any help will be appreciated.

>Thanks -

>Hal Finney


I have heard that SunOS 4.1.2 has a "bug" in SO_KEEPALIVE in that the heartbeat
period is in the order of hours. This may be fixed in Solaris, I think it can be
fixed by re-linking the kernel as well.

 
 
 

socket option SO_KEEPALIVE not detecting crash?

Post by Yuval Yar » Thu, 01 Apr 1993 23:13:16


|> I am trying to write a socket-based server program.  I want it to
|> detect when a client machine crashes and take some action.  I've tried
|> using setsockopt to set the KEEPALIVE option on the (stream) socket.
|> But if I connect a client, then halt and reboot the client machine by
|> toggling the power, the server never hears about it (it is waiting in a
|> select call).  Even after the machine comes up I can run a new client
|> and connect to the server OK, but the old connection is still being
|> held.
|>
|> Am I right in thinking the KEEPALIVE option is supposed to prevent
|> this?  Just how long is the time out value in KEEPALIVE?  (I waited
|> >10 minutes.)

You should have waited about two more hours.  The keepalive timer starts after
the connection is idle for two hours, and transmit 8 test messages at an
interval of 75 seconds (the values are from /usr/include/netinet/tcp_timer.h).

                                Yuval
|>
|> I am running under SunOS 4.1.2 but this code is intended to run on a
|> wide variety of Unix systems.  Any help will be appreciated.
|>
|> Thanks -
|>
|> Hal Finney

--
Yuval Yarom

 
 
 

socket option SO_KEEPALIVE not detecting crash?

Post by W. Richard Steve » Thu, 01 Apr 1993 23:40:40


Quote:> I have heard that SunOS 4.1.2 has a "bug" in SO_KEEPALIVE in that the
> heartbeat period is in the order of hours. This may be fixed in Solaris,
> I think it can be fixed by re-linking the kernel as well.

It's not a bug, it's a requirement of RFC 1122: "This interval MUST
be configurable and MUST default to no less than two hours." (p. 101).

It is configurable under SunOS 4.1.x in the file /usr/kvm/netinet/in_proto.c,
but this is the system-wide default; you can't change it on a per-connection
basis.

Solaris 2.1 has the same default (2 hours): try
"ndd /dev/tcp tcp_keepalive_interval" (the reported units are milliseconds).


 
 
 

socket option SO_KEEPALIVE not detecting crash?

Post by Gregory N. Bo » Fri, 02 Apr 1993 10:43:04


   Am I right in thinking the KEEPALIVE option is supposed to prevent
   this?  Just how long is the time out value in KEEPALIVE?  (I waited
   >10 minutes.)

An extract from /usr/include/netinet/tcp_timer.h (SunOs 4.1.something):

 * The TCPT_KEEP timer is used to keep connections alive.  If a
 * connection is idle (no segments received) for TCPTV_KEEP_INIT amount of time,
 * but not yet established, then we drop the connection.  Once the connection
 * is established, if the connection is idle for TCPTV_KEEP_IDLE time
 * (and keepalives have been enabled on the socket), we begin to probe
 * the connection.  We force the peer to send us a segment by sending:
 *      <SEQ=SND.UNA-1><ACK=RCV.NXT><CTL=ACK>
 * This segment is (deliberately) outside the window, and should elicit
 * an ack segment in response from the peer.  If, despite the TCPT_KEEP
 * initiated segments we cannot elicit a response from a peer in TCPT_MAXIDLE
 * amount of time probing, then we drop the connection.
 */
[...]
#define TCPTV_KEEP_INIT ( 75*PR_SLOWHZ)         /* initial connect keep alive */
#define TCPTV_KEEP_IDLE (120*60*PR_SLOWHZ)      /* dflt time before probing */
#define TCPTV_KEEPINTVL ( 75*PR_SLOWHZ)         /* default probe interval */
#define TCPTV_KEEPCNT   8                       /* max probes before drop */

So (if I understand this properly, which is a dubious proposition!) it
waits for two hours before starting keepalive pinging, then gives up
after 8*75 seconds (10 minutes) of keepalive probing.

In the case of a rebooted machine, the keepalive will kill the
connection on the first keepalive ping because the remote machine will
send a RST in response to the keepalive packet.

So the answer is you have to wait two hours or a bit more.  If more
rapid response to crashed machines is needed, you will need to do that
in the application.  One of our applications here writes a single
space character to a (non-blocking) socket that has been idle for 1
minute. Either this elicits an RST (reflected as ECONNRESET?) if the
machine has booted, or returns EWOULDBLOCK if the remote machine has
been down and the TCP buffers are full.  The remote end ignores the
space character (it is a broadcast data feed), so we get keepalive
functionality with programable 1 minute resolution.

Greg.
--

   Knox's 386 is slick.            Fox in Sox, on Knox's Box
   Knox's box is very quick.       Plays lots of LSL. He's sick!
(Apologies to John "Iron Bar" Mackin.)

 
 
 

1. SO_KEEPALIVE socket option?

I have two processes on different hosts communicating via
TCP sockets, and I want each of the processes to find out
"immediately" if the other becomes inaccessible, either
because the other process has terminated or closed its
socket or because the network connectivity between the
two hosts has been lost.

The SO_KEEPALIVE socket option looks suitable, but it isn't
clear from the setsockopt(3n) man page if this is really what
I want to use.  In particular, two things aren't specified:
        1. With SO_KEEPALIVE turned on, how frequently does
           a process send poll messages over the idle connection?
           Is the frequency tunable?
        2. Is the response to this polling supported transparently
           as part of the TCP protocol, or will the host at the
           other end of the connection have to have SO_KEEPALIVE
           set on its socket in order to reply to the polls?

If it matters, the processes will be running under SunOS 5.2 and/or
5.3.

Thanks for any responses,

-Brian Pane

2. screensavers on xdm/gdm

3. Socket Option: SO_KEEPALIVE

4. process CPU time

5. socket option SO_KEEPALIVE

6. Shift lock and A

7. timeout of sockets in TCPS_ESTABLISHED state without SO_KEEPALIVE option

8. apache ignores FollowSymLinks

9. socket option SO_KEEPALIVE

10. The SO_KEEPALIVE socket option

11. Sockets & SO_KEEPALIVE option

12. SO_KEEPALIVE - How to detect a broken socket connection without read/write?

13. How does a server detect a client crash using sockets?