I have seen a particular syndrome on 3-4 Unices, with supposedly
independently developed TCP/IP stacks, and it is being a right
pain under Linux over PPP. I have no proof that the cause is the
same for each Unix, but the commonality is the original specification.
It has remained unchanged over a major Linux upgrade, too.
The symptom is that a transmission error sometimes causes the stack
to jam. Totally. Keepalive packets are not being sent, the message
is not being retried, and the normal timeouts are not occurring.
After a long time (typically 30 minutes to days, depending on the
system), there is a timeout that causes a high-level error return.
Under Linux, I get the interesting message "Invalid argument".
When this happens, the whole transport jams, which may be cause or
effect. It is SOMETIMES possible to kick-start the Linux/PPP one
by doing a separate operation (ping will do) fast enough after the
jam occurs, but not often - and I have never noticed the effect
soon enough on the other systems. I have seen the syndrome on PPP,
Ethernet, HiPPI and proprietary interconnects, so I doubt that it
the main cause is in them.
So the solution is to restart the transport. That works, and I can
then use it again. But the jammed TCP/IP connexion does NOT restart,
and it eventually times out or not exactly as if the transport were
still dead. This makes me certain that, whatever else is going on,
there are similar bugs in several TCP/IP stacks. Now, it is the
jammed stack effect that I am enquiring about first, and NOT the
PPP failure as such.
My question is whether anyone knows of any instrumentation that
I can turn on, preferably on Linux? This has to be lightweight
enough that it can be turned on and left on, as this effect is very
erratic. Yes, of course, I can learn how the stack works, put my
own instrumentation in and build my own copy, but life is short ....
I should also be interested to know if anything much can be done
along those lines in PPP. The Linux upgrade helped a great deal
by including enough logging that it diagnosed an unrelated connexion
failure, but the logging is still at too high a level. I need to
be able to find out what state the PPP transport has got into, and
whether it thinks that it has lost the modem, for example.
But being able to restart PPP and get my TCP connexion back would
be a great help.
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QH, England.
Tel.: +44 1223 334761 Fax: +44 1223 334679