Possible design error in TCP/IP stack

Possible design error in TCP/IP stack

Post by Nick Maclar » Thu, 19 Dec 2002 18:53:36



I have seen a particular syndrome on 3-4 Unices, with supposedly
independently developed TCP/IP stacks, and it is being a right
pain under Linux over PPP.  I have no proof that the cause is the
same for each Unix, but the commonality is the original specification.
It has remained unchanged over a major Linux upgrade, too.

The symptom is that a transmission error sometimes causes the stack
to jam.  Totally.  Keepalive packets are not being sent, the message
is not being retried, and the normal timeouts are not occurring.
After a long time (typically 30 minutes to days, depending on the
system), there is a timeout that causes a high-level error return.
Under Linux, I get the interesting message "Invalid argument".

When this happens, the whole transport jams, which may be cause or
effect.  It is SOMETIMES possible to kick-start the Linux/PPP one
by doing a separate operation (ping will do) fast enough after the
jam occurs, but not often - and I have never noticed the effect
soon enough on the other systems.  I have seen the syndrome on PPP,
Ethernet, HiPPI and proprietary interconnects, so I doubt that it
the main cause is in them.

So the solution is to restart the transport.  That works, and I can
then use it again.  But the jammed TCP/IP connexion does NOT restart,
and it eventually times out or not exactly as if the transport were
still dead.  This makes me certain that, whatever else is going on,
there are similar bugs in several TCP/IP stacks.  Now, it is the
jammed stack effect that I am enquiring about first, and NOT the
PPP failure as such.

My question is whether anyone knows of any instrumentation that
I can turn on, preferably on Linux?  This has to be lightweight
enough that it can be turned on and left on, as this effect is very
erratic.  Yes, of course, I can learn how the stack works, put my
own instrumentation in and build my own copy, but life is short ....

I should also be interested to know if anything much can be done
along those lines in PPP.  The Linux upgrade helped a great deal
by including enough logging that it diagnosed an unrelated connexion
failure, but the logging is still at too high a level.  I need to
be able to find out what state the PPP transport has got into, and
whether it thinks that it has lost the modem, for example.

But being able to restart PPP and get my TCP connexion back would
be a great help.

Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QH, England.

Tel.:  +44 1223 334761    Fax:  +44 1223 334679

 
 
 

Possible design error in TCP/IP stack

Post by David Efflan » Sat, 21 Dec 2002 12:06:01



Quote:

> I have seen a particular syndrome on 3-4 Unices, with supposedly
> independently developed TCP/IP stacks, and it is being a right
> pain under Linux over PPP.  I have no proof that the cause is the
> same for each Unix, but the commonality is the original specification.
> It has remained unchanged over a major Linux upgrade, too.

> The symptom is that a transmission error sometimes causes the stack
> to jam.  Totally.  Keepalive packets are not being sent, the message
> is not being retried, and the normal timeouts are not occurring.
> After a long time (typically 30 minutes to days, depending on the
> system), there is a timeout that causes a high-level error return.
> Under Linux, I get the interesting message "Invalid argument".

Have you ever thought that it might be a hardware problem if it occurs in
different OS's?

I had a problem with a Zoom internal hardware modem that was K56flex from
before V.90 standards were finalized, that was flashed to V.90.  When
connected to a Livingston (then Lucent) Portmaster, the modem would
occasionally lock up regardless of OS (Win95/98, Linux, FreeBSD).  No OS
would even recognize that it was hung and the only way to get it working
again in any OS was to reboot.  I had NO such problem with any other PPP
connection (to work or CompuServe).

Apparently Lucent never could fix the problem with the Portmaster and when
it was retired, that same modem worked flawlessly with that same ISP for
years.

BTW when this internal modem was regularly locking up with the Portmaster,
I purchased an external V.90 modem that had no problems with it, and not a
single hang or disconnect during an 11 hr ftp install of FreeBSD.

So if you have not tested if your problem can be replicated with other
hardware, you may want to try that first.

--
David Efflandt - All spam ignored  http://www.de-srv.com/
http://www.autox.chicago.il.us/  http://www.berniesfloral.net/
http://cgi-help.virtualave.net/  http://hammer.prohosting.com/~cgi-wiz/

 
 
 

Possible design error in TCP/IP stack

Post by Nick Maclar » Sat, 21 Dec 2002 19:31:50





>> I have seen a particular syndrome on 3-4 Unices, with supposedly
>> independently developed TCP/IP stacks, and it is being a right
>> pain under Linux over PPP.  I have no proof that the cause is the
>> same for each Unix, but the commonality is the original specification.
>> It has remained unchanged over a major Linux upgrade, too.

>> The symptom is that a transmission error sometimes causes the stack
>> to jam.  Totally.  Keepalive packets are not being sent, the message
>> is not being retried, and the normal timeouts are not occurring.
>> After a long time (typically 30 minutes to days, depending on the
>> system), there is a timeout that causes a high-level error return.
>> Under Linux, I get the interesting message "Invalid argument".

>Have you ever thought that it might be a hardware problem if it occurs in
>different OS's?

    ...  I have seen the syndrome on PPP,
    Ethernet, HiPPI and proprietary interconnects, so I doubt that it
    the main cause is in them.

The commonality of hardware between the systems on which I have seen
it is considerably less than the commonality of software!  It includes
an Intel-based PC running Linux, and IBM Nighthawk SP and an SGI
Origin.

Quote:>Apparently Lucent never could fix the problem with the Portmaster and when
>it was retired, that same modem worked flawlessly with that same ISP for
>years.

That is a very common syndrome, and almost invariably indicates one
of two things:

    There are errors in both the hardware and software/firmware
(usually in the exception handling) and the symptoms occur only when
the failures interact.

    There is a design error or ambiguity in the specification, and
the two parts have interpreted it incompatibly.  This is regrettably
common.

Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QH, England.

Tel.:  +44 1223 334761    Fax:  +44 1223 334679

 
 
 

Possible design error in TCP/IP stack

Post by Rick Jone » Sun, 22 Dec 2002 03:45:11



Quote:> I have seen a particular syndrome on 3-4 Unices, with supposedly
> independently developed TCP/IP stacks

I suspect there are really only two or three "main branches" of TCP/IP
stack these days. BSDish, Linux and Mentat.

Quote:> The symptom is that a transmission error sometimes causes the stack
> to jam.  Totally.  Keepalive packets are not being sent, the message
> is not being retried, and the normal timeouts are not occurring.
> After a long time (typically 30 minutes to days, depending on the
> system), there is a timeout that causes a high-level error return.
> Under Linux, I get the interesting message "Invalid argument".

Transmission error - as in an error on a specific link yes?

Quote:> When this happens, the whole transport jams, which may be cause or
> effect.  It is SOMETIMES possible to kick-start the Linux/PPP one
> by doing a separate operation (ping will do) fast enough after the
> jam occurs, but not often - and I have never noticed the effect
> soon enough on the other systems.  I have seen the syndrome on PPP,
> Ethernet, HiPPI and proprietary interconnects, so I doubt that it
> the main cause is in them.

The _whole_ transport - nothing through _any_ interface or any
connection works? (Have any of these been on multiple-interface
systems?)

Quote:> So the solution is to restart the transport.  That works, and I can
> then use it again.  But the jammed TCP/IP connexion does NOT
> restart, and it eventually times out or not exactly as if the
> transport were still dead.  This makes me certain that, whatever
> else is going on, there are similar bugs in several TCP/IP stacks.
> Now, it is the jammed stack effect that I am enquiring about first,
> and NOT the PPP failure as such.

Im a triffle confused - you were saying the transport is jammed -
which I took to mean hung, so how can the jammed TCP/IP connection
timeout? That implies that at least part of the FSM for the TCP
connection is running.

rick jones
--
Wisdom Teeth are impacted, people are affected by the effects of events.
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to raj in cup.hp.com  but NOT BOTH...

 
 
 

Possible design error in TCP/IP stack

Post by Nick Maclar » Sun, 22 Dec 2002 04:53:41





>> I have seen a particular syndrome on 3-4 Unices, with supposedly
>> independently developed TCP/IP stacks

>I suspect there are really only two or three "main branches" of TCP/IP
>stack these days. BSDish, Linux and Mentat.

Interesting.  It certainly used to be the case that Sun and HP were
BSDish, but were VERY different from each other in details; at least
HP had more-or-less rewritten from scratch.  This caused some, er,
interesting performance problems ....

Quote:>> The symptom is that a transmission error sometimes causes the stack
>> to jam.  Totally.  Keepalive packets are not being sent, the message
>> is not being retried, and the normal timeouts are not occurring.
>> After a long time (typically 30 minutes to days, depending on the
>> system), there is a timeout that causes a high-level error return.
>> Under Linux, I get the interesting message "Invalid argument".

>Transmission error - as in an error on a specific link yes?

In some cases, definitely.  In others, that is what I hypothesise.

Quote:>> When this happens, the whole transport jams, which may be cause or
>> effect.  It is SOMETIMES possible to kick-start the Linux/PPP one
>> by doing a separate operation (ping will do) fast enough after the
>> jam occurs, but not often - and I have never noticed the effect
>> soon enough on the other systems.  I have seen the syndrome on PPP,
>> Ethernet, HiPPI and proprietary interconnects, so I doubt that it
>> the main cause is in them.

>The _whole_ transport - nothing through _any_ interface or any
>connection works? (Have any of these been on multiple-interface
>systems?)

No, just that interface.  All except my Linux system were seriously
multiple interface boxes, and the other interfaces continued as
normal.  But it took out ALL use of that interface, and not just
the application stack - i.e. ping would no longer work.  However,
the actual driver remained alive, which is why I used the word
'transport'.  I don't know what TCP/IP calls that level of its stack.
Anyway, it is the communication channel corresponding to a single
I/O connexion.

There is another phenomenon which causes one jammed interface to
cause others to block, but I understand that rather better and it
is almost certainly unrelated.

Quote:>> So the solution is to restart the transport.  That works, and I can
>> then use it again.  But the jammed TCP/IP connexion does NOT
>> restart, and it eventually times out or not exactly as if the
>> transport were still dead.  This makes me certain that, whatever
>> else is going on, there are similar bugs in several TCP/IP stacks.
>> Now, it is the jammed stack effect that I am enquiring about first,
>> and NOT the PPP failure as such.

>Im a triffle confused - you were saying the transport is jammed -
>which I took to mean hung, so how can the jammed TCP/IP connection
>timeout? That implies that at least part of the FSM for the TCP
>connection is running.

I am 90% certain it is a timeout in a higher level of the stack,
and it is definitely very system dependent.  For example, I get times
of somewhere in the 10-30 minutes range on Linux, and ones of about
a day on AIX.  It could well be because an 'infinite' sleep has
returned without EINTR being set, or some similar result.

To attempt to clarify.  After restarting the transport, I have a
working interface for all new uses.  The application that hung is
still stuck in an I/O call to TCP/IP.  In this state, it does NOT
cause keepalives to be sent, nor any other activity on the
interface.  I can't provoke it reliably in an environment that I
can find out exactly what the packet state was (because I am rather
prevented from using the far end when dial-up jams and it is very
rare at work!)  But, after a time that seems fairly consistent
within a system but very different between them, the I/O transfer
completes UNsuccessfully and the application exits.  And, yes, I
do mean that a TCP/IP operation to a previously established target
fails with an I/O error of some sort WITHOUT the far end being
involved.

As you can gather, I don't know the internals of the TCP/IP stack,
either specification or implementations.  And I don't think that it
would be quick to become enough of an expert to deal with this
horrible issue :-(

Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QH, England.

Tel.:  +44 1223 334761    Fax:  +44 1223 334679

 
 
 

Possible design error in TCP/IP stack

Post by Vernon Schryv » Sun, 22 Dec 2002 13:50:49






> ...
>>I suspect there are really only two or three "main branches" of TCP/IP
>>stack these days. BSDish, Linux and Mentat.

>Interesting.  It certainly used to be the case that Sun and HP were
>BSDish, but were VERY different from each other in details; at least
>HP had more-or-less rewritten from scratch.  This caused some, er,
>interesting performance problems ....

Sun has been other than BSD flavored for the many years since Sun
bosses soiled their pants about what AT&T (or certain consultants)
had done to the UCB copyright messages in the BSD TCP source that had
somehow appeared in System V.  In their pointy haired panic at what
the long sleeping but finally arroused Regents could do to AT&T/System
V, they tossed everything related to BSD TCP...or so I've been told
by Sun programmers who ought to know.  They are long time network
kernel hacks who were there and not in sales or marketing or above
2nd level management.

I've also had a little, somewhat more direct contact with the Mentat
people.  For reasons that are variously good and bad, I trust they
did keep any taint of BSD out of their STREAMS modules.  I don't know
what might have crept back in since the early 1990's.

When I've seen performance problems between BSD TCP kernel code I was
responsible for and Sun's, it was with the non-BSD flavor of Sun's code.

While Rick Jones is available, I wouldn't presume to say anything
about the history of HP's TCP code.


 
 
 

Possible design error in TCP/IP stack

Post by Nick Maclar » Sun, 22 Dec 2002 20:12:30




>Sun has been other than BSD flavored for the many years since Sun
>bosses soiled their pants about what AT&T (or certain consultants)
>had done to the UCB copyright messages in the BSD TCP source that had
>somehow appeared in System V.  In their pointy haired panic at what
>the long sleeping but finally arroused Regents could do to AT&T/System
>V, they tossed everything related to BSD TCP...or so I've been told
>by Sun programmers who ought to know.  They are long time network
>kernel hacks who were there and not in sales or marketing or above
>2nd level management.

Most interesting.  It could have been during that era that I hit the
interesting incompatibilities.  It was in the early 1990s.  I have
absolutely no idea whether Sun, HP, neither or both was to blame,
though I discovered that the cause was parameter incompatibility,
combined with error recovery.

Quote:>When I've seen performance problems between BSD TCP kernel code I was
>responsible for and Sun's, it was with the non-BSD flavor of Sun's code.

Could be.  I wasn't closely involved with Sun networking, except when
I had to track down this and related problems.

Quote:>While Rick Jones is available, I wouldn't presume to say anything
>about the history of HP's TCP code.

I know nothing about it, except as a user, and should be interested.
But you don't get that level of system integration, performance, error
recovery and automatic resource allocation by minor tweaking.  I am
talking about HP-UX 7-9 (the last I used).

Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QH, England.

Tel.:  +44 1223 334761    Fax:  +44 1223 334679

 
 
 

Possible design error in TCP/IP stack

Post by Casper H.S. Di » Mon, 23 Dec 2002 05:33:45



>Sun has been other than BSD flavored for the many years since Sun
>bosses soiled their pants about what AT&T (or certain consultants)
>had done to the UCB copyright messages in the BSD TCP source that had
>somehow appeared in System V.  In their pointy haired panic at what
>the long sleeping but finally arroused Regents could do to AT&T/System
>V, they tossed everything related to BSD TCP...or so I've been told
>by Sun programmers who ought to know.  They are long time network
>kernel hacks who were there and not in sales or marketing or above
>2nd level management.

Wasn't the original SVR4 TCP stack "Lachman" TCP/IP which again
was basically BSD rewhacked into STREAMs?

That code was all gone as early as Solaris 2.1 when Sun took on
the Mentat stack (and branched it off at that point).

I'm familiar with both the BSD and Solaris stacks and they do not
look at all similar.

Quote:>While Rick Jones is available, I wouldn't presume to say anything
>about the history of HP's TCP code.

Well, you could look at:

http://groups.google.com/groups?selm=74h9lf%24ql8%243%40ocean.cup.hp....

which basically says it all.

Casper
--
Expressed in this posting are my opinions.  They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.

 
 
 

Possible design error in TCP/IP stack

Post by Vernon Schryv » Mon, 23 Dec 2002 06:38:42




Quote:> ...
>>Sun has been other than BSD flavored for the many years since Sun
>>bosses soiled their pants about what AT&T (or certain consultants)
>>had done to the UCB copyright messages in the BSD TCP source that had
>>somehow appeared in System V. ...
>Wasn't the original SVR4 TCP stack "Lachman" TCP/IP which again
>was basically BSD rewhacked into STREAMs?

I'm not sure which version of the ported BSD TCP code got into the
official SVR4.  I seem to recall an ex-Lachman employee mentioning
having done one of the SVR4 STREAMS stacks.


 
 
 

Possible design error in TCP/IP stack

Post by nobod » Mon, 23 Dec 2002 07:19:43



> I had a problem with a Zoom internal hardware modem that was K56flex from
> before V.90 standards were finalized, that was flashed to V.90.  When
> connected to a Livingston (then Lucent) Portmaster, the modem would
> occasionally lock up regardless of OS (Win95/98, Linux, FreeBSD).  No OS
> would even recognize that it was hung and the only way to get it working
> again in any OS was to reboot.  I had NO such problem with any other PPP
> connection (to work or CompuServe).

> Apparently Lucent never could fix the problem with the Portmaster and when
> it was retired, that same modem worked flawlessly with that same ISP for
> years.

> BTW when this internal modem was regularly locking up with the Portmaster,
> I purchased an external V.90 modem that had no problems with it, and not a
> single hang or disconnect during an 11 hr ftp install of FreeBSD.

Interesting.  I know of a problem with the Portmaster PPP implementation
(circa January 1998) that caused it to incorrectly compute the FCS when a
character that had to be escaped fell on a 128-byte boundary.  This would
certainly cause a TCP connection to hang, but it was independent of the
type of modem and would not persist if the hung TCP connection was closed.

nobody

 
 
 

Possible design error in TCP/IP stack

Post by Nick Maclar » Tue, 24 Dec 2002 02:49:24





>>While Rick Jones is available, I wouldn't presume to say anything
>>about the history of HP's TCP code.

>Well, you could look at:

>http://groups.google.com/groups?selm=74h9lf%24ql8%243%40ocean.cup.hp....

Though not as far as the oddities that I saw, which were HP-UX 9
and (I think) SunOS 4.1 or an early Solaris.

I have seen similar problems since, though not to the same extreme,
with other combinations of system.

Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QH, England.

Tel.:  +44 1223 334761    Fax:  +44 1223 334679

 
 
 

1. Packets from bottom of TCP/IP stack direct to application bypassing stack

Hello Everyone

I am working on a ADSL modem and have the following situation that I
would like to have some advice on.

I need to filter out some packages in the lower level of the network
stack. There are 2 types of packages: [eth | ppp | ip | udp] and [eth
| ip | udp], the data in these packages are the same and they can be
identified with the first 16 bits in the UDP data.

I have manage to catch these packages in the /net/core/dev.c file and
function netif_rx(...) with the 16 bit ID so I have the packages.

Now for my question: How do I in an easy way get these packages
directly to my application without using the network stack. I need
BOTH of these packages to reach there and if I use socket the one with
PPP get thrown away somewhere and that is not so good.

I know this is not a very specific question and a little vague but
some advice and pointers would be appreciated.

Regards
Andreas

2. ftpd I am almost there please help!!

3. module installation in TCP/IP stack error.

4. Advice wanted: Porting to NT (ESRI Arc/Info application)

5. tcp / ip stack and ip forwarding questions

6. How can I get login to record all logins?

7. How to tell an application to use a custom tcp/ip stack instead of tcp/ip stack from linux?

8. SCSI Controller: adaptec 19160 will this work in Linux ?

9. possible bug x86 2.4.2 SMP in IP receive stack

10. Custom drivers designed for real-time industrial systems - FIX/DMACS, Wonderware, TCP/IP Datapump

11. Designing a TCP/IP server for ARM

12. Token Ring/Netbeui - Linux/Apache & TCP-IP Design

13. Custom drivers designed for real-time industrial systems - FIX/DMACS, Wonderware, TCP/IP Datapump