TCP checksum errors in Solaris 2.4

TCP checksum errors in Solaris 2.4

Post by Y Badr » Tue, 18 Jun 1996 04:00:00



Hello all,

I posted about a month ago, about a situation where whenever an x86 Solaris
2.4 PPP client had to retransmit packets, its retransmissions were all
ignored. Since it causes these retransmissions by screwing up an original
TCP transmission (sooner rather than later), that system has become virtually
unusable.
It is heavily patched, including 101946-29, 102855-01, 102821-01, and these
problems suddenly began out of the blue, more than a month after the most
recent patch.

I later added checksum verification to the network monitor I was using, and
have found that all these packets had invalid TCP checksums ... BEFORE they
left the Solaris PC.
There were also some spurious differences between between the originals and
their retransmissions, eg. the PUSH flag is almost always set in the original
and almost always cleared in the retransmission. Unfortunately, I wasn't able
to tie down an absolute pattern, with respect to TCP's flags and checksums.

I ran my network monitor on the le0 interface of a Sparc system, to verify
its reliability. I felt that there should be virtually no checksum errors
there, so if my program flagged too many, it might be in the wrong (although
checksum computation is not that hard, and there is example code in RFC1071
and PD apps such as tcpdump - admittedly tcpdump only does IP checksums, but
the computation is the same).

It didn't flag very many errors, but what made me doubt its reliability was
that the ones it did flag were outgoing TCP packets on that Sparc. As I
remember, they were all Telnet packets, with 2 bytes of data ("\r\n"). Manual
computation of the checksum over a dump of selected packets (and yes, I did
remember the pseudo header) backed up the verdict of my program.
The strangest bit was that one of the "bad" TCP segments got acknowledged
(I now suspect this might be due to compensating errors in the way Solaris
computes *and* verifies checksums - see below).

This left me scratching my head, so eventually, I decided to scan Sunsolve
for keywords such as retransmission and checksum. And what did I find but a
plethora of bug reports on checksum errors and related problems (apparently
SunOS 5 has even lost that uncanny ability SunOS 4 had, to reassemble IP
fragments without scrambling them).
The bugs that I found were 1212710 (aka 1194355), 1224148, 1233461 and
1238993.

Sunsolve does say that the above bugs are fixed by the Solaris 2.4 jumbo
kernel patch 101945-39 (Sparc), which is dated May 31st. I haven't installed
this, but the symptoms I have found don't match these bug reports, and I do
now trust my program.
As for Solaris 2.4 x86, there hasn't been a kernel jumbo patch since Nov 2nd,
and since much of it is probably ported from the Sparc, I'm sure it has all
these bugs plus more (due to byte ordering, etc). This doesn't do much to
allay my existing suspicions that Sun are not serious about the x86 version.

I noticed several more bug reports, about PPP corrupting packets when doing
VJ compression (1151532 and 1151536), so I shall mention some other symptons.
I turned aspppd debug up to level 9 on the Solaris PC, and it showed a mixture
of compressed and uncompressed PPP frames. I think only the retransmissions
were uncompressed. My PPP settings for compression are the default (ie. on).

Well, at the end of this rambling post, I'm not sure if I have a definite
question to pose. Given that Sunsolve registers many related bugs, I just
wonder why I don't see more problem reports here in Usenet (not that I am
a regular reader), and if Sun want to pass any comment.

Actually, there is one concrete question, which will help to clarify my
thoughts. When a DLPI-based network monitor captures outgoing packets
from its host system, where does it get them from ?
Are they:
- A: looped back as soon as they hit the PPP/le0 modules.
- B: read back off the hardware device as they are transmitted, and thus
  returned back to the monitor
- C: looped back within the PPP/le0 module, before hitting any hardware.

All comments/suggestions welcome.
--

The above opinions are not my own

 
 
 

TCP checksum errors in Solaris 2.4

Post by Richard F » Thu, 20 Jun 1996 04:00:00



>Hello all,

>I posted about a month ago, about a situation where whenever an x86 Solaris
>2.4 PPP client had to retransmit packets, its retransmissions were all
>ignored. Since it causes these retransmissions by screwing up an original
>TCP transmission (sooner rather than later), that system has become virtually
>unusable.
>It is heavily patched, including 101946-29, 102855-01, 102821-01, and these
>problems suddenly began out of the blue, more than a month after the most
>recent patch.

>I later added checksum verification to the network monitor I was using, and
>have found that all these packets had invalid TCP checksums ... BEFORE they

You need to load 101946-33 or newer to address this problem.

--rich

 
 
 

TCP checksum errors in Solaris 2.4

Post by Y Badr » Sun, 23 Jun 1996 04:00:00




>>I posted about a month ago, about a situation where whenever an x86 Solaris
>>2.4 PPP client had to retransmit packets, its retransmissions were all
>>ignored. Since it causes these retransmissions by screwing up an original
>>TCP transmission (sooner rather than later), that system has become virtually
>>unusable.

>>I later added checksum verification to the network monitor I was using, and
>>have found that all these packets had invalid TCP checksums ... BEFORE they

>You need to load 101946-33 or newer to address this problem.

The latest version of 101946 in the Sunsolve contract area is 29. Did you
mean 101945 (Sparc) ?
--

The above opinions are not my own
 
 
 

1. bad TCP checksums in tcp_retrans_try_collapse 2.4.5pre5 (at least)

Hi,

We've been pulling our hair out lately trying to figure out why
certain connections of ours have been just stalling and dying.  It
turns out that the problem occurs when tcp segments are lost and then
coalesced into a single segment for retransmission.  When that
happens, a bad checksum is computed, and then the connection dies
while one end continually retransmits the same packet with a bad
checksum.

Here's a patch against 2.4.5pre5 which should fix it:

--- tcp_output.c~       Thu Apr 12 15:11:39 2001

                if (skb->ip_summed != CHECKSUM_HW) {
                        memcpy(skb_put(skb, next_skb_size), next_skb->data, next_skb_size);
-                       skb->csum = csum_block_add(skb->csum, next_skb->csum, skb->len);
+                       skb->csum = csum_block_add(skb->csum, next_skb->csum, skb_size);
                }

                /* Update sequence range on original skb. */

Hopefully the problem is obvious in retrospect.  skb_put(skb,...)
modifies skb->len, and the new value was being used in csum_block_add
instead of the original len.  We're testing the patch now, but it
seems fairly obvious and apparently other people have been reporting
similar problems so I wanted to get this out there...

Todd

p.s.  If there's followup, please Cc me directly, as I'm not
subscribed to lkml.

--


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2. sources of unix-like commnads for DOS/Win95

3. Solaris 2.4 + TCP = input errors?

4. rpc.rstatd missing from Slackware 1.2.0

5. How to enable UDP checksums in kernel in Solaris 2.4?

6. can't get dial tone on modem

7. Solaris 2.4, Oracle 7.2.2.3, inodes, bus errors, and I/O errors

8. Solaris9..Apache startup problem

9. IP MASQ : TCP/UDP checksum errors

10. Occasional TCP checksum errors with qdisc

11. TCP Checksum error with WD8013WC on FreeBSD-2.2.5 (driver ed0)

12. IP masquerading : TCP/UDP checksum errors

13. TCP Checksum Errors