getting tcp packets larger than MTU, how is that possible??

getting tcp packets larger than MTU, how is that possible??

Post by Tobias Skytt » Tue, 13 Jan 2009 20:44:32



Hi,

While speedtesting various MTU sizes (1500 and 9000) I noticed that
packet lengths, as reported by tcpdump, are varying in size from MTU
size (1514 and 9014 bytes, up to 62702(!) bytes).
When the length is over MTU size (e.g. 62702 bytes), the receiving
machine sends back a lot of 66byte ACKs, before receiving the next
packet.
Whats up with this? how can it have a packet size greater than MTU??

I have included below a short excerpt from the tcpdump when it was MTU
9000 and while transfering a large file (2.2gb) over FTP.
The machines both run RH 5.2 and both have the following two NICs
installed in each machine:
Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet
Intel Corporation 82571EB Gigabit Ethernet Controller.
The NICs are connected via x-over cable.

The results and speeds are the same when using either card (e.g.
broadcom to broadcom or intel to intel)

Any hints would be much appreciated. Thanks.
Tobias
*************************************************************************************************
Capture from the sending machine, having MTU 9000 and capturing on
port 20 only:
*************************************************************************************************
11:39:20.323198 00:15:17:16:37:4b (oui Unknown) > 00:15:17:12:f0:8f
(oui Unknown), ethertype IPv4 (0x0800
), length 9014: 192.168.0.2.ftp-data > 192.168.0.4.59912: .
187909:196857(8948) ack 1 win 140 <nop,nop,ti
mestamp 485338 531647>
11:39:20.323200 00:15:17:12:f0:8f (oui Unknown) > 00:15:17:16:37:4b
(oui Unknown), ethertype IPv4 (0x0800
), length 66: 192.168.0.4.59912 > 192.168.0.2.ftp-data: . ack 143169
win 958 <nop,nop,timestamp 531647 48
5338>
11:39:20.323204 00:15:17:16:37:4b (oui Unknown) > 00:15:17:12:f0:8f
(oui Unknown), ethertype IPv4 (0x0800
), length 35858: 192.168.0.2.ftp-data > 192.168.0.4.59912: .
196857:232649(35792) ack 1 win 140 <nop,nop,
timestamp 485338 531647>
11:39:20.323443 00:15:17:12:f0:8f (oui Unknown) > 00:15:17:16:37:4b
(oui Unknown), ethertype IPv4 (0x0800
), length 66: 192.168.0.4.59912 > 192.168.0.2.ftp-data: . ack 161065
win 1014 <nop,nop,timestamp 531648 4
85338>
11:39:20.323448 00:15:17:12:f0:8f (oui Unknown) > 00:15:17:16:37:4b
(oui Unknown), ethertype IPv4 (0x0800
), length 66: 192.168.0.4.59912 > 192.168.0.2.ftp-data: . ack 178961
win 903 <nop,nop,timestamp 531648 48
5338>
11:39:20.323452 00:15:17:16:37:4b (oui Unknown) > 00:15:17:12:f0:8f
(oui Unknown), ethertype IPv4 (0x0800
), length 53754: 192.168.0.2.ftp-data > 192.168.0.4.59912: .
232649:286337(53688) ack 1 win 140 <nop,nop,
timestamp 485338 531648>
11:39:20.323694 00:15:17:12:f0:8f (oui Unknown) > 00:15:17:16:37:4b
(oui Unknown), ethertype IPv4 (0x0800
), length 66: 192.168.0.4.59912 > 192.168.0.2.ftp-data: . ack 196857
win 1014 <nop,nop,timestamp 531648 4
85338>
11:39:20.323698 00:15:17:16:37:4b (oui Unknown) > 00:15:17:12:f0:8f
(oui Unknown), ethertype IPv4 (0x0800
), length 9014: 192.168.0.2.ftp-data > 192.168.0.4.59912: .
286337:295285(8948) ack 1 win 140 <nop,nop,ti
mestamp 485339 531648>
11:39:20.323942 00:15:17:12:f0:8f (oui Unknown) > 00:15:17:16:37:4b
(oui Unknown), ethertype IPv4 (0x0800
), length 66: 192.168.0.4.59912 > 192.168.0.2.ftp-data: . ack 214753
win 1069 <nop,nop,timestamp 531648 4
85338>
11:39:20.323952 00:15:17:12:f0:8f (oui Unknown) > 00:15:17:16:37:4b
(oui Unknown), ethertype IPv4 (0x0800
), length 66: 192.168.0.4.59912 > 192.168.0.2.ftp-data: . ack 241597
win 903 <nop,nop,timestamp 531648 48
5338>
11:39:20.324193 00:15:17:12:f0:8f (oui Unknown) > 00:15:17:16:37:4b
(oui Unknown), ethertype IPv4 (0x0800
), length 66: 192.168.0.4.59912 > 192.168.0.2.ftp-data: . ack 259493
win 1014 <nop,nop,timestamp 531648 4
85338>
11:39:20.324198 00:15:17:16:37:4b (oui Unknown) > 00:15:17:12:f0:8f
(oui Unknown), ethertype IPv4 (0x0800
), length 17962: 192.168.0.2.ftp-data > 192.168.0.4.59912: P
340025:357921(17896) ack 1 win 140 <nop,nop,
timestamp 485339 531648>
11:39:20.324442 00:15:17:12:f0:8f (oui Unknown) > 00:15:17:16:37:4b
(oui Unknown), ethertype IPv4 (0x0800
), length 66: 192.168.0.4.59912 > 192.168.0.2.ftp-data: . ack 277389
win 1069 <nop,nop,timestamp 531649 4
85338>
11:39:20.324452 00:15:17:12:f0:8f (oui Unknown) > 00:15:17:16:37:4b
(oui Unknown), ethertype IPv4 (0x0800
), length 66: 192.168.0.4.59912 > 192.168.0.2.ftp-data: . ack 304233
win 1119 <nop,nop,timestamp 531649 4
85338>
11:39:20.324456 00:15:17:16:37:4b (oui Unknown) > 00:15:17:12:f0:8f
(oui Unknown), ethertype IPv4 (0x0800
), length 6402: 192.168.0.2.ftp-data > 192.168.0.4.59912: .
414221:420557(6336) ack 1 win 140 <nop,nop,ti
:

 
 
 

getting tcp packets larger than MTU, how is that possible??

Post by Tobias Skytt » Tue, 13 Jan 2009 20:50:03


Forgot to mention that kernel version on both machines is:
2.6.18-92.el5

Also, on a FTP transfer of 2.2gb and MTU 9000 I get the following
packets:
16542 packets of length 66 bytes (ACKs from the receiver)
5746 packets of 9014 bytes
2127 packets of 62702 bytes
69 packets of other size

Tobias

 
 
 

getting tcp packets larger than MTU, how is that possible??

Post by Pascal Hambour » Wed, 14 Jan 2009 00:08:47


Hello,

Tobias Skytte a crit :

Quote:

> While speedtesting various MTU sizes (1500 and 9000) I noticed that
> packet lengths, as reported by tcpdump, are varying in size from MTU
> size (1514 and 9014 bytes, up to 62702(!) bytes).
> When the length is over MTU size (e.g. 62702 bytes), the receiving
> machine sends back a lot of 66byte ACKs, before receiving the next
> packet.
> Whats up with this? how can it have a packet size greater than MTU??

Could it be caused by the NIC doing TSO (TCP segmentation offload) ?
 
 
 

getting tcp packets larger than MTU, how is that possible??

Post by Rick Jone » Wed, 14 Jan 2009 03:07:13



> Hello,
> Tobias Skytte a ?crit :

> > While speedtesting various MTU sizes (1500 and 9000) I noticed that
> > packet lengths, as reported by tcpdump, are varying in size from MTU
> > size (1514 and 9014 bytes, up to 62702(!) bytes).
> > When the length is over MTU size (e.g. 62702 bytes), the receiving
> > machine sends back a lot of 66byte ACKs, before receiving the next
> > packet.
> > Whats up with this? how can it have a packet size greater than MTU??
> Could it be caused by the NIC doing TSO (TCP segmentation offload) ?

Most likely, and if one were snapping the entire send tcpdump wuold
probably report a botched checksum too, thanks to CKO :)

Packet tracing on the sending system takes-place _before_ the
packet(s) make it to the wire - on the wire, the packets will be the
"correct" size and should have the correct checksum.  

If what you want to see is the on the wire stuff, you need to trace
with a third system that is not part of any conversations - and
perform some tricks with configuring monitor ports on switches and
whatnot.

rick jone
--
portable adj, code that compiles under more than one compiler
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

 
 
 

getting tcp packets larger than MTU, how is that possible??

Post by Blah Blah Bla » Wed, 14 Jan 2009 04:48:08


On Mon, 12 Jan 2009 18:07:13 +0000, Rick Jones faxed us with....


>> Hello,

>> Tobias Skytte a ?crit :

>> > While speedtesting various MTU sizes (1500 and 9000) I noticed that
>> > packet lengths, as reported by tcpdump, are varying in size from MTU
>> > size (1514 and 9014 bytes, up to 62702(!) bytes). When the length is
>> > over MTU size (e.g. 62702 bytes), the receiving machine sends back a
>> > lot of 66byte ACKs, before receiving the next packet.
>> > Whats up with this? how can it have a packet size greater than MTU??

>> Could it be caused by the NIC doing TSO (TCP segmentation offload) ?

> Most likely, and if one were snapping the entire send tcpdump wuold
> probably report a botched checksum too, thanks to CKO :)

> Packet tracing on the sending system takes-place _before_ the packet(s)
> make it to the wire - on the wire, the packets will be the "correct"
> size and should have the correct checksum.

> If what you want to see is the on the wire stuff, you need to trace with
> a third system that is not part of any conversations - and perform some
> tricks with configuring monitor ports on switches and whatnot.

> rick jone

This made an interesting read. I was thinking Path MTU Discovery myself -
but this much more interesting. Can we expand on this a bit?

--
Replica Watches - TRY LIDL - Cheap meds? Visit your GP

 
 
 

getting tcp packets larger than MTU, how is that possible??

Post by Maxwell Lo » Wed, 14 Jan 2009 06:03:01



> Hi,

> While speedtesting various MTU sizes (1500 and 9000) I noticed that
> packet lengths, as reported by tcpdump, are varying in size from MTU
> size (1514 and 9014 bytes, up to 62702(!) bytes).

TCP and UDP packets can be fragmented into IP fragments.
IP reassembles fragments into larger units.

Are you filtering out non-TCP traffic in your TCPdump results?
If so, you won't see the IP fragments.

 
 
 

getting tcp packets larger than MTU, how is that possible??

Post by Rick Jone » Wed, 14 Jan 2009 07:24:21



Quote:> On Mon, 12 Jan 2009 18:07:13 +0000, Rick Jones faxed us with....
> > Most likely, and if one were snapping the entire send tcpdump
> > would probably report a botched checksum too, thanks to CKO :)

> > Packet tracing on the sending system takes-place _before_ the
> > packet(s) make it to the wire - on the wire, the packets will be
> > the "correct" size and should have the correct checksum.

> > If what you want to see is the on the wire stuff, you need to
> > trace with a third system that is not part of any conversations -
> > and perform some tricks with configuring monitor ports on switches
> > and whatnot.

> > rick jone
> This made an interesting read. I was thinking Path MTU Discovery
> myself - but this much more interesting. Can we expand on this a
> bit?

I suppose - in which direction do you seek to see it expand?

rick jones
--
The glass is neither half-empty nor half-full. The glass has a leak.
The real question is "Can it be patched?"
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

 
 
 

getting tcp packets larger than MTU, how is that possible??

Post by Blah Blah Bla » Wed, 14 Jan 2009 18:43:05


On Mon, 12 Jan 2009 22:24:21 +0000, Rick Jones faxed us with....


>> On Mon, 12 Jan 2009 18:07:13 +0000, Rick Jones faxed us with....

>> > Most likely, and if one were snapping the entire send tcpdump would
>> > probably report a botched checksum too, thanks to CKO :)

>> > Packet tracing on the sending system takes-place _before_ the
>> > packet(s) make it to the wire - on the wire, the packets will be the
>> > "correct" size and should have the correct checksum.

>> > If what you want to see is the on the wire stuff, you need to trace
>> > with a third system that is not part of any conversations - and
>> > perform some tricks with configuring monitor ports on switches and
>> > whatnot.

>> > rick jone

>> This made an interesting read. I was thinking Path MTU Discovery myself
>> - but this much more interesting. Can we expand on this a bit?

> I suppose - in which direction do you seek to see it expand?

> rick jones

Not needed Rick - but thanks. A quick google put that one to bed.

--
Replica Watches - TRY LIDL - Cheap meds? Visit your GP

 
 
 

getting tcp packets larger than MTU, how is that possible??

Post by Pascal Hambour » Wed, 14 Jan 2009 23:59:50


Maxwell Lol a crit :


>> While speedtesting various MTU sizes (1500 and 9000) I noticed that
>> packet lengths, as reported by tcpdump, are varying in size from MTU
>> size (1514 and 9014 bytes, up to 62702(!) bytes).

> TCP and UDP packets can be fragmented into IP fragments.

AFAIK TCP tries not to send segments bigger than the path MTU allows in
order to avoid fragmentation.

Quote:> IP reassembles fragments into larger units.

In a "normal" (without offloading) data path, tcpdump sees packets
before they enter and after they leave the IP stack, so it should see
the fragments, not the reassembled datagrams.

Quote:> Are you filtering out non-TCP traffic in your TCPdump results?
> If so, you won't see the IP fragments.

Why not ? The protocol number is in the IP header of each fragment, so
tcpdump knows the protocol of the datagram a fragment is part of.
 
 
 

getting tcp packets larger than MTU, how is that possible??

Post by Maxwell Lo » Thu, 15 Jan 2009 10:36:23



> Maxwell Lol a crit :
>> Are you filtering out non-TCP traffic in your TCPdump results?
>> If so, you won't see the IP fragments.

> Why not ? The protocol number is in the IP header of each fragment, so
> tcpdump knows the protocol of the datagram a fragment is part of.

I haven't tested this. But this is my reasoning

Tcpdump prints fragments as



It doesn't identify the fragment as UDP, TCP or whatever.

Checking the source, the frag printing routine is in print-ip.c
and not in print-tcp.c or print-udp.c

Also looking at print-ip.c it has

        switch (ipds->nh) {

---------------[snip]-------------
        case IPPROTO_TCP:
                /* pass on the MF bit plus the offset to detect fragments */
                tcp_print(ipds->cp, ipds->len, (const u_char *)ipds->ip,
                          ipds->off & (IP_MF|IP_OFFMASK));
                break;

        case IPPROTO_UDP:
                /* pass on the MF bit plus the offset to detect fragments */
                udp_print(ipds->cp, ipds->len, (const u_char *)ipds->ip,
                          ipds->off & (IP_MF|IP_OFFMASK));
                break;
---------------[snip]-------------
        case IPPROTO_IPV4:
                /* DVMRP multicast tunnel (ip-in-ip encapsulation) */
                ip_print(gndo, ipds->cp, ipds->len);
                if (! vflag) {
                        ND_PRINT((ndo, " (ipip-proto-4)"));
                        return;
                }
                break;

Which tells me that when you use "tcp" as a filter, "ip" is not
printed (unless you say "tcp and ip")

 
 
 

getting tcp packets larger than MTU, how is that possible??

Post by Pascal Hambour » Thu, 15 Jan 2009 18:20:31


Maxwell Lol a crit :


>> Maxwell Lol a crit :

>>> Are you filtering out non-TCP traffic in your TCPdump results?
>>> If so, you won't see the IP fragments.

>> Why not ? The protocol number is in the IP header of each fragment, so
>> tcpdump knows the protocol of the datagram a fragment is part of.

> I haven't tested this. But this is my reasoning

> Tcpdump prints fragments as



> It doesn't identify the fragment as UDP, TCP or whatever.

Each fragment contains a complete IP header, and each IP header contains
the protocol number, so in /my/ reasoning nothing prevents tcpdump from
printing the protocol of a fragment. Of course it won't be able to print
other information such as the port numbers or ICMP type/code as they are
in the first (offset 0) fragment only.

- Show quoted text -

Quote:> Checking the source, the frag printing routine is in print-ip.c
> and not in print-tcp.c or print-udp.c

> Also looking at print-ip.c it has

>    switch (ipds->nh) {

> ---------------[snip]-------------
>    case IPPROTO_TCP:
>            /* pass on the MF bit plus the offset to detect fragments */
>            tcp_print(ipds->cp, ipds->len, (const u_char *)ipds->ip,
>                      ipds->off & (IP_MF|IP_OFFMASK));
>            break;

>    case IPPROTO_UDP:
>            /* pass on the MF bit plus the offset to detect fragments */
>            udp_print(ipds->cp, ipds->len, (const u_char *)ipds->ip,
>                      ipds->off & (IP_MF|IP_OFFMASK));
>            break;
> ---------------[snip]-------------
>    case IPPROTO_IPV4:
>            /* DVMRP multicast tunnel (ip-in-ip encapsulation) */
>            ip_print(gndo, ipds->cp, ipds->len);
>            if (! vflag) {
>                    ND_PRINT((ndo, " (ipip-proto-4)"));
>                    return;
>            }
>            break;

Hmm, looks like IPPROTO_IPV4 is not the raw IP protocol but IPIP
tunneling encapsulation (protocol number 4 ?).

Quote:> Which tells me that when you use "tcp" as a filter, "ip" is not
> printed (unless you say "tcp and ip")

I cannot easily test with TCP because the MSS limits the size of TCP
segments, but I tested with UDP and ICMP traceroute sending packets of
1500 octets over a link with MTU set to 1460 :

zenith:~# tcpdump -ntvi ppp0 udp and host y.y.y.y
tcpdump: listening on ppp0, link-type LINUX_SLL (Linux cooked), capture
size 96 bytes
IP (tos 0x0, ttl   1, id 35011, offset 0, flags [+], proto: UDP (17),
length: 1460) x.x.x.x.35007 > y.y.y.y.33438: UDP, length 1472
IP (tos 0x0, ttl   1, id 35011, offset 1440, flags [none], proto: UDP
(17), length: 60) x.x.x.x > y.y.y.y: udp

zenith:~# tcpdump -ntvi ppp0 icmp and host y.y.y.y
tcpdump: listening on ppp0, link-type LINUX_SLL (Linux cooked), capture
size 96 bytes
IP (tos 0x0, ttl   1, id 35013, offset 0, flags [+], proto: ICMP (1),
length: 1460) x.x.x.x > y.y.y.y: ICMP echo request, id 35009, seq 4,
length 1440
IP (tos 0x0, ttl   1, id 35013, offset 1440, flags [none], proto: ICMP
(1), length: 60) x.x.x.x > y.y.y.y: icmp

 
 
 

1. Drawbacks of sending UDP packets larger than MTU?

Hi list,

I'm wondering about the reliability of sending UDP packets that are
larger than the MTU of the link. I know that, contrary to TCP, UDP does
not do fragments, but IP has some fragmentation ability, so it should
take care of that regardless. Even though IPv6 with it's PMTU discovery
doesn't fragment on intermediate nodes, the endpoint stacks supposedly
still do fragmentation, if I've read that correctly, so the IP version
shouldn't really matter.

Therefore, it would be possible to send UDP packets that are larger than
the MTU, and still have them arrive either properly or not at all, just
like normal UDP packets. I thus wonder if there is any drawback to doing
that, except that the additional packets may increase loss rates?
Especially, should I better fragment myself and put some sequence
numbers in it, or is that just going to do the same thing at best?

Thanks in advance!

--
Please don't reply to the email address in the header.

2. AGP card for linux and XFree86

3. why am i getting packets here...?

4. boot di mandrake da un powerbook G3 wall street "rev.B" che porta in sequenza i seguenti OS..

5. How to convert TCP/IP packet to IPX packet and visa-versa ?

6. SNMP Select Interface Timeout?

7. Finding IP-Spoofing Information

8. silly MTU problem? Or am I silly?

9. Create TCP syn packet with given seq num and few other TCP parameters

10. Tracing TCP/IP packets from NIC to TCP

11. TCP/IP: Slow packets every so often, even with the TCP patch.

12. UDP packets larger than recvfrom requests- how is it handled?