Info: NAPI performance at "low" loads

Info: NAPI performance at "low" loads

Post by Manfred Sprau » Thu, 19 Sep 2002 05:00:13



NAPI network drivers mask the rx interrupts in their interrupt handler,
and reenable them in dev->poll(). In the worst case, that happens for
every packet. I've tried to measure the overhead of that operation.

The cpu time needed to recieve 50k packets/sec:

without NAPI:   53.7 %
with NAPI:      59.9 %

50k packets/sec is the limit for NAPI, at higher packet rates the forced
mitigation kicks in and every interrupt recieves more than one packet.

The cpu time was measured by busy-looping in user space, the numbers
should be accurate to less than 1 %.
Summary: with my setup, the overhead is around 11 %.

Could someone try to reproduce my results?

Sender:
  # sendpkt <target ip> 1 <10..50, go get a good packet rate>

Receiver:
  $ loadtest

Please disable any interrupt  mitigation features of your nic, otherwise
the mitigation will dramatically change the needed cpu time.
The sender sends ICMP echo reply packets, evenly spaced by
"memset(,,n*512)" between the syscalls.
The cpu load was measured with a user space app that calls
"memset(,,16384)" in a tight loop, and reports the number of loops per
second.

I've used a patched tulip driver, the current NAPI driver contains a
loop that severely slows down the nic under such loads.

The patch and my test apps are at

http://www.q-ag.de/~manfred/loadtest

hardware setup:
        Duron 700, VIA KT 133
                no IO APIC, i.e. slow 8259 XT PIC.
        Accton tulip clone, ADMtek comet.
        crossover cable
        Sender: Celeron 1.13 GHz, rtl8139

--
        Manfred

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

Info: NAPI performance at "low" loads

Post by David S. Mille » Thu, 19 Sep 2002 06:20:06



   Date: Tue, 17 Sep 2002 21:53:03 +0200

   Receiver:
     $ loadtest

This appears to be x86 only, sorry I can't test this out for you as
all my boxes are sparc64.

I was actually eager to try your tests out here.

Do you really need to use x86 instructions to do what you
are doing?  There are portable pthread mutexes available.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

Info: NAPI performance at "low" loads

Post by Andrew Morto » Thu, 19 Sep 2002 06:40:07




>    Date: Tue, 17 Sep 2002 21:53:03 +0200

>    Receiver:
>      $ loadtest

> This appears to be x86 only, sorry I can't test this out for you as
> all my boxes are sparc64.

> I was actually eager to try your tests out here.

> Do you really need to use x86 instructions to do what you
> are doing?  There are portable pthread mutexes available.

There is a similar background loadtester at
http://www.zip.com.au/~akpm/linux/#zc .

It's fairly fancy - I wrote it for measuring networking
efficiency.  It doesn't seem to have any PCisms....

(I measured similar regression using an ancient NAPIfied
3c59x a long time ago).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

Info: NAPI performance at "low" loads

Post by David S. Mille » Thu, 19 Sep 2002 06:40:09



   Date: Tue, 17 Sep 2002 14:32:09 -0700

   There is a similar background loadtester at
   http://www.zip.com.au/~akpm/linux/#zc .

   It's fairly fancy - I wrote it for measuring networking
   efficiency.  It doesn't seem to have any PCisms....

Thanks I'll check it out, but meanwhile I hacked up sparc
specific assembler for manfred's code :-)

   (I measured similar regression using an ancient NAPIfied
   3c59x a long time ago).

Well, it is due to the same problems manfred saw initially,
namely just a crappy or buggy NAPI driver implementation. :-)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

Info: NAPI performance at "low" loads

Post by Andrew Morto » Thu, 19 Sep 2002 06:50:07




>    Date: Tue, 17 Sep 2002 14:32:09 -0700

>    There is a similar background loadtester at
>    http://www.zip.com.au/~akpm/linux/#zc .

>    It's fairly fancy - I wrote it for measuring networking
>    efficiency.  It doesn't seem to have any PCisms....

> Thanks I'll check it out, but meanwhile I hacked up sparc
> specific assembler for manfred's code :-)

>    (I measured similar regression using an ancient NAPIfied
>    3c59x a long time ago).

> Well, it is due to the same problems manfred saw initially,
> namely just a crappy or buggy NAPI driver implementation. :-)

It was due to additional inl()'s and outl()'s in the driver fastpath.

Testcase was netperf Tx and Rx.  Just TCP over 100bT. AFAIK, this overhead
is intrinsic to NAPI.  Not to say that its costs outweigh its benefits,
but it's just there.

If someone wants to point me at all the bits and pieces to get a
NAPIfied 3c59x working on 2.5.current I'll retest, and generate
some instruction-level oprofiles.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

Info: NAPI performance at "low" loads

Post by David S. Mille » Thu, 19 Sep 2002 07:00:09



   Date: Tue, 17 Sep 2002 14:45:08 -0700


   > Well, it is due to the same problems manfred saw initially,
   > namely just a crappy or buggy NAPI driver implementation. :-)

   It was due to additional inl()'s and outl()'s in the driver fastpath.

How many?  Did the implementation cache the register value in a
software state word or did it read the register each time to write
the IRQ masking bits back?

It is issues like this that make me say "crappy or buggy NAPI
implementation"

Any driver should be able to get the NAPI overhead to max out at
2 PIOs per packet.

And if the performance is really concerning, perhaps add an option to
use MEM space in the 3c59x driver too, IO instructions are constant
cost regardless of how fast the PCI bus being used is :-)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

Info: NAPI performance at "low" loads

Post by Jeff Garzi » Thu, 19 Sep 2002 07:00:11



> Any driver should be able to get the NAPI overhead to max out at
> 2 PIOs per packet.

Just to pick nits... my example went from 2 or 3 IOs [depending on the
presence/absence of a work loop] to 6 IOs.

Feel free to re-read my message and point out where an IO can be
eliminated...

        Jeff

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

Info: NAPI performance at "low" loads

Post by David S. Mille » Thu, 19 Sep 2002 07:00:13



   Date: Tue, 17 Sep 2002 17:54:42 -0400


   > Any driver should be able to get the NAPI overhead to max out at
   > 2 PIOs per packet.

   Just to pick nits... my example went from 2 or 3 IOs [depending on the
   presence/absence of a work loop] to 6 IOs.

I mean "2 extra PIOs" not "2 total PIOs".

I think it's doable for just about every driver, even tg3 with it's
weird semaphore scheme takes 2 extra PIOs worst case with NAPI.

The semaphore I have to ACK anyways at hw IRQ time anyways, and since
I keep a software copy of the IRQ masking register, mask and unmask
are each one PIO.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

Info: NAPI performance at "low" loads

Post by Andrew Morto » Thu, 19 Sep 2002 07:10:05




>    Date: Tue, 17 Sep 2002 14:45:08 -0700


>    > Well, it is due to the same problems manfred saw initially,
>    > namely just a crappy or buggy NAPI driver implementation. :-)

>    It was due to additional inl()'s and outl()'s in the driver fastpath.

> How many?  Did the implementation cache the register value in a
> software state word or did it read the register each time to write
> the IRQ masking bits back?

Looks like it cached it:

-    outw(SetIntrEnb | (inw(ioaddr + 10) & ~StatsFull), ioaddr + EL3_CMD);
     vp->intr_enable &= ~StatsFull;
+    outw(vp->intr_enable, ioaddr + EL3_CMD);

Quote:> It is issues like this that make me say "crappy or buggy NAPI
> implementation"

> Any driver should be able to get the NAPI overhead to max out at
> 2 PIOs per packet.

> And if the performance is really concerning, perhaps add an option to
> use MEM space in the 3c59x driver too, IO instructions are constant
> cost regardless of how fast the PCI bus being used is :-)

Yup.  But deltas are interesting.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
 
 
 

Info: NAPI performance at "low" loads

Post by jama » Thu, 19 Sep 2002 10:10:06


Manfred, could you please turn MMIO (you can select it
via kernel config) and see what the new difference looks like?

I am not so sure with that 6% difference there is no other bug lurking
there; 6% seems too large for an extra two PCI transactions per packet.
If someone could test a different NIC this would be great.
Actually what would be even better is to go something like 20kpps,
50kpps, 80 kpps, 100kpps and 140 kpps and see what we get.

cheers,
jamal

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

Info: NAPI performance at "low" loads

Post by David S. Mille » Thu, 19 Sep 2002 10:20:05



   Date: Tue, 17 Sep 2002 20:57:58 -0400 (EDT)

   I am not so sure with that 6% difference there is no other bug lurking
   there; 6% seems too large for an extra two PCI transactions per packet.

{in,out}{b,w,l}() operations have a fixed timing, therefore his
results doesn't sound that far off.

It is also one of the reasons I suspect Andrew saw such bad results
with 3c59x, but probably that is not the only reason.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

Info: NAPI performance at "low" loads

Post by Jeff Garzi » Thu, 19 Sep 2002 11:20:04




>    Date: Tue, 17 Sep 2002 17:54:42 -0400


>    > Any driver should be able to get the NAPI overhead to max out at
>    > 2 PIOs per packet.

>    Just to pick nits... my example went from 2 or 3 IOs [depending on the
>    presence/absence of a work loop] to 6 IOs.

> I mean "2 extra PIOs" not "2 total PIOs".

> I think it's doable for just about every driver, even tg3 with it's
> weird semaphore scheme takes 2 extra PIOs worst case with NAPI.

> The semaphore I have to ACK anyways at hw IRQ time anyways, and since
> I keep a software copy of the IRQ masking register, mask and unmask
> are each one PIO.

You're looking at at least one extra get-irq-status too, at least in the
classical 10/100 drivers I'm used to seeing...

        Jeff

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

Info: NAPI performance at "low" loads

Post by David S. Mille » Thu, 19 Sep 2002 11:20:05



   Date: Tue, 17 Sep 2002 22:11:14 -0400

   You're looking at at least one extra get-irq-status too, at least in the
   classical 10/100 drivers I'm used to seeing...

How so?  The number of ones done in the e1000 NAPI code are the same
(read register until no interesting status bits remain set, same as
pre-NAPI e1000 driver).

For tg3 it's a cheap memory read from the status block not a PIO.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

Info: NAPI performance at "low" loads

Post by Andrew Morto » Thu, 19 Sep 2002 11:20:06




>    Date: Tue, 17 Sep 2002 20:57:58 -0400 (EDT)

>    I am not so sure with that 6% difference there is no other bug lurking
>    there; 6% seems too large for an extra two PCI transactions per packet.

> {in,out}{b,w,l}() operations have a fixed timing, therefore his
> results doesn't sound that far off.

> It is also one of the reasons I suspect Andrew saw such bad results
> with 3c59x, but probably that is not the only reason.

They weren't "very bad", iirc.  Maybe a 5% increase in CPU load.

It was all a long time ago.  Will retest if someone sends URLs.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

Info: NAPI performance at "low" loads

Post by Jeff Garzi » Thu, 19 Sep 2002 11:40:04




>    Date: Tue, 17 Sep 2002 22:11:14 -0400

>    You're looking at at least one extra get-irq-status too, at least in the
>    classical 10/100 drivers I'm used to seeing...

> How so?  The number of ones done in the e1000 NAPI code are the same
> (read register until no interesting status bits remain set, same as
> pre-NAPI e1000 driver).

> For tg3 it's a cheap memory read from the status block not a PIO.

Non-NAPI:

        get-irq-stat
        ack-irq
        get-irq-stat (omit, if no work loop)

NAPI:

        get-irq-stat
        ack-all-but-rx-irq
        mask-rx-irqs
        get-irq-stat (omit, if work loop)
        ...
        ack-rx-irqs
        get-irq-stat
        unmask-rx-irqs

This is the low load / low latency case only.  The number of IOs
decreases at higher loads [obviously :)]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/