NIC lockup in 2.4.17 (SMP/APIC/Intel 82557)

NIC lockup in 2.4.17 (SMP/APIC/Intel 82557)

Post by Stephan von Krawczynsk » Fri, 01 Feb 2002 21:40:12



On Thu, 31 Jan 2002 01:27:47 +0100


> Thanks for your reaction Stephan, but I seriously doubt the change below
> would fix the problem... Also, as the problem appears randomly, and
> usually after some uptime, I obviously can not know about it being fixed
> if I constantly upgrade the kernel. I'd rather wait and see if it
> appears again in time after I did a kernel upgrade, and not trying every
> -pre while there's no mention on the mailing list of such bug being
> fixed.

> Anyway, I just rebooted with 2.4.18-pre7-ac1, we'll see if it helps.

Hello Robert,

Well, I know the changes to the driver are rather ... small :-)
But on the other hand, I would not be all that sure that the bug is a
hundred percent related to the driver itself.
I run a working config with eepro100-driver, btw.

Regards,
Stephan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

NIC lockup in 2.4.17 (SMP/APIC/Intel 82557)

Post by Ben Greea » Sat, 02 Feb 2002 01:40:10


What does the rest of the hardware-config look like?  Is
the NIC attached to a 10bt hub?  Are you using PCI-Riser
cards?


> On Thu, 31 Jan 2002 01:27:47 +0100

>>Thanks for your reaction Stephan, but I seriously doubt the change below
>>would fix the problem... Also, as the problem appears randomly, and
>>usually after some uptime, I obviously can not know about it being fixed
>>if I constantly upgrade the kernel. I'd rather wait and see if it
>>appears again in time after I did a kernel upgrade, and not trying every
>>-pre while there's no mention on the mailing list of such bug being
>>fixed.

>>Anyway, I just rebooted with 2.4.18-pre7-ac1, we'll see if it helps.

> Hello Robert,

> Well, I know the changes to the driver are rather ... small :-)
> But on the other hand, I would not be all that sure that the bug is a
> hundred percent related to the driver itself.
> I run a working config with eepro100-driver, btw.

> Regards,
> Stephan

> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--

President of Candela Technologies Inc      http://www.candelatech.com
ScryMUD:  http://scry.wanfear.com     http://scry.wanfear.com/~greear

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

NIC lockup in 2.4.17 (SMP/APIC/Intel 82557)

Post by Robbert Koupri » Sat, 02 Feb 2002 02:00:17


The box is an Abit BP6 with Dual Celerons 433 and 192 Mb RAM. No
PCI-Riser cards. It is connected at 100 Mbit full duplex to a 100
Mbit switch. APIC is enabled. No kind of power management is enabled.

Below is my /proc/interrupts, lspci -vx and dmesg output.

Regards,
- Robbert

radium:/# cat /proc/interrupts
           CPU0       CPU1
  0:    2944301    2940065    IO-APIC-edge  timer
  1:         39         41    IO-APIC-edge  keyboard
  2:          0          0          XT-PIC  cascade
  4:       3074       3208    IO-APIC-edge  serial
  8:          2          0    IO-APIC-edge  rtc
 14:         20         29    IO-APIC-edge  ide0
 17:     627932     628166   IO-APIC-level  eth0
 18:     121201     121973   IO-APIC-level  ide2
 19:     522304     521928   IO-APIC-level  es1371
NMI:          0          0
LOC:    5884708    5884706
ERR:        170
MIS:          0

radium:/# lspci -vx
00:00.0 Host bridge: Intel Corp. 440BX/ZX - 82443BX/ZX Host bridge (rev
03)
        Flags: bus master, medium devsel, latency 32
        Memory at d0000000 (32-bit, prefetchable) [size=64M]
        Capabilities: [a0] AGP version 1.0
00: 86 80 90 71 06 00 10 22 03 00 00 06 00 20 00 00
10: 08 00 00 d0 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 a0 00 00 00 00 00 00 00 00 00 00 00

00:01.0 PCI bridge: Intel Corp. 440BX/ZX - 82443BX/ZX AGP bridge (rev 03)
(prog-if 00 [Normal decode])
        Flags: bus master, 66Mhz, medium devsel, latency 64
        Bus: primary=00, secondary=01, subordinate=01, sec-latency=32
        Memory behind bridge: d4000000-d7ffffff
        Prefetchable memory behind bridge: d8000000-d8ffffff
00: 86 80 91 71 07 01 20 02 03 00 04 06 00 40 01 00
10: 00 00 00 00 00 00 00 00 00 01 01 20 f0 00 a0 22
20: 00 d4 f0 d7 00 d8 f0 d8 00 00 00 00 00 00 00 00
30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 88 00

00:07.0 ISA bridge: Intel Corp. 82371AB PIIX4 ISA (rev 02)
        Flags: bus master, medium devsel, latency 0
00: 86 80 10 71 0f 00 80 02 02 00 01 06 00 00 80 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

00:07.1 IDE interface: Intel Corp. 82371AB PIIX4 IDE (rev 01) (prog-if 80
[Master])
        Flags: bus master, medium devsel, latency 32
        I/O ports at f000 [size=16]
00: 86 80 11 71 05 00 80 02 01 80 01 01 00 20 00 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 01 f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

00:07.2 USB Controller: Intel Corp. 82371AB PIIX4 USB (rev 01) (prog-if 00
[UHCI])
        Flags: bus master, medium devsel, latency 32, IRQ 19
        I/O ports at c000 [size=32]
00: 86 80 12 71 05 00 80 02 01 00 03 0c 00 20 00 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 01 c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 00 00 00 00 00 00 00 00 0c 04 00 00

00:07.3 Bridge: Intel Corp. 82371AB PIIX4 ACPI (rev 02)
        Flags: medium devsel, IRQ 9
00: 86 80 13 71 03 00 80 02 02 00 80 06 00 00 00 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

00:09.0 Multimedia audio controller: Ensoniq ES1371 [AudioPCI-97] (rev 06)
        Subsystem: Ensoniq Creative Sound Blaster AudioPCI64V, AudioPCI128
        Flags: bus master, slow devsel, latency 32, IRQ 19
        I/O ports at c400 [size=64]
        Capabilities: [dc] Power Management version 1
00: 74 12 71 13 05 01 10 34 06 00 01 04 00 20 00 00
10: 01 c4 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 74 12 71 13
30: 00 00 00 00 dc 00 00 00 00 00 00 00 0c 01 0c 80

00:0d.0 Ethernet controller: Intel Corp. 82557 [Ethernet Pro 100] (rev 09)
        Subsystem: Intel Corp.: Unknown device 0011
        Flags: bus master, medium devsel, latency 32, IRQ 17
        Memory at da020000 (32-bit, non-prefetchable) [size=4K]
        I/O ports at c800 [size=64]
        Memory at da000000 (32-bit, non-prefetchable) [size=128K]
        Expansion ROM at <unassigned> [disabled] [size=1M]
        Capabilities: [dc] Power Management version 2
00: 86 80 29 12 07 00 90 02 09 00 00 02 08 20 00 00
10: 00 00 02 da 01 c8 00 00 00 00 00 da 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 86 80 11 00
30: 00 00 00 00 dc 00 00 00 00 00 00 00 0a 01 08 38

00:13.0 Unknown mass storage controller: Triones Technologies, Inc. HPT366
/ HPT370 (rev 01)
        Flags: bus master, medium devsel, latency 120, IRQ 18
        I/O ports at cc00 [size=8]
        I/O ports at d000 [size=4]
        I/O ports at d400 [size=256]
        Expansion ROM at <unassigned> [disabled] [size=128K]
00: 03 11 04 00 05 00 00 02 01 00 80 01 08 78 80 00
10: 01 cc 00 00 01 d0 00 00 00 00 00 00 00 00 00 00
20: 01 d4 00 00 00 00 00 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 00 00 00 00 00 00 00 00 0b 01 08 08

00:13.1 Unknown mass storage controller: Triones Technologies, Inc. HPT366
/ HPT370 (rev 01)
        Flags: bus master, medium devsel, latency 120, IRQ 18
        I/O ports at d800 [size=8]
        I/O ports at dc00 [size=4]
        I/O ports at e000 [size=256]
00: 03 11 04 00 07 00 00 02 01 00 80 01 08 78 80 00
10: 01 d8 00 00 01 dc 00 00 00 00 00 00 00 00 00 00
20: 01 e0 00 00 00 00 00 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 00 00 00 00 00 00 00 00 0b 02 08 08

01:00.0 VGA compatible controller: Matrox Graphics, Inc. MGA G200 AGP (rev
01) (prog-if 00 [VGA])
        Subsystem: Matrox Graphics, Inc. Millennium G200 AGP
        Flags: bus master, medium devsel, latency 32, IRQ 16
        Memory at d8000000 (32-bit, prefetchable) [size=16M]
        Memory at d4000000 (32-bit, non-prefetchable) [size=16K]
        Memory at d5000000 (32-bit, non-prefetchable) [size=8M]
        Expansion ROM at <unassigned> [disabled] [size=64K]
        Capabilities: [dc] Power Management version 1
        Capabilities: [f0] AGP version 1.0
00: 2b 10 21 05 07 00 90 02 01 00 00 03 08 20 00 00
10: 08 00 00 d8 00 00 00 d4 00 00 00 d5 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 2b 10 03 ff
30: 00 00 00 00 dc 00 00 00 00 00 00 00 09 01 10 20

radium:/# cat /var/log/dmesg
Linux version 2.4.18-pre7-ac1 (root@radium) (gcc version 2.95.4 (Debian
prerelease)) #1 SMP Thu Jan 31 01:06:59 CET 2002
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 00000000000a0000 (usable)
 BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 000000000c000000 (usable)
 BIOS-e820: 00000000fec00000 - 00000000fec01000 (reserved)
 BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
 BIOS-e820: 00000000ffff0000 - 0000000100000000 (reserved)
found SMP MP-table at 000f5ae0
hm, page 000f5000 reserved twice.
hm, page 000f6000 reserved twice.
hm, page 000f1000 reserved twice.
hm, page 000f2000 reserved twice.
On node 0 totalpages: 49152
zone(0): 4096 pages.
zone(1): 45056 pages.
zone(2): 0 pages.
Intel MultiProcessor Specification v1.4
    Virtual Wire compatibility mode.
OEM ID: OEM00000 Product ID: PROD00000000 APIC at: 0xFEE00000
Processor #0 Pentium(tm) Pro APIC version 17
Processor #1 Pentium(tm) Pro APIC version 17
I/O APIC #2 Version 17 at 0xFEC00000.
Processors: 2
Kernel command line: auto BOOT_IMAGE=Linux ro root=2101
Initializing CPU#0
Detected 434.324 MHz processor.
Console: colour VGA+ 80x25
Calibrating delay loop... 865.07 BogoMIPS
Memory: 191360k/196608k available (1316k kernel code, 4864k reserved, 383k
data, 240k init, 0k highmem)
Dentry-cache hash table entries: 32768 (order: 6, 262144 bytes)
Inode-cache hash table entries: 16384 (order: 5, 131072 bytes)
Mount-cache hash table entries: 4096 (order: 3, 32768 bytes)
Buffer-cache hash table entries: 16384 (order: 4, 65536 bytes)
Page-cache hash table entries: 65536 (order: 6, 262144 bytes)
CPU: Before vendor init, caps: 0183fbff 00000000 00000000, vendor = 0
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 128K
CPU: After vendor init, caps: 0183fbff 00000000 00000000 00000000
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
CPU:     After generic, caps: 0183fbff 00000000 00000000 00000000
CPU:             Common caps: 0183fbff 00000000 00000000 00000000
Enabling fast FPU save and restore... done.
Checking 'hlt' instruction... OK.
POSIX conformance testing by UNIFIX
mtrr: v1.40 (20010327) Richard Gooch (rgo...@atnf.csiro.au)
mtrr: detected mtrr type: Intel
CPU: Before vendor init, caps: 0183fbff 00000000 00000000, vendor = 0
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 128K
CPU: After vendor init, caps: 0183fbff 00000000 00000000 00000000
Intel machine check reporting enabled on CPU#0.
CPU:     After generic, caps: 0183fbff 00000000 00000000 00000000
CPU:             Common caps: 0183fbff 00000000 00000000 00000000
CPU0: Intel Celeron (Mendocino) stepping 05
per-CPU timeslice cutoff: 365.86 usecs.
enabled ExtINT on CPU#0
ESR value before enabling vector: 00000000
ESR value after enabling vector: 00000000
Booting processor 1/1 eip 2000
Initializing CPU#1
masked ExtINT on CPU#1
ESR value before enabling vector: 00000000
ESR value after enabling vector: 00000000
Calibrating delay loop... 868.35 BogoMIPS
CPU: Before vendor init, caps: 0183fbff 00000000 00000000, vendor = 0
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 128K
CPU: After vendor init, caps: 0183fbff 00000000 00000000 00000000
Intel machine check reporting enabled on CPU#1.
CPU:     After generic, caps: 0183fbff 00000000 00000000 00000000
CPU:             Common caps: 0183fbff 00000000 00000000 00000000
CPU1: Intel Celeron (Mendocino) stepping 05
Total of 2 processors activated (1733.42 BogoMIPS).
ENABLING IO-APIC IRQs
Setting 2 in the phys_id_present_map
...changing IO-APIC physical APIC ID to 2 ... ok.
init IO_APIC IRQs
 IO-APIC (apicid-pin) 2-0, 2-9, 2-10, 2-11, 2-12, 2-20, 2-21, 2-22, 2-23
not connected.
..TIMER: ...

read more »

 
 
 

NIC lockup in 2.4.17 (SMP/APIC/Intel 82557)

Post by Ben Greea » Sat, 02 Feb 2002 03:00:23



> The box is an Abit BP6 with Dual Celerons 433 and 192 Mb RAM. No
> PCI-Riser cards. It is connected at 100 Mbit full duplex to a 100
> Mbit switch. APIC is enabled. No kind of power management is enabled.

The only lockup problems I have run into are connecting some eepro nics to
a 10bt hub, and using (cheap arsed, it appears) PCI riser cards.  I have
heard of some SMP related issues, but nothing concrete, and I don't
have any SMP systems personally.  You could try the e100, but I have
no idea if it will be better or worse for your particular problem.

--

President of Candela Technologies Inc      http://www.candelatech.com
ScryMUD:  http://scry.wanfear.com     http://scry.wanfear.com/~greear

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

NIC lockup in 2.4.17 (SMP/APIC/Intel 82557)

Post by Robbert Koupri » Sat, 02 Feb 2002 04:20:19


I experienced the 10 Mbit half duplex problems too with this card, but
they seemed to have gone away after a bugfix from Alan Cox somewhere in
2.4. Somewhere later I upgraded to 100 Mbit full duplex and never
experienced problems again until 2.4.17.

I think im gonna try some older kernels and look through diffs if I have
time.

- Robbert

> -----Original Message-----

> Sent: donderdag 31 januari 2002 18:54
> To: Robbert Kouprie

> Subject: Re: NIC lockup in 2.4.17 (SMP/APIC/Intel 82557)


> > The box is an Abit BP6 with Dual Celerons 433 and 192 Mb RAM. No
> > PCI-Riser cards. It is connected at 100 Mbit full duplex to a 100
> > Mbit switch. APIC is enabled. No kind of power management
> is enabled.

> The only lockup problems I have run into are connecting some
> eepro nics to
> a 10bt hub, and using (cheap arsed, it appears) PCI riser
> cards.  I have
> heard of some SMP related issues, but nothing concrete, and I don't
> have any SMP systems personally.  You could try the e100, but I have
> no idea if it will be better or worse for your particular problem.

> --

> President of Candela Technologies Inc      http://www.candelatech.com
> ScryMUD:  http://scry.wanfear.com     http://scry.wanfear.com/~greear

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
 
 
 

NIC lockup in 2.4.17 (SMP/APIC/Intel 82557)

Post by Edward S. Marshal » Sun, 03 Feb 2002 03:00:16



> The only lockup problems I have run into are connecting some eepro nics to
> a 10bt hub, and using (cheap arsed, it appears) PCI riser cards.  I have
> heard of some SMP related issues, but nothing concrete, and I don't
> have any SMP systems personally.  You could try the e100, but I have
> no idea if it will be better or worse for your particular problem.

I was running into the same problems here; SMP system w/PCI riser card
(HP NetServer LPr), connected to a 10/100 switch. I'd get
"wait_for_command_timeout" errors all the time under moderate network
load. Switching to the e100 driver didn't help in the slightest.
Eventually, I'd experience a complete system lockup.

Replacing the card with a 3c59x-based card put the machine back in
service (I've completely written eepro100s off as a viable cards now),
although I still saw occasional PCI-related issues. Specifically:

Jan 23 10:11:37 x kernel: Uhhuh. NMI received. Dazed and confused, but
trying to continue
Jan 23 10:11:37 x kernel: eth0: Host error, FIFO diagnostic register
0000.
Jan 23 10:11:37 x kernel: eth0: PCI bus error, bus status 80000020
Jan 23 10:11:37 x kernel: You probably have a hardware problem with your
RAM chips
Jan 23 10:11:37 x kernel: eth0: Host error, FIFO diagnostic register
0000.
Jan 23 10:11:37 x kernel: eth0: PCI bus error, bus status 80000020

The last two messages will repeat indefinitely, usually with a hit to
the dist for each pair of log entries (resulting in a very distinctive
drive grinding). Memory problems don't seem to be the issue; with a
fairly extensive run of memtest86, everything came back clean.

Taking a few minutes to try and rectify the situation, I started
shutting down services and manually unloading modules to see what was
causing the problem. Unloading usbcore did the trick:

Jan 26 18:41:24 x kernel: eth0: Host error, FIFO diagnostic register
0000.
Jan 26 18:41:24 x kernel: eth0: PCI bus error, bus status 80000020
Jan 26 18:41:24 x kernel: eth0: Too much work in interrupt, status e003.
Jan 26 18:41:24 x kernel: usb.c: USB disconnect on device 1
Jan 26 18:41:24 x kernel: USB bus 1 deregistered

I've rebooted the machine since then, but have always unloaded usb-uhci
and usbcore after booting. The issue hasn't cropped up again, although
it happened every couple of days previously.

The kernel in question is Red Hat's kernel-smp-2.4.9-21 build.

--

http://esm.logic.net/
-------------------------------------------------------------------------------
[                  Felix qui potuit rerum cognoscere causas.            
]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

NIC lockup in 2.4.17 (SMP/APIC/Intel 82557)

Post by <n.. » Sun, 03 Feb 2002 03:20:14


Odd, I've got an HP LPr with an     Ethernet controller: Intel Corporation
82557 [Ethernet Pro 100] (rev 8). on the riser.  Works fine for me under

version 2.95.4 20010319 (Debian prerelease)) #1 SMP Sun May 27 18:32:54
EST 2001.  If you'd like me to test a workload or similar let me know, the
system is relativly low memory though.
        Nick



> > The only lockup problems I have run into are connecting some eepro nics to
> > a 10bt hub, and using (cheap arsed, it appears) PCI riser cards.  I have
> > heard of some SMP related issues, but nothing concrete, and I don't
> > have any SMP systems personally.  You could try the e100, but I have
> > no idea if it will be better or worse for your particular problem.

> I was running into the same problems here; SMP system w/PCI riser card
> (HP NetServer LPr), connected to a 10/100 switch. I'd get
> "wait_for_command_timeout" errors all the time under moderate network
> load. Switching to the e100 driver didn't help in the slightest.
> Eventually, I'd experience a complete system lockup.

> Replacing the card with a 3c59x-based card put the machine back in
> service (I've completely written eepro100s off as a viable cards now),
> although I still saw occasional PCI-related issues. Specifically:

> Jan 23 10:11:37 x kernel: Uhhuh. NMI received. Dazed and confused, but
> trying to continue
> Jan 23 10:11:37 x kernel: eth0: Host error, FIFO diagnostic register
> 0000.
> Jan 23 10:11:37 x kernel: eth0: PCI bus error, bus status 80000020
> Jan 23 10:11:37 x kernel: You probably have a hardware problem with your
> RAM chips
> Jan 23 10:11:37 x kernel: eth0: Host error, FIFO diagnostic register
> 0000.
> Jan 23 10:11:37 x kernel: eth0: PCI bus error, bus status 80000020

> The last two messages will repeat indefinitely, usually with a hit to
> the dist for each pair of log entries (resulting in a very distinctive
> drive grinding). Memory problems don't seem to be the issue; with a
> fairly extensive run of memtest86, everything came back clean.

> Taking a few minutes to try and rectify the situation, I started
> shutting down services and manually unloading modules to see what was
> causing the problem. Unloading usbcore did the trick:

> Jan 26 18:41:24 x kernel: eth0: Host error, FIFO diagnostic register
> 0000.
> Jan 26 18:41:24 x kernel: eth0: PCI bus error, bus status 80000020
> Jan 26 18:41:24 x kernel: eth0: Too much work in interrupt, status e003.
> Jan 26 18:41:24 x kernel: usb.c: USB disconnect on device 1
> Jan 26 18:41:24 x kernel: USB bus 1 deregistered

> I've rebooted the machine since then, but have always unloaded usb-uhci
> and usbcore after booting. The issue hasn't cropped up again, although
> it happened every couple of days previously.

> The kernel in question is Red Hat's kernel-smp-2.4.9-21 build.

> --

> http://esm.logic.net/
> -------------------------------------------------------------------------------
> [                  Felix qui potuit rerum cognoscere causas.            
> ]

> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
 
 
 

NIC lockup in 2.4.17 (SMP/APIC/Intel 82557)

Post by Ken Brownfiel » Sun, 03 Feb 2002 04:50:11


I've had LPr, LP1000r, LP2000r, and LH6000s in *heavy* production for
two years straight with nary a whimper from the eepro100, e100, or e1000
drivers.*  This is SMP with 2-6 procs, 256MB-4GB RAM, all 2.2 and 2.4
kernels under RH6.2.  On 100/1000, never 10Mb though.

Not to say that the HPs smell like roses, but I would highly suspect bad
hardware or a suspect BIOS/PCB revision, etc. in this case.

Just my US$0.02,
--
Ken.

* besides a quirky arp issue on boot that seemed to go away on its own
and wasn't card-specific.  And the long-standing I/O APIC issues. ;)

| Odd, I've got an HP LPr with an     Ethernet controller: Intel Corporation
| 82557 [Ethernet Pro 100] (rev 8). on the riser.  Works fine for me under

| version 2.95.4 20010319 (Debian prerelease)) #1 SMP Sun May 27 18:32:54
| EST 2001.  If you'd like me to test a workload or similar let me know, the
| system is relativly low memory though.
|       Nick
|

|

| > > The only lockup problems I have run into are connecting some eepro nics to
| > > a 10bt hub, and using (cheap arsed, it appears) PCI riser cards.  I have
| > > heard of some SMP related issues, but nothing concrete, and I don't
| > > have any SMP systems personally.  You could try the e100, but I have
| > > no idea if it will be better or worse for your particular problem.
| >
| > I was running into the same problems here; SMP system w/PCI riser card
| > (HP NetServer LPr), connected to a 10/100 switch. I'd get
| > "wait_for_command_timeout" errors all the time under moderate network
| > load. Switching to the e100 driver didn't help in the slightest.
| > Eventually, I'd experience a complete system lockup.
| >
| > Replacing the card with a 3c59x-based card put the machine back in
| > service (I've completely written eepro100s off as a viable cards now),
| > although I still saw occasional PCI-related issues. Specifically:
| >
| > Jan 23 10:11:37 x kernel: Uhhuh. NMI received. Dazed and confused, but
| > trying to continue
| > Jan 23 10:11:37 x kernel: eth0: Host error, FIFO diagnostic register
| > 0000.
| > Jan 23 10:11:37 x kernel: eth0: PCI bus error, bus status 80000020
| > Jan 23 10:11:37 x kernel: You probably have a hardware problem with your
| > RAM chips
| > Jan 23 10:11:37 x kernel: eth0: Host error, FIFO diagnostic register
| > 0000.
| > Jan 23 10:11:37 x kernel: eth0: PCI bus error, bus status 80000020
| >
| > The last two messages will repeat indefinitely, usually with a hit to
| > the dist for each pair of log entries (resulting in a very distinctive
| > drive grinding). Memory problems don't seem to be the issue; with a
| > fairly extensive run of memtest86, everything came back clean.
| >
| > Taking a few minutes to try and rectify the situation, I started
| > shutting down services and manually unloading modules to see what was
| > causing the problem. Unloading usbcore did the trick:
| >
| > Jan 26 18:41:24 x kernel: eth0: Host error, FIFO diagnostic register
| > 0000.
| > Jan 26 18:41:24 x kernel: eth0: PCI bus error, bus status 80000020
| > Jan 26 18:41:24 x kernel: eth0: Too much work in interrupt, status e003.
| > Jan 26 18:41:24 x kernel: usb.c: USB disconnect on device 1
| > Jan 26 18:41:24 x kernel: USB bus 1 deregistered
| >
| > I've rebooted the machine since then, but have always unloaded usb-uhci
| > and usbcore after booting. The issue hasn't cropped up again, although
| > it happened every couple of days previously.
| >
| > The kernel in question is Red Hat's kernel-smp-2.4.9-21 build.
| >
| > --

| > http://esm.logic.net/
| > -------------------------------------------------------------------------------
| > [                  Felix qui potuit rerum cognoscere causas.            
| > ]
| >
| > -
| > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

| > More majordomo info at  http://vger.kernel.org/majordomo-info.html
| > Please read the FAQ at  http://www.tux.org/lkml/
| >
|
| -
| To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

| More majordomo info at  http://vger.kernel.org/majordomo-info.html
| Please read the FAQ at  http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

NIC lockup in 2.4.17 (SMP/APIC/Intel 82557)

Post by Alan Co » Sun, 03 Feb 2002 08:50:09


Quote:> Jan 23 10:11:37 x kernel: Uhhuh. NMI received. Dazed and confused, but
> trying to continue
> Jan 23 10:11:37 x kernel: eth0: Host error, FIFO diagnostic register
> 0000.
> Jan 23 10:11:37 x kernel: eth0: PCI bus error, bus status 80000020
> Jan 23 10:11:37 x kernel: You probably have a hardware problem with your
> RAM chips
> Jan 23 10:11:37 x kernel: eth0: Host error, FIFO diagnostic register
> 0000.
> Jan 23 10:11:37 x kernel: eth0: PCI bus error, bus status 80000020

Your machine took an NMI and a PCI bus diagnostic. That generally points
hard to a bus problem.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/