IRQ/DMA and hardware hell with 2.4.x kernel

IRQ/DMA and hardware hell with 2.4.x kernel

Post by Mikael Svenso » Thu, 05 Dec 2002 18:35:28



I have an PIII ABIT bx-133 board with a 3com 905b NIC, an onboard
HighPoint controller and an additional HightPoint RocketRaid controller.

I have tried with both 2.4.19ac4 and the latest 2.4.20 kernels, and in
both cases after the machine has been running for about 6-12 hours
before it hangs. The machine has high loads on the network 24/7.

These are the messages I've gotten with different configurations on the
pci slot and kernel. I have also tried using an eepro100 nic instead.

eth0: IRQ 5 is physically blocked!
-
ide_dmaproc: chipset supported ide_dma_lostirq func only
hdd: lost interrtup
-
hdh:dma_timer_expiry: dma status == 0x44
hdh: lost interrupt

So clearly there is some IRQ/DMA problem somewhere which only show up
when the machine is being loaded high in disk/network traffic.

Any pointers on fixing the matter is appreciated. I can provide
pci/interrupt info if neccessary.

Regards,
Mikael Svenson

 
 
 

IRQ/DMA and hardware hell with 2.4.x kernel

Post by alex » Fri, 06 Dec 2002 05:38:54


Mikael Svenson shorted the keyboard with drool and:

Quote:> I have tried with both 2.4.19ac4 and the latest 2.4.20 kernels, and in
> both cases after the machine has been running for about 6-12 hours
> before it hangs. The machine has high loads on the network 24/7.

Has this machine run stable previously?  Was it up for months then suddenly
went unstable? Or have you just built it, and this is the smoke test?

Quote:> These are the messages I've gotten with different configurations on the
> pci slot and kernel. I have also tried using an eepro100 nic instead.

> eth0: IRQ 5 is physically blocked!

So does it say that no matter what NIC you use? Have you looked at BIOS
settings?

Quote:> ide_dmaproc: chipset supported ide_dma_lostirq func only
> hdd: lost interrtup
> -
> hdh:dma_timer_expiry: dma status == 0x44
> hdh: lost interrupt

Yikes! Ouch! Have you tried tweaking things with hdparm? Is this a stupid
machine that only works properly with ACPI compiled in [like my Sony
lacktop]? What do the devices hdd and hdh have in common? [apart from being
IDE drives...] like are they both next to a hot piece of hardware, or do
they share a power connector?

Quote:> So clearly there is some IRQ/DMA problem somewhere which only show up
> when the machine is being loaded high in disk/network traffic.

But you said the network was heavily loaded all the time, so I guess that
means it's never working ;-)

Quote:> Any pointers on fixing the matter is appreciated. I can provide
> pci/interrupt info if neccessary.

Seeing as it only shows up when heavily loaded, could it be a heat problem
inside the case? Maybe a thermostatic fan comes on and draws too much power
for the system? You could try underclocking it...

alexd

--
http://www.troffasky.pwp.blueyonder.co.uk/pix/
AIM:troffasky
Knives and guns are dangerous,
They don't want to play with us

 
 
 

IRQ/DMA and hardware hell with 2.4.x kernel

Post by Bill Marcu » Fri, 06 Dec 2002 12:13:26


On Wed, 04 Dec 2002 10:35:28 +0100,

Quote:> I have an PIII ABIT bx-133 board with a 3com 905b NIC, an onboard
> HighPoint controller and an additional HightPoint RocketRaid controller.

> I have tried with both 2.4.19ac4 and the latest 2.4.20 kernels, and in
> both cases after the machine has been running for about 6-12 hours
> before it hangs. The machine has high loads on the network 24/7.

...>
> So clearly there is some IRQ/DMA problem somewhere which only show up
> when the machine is being loaded high in disk/network traffic.

> Any pointers on fixing the matter is appreciated. I can provide
> pci/interrupt info if neccessary.

Could it be a heat problem?
 
 
 

IRQ/DMA and hardware hell with 2.4.x kernel

Post by Mikael Svenso » Sat, 07 Dec 2002 18:42:49


It was running stable with 3 internal disks up until I added a HighPoint
controller and 5 more disks. Then I started getting problems, so you
could say it started from scratch.

I have fans all around the disks, so heat shouldn't be a problem.

My current setup is as follows:

3 drives on ide0+1 in software raid0
2 drives on onboard hpt controller and 5 drives on plugin hpt controller
in one LVM volume.

When it crashed last night I did the following when I rebooted it. I did
hdparm -X66 on all drives so they run UDMA2 instead of UDMA5. I also
used setpci to make the devices do more bursts.

Seems the machine has survived the night and has been up a total 24
hours as I write this. If this solution works I would say the problem
was with the disks loading the pci bus too much.

If this is a fault of my Abit BX-133 raid mainboard or with Linux I
really can't say since I don't know enough about the problem.

But..... disks might not be as warm when running UDMA2(?)

-m


> Mikael Svenson shorted the keyboard with drool and:

> > I have tried with both 2.4.19ac4 and the latest 2.4.20 kernels, and in
> > both cases after the machine has been running for about 6-12 hours
> > before it hangs. The machine has high loads on the network 24/7.

> Has this machine run stable previously?  Was it up for months then suddenly
> went unstable? Or have you just built it, and this is the smoke test?

> > These are the messages I've gotten with different configurations on the
> > pci slot and kernel. I have also tried using an eepro100 nic instead.

> > eth0: IRQ 5 is physically blocked!

> So does it say that no matter what NIC you use? Have you looked at BIOS
> settings?

> > ide_dmaproc: chipset supported ide_dma_lostirq func only
> > hdd: lost interrtup
> > -
> > hdh:dma_timer_expiry: dma status == 0x44
> > hdh: lost interrupt

> Yikes! Ouch! Have you tried tweaking things with hdparm? Is this a stupid
> machine that only works properly with ACPI compiled in [like my Sony
> lacktop]? What do the devices hdd and hdh have in common? [apart from being
> IDE drives...] like are they both next to a hot piece of hardware, or do
> they share a power connector?

> > So clearly there is some IRQ/DMA problem somewhere which only show up
> > when the machine is being loaded high in disk/network traffic.

> But you said the network was heavily loaded all the time, so I guess that
> means it's never working ;-)

> > Any pointers on fixing the matter is appreciated. I can provide
> > pci/interrupt info if neccessary.

> Seeing as it only shows up when heavily loaded, could it be a heat problem
> inside the case? Maybe a thermostatic fan comes on and draws too much power
> for the system? You could try underclocking it...

> alexd

> --
> http://www.troffasky.pwp.blueyonder.co.uk/pix/
> AIM:troffasky
> Knives and guns are dangerous,
> They don't want to play with us

 
 
 

IRQ/DMA and hardware hell with 2.4.x kernel

Post by John-Paul Stewar » Sun, 08 Dec 2002 01:00:31



> It was running stable with 3 internal disks up until I added a HighPoint
> controller and 5 more disks. Then I started getting problems, so you
> could say it started from scratch.

> I have fans all around the disks, so heat shouldn't be a problem.

Yikes...3 disks...5 MORE...that's 8 disks plus all of your
fans.  Have you got a beefy enough power supply to handle
all of that?  If not, poor power could be playing a part in
your troubles.
 
 
 

IRQ/DMA and hardware hell with 2.4.x kernel

Post by alex » Sun, 08 Dec 2002 04:16:33


Mikael Svenson shorted the keyboard with drool and:

Quote:> It was running stable with 3 internal disks up until I added a HighPoint
> controller and 5 more disks. Then I started getting problems, so you
> could say it started from scratch.

Well that sounds like your problem right there! It sounds like a PSU
problem. tomshardware did a test of various PSUs recently, and concluded
that they don't all live up to their advertised rating. Investigate a new
PSU.

alexd

--
http://www.troffasky.pwp.blueyonder.co.uk/pix/
AIM:troffasky
Knives and guns are dangerous,
They don't want to play with us

 
 
 

IRQ/DMA and hardware hell with 2.4.x kernel

Post by Mikael Svenso » Wed, 11 Dec 2002 02:26:32


Right now I'm running 5 disks on one PSU and 6 on the one connected to
the mainboard. Sorry I didn't mention this right away.

After setting my disks to ATA66/UDMA2 and setting pci latency the
machine did stay up for 3 days instead of just 6-12hours. So that did
had some impact on stability.

According to Western Digital pages the 120gb drives use 19W on spinup,
and 7.75W on read/write/idle. And 12W on seek. The 100gb drives uses 14W
on seek. Since the machine starts fine, that should be:

4x14W + 2x12W = 80W for the disks during operation.

A fan uses little under 2W. So say tops 10W for the fans.

That would be 90W in total. I don't know how much the disk controllers
or the CPU uses. If i'm not mistaken the PSU delivers around 230W (I
don't have day to day access to the machine). The other PSU for the
other 5 disks has the same output.

The machine is running a FSB of 100mhz, and the pci bus on 1/3 of that.

All this said, it might just be a PSU thing if my calculations are
totally off, or maybe a heat issue on the bridge perhaps.

Guess I might try UDMA1 to see how that goes. Even though that's not too
fast it should still be faster than the internet line it's hooked up to.

I appreciate the input so far :)

Regards,
Mikael



>>It was running stable with 3 internal disks up until I added a HighPoint
>>controller and 5 more disks. Then I started getting problems, so you
>>could say it started from scratch.

>>I have fans all around the disks, so heat shouldn't be a problem.

> Yikes...3 disks...5 MORE...that's 8 disks plus all of your
> fans.  Have you got a beefy enough power supply to handle
> all of that?  If not, poor power could be playing a part in
> your troubles.

 
 
 

IRQ/DMA and hardware hell with 2.4.x kernel

Post by John-Paul Stewar » Wed, 11 Dec 2002 03:22:11



> Right now I'm running 5 disks on one PSU and 6 on the one connected to
> the mainboard. Sorry I didn't mention this right away.

> After setting my disks to ATA66/UDMA2 and setting pci latency the
> machine did stay up for 3 days instead of just 6-12hours. So that did
> had some impact on stability.

> According to Western Digital pages the 120gb drives use 19W on spinup,
> and 7.75W on read/write/idle. And 12W on seek. The 100gb drives uses 14W
> on seek. Since the machine starts fine, that should be:

> 4x14W + 2x12W = 80W for the disks during operation.

> A fan uses little under 2W. So say tops 10W for the fans.

> That would be 90W in total. I don't know how much the disk controllers
> or the CPU uses. If i'm not mistaken the PSU delivers around 230W (I
> don't have day to day access to the machine).

So you're running 6 disks, plus motherboard/CPU/etc. off a
230W PSU.  No way!  I'd put in at least 400W for that many
drives, personally.  (A modern CPU can draw 70W or so all by
itself, IIRC.  Plus the motherboard, RAM, any expansion
cards, etc.  For comparison, Athlons require 300W minimum
for just one or two disk systems!)  Most power supplies
don't perform very well at 100% capacity either;  it is
always good to oversize them.

Keep in mind that disks and fans draw the majority of their
power from the 12V lines.  Make sure your power supply can
handle enough amperage there---total PSU capacity in Watts
may be less important than 12V amperage in your situation.

 
 
 

IRQ/DMA and hardware hell with 2.4.x kernel

Post by Mikael Svenso » Wed, 11 Dec 2002 07:22:56




>>Right now I'm running 5 disks on one PSU and 6 on the one connected to
>>the mainboard. Sorry I didn't mention this right away.

>>After setting my disks to ATA66/UDMA2 and setting pci latency the
>>machine did stay up for 3 days instead of just 6-12hours. So that did
>>had some impact on stability.

>>According to Western Digital pages the 120gb drives use 19W on spinup,
>>and 7.75W on read/write/idle. And 12W on seek. The 100gb drives uses 14W
>>on seek. Since the machine starts fine, that should be:

>>4x14W + 2x12W = 80W for the disks during operation.

>>A fan uses little under 2W. So say tops 10W for the fans.

>>That would be 90W in total. I don't know how much the disk controllers
>>or the CPU uses. If i'm not mistaken the PSU delivers around 230W (I
>>don't have day to day access to the machine).

> So you're running 6 disks, plus motherboard/CPU/etc. off a
> 230W PSU.  No way!  I'd put in at least 400W for that many
> drives, personally.  (A modern CPU can draw 70W or so all by
> itself, IIRC.  Plus the motherboard, RAM, any expansion
> cards, etc.  For comparison, Athlons require 300W minimum
> for just one or two disk systems!)  Most power supplies
> don't perform very well at 100% capacity either;  it is
> always good to oversize them.

> Keep in mind that disks and fans draw the majority of their
> power from the 12V lines.  Make sure your power supply can
> handle enough amperage there---total PSU capacity in Watts
> may be less important than 12V amperage in your situation.

I'll definately check exactly how much the PSU can deliver. It might be
more than 230W :) But I haven't thought about the 12V issue. I know the
P4 can draw around 70W, but I think the PIII (423 socket) uses less.

Would me tuning down the disk speeds cause the machine to draw less
power, since it's more stable in that configuration?

 From all of this the problem seems to be:

1. PSU related
2. PCI bus related
3. Heat related

or any combination of the above. Not very easy to debug. Especially when
the debugging might force you to buy more hardware :)

Regards,
Mikael