Floating point exception: hardware failure?

Floating point exception: hardware failure?

Post by Mikko Huhta » Sat, 03 Aug 2002 19:51:53



I am getting floating point exceptions seemingly at random on a
Pentium 4 / Linux /OpenMosix 2.4.17 box. Programs terminate with
floating point exception when doing any fp arithmetic. This is not
limited to any one program or input, but occurs at random, i.e. one in
ten identical runs may complete correctly whereas the nine others
terminated with the exception at different points on the way.

The box is in a cluster and both the software and hardware setup is
identical to other nodes that are running without problems. The
exceptions happen both when processing files on local disks and on
network mounted file systems. It looks like hardware failure to me.
Can a failing processor or memory chip manifest as floating point
exceptions? The box has had a few unexplained system crashes, too. Is
there some system-wide setting controlling fp exceptions?

Mikko

 
 
 

Floating point exception: hardware failure?

Post by TimC » Sat, 03 Aug 2002 20:32:04


Mikko Huhtala (aka Bruce) was almost, but not quite, entirely unlike tea:

Quote:> I am getting floating point exceptions seemingly at random on a
> Pentium 4 / Linux /OpenMosix 2.4.17 box. Programs terminate with
> floating point exception when doing any fp arithmetic. This is not
> limited to any one program or input, but occurs at random, i.e. one in
> ten identical runs may complete correctly whereas the nine others
> terminated with the exception at different points on the way.

> The box is in a cluster and both the software and hardware setup is
> identical to other nodes that are running without problems. The
> exceptions happen both when processing files on local disks and on
> network mounted file systems. It looks like hardware failure to me.
> Can a failing processor or memory chip manifest as floating point
> exceptions? The box has had a few unexplained system crashes, too.

Sounds like memory failure. What does a traceback on the code reveal -
where is it crashing? If it crashes on a function like tan() that
requires valid input, and the input variable flipped a few bits due to
dodgy memory, then it can easily go and produce an exception.

Quote:>Is there some system-wide setting controlling fp exceptions?

What do you mean? Whether you get them? It is a hardware thing, the
FPU tries to do something and realises it makes no sense, and so
"excepts".

--
TimC -- http://astronomy.swin.edu.au/staff/tconnors/

Just because they are called 'forbidden' transitions does not mean
that they are forbidden.  They are less allowed than allowed
transitions, if you see what I mean.

 
 
 

Floating point exception: hardware failure?

Post by David E. Konerdin » Sun, 04 Aug 2002 00:14:05



> I am getting floating point exceptions seemingly at random on a
> Pentium 4 / Linux /OpenMosix 2.4.17 box. Programs terminate with
> floating point exception when doing any fp arithmetic. This is not
> limited to any one program or input, but occurs at random, i.e. one in
> ten identical runs may complete correctly whereas the nine others
> terminated with the exception at different points on the way.

> The box is in a cluster and both the software and hardware setup is
> identical to other nodes that are running without problems. The
> exceptions happen both when processing files on local disks and on
> network mounted file systems. It looks like hardware failure to me.
> Can a failing processor or memory chip manifest as floating point
> exceptions? The box has had a few unexplained system crashes, too. Is
> there some system-wide setting controlling fp exceptions?

There is a per-process FP register; see /usr/include/fp_control.h

Other ideas:  at one point somebody noticed it was possible to corrupt
fp registers ebcause the linux kernel wasn't properly storing them in
some obscure situation.  And I think an intel person piped up and pointed
out the processing manual was unclear and that linux was handling the FP registers
wrong on SMP machines.  however I can't recall the deatils.

Yes, a failed processor or memory chip could easily manifest itself as a floating point
exception.  Can you isolate the problem a bit more by removing any traces of OpenMosix
from the test environment?

Dave

 
 
 

Floating point exception: hardware failure?

Post by Charlie Dyso » Sun, 04 Aug 2002 02:15:02



> Mikko Huhtala (aka Bruce) was almost, but not quite, entirely unlike tea:
>> I am getting floating point exceptions seemingly at random on a
>> Pentium 4 / Linux /OpenMosix 2.4.17 box. Programs terminate with
>> floating point exception when doing any fp arithmetic. This is not
>> limited to any one program or input, but occurs at random, i.e. one in
>> ten identical runs may complete correctly whereas the nine others
>> terminated with the exception at different points on the way.

>> The box is in a cluster and both the software and hardware setup is
>> identical to other nodes that are running without problems. The
>> exceptions happen both when processing files on local disks and on
>> network mounted file systems. It looks like hardware failure to me.
>> Can a failing processor or memory chip manifest as floating point
>> exceptions? The box has had a few unexplained system crashes, too.

> Sounds like memory failure. What does a traceback on the code reveal -
> where is it crashing? If it crashes on a function like tan() that
> requires valid input, and the input variable flipped a few bits due to
> dodgy memory, then it can easily go and produce an exception.

>>Is there some system-wide setting controlling fp exceptions?

> What do you mean? Whether you get them? It is a hardware thing, the
> FPU tries to do something and realises it makes no sense, and so
> "excepts".

fp-exceptions raise a signal that can be caught, as far as I'm aware.

If it is memory failure, download memtest86 - do a search at google. It is
definately worth running anyway - very powerful memory checker, runs of a
floppy (only on x86 processors though, which you have unless you have
something else). Also, try running a silly C program to see (forgive the
pun) what's going on:
/* Silly.c */
#include<stdio.h>

int main() {
        int i;
        float z;
        for(i=0; i<10000; i++) {
                z = i / 25;
        }
        printf("That didn't crash.\nTrying to force a crash.\n");
        /* This will cause the program to fail - should not harm system */
        z = 0 / 0;
        printf("Very strange - didn't crash. Something wrong here.\n");
        return 0;

Quote:}

See what that does.


 
 
 

Floating point exception: hardware failure?

Post by Mikko Huhta » Sun, 04 Aug 2002 04:01:28



> Mikko Huhtala (aka Bruce) was almost, but not quite, entirely unlike tea:

Bruce? I have been called a lot of things, but that was a first.

Quote:> Sounds like memory failure. What does a traceback on the code reveal -
> where is it crashing? If it crashes on a function like tan() that
> requires valid input, and the input variable flipped a few bits due to
> dodgy memory, then it can easily go and produce an exception.

In the one core dump that I looked at, it was in the C++ standard
library implementation of 'stringstream >> float', so that seems to
support your idea. I guess I should go tear out memory modules from a
working machine and try with those...

Thanks for your comment.

Mikko

 
 
 

Floating point exception: hardware failure?

Post by B. Joshua Rose » Sun, 04 Aug 2002 11:37:16





>> I am getting floating point exceptions seemingly at random on a Pentium
>> 4 / Linux /OpenMosix 2.4.17 box. Programs terminate with floating point
>> exception when doing any fp arithmetic. This is not limited to any one
>> program or input, but occurs at random, i.e. one in ten identical runs
>> may complete correctly whereas the nine others terminated with the
>> exception at different points on the way.

>> The box is in a cluster and both the software and hardware setup is
>> identical to other nodes that are running without problems. The
>> exceptions happen both when processing files on local disks and on
>> network mounted file systems. It looks like hardware failure to me. Can
>> a failing processor or memory chip manifest as floating point
>> exceptions? The box has had a few unexplained system crashes, too. Is
>> there some system-wide setting controlling fp exceptions?

> There is a per-process FP register; see /usr/include/fp_control.h

> Other ideas:  at one point somebody noticed it was possible to corrupt
> fp registers ebcause the linux kernel wasn't properly storing them in
> some obscure situation.  And I think an intel person piped up and
> pointed out the processing manual was unclear and that linux was
> handling the FP registers wrong on SMP machines.  however I can't recall
> the deatils.

> Yes, a failed processor or memory chip could easily manifest itself as a
> floating point exception.  Can you isolate the problem a bit more by
> removing any traces of OpenMosix from the test environment?

> Dave

Have you checked the temperature of your CPU? If it was a RAM problem
that was severe enough to cause floating point exceptions your system
would be crashing. I'd suspect the CPU, it might just be a cooling
problem so you need to check the die temperature. You problem could be as
simple as a loose fan.
 
 
 

Floating point exception: hardware failure?

Post by Eric P. McC » Sun, 04 Aug 2002 12:30:52



Quote:> Have you checked the temperature of your CPU? If it was a RAM problem
> that was severe enough to cause floating point exceptions your system
> would be crashing.

It has; did you read the OP all the way through?

Quote:> I'd suspect the CPU, it might just be a cooling problem so you need
> to check the die temperature. You problem could be as simple as a
> loose fan.

A CPU that's malfunctioning due to overheating is just as likely to
crash randomly as a system with defective memory.

--

"Last I checked, it wasn't the power cord for the Clue Generator that
was sticking up your ass." - John Novak, rasfwrj

 
 
 

Floating point exception: hardware failure?

Post by B. Joshua Rose » Sun, 04 Aug 2002 13:09:22




>> Have you checked the temperature of your CPU? If it was a RAM problem
>> that was severe enough to cause floating point exceptions your system
>> would be crashing.

> It has; did you read the OP all the way through?

>> I'd suspect the CPU, it might just be a cooling problem so you need to
>> check the die temperature. You problem could be as simple as a loose
>> fan.

> A CPU that's malfunctioning due to overheating is just as likely to
> crash randomly as a system with defective memory.

It all depends on the timing margins on various parts of the chip. It's
entirely possible that the worst case path is in the FPU. A chip can be
just hot enough for some operations in the FPU to not work while the rest
of the chip functions normally.

If the BIOS has the ability to change the clock frequency you could try
reducing the clock frequency to see if that fixes the problem. If it
doesn't then you'll need to replace the CPU.

 
 
 

Floating point exception: hardware failure?

Post by Mikko Huhta » Mon, 05 Aug 2002 23:48:24



Quote:> exception.  Can you isolate the problem a bit more by removing any traces of OpenMosix
> from the test environment?

I did boot to both Mandrake and Debian-packaged versions of 2.4.18 and
the problem remains. It still looks like a hardware problem. OpenMosix
runs just fine on the other nodes of the cluster.

Mikko

 
 
 

Floating point exception: hardware failure?

Post by Mikko Huhta » Mon, 05 Aug 2002 23:57:04


omething else). Also, try running a silly C program to see (forgive
the

Quote:> pun) what's going on:

The test program runs ok. I increased number of steps in the loop 100
times, and it still runs ok, at least it did the dozen or so times I
tried. Then again the program is probably not much of a memory test...

I tried memtest86, too. It has done 3 passes now and has found no
errors, so as far as I can tell, the memory seems to be in working
order.

Mikko

 
 
 

1. Floating Point Exception

Hi All !

Please, can somebody help me.

I have the following problem. I have access to two Linux/Alphas
PC21164-P7.  But I am not able to do something really useful with
them, since I got "Floating Point Exceptions" and the console report
something like:

arithmetic trap at 0000000120090248: 11 0000000800000000

It is not that the applications I am running are bad, since they are
well established and runs on other platforms quite well.

There is one more thing that bothers me and that is that console reports

messages like, whenever I run some program (for example compiler or some

other programs):

<sc 208(11ffffbba,3e8,64)>

And I don't know what this messages means.

I am running kernel 2.0.37 and Debian/GNU Linux slink 2.1.

Does anybody have the same problems or knows how to resolve them ????

Thanks,
Tone

--
+------------------------------------------------------------------------+

|
| Department of Physical and Organic Chemistry Phone: x 386 61 177 3520
|
| Jozef Stefan Institute                         Fax: x 386 61 177 3811
|
| Jamova 39, SI-1000 Ljubljana
|
| SLOVENIA
|
+------------------------------------------------------------------------+

2. problem compiling diald-0.12 on last-year's Slackware and kernel 1.2.8

3. float point exception on 3.0

4. stat(2) missing in libc.5.0.9

5. Floating Point Exception error

6. 2.4.16/2.2.19 KDSKBSENT console ioctl

7. floating point exceptions

8. Problems with Belkin usb card

9. idraw floating point exception

10. floating point exception

11. gcc-2.7.0 Floating Point Exception bug

12. floating point exceptions managed in Linux?

13. Floating point exception