Difficult to Swallow...

Difficult to Swallow...

Post by Dr. W » Fri, 16 May 1997 04:00:00



After having read so many complaints here of people having trouble with
the infamous "Signal 11 -- Seg fault" in Linux, I'm finding it very
difficult to believe that all these people have flaky hardware...  I just
talked to someone last night with new hardware who couldn't compile a
kernel because gcc would barf on Sig 11 all the time.

Now, I *know* my hardware is shot, thanks to a faulty power/surge protector
that actually exploded...  But I still have a hard time accepting that
the Sig 11 stuff can always be blamed on faulty hardware, considering the
number of claims of problems with it for no apparent reason.

I'm not a kernel hacker (don't have the time to mess with it), so I don't
know my left from my right about the Linux kernel...  But I'd think that
there's something worth looking into when it comes to this problem.

Opinions???

--
                 Linux: friends don't let friends use dos.

 
 
 

Difficult to Swallow...

Post by Dr. W » Sat, 17 May 1997 04:00:00



: > After having read so many complaints here of people having trouble with
: > the infamous "Signal 11 -- Seg fault" in Linux, I'm finding it very
: > difficult to believe that all these people have flaky hardware...  I just

: Suggestion - if you don't use it already, put thermal compound between
: the CPU and the fan. I think there are a lot of hot CPUs out there.

Well, I don't use it (actually, the only reason I *don't* is because I've
just not gotten it yet).  However, I also had the same problems with a
different CPU on a "temporary" test...  It doesn't appear that heat is
causing any trouble (it's just a 486 dx2 80MHz...)

--
                 Linux: friends don't let friends use dos.


 
 
 

Difficult to Swallow...

Post by Dr. W » Sat, 17 May 1997 04:00:00



: : After having read so many complaints here of people having trouble with
: : the infamous "Signal 11 -- Seg fault" in Linux, I'm finding it very
: : difficult to believe that all these people have flaky hardware...  I just
: : talked to someone last night with new hardware who couldn't compile a
: : kernel because gcc would barf on Sig 11 all the time.

: Well, O.K., but how do you explain the vast majority of people who
: never get signal 11s?  

I have absolutely no clue whatsoever.  I had no problems at all at first,
but I know what *caused* mine...  However, it seems that if other OS's don't
seem to have a problem with this, perhaps there's something worth examining..

--
                 Linux: friends don't let friends use dos.

 
 
 

Difficult to Swallow...

Post by Donnie Barn » Sat, 17 May 1997 04:00:00



Quote:>After having read so many complaints here of people having trouble with
>the infamous "Signal 11 -- Seg fault" in Linux, I'm finding it very
>difficult to believe that all these people have flaky hardware...  I just
>talked to someone last night with new hardware who couldn't compile a
>kernel because gcc would barf on Sig 11 all the time.

>Now, I *know* my hardware is shot, thanks to a faulty power/surge protector
>that actually exploded...  But I still have a hard time accepting that
>the Sig 11 stuff can always be blamed on faulty hardware, considering the
>number of claims of problems with it for no apparent reason.

>I'm not a kernel hacker (don't have the time to mess with it), so I don't
>know my left from my right about the Linux kernel...  But I'd think that
>there's something worth looking into when it comes to this problem.

After having built and operated over 200 different Linux boxes, I
must completely disagree with you.  Every time I've seen that error
during a build it was the fault of the hardware.  Every Single Time(tm).

--Donnie

--
 Donnie Barnes              http://www.redhat.com/~djb             "Bah."

  Challenge Diversity.  Ignore People.  Live Life.  Use Linux.  2003.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
_Things You'd NEVER Expect A Southerner To Say_ by Vic Henley:
**  Checkmate.

 
 
 

Difficult to Swallow...

Post by Ken Lat » Sat, 17 May 1997 04:00:00





>> After having read so many complaints here of people having trouble with
>> the infamous "Signal 11 -- Seg fault" in Linux, I'm finding it very
>> difficult to believe that all these people have flaky hardware...  I just
>> talked to someone last night with new hardware who couldn't compile a
>> kernel because gcc would barf on Sig 11 all the time.

> As a former Field service engineer for DEC (in the early to mid '80's, I
> have to disagree with you. I have seen and repaired things that you
> would never imagine could possibly
> affect a computer.

> Seen with my own eyes: A printer/terminal (LA120-AA) that would print
> garble
> every other page, but only in January in Toronto. Cause: Static
> Electricity, due to sitting on one of those plastic pads that made the
> chair move easier. January because the cold winter there makes  for very
> dry air inside the offices.

> Solution: You are not going to believe this, but we diluted fabric
> softener (anti-static) and sprayed it on the pad.  It worked like a
> charm.  Meanwhile, we installed about 4 logic boards & 2 power supplies.

> Seen it also with my own eyes.  A Vax mini (Vax11/8650) that would get
> floating
> point errors every day at 5PM.  Cause: a particularly poor choice in
> power routing
> that put the power to a huge industrial garbage unit right behind it.
> The resulting
> EMF was enough to cause a poorly shielded FPU unit to get a few errors.
> Solution: Turn the CPU cabs *Backwards* in the room & install a bunch of
> grounded shielding in the room, and in the cabs and replace the FPU with
> a newer one with slightly different shielding, and, move the power
> routing.  It was a *serious* amount of current going to that garbage
> compactor. It took corporate support and a guy with a static meter to
> figure that one out.

> A 286 PC (a VaxMate... Eeeew) that would not run a certain piece of
> software (can't remember which one) with a certain hard drive.  Cause:
> Aftermarket HD installed in regular case, too much vibration when it was
> running this database caused the edge connector between the cpu and the
> add-on HD unit to be flaky.  Solution: different brand of aftermarket
> HD.

> That big Vax had only 64Mb of ram.  And it was about $300,000 of machine
> and served
> about 150 users.  It never got ram errors because its ram was error
> correcting.  Whenever a customer would log ECC errors they'd insist on
> us replacing the RAM.  Non trivial
> for $100K worth of RAM (Back then.....)  Luckily the Ram subsystem on a
> vax tracks memory errors  down to bit and bank.  Pull one card, slide in
> a new one, run the memory exerciser, and away you go.  ECC ram is 39
> bits for a 32bit word.  I forget how it works
> but you could tell which bit failed and correct it for a one bit error,
> and detect a
> 2 bit error (and machine check.. Dammit.)  

> When you get a single bit error in a ram location, you are liable to get
> a slight bit of change in colour in a single pixel of one graphic you
> are editing.  Thats a non-noticeable error. BUT, when you compile you
> also use up huge amounts of ram.  Compiling a kernel  puts all sorts of
> stuff in a pipe between your hard disk, your compliler and your
> assembler.  A bit error can have your compiler branching to Efff0001
> instead of ffff0001, outside of your protected memory space, and
> therefore: Sig11.

> Your EDO ram cannot detect any errors, much less correct them.  This is
> a big reason to get at least parity ram on a big, heavily used, depended
> upon machine.  At 128Mbytes or more, your MTBF for Ram starts to get
> pretty short with non-parity RAM.  At 20 ns or less for cache ram, a
> little bit of noise from a radio source (like a cell phone) can cause
> the cache to get a 1 bit error. And there is no way to set up parity
> checking for cache.

[snip]

In my DEC Field Service days I saw even weirder things than Jay has
mentioned. I saw, on several occassions, outputs of IC AND gates
momentarily change with no change to the inputs. The PDP-8 had
instructions that would skip the next instruction. I had one that would
reliably skip more than one when the instruction executed at a particular
address and the number skipped was a function of how many instructions
immediately preceding the skip were executed. I saw a PDP-12 that would
flub an add-to-memory instruction with certain combinations of operands
and addresses. These things are a lot flakier than most folks like to
imagine.

--

   Ken Latta

        ****  If you're not running Linux, you paid too much.  ****

 
 
 

Difficult to Swallow...

Post by Jon Sno » Sat, 17 May 1997 04:00:00




> >After having read so many complaints here of people having trouble with
> >the infamous "Signal 11 -- Seg fault" in Linux, I'm finding it very
> >difficult to believe that all these people have flaky hardware...  I just
> >talked to someone last night with new hardware who couldn't compile a
> >kernel because gcc would barf on Sig 11 all the time.

> >Now, I *know* my hardware is shot, thanks to a faulty power/surge protector
> >that actually exploded...  But I still have a hard time accepting that
> >the Sig 11 stuff can always be blamed on faulty hardware, considering the
> >number of claims of problems with it for no apparent reason.

> >I'm not a kernel hacker (don't have the time to mess with it), so I don't
> >know my left from my right about the Linux kernel...  But I'd think that
> >there's something worth looking into when it comes to this problem.

My experience with sig 11 was that X *ed up after about 15 seconds
and really caused havoc, seg faults, file systems trashed etc and it all
came down to my mother(of a)board. Changing that fixed the problem, and
that kind of problem is totally independant of how new your system is.

Jon Snow

 
 
 

Difficult to Swallow...

Post by Steve Falc » Sat, 17 May 1997 04:00:00


Quote:> Suggestion - if you don't use it already, put thermal compound between
> the CPU and the fan. I think there are a lot of hot CPUs out there.

Be careful with this.  I tried thermal compound, but the oil in it bled
out over time, and attracted a lot of dust.  This caused the CPU fan to
fail.  Not good.

If you really want to use thermal compound, use only a tiny amount.

        Steve Falco

 
 
 

Difficult to Swallow...

Post by Frits Daalma » Sat, 17 May 1997 04:00:00


: >
: > After having read so many complaints here of people having trouble with
: > the infamous "Signal 11 -- Seg fault" in Linux, I'm finding it very
: > difficult to believe that all these people have flaky hardware...  I just
: > talked to someone last night with new hardware who couldn't compile a
: > kernel because gcc would barf on Sig 11 all the time.
: >
: As a former Field service engineer for DEC (in the early to mid '80's, I
: have to disagree with you. I have seen and repaired things that you
: would never imagine could possibly
: affect a computer.
[stories deleted]

Your stories really made me laugh! (obviously because I wasn't the one who
had to solve them).
I suggest that you post them to alt.folklore.computers or alt.cosuard,
and hopefully they'll end up in an archive somewhere!

Besides, they do sound like a good reason to persuade the computer
manufacturers to resume producing RAM and motherboards with parity or ECC.

I have had a flaky SIMM in my PC at work (showed up because the Linux kernel
compilation failed), and at first our computer maintenance
people wouldn't believe it because RAM tests didn't show anything wrong.
Then, I obtained a shareware (euh.. DOS) memory test utility which repeatedly
tested the memory in different modes all night. Apparently the errors only
showed up after the chips got warmed to operating temperature: in the morning
numerous errors were displayed when I turned the screen back on.
Now they use this utility too for all their tests :-)

Greetings, & may your SIMMS never fail,
Frits

 
 
 

Difficult to Swallow...

Post by Timothy Watso » Sat, 17 May 1997 04:00:00



> > Suggestion - if you don't use it already, put thermal compound between
> > the CPU and the fan. I think there are a lot of hot CPUs out there.

> Be careful with this.  I tried thermal compound, but the oil in it bled
> out over time, and attracted a lot of dust.  This caused the CPU fan to
> fail.  Not good.

Also get a fan that lives on top of a heat sink - then the thermal
compound is not exposed to dust, and the combinations should work
better.

--
________________________________________________________________________
T    i    m    o    t    h    y              W    a    t    s    o    n

  __/| Something there is that doesn't love a wall, that wants it down

 
 
 

Difficult to Swallow...

Post by Rauli Ruohon » Sat, 17 May 1997 04:00:00



>I have absolutely no clue whatsoever.  I had no problems at all at first,
>but I know what *caused* mine...  However, it seems that if other OS's don't
>seem to have a problem with this, perhaps there's something worth examining..

Well, Linux uses the hardware more throughly?
This is an example from my 486 DX2/66:

Phenomenon:

Time jumped sometimes backwards and was quite "jumpy", noticed with
interactive programs (especially games). This happened only with Linux.

Cause:

Broken timer chip. DOS/etc. used only the timer interrupt, but Linux uses
the chip to read more precise time. As it's broken, the time's completely
bogus.

Fix:

Ingo made a kluge kernel patch to work around this bug :)

And think about the case where you use DOS/Windows with one, two apps and
never experience any RAM errors. Then you boot Linux, and run 3 big
programs which access the hardware directly (DMA from disk to memory +
memory accesses), and windows/dos uses slow BIOS..

--
Prof:    So the American government went to IBM to come up with a data
         encryption standard and they came up with ...
Student: EBCDIC!

 
 
 

Difficult to Swallow...

Post by Jason Mathiso » Sun, 18 May 1997 04:00:00





> : : After having read so many complaints here of people having trouble with
> : : the infamous "Signal 11 -- Seg fault" in Linux, I'm finding it very
> : : difficult to believe that all these people have flaky hardware...  I just
> : : talked to someone last night with new hardware who couldn't compile a
> : : kernel because gcc would barf on Sig 11 all the time.

> : Well, O.K., but how do you explain the vast majority of people who
> : never get signal 11s?

> I have absolutely no clue whatsoever.  I had no problems at all at first,
> but I know what *caused* mine...  However, it seems that if other OS's don't
> seem to have a problem with this, perhaps there's something worth examining..

The problem is two fold
1)  When windows crashes people don't think much of it... water is wet,
women have
    secrets, and windows crashes, deal with it.

2)  When windows crashes there is no indication of the actual cause of
the problem.
    The signal 11 is a fairly tell tail sign of hardware problems, but
windows just
    crashes(tm) and at best will give you a register state readout.

As a good rule of thumb, if the problem is repeatable then it is
software,
if it is not repeatable then the hardware is suspect.  At one time I had
a repeatable
sig 11 when I was compiling my kernel, it turned out that the pentium
optimized compilier
I was using had a bug in it.

Jason Mathison
Rose-Hulman Networking Manager
"If we can't fix it, it ain't broken"

 
 
 

Difficult to Swallow...

Post by Horst von Bra » Sun, 18 May 1997 04:00:00




>>[gcc segfaults _can't_ all be due to faulty hardware]
>After having built and operated over 200 different Linux boxes, I
>must completely disagree with you.  Every time I've seen that error
>during a build it was the fault of the hardware.  Every Single Time(tm).

Second that, at least for stable kernels (I haven't installed exactly 200
machines, but...). There was a problem in one of the late testing kernels
that made gcc crash, but it was deterministic: Every time in the same file.
The "infamous signal 11" crashes in different files each time, mostly due to
memory problems.

Besides, works with DOS/Win != works right: By mistake, my current
motherboard (i586/100) came set up for 133Mhz. DOS/Win worked fine (no
thorough testing, though), linux wouldn't even boot most of the time.
--

Casilla 9G, Vi?a del Mar, Chile                               +56 32 672616

 
 
 

Difficult to Swallow...

Post by Wayne Schlit » Sun, 18 May 1997 04:00:00





> : >
> : > After having read so many complaints here of people having trouble with
> : > the infamous "Signal 11 -- Seg fault" in Linux, I'm finding it very
> : > difficult to believe that all these people have flaky hardware...
> : >
> : As a former Field service engineer for DEC (in the early to mid '80's, I
> : have to disagree with you. [stories deleted]
> [yet another story deleted]

In a previous life of my computer (when it was still running SVR4 and
several upgrades ago..), it would crash occasionally.  It started out
crashing once every month or two, but after it switched from being a
backup newsfeed to the main newsfeed for several sites, I added SCSI,
and I started to run Framemaker, it started to crash every couple of
days.  Sometimes, these crashes would do really * things to my
filesystem, but mostly it was just _real_ frustrating.

No memory test programs that I could find (four of them) found any
problems, I tried swapping SIMMs anyway, no luck.  I tried swapping
SCSI cards, IDE controllers, changing BIOS settings, re-installing,
etc.  No luck.

Worse, I could not reproduce the error repeatedly.  That is until I
did the following:  I restored my system off of tape into a directory,
and while it was being restored, I would compare the files to what was
already in the system.  If the files were the same, I would delete the
freshly restored file, otherwise I would move it to another
directory.  So, for several hours, my computer would being doing lots
of DMA'ed I/O from tape, writing via SCSI DMA to disk, reading from
the IDE and SCSI disks, and doing a lot of CPU crunching.  

I found that about one (1) byte out of every 150MB of data process
would be corrupted (usually changed to an 0xff).  Setting my BIOS
memory speeds down to the very slowest setting fixed the problem.

Morals of this story:

  1) It doesn't take very many bad bytes to really hose the system.

  2) It can take _really_ heavy loads on your _entire_ system to
     trigger an error.

  3) Problems may not always be bad hardware, it may be just that the
     configuration is wrong.  (Bus speeds too high, overclocking,
     wrong memory settings, etc.)

-wayne

--
Wayne Schlitt can not assert the truth of all statements in this
article and still be consistent.

 
 
 

Difficult to Swallow...

Post by waywar » Sun, 18 May 1997 04:00:00




> > talked to someone last night with new hardware who couldn't compile
> > a kernel because gcc would barf on Sig 11 all the time.
> Your EDO ram cannot detect any errors, much less correct them.  This is
> a big reason to get at least parity ram on a big, heavily used, depended
> upon machine.  At 128Mbytes or more, your MTBF for Ram starts to get
> pretty short with non-parity RAM.  At 20 ns or less for cache ram, a
> little bit of noise from a radio source (like a cell phone) can cause
> the cache to get a 1 bit error. And there is no way to set up parity
> checking for cache.

Another huge reason for external modems... you have several coils in the
darn thing (speaker and off/on hook) being jazzed up and the RF can be a
factor within your computer case. Applied Management, in Helena MT., had
someone come in with one of those field meters and detected more than
healthy levels of RF from the internal modem. They swore they'd never
ever use an internal again, and I've followed their advice... they build
the parking systems computer hardware and software for such airports as
Dulles, Minn. St. Paul, and such.  

You might want to start compiling a kernel in one term, and go into
another term and enter "echo ath1 > /dev/cuaX" followed by "echo ath0

Quote:>.dev.cuaX" (X= yourcomport) repeat this several times and see if it generates interference with your machine and compile. I'd give it a shot, but all of 6 of my modems are external. :) I agree with Jay, you never know what will bite you on the ass! <g> ric

--
Home for Wayward Computers      | Home of Tippy the Wonder Dog
Ric Moore - sole proprietor and | Caldera Linux - NightmareOS MUD
Grand Poobah. BCUG President    | (304) 255-7193-94-95-96-97-98-99

                      http://www.wayward.org
 
 
 

Difficult to Swallow...

Post by Donnie Barn » Sun, 18 May 1997 04:00:00


Quote:>I have absolutely no clue whatsoever.  I had no problems at all at first,
>but I know what *caused* mine...  However, it seems that if other OS's don't
>seem to have a problem with this, perhaps there's something worth examining..

"Other" OS's don't have a problem because they have no idea how to
make the best use of the hardware.  None.  Linux can put it through
it's paces in short order, and that reveals hardware problems.  In
this day of el-cheapo components, things that are "marginal" can still
work fine in Windows environments, so people sell them that way.  
Those problems are uncovered if you really exercise your hardware.

--Donnie

--
 Donnie Barnes              http://www.redhat.com/~djb             "Bah."

  Challenge Diversity.  Ignore People.  Live Life.  Use Linux.  2003.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
_Things You'd NEVER Expect A Southerner To Say_ by Vic Henley:
**  Checkmate.

 
 
 

1. Serial Port swallowing chars

Hi!
I'm running into trouble when connecting to my university via modem.
The problem is, when I'm doing this under X11R6, many of the chars simply
don't appear on my screen. The same problem ist true for MS-Windows.
When I'm connecting from the text console, everything works just fine.
I just bought a new serial card with brand new 16552-UARTs on it, because
the old ones were 16550 from the QF-series, but the problem is still there.
I wonder if it's a problem of the modem, serial chip or some driver.

Please mail answers to the e-mail adress given below.

Thanks in advance,

Philipp

--
------------------------------------------------------------------------------

student of computer sciences    |               "Is this virtuous?"
at university of ulm, germany   |
------------------------------------------------------------------------------

2. 2.5.71 svgatextmode

3. how to write a swallowed app?

4. Space occupied....

5. Question about swallowing an app into KDE's panel

6. Toshiba Satellite 230CX

7. GoodStuff swallows?

8. 'more' failure under color_xterm

9. NATD/IPFW and ping/traceroute swallowing.

10. my Xserver swallows 35MB from start

11. Swallowing on Panel

12. Cool Apps to Swallow

13. diald swallows first packet