> : >
> : > After having read so many complaints here of people having trouble with
> : > the infamous "Signal 11 -- Seg fault" in Linux, I'm finding it very
> : > difficult to believe that all these people have flaky hardware...
> : >
> : As a former Field service engineer for DEC (in the early to mid '80's, I
> : have to disagree with you. [stories deleted]
> [yet another story deleted]
In a previous life of my computer (when it was still running SVR4 and
several upgrades ago..), it would crash occasionally. It started out
crashing once every month or two, but after it switched from being a
backup newsfeed to the main newsfeed for several sites, I added SCSI,
and I started to run Framemaker, it started to crash every couple of
days. Sometimes, these crashes would do really * things to my
filesystem, but mostly it was just _real_ frustrating.
No memory test programs that I could find (four of them) found any
problems, I tried swapping SIMMs anyway, no luck. I tried swapping
SCSI cards, IDE controllers, changing BIOS settings, re-installing,
etc. No luck.
Worse, I could not reproduce the error repeatedly. That is until I
did the following: I restored my system off of tape into a directory,
and while it was being restored, I would compare the files to what was
already in the system. If the files were the same, I would delete the
freshly restored file, otherwise I would move it to another
directory. So, for several hours, my computer would being doing lots
of DMA'ed I/O from tape, writing via SCSI DMA to disk, reading from
the IDE and SCSI disks, and doing a lot of CPU crunching.
I found that about one (1) byte out of every 150MB of data process
would be corrupted (usually changed to an 0xff). Setting my BIOS
memory speeds down to the very slowest setting fixed the problem.
Morals of this story:
1) It doesn't take very many bad bytes to really hose the system.
2) It can take _really_ heavy loads on your _entire_ system to
trigger an error.
3) Problems may not always be bad hardware, it may be just that the
configuration is wrong. (Bus speeds too high, overclocking,
wrong memory settings, etc.)
-wayne
--
Wayne Schlitt can not assert the truth of all statements in this
article and still be consistent.