HELP HELP HELP: SC2000 went crazy

> Our central server machine is a SC2000 with 2 CPUs, 128MB memory and with
> Solaris 2.4. Last week this machine started to panic with memory address
> alignment errors. We installed the jumbo kernel patch 101945-34.

> The panics just didn't stop. We checked the memory by SunDiag, no problem.
> We were told to check the memory at the boot prompt too, and to our biggest
> surprise the memory test told that the machine had only 127MB memory!!

> OK, we removed the SIMs four by four and put back brand new SIM modules.
> The test always said the machine had 127MB. We replaced the motherboard by
> another one - no change.

> We let the machine to run with the new motherboard and after two days,
> the machine paniced again!

> Here are the messages:

> Nov 16 17:51:11 sunserv unix: BAD TRAP: cpu_id=1 type=7 <Memory address alignment> addr=0 rw=0 rp=e0ff6ce4
> Nov 16 17:51:11 sunserv unix: MMU sfsr=0x0: ft=<None>
> Nov 16 17:51:11 sunserv unix: sched: Memory address alignment
> Nov 16 17:51:11 sunserv unix: cv_block+0x14, pid=0, pc=0xe005e824, sp=0xe0ff6d30, psr=0x41400ac2, context=0
> Nov 16 17:51:11 sunserv unix: g1-g7: 1, f745cab8, 0, 0, 0, 1, e0ff6ec0
> Nov 16 17:51:11 sunserv unix: Begin traceback... sp = e0ff6d30

> Nov 16 17:51:11 sunserv unix:  args=e01c4cfc 14032 f7f74048 414000e3 f745cb0c 0

> Nov 16 17:51:11 sunserv unix:  args=e01c4cfc e01d53b4 0 0 0 0

> Nov 16 17:51:11 sunserv unix:  args=0 fffffffe 10000 f7ae3ebc 0 e01d53b4

> Nov 16 17:51:11 sunserv unix:  args=0 0 0 0 0 0
> Nov 16 17:51:11 sunserv unix: End traceback...
> Nov 16 17:51:11 sunserv unix: panic[cpu1]/thread=0xe0ff6ec0: Memory address alignment

I received suggestions both to check the hardware and the software.

Because of the randomness of the panics and because we didn't install any
new software before the panics, we checked the hardware first.
Unfortunately we ran the "memory test" at the monitor level:

Quote:> setenv diag-switch? true
> test-memory
> setenv diag-switch? false

Do not do this! The so called 'test-memory' is as buggy as possible,
it can print out different memory sizes or even physical memory
address problems when everything is ok. It misled us completely and
in order to find the problem we checked every bit of the hardware...

We enabled creating crashdumps and sent the analyses to Sun UK.
They answered last week suggesting if we run PPP we should
install patch 102854-01. This patch is titled as

Patch-ID# 102854-01
Keywords: PPP crash windows
Synopsis: SunOS 5.4: Windows 95 PPP causes Solaris to crash every time
Date: Nov/20/95

And yes, our crashes started AFTER Windows 95 was released and we use PPP.

We tested it: without the patch as the two PPPs are connected,
the SC2000 panics with exactly the same messages as we previously had.
(After the patch was applied we couldn't get the Windows 95 PPP to connect
to the Solaris PPP successfully.)

I would rather not comment this Solaris bug.

Thanks the suggestions to

