Solaris 8 on ultra10 platform keeps crashing with....................

Solaris 8 on ultra10 platform keeps crashing with....................

Post by Terry Pi » Fri, 23 Apr 2004 18:23:38



Sometimes the system stays up for a whole day !,  whats this NT, don't
think so.
whats going on with this......anyone have any thoughts.

Apr 19 20:10:59 stvmm01 unix: [ID 836849 kern.notice]
Apr 19 20:10:59 stvmm01 ^Mpanic[cpu0]/thread=30000f766e0:
Apr 19 20:11:00 stvmm01 unix: [ID 799565 kern.notice] BAD TRAP:
type=10 rp=2a10058ba50 addr=10035648 mmu_fsr=0
Apr 19 20:11:00 stvmm01 unix: [ID 100000 kern.notice]
Apr 19 20:11:00 stvmm01 unix: [ID 839527 kern.notice] mibiisa:
Apr 19 20:11:00 stvmm01 unix: [ID 901337 kern.notice] illegal
instruction fault:
Apr 19 20:11:00 stvmm01 unix: [ID 381800 kern.notice] addr=0x10035648
Apr 19 20:11:00 stvmm01 unix: [ID 101969 kern.notice] pid=324,
pc=0x10035648, sp=0x2a10058b2f1, tstate=0x4400001601, context=0x1080
Apr 19 20:11:00 stvmm01 unix: [ID 743441 kern.notice] g1-g7: 100a93b0,
0, 7f603f, 0, 0, 30b742b9, 30000f766e0
Apr 19 20:11:00 stvmm01 unix: [ID 100000 kern.notice]
Apr 19 20:11:00 stvmm01 genunix: [ID 723222 kern.notice]
000002a10058b780 unix:die+a4 (10, 2a10058ba50, 10035648, 0,
2a10058ba50, 10
41c808)
Apr 19 20:11:00 stvmm01 genunix: [ID 179002 kern.notice]   %l0-3:
00000000100bd64c 0000000000000000 0000000000000040 000000000000000
0
Apr 19 20:11:00 stvmm01   %l4-7: 0000000000000000 0000000000000000
0000000000000000 0000000000000000
Apr 19 20:11:00 stvmm01 genunix: [ID 723222 kern.notice]
000002a10058b860 unix:trap+1220 (20, 0, 0, 10000, 2a10058ba50, 0)
Apr 19 20:11:01 stvmm01 genunix: [ID 179002 kern.notice]   %l0-3:
0000000010441568 0000000010441570 000003000102b538 000000000000000
0
Apr 19 20:11:01 stvmm01   %l4-7: 0000000000000010 0000030001019880
0000000000010000 0000000000000000
Apr 19 20:11:01 stvmm01 genunix: [ID 723222 kern.notice]
000002a10058b9a0 unix:prom_rtt+0 (0, 0, 30000f766e0, fec07e60, 0, 2)
Apr 19 20:11:01 stvmm01 genunix: [ID 179002 kern.notice]   %l0-3:
0000000000000002 0000000000001400 0000004400001601 000000001002b6a
4
Apr 19 20:11:01 stvmm01   %l4-7: 0000000000000000 0000000000000000
0000000000000000 000002a10058ba50
Apr 19 20:11:01 stvmm01 unix: [ID 100000 kern.notice]
Apr 19 20:11:01 stvmm01 genunix: [ID 672855 kern.notice] syncing file
systems...
Apr 19 20:11:01 stvmm01 genunix: [ID 904073 kern.notice]  done
Apr 19 20:11:02 stvmm01 genunix: [ID 353387 kern.notice] dumping to
/dev/dsk/c0t0d0s1, offset 107479040
Apr 19 20:11:06 stvmm01 genunix: [ID 409368 kern.notice] ^M100% done:
4386 pages dumped, compression ratio 3.48,
Apr 19 20:11:06 stvmm01 genunix: [ID 851671 kern.notice] dump
succeeded
Apr 20 14:37:07 stvmm01 genunix: [ID 540533 kern.notice] ^MSunOS
Release 5.8 Version Generic_108528-24 64-bit
Apr 20 14:37:07 stvmm01 genunix: [ID 913632 kern.notice] Copyright
1983-2003 Sun Microsystems, Inc.  All rights reserved.
Apr 20 14:37:07 stvmm01 genunix: [ID 678236 kern.info] Ethernet
address = 8:0:20:f9:47:26
Apr 20 14:37:07 stvmm01 unix: [ID 389951 kern.info] mem = 262144K
(0x10000000)
Apr 20 14:37:07 stvmm01 unix: [ID 930857 kern.info] avail mem =
251371520
Apr 20 14:37:07 stvmm01 rootnex: [ID 466748 kern.info] root nexus =
Sun Ultra 5/10 UPA/PCI (UltraSPARC-IIi 440MHz


 
 
 

Solaris 8 on ultra10 platform keeps crashing with....................

Post by Tugge » Fri, 23 Apr 2004 18:42:00



> Sometimes the system stays up for a whole day !,  whats this NT, don't
> think so.
> whats going on with this......anyone have any thoughts.

> Apr 19 20:10:59 stvmm01 unix: [ID 836849 kern.notice]
> Apr 19 20:10:59 stvmm01 ^Mpanic[cpu0]/thread=30000f766e0:
> Apr 19 20:11:00 stvmm01 unix: [ID 799565 kern.notice] BAD TRAP:
> type=10 rp=2a10058ba50 addr=10035648 mmu_fsr=0

Looks a lot like a CPU error, although we have had Ultra 10's virtually
rebuilt and still having this type of problem

Is it a 400Mhz by any chance? If so, its a known problem with the
external CPU cache.

More info here:
http://aa11.cjb.net/sun_managers/2000/09/msg00109.html

 
 
 

Solaris 8 on ultra10 platform keeps crashing with....................

Post by Martin Pau » Fri, 23 Apr 2004 18:43:58



> whats going on with this......anyone have any thoughts.

> Apr 19 20:10:59 stvmm01 unix: [ID 836849 kern.notice]
> Apr 19 20:10:59 stvmm01 ^Mpanic[cpu0]/thread=30000f766e0:
> Apr 19 20:11:00 stvmm01 unix: [ID 799565 kern.notice] BAD TRAP:
> type=10 rp=2a10058ba50 addr=10035648 mmu_fsr=0
> Apr 19 20:11:00 stvmm01 unix: [ID 100000 kern.notice]
> Apr 19 20:11:00 stvmm01 unix: [ID 839527 kern.notice] mibiisa:
> Apr 19 20:11:00 stvmm01 unix: [ID 901337 kern.notice] illegal
> instruction fault:

I'd start by making sure that the most recent kernel patch and
the patch for mibiisa (108869-23) are installed. No need to hunt
for a bug that has probably been fixed long ago.

If you don't need snmp services (which mibiisa belongs too) I'd
stop them by running "/etc/init.d/init.snmpdx stop" (you can
make this permanent by renaming /etc/rc3.d/S76snmpdx to e.g.
/etc/rc3.d/s76snmpdx).

mp.
--
Systems Administrator | Institute for Software Science | Univ. of Vienna

 
 
 

Solaris 8 on ultra10 platform keeps crashing with....................

Post by Gavin Maltb » Fri, 23 Apr 2004 19:15:49


Hi


> Sometimes the system stays up for a whole day !,  whats this NT, don't
> think so.
> whats going on with this......anyone have any thoughts.

> Apr 19 20:10:59 stvmm01 unix: [ID 836849 kern.notice]
> Apr 19 20:10:59 stvmm01 ^Mpanic[cpu0]/thread=30000f766e0:
> Apr 19 20:11:00 stvmm01 unix: [ID 799565 kern.notice] BAD TRAP:
> type=10 rp=2a10058ba50 addr=10035648 mmu_fsr=0

Trap 0x10 is an illegal instruction trap.  If we execute an
illegal instruction while in the kernel ("privileged") then
we will panic (userland illegal instructions will just kill
the process but if the kernel is executing illegal instructions
then the system is hosed).

Quote:> Apr 19 20:11:00 stvmm01 unix: [ID 100000 kern.notice]
> Apr 19 20:11:00 stvmm01 unix: [ID 839527 kern.notice] mibiisa:
> Apr 19 20:11:00 stvmm01 unix: [ID 901337 kern.notice] illegal
> instruction fault:

Almost certainly nothing to do with mibiisa itself - it just happened
to be the process scheduled on cpu when we took the illegal instruction.
You'll see below that it it is not an instruction of mibiisa that
faulted.

If other failures all show mibiisa being affected then this would
appear to be some software failure that could be induced by
some behaviour in mibiisa.  This is unlikely at the best of times,
but particularly so for bad instructions (on the odd occasion
that there have been bugs like that it's not been a case of the
kernel being induced into executing bad instructions - the instructions
the kernel executes are fixed and not influenced by userland behaviour).

Quote:> Apr 19 20:11:00 stvmm01 unix: [ID 381800 kern.notice] addr=0x10035648
> Apr 19 20:11:00 stvmm01 unix: [ID 101969 kern.notice] pid=324,
> pc=0x10035648, sp=0x2a10058b2f1, tstate=0x4400001601, context=0x1080

In 108528-24 which you're running %pc of 0x10035648 is

 > 0x10035648/i
sun4`flush_user_windows_to_stack+0x1c:
                 call      -0x2e69c      <sun4`flush_user_windows>

The tstate value (some preserved processor state at the point we trapped)
indicates that we were privileged - so mibiisa had already trapped
into the kernel (I can't tell what for from the info we have;
the crash dump would likely tell us).

But the point is that this is a kernel instruction that has been corrupted.
flush_user_windows_to_stack is called pretty often for all sorts of
reasons, so we can be pretty certain that this is not the first time
we've executed the instruction.  It could have been corrupted in
all sorts of ways, eg bad memory, ecache error (we'd have expected
an error report for those - maybe one is buried in the crash dump), fried cpu,
kernel memory corruption by a wayward driver (usually not easy
to corrupt instructions) ...

A look at the other failures you've had could probably help decide.
eg, if it's always mibiisa, always the same %pc, always an
illegal instruction trap etc.  If the symptoms are pretty random
(eg, different trap types at different addresses) then hardware
would shoot right up the suspect list (and it's quite high already).

Gavin

> Apr 19 20:11:00 stvmm01 unix: [ID 743441 kern.notice] g1-g7: 100a93b0,
> 0, 7f603f, 0, 0, 30b742b9, 30000f766e0
> Apr 19 20:11:00 stvmm01 unix: [ID 100000 kern.notice]
> Apr 19 20:11:00 stvmm01 genunix: [ID 723222 kern.notice]
> 000002a10058b780 unix:die+a4 (10, 2a10058ba50, 10035648, 0,
> 2a10058ba50, 10
> 41c808)
> Apr 19 20:11:00 stvmm01 genunix: [ID 179002 kern.notice]   %l0-3:
> 00000000100bd64c 0000000000000000 0000000000000040 000000000000000
> 0
> Apr 19 20:11:00 stvmm01   %l4-7: 0000000000000000 0000000000000000
> 0000000000000000 0000000000000000
> Apr 19 20:11:00 stvmm01 genunix: [ID 723222 kern.notice]
> 000002a10058b860 unix:trap+1220 (20, 0, 0, 10000, 2a10058ba50, 0)
> Apr 19 20:11:01 stvmm01 genunix: [ID 179002 kern.notice]   %l0-3:
> 0000000010441568 0000000010441570 000003000102b538 000000000000000
> 0
> Apr 19 20:11:01 stvmm01   %l4-7: 0000000000000010 0000030001019880
> 0000000000010000 0000000000000000
> Apr 19 20:11:01 stvmm01 genunix: [ID 723222 kern.notice]
> 000002a10058b9a0 unix:prom_rtt+0 (0, 0, 30000f766e0, fec07e60, 0, 2)
> Apr 19 20:11:01 stvmm01 genunix: [ID 179002 kern.notice]   %l0-3:
> 0000000000000002 0000000000001400 0000004400001601 000000001002b6a
> 4
> Apr 19 20:11:01 stvmm01   %l4-7: 0000000000000000 0000000000000000
> 0000000000000000 000002a10058ba50
> Apr 19 20:11:01 stvmm01 unix: [ID 100000 kern.notice]
> Apr 19 20:11:01 stvmm01 genunix: [ID 672855 kern.notice] syncing file
> systems...
> Apr 19 20:11:01 stvmm01 genunix: [ID 904073 kern.notice]  done
> Apr 19 20:11:02 stvmm01 genunix: [ID 353387 kern.notice] dumping to
> /dev/dsk/c0t0d0s1, offset 107479040
> Apr 19 20:11:06 stvmm01 genunix: [ID 409368 kern.notice] ^M100% done:
> 4386 pages dumped, compression ratio 3.48,
> Apr 19 20:11:06 stvmm01 genunix: [ID 851671 kern.notice] dump
> succeeded
> Apr 20 14:37:07 stvmm01 genunix: [ID 540533 kern.notice] ^MSunOS
> Release 5.8 Version Generic_108528-24 64-bit
> Apr 20 14:37:07 stvmm01 genunix: [ID 913632 kern.notice] Copyright
> 1983-2003 Sun Microsystems, Inc.  All rights reserved.
> Apr 20 14:37:07 stvmm01 genunix: [ID 678236 kern.info] Ethernet
> address = 8:0:20:f9:47:26
> Apr 20 14:37:07 stvmm01 unix: [ID 389951 kern.info] mem = 262144K
> (0x10000000)
> Apr 20 14:37:07 stvmm01 unix: [ID 930857 kern.info] avail mem =
> 251371520
> Apr 20 14:37:07 stvmm01 rootnex: [ID 466748 kern.info] root nexus =
> Sun Ultra 5/10 UPA/PCI (UltraSPARC-IIi 440MHz



 
 
 

Solaris 8 on ultra10 platform keeps crashing with....................

Post by Gavin Maltb » Fri, 23 Apr 2004 19:28:35


[cut]

Quote:> If you don't need snmp services (which mibiisa belongs too) I'd
> stop them by running "/etc/init.d/init.snmpdx stop" (you can
> make this permanent by renaming /etc/rc3.d/S76snmpdx to e.g.
> /etc/rc3.d/s76snmpdx).

Unless (and I think it's unlikely) it's always mibiisa that is
affected I don't think this will help or is necessary.  The
illegal instruction occured in the kernel, so unless it's
something that mibiisa could aggravate (extremely unlikely)
it's just an innocent victim.  An illegal instruction
fault for a user process will never panic the system.

Gavin

 
 
 

Solaris 8 on ultra10 platform keeps crashing with....................

Post by Gavin Maltb » Fri, 23 Apr 2004 19:25:54




>> Sometimes the system stays up for a whole day !,  whats this NT, don't
>> think so.
>> whats going on with this......anyone have any thoughts.

>> Apr 19 20:10:59 stvmm01 unix: [ID 836849 kern.notice] Apr 19 20:10:59
>> stvmm01 ^Mpanic[cpu0]/thread=30000f766e0: Apr 19 20:11:00 stvmm01
>> unix: [ID 799565 kern.notice] BAD TRAP:
>> type=10 rp=2a10058ba50 addr=10035648 mmu_fsr=0

> Looks a lot like a CPU error,

(Per my other post) that's certainly the leading suspect.

Quote:> although we have had Ultra 10's virtually
> rebuilt and still having this type of problem

"This type of problem" is too vague.  Do you mean panics
from "BAD TRAP" (there are a zillion ways to achieve this),
specifically illegal instruction bad trap panics etc.  All I'm saying
is that this is a pretty generic panic message (unless you spend
too much of your time dealing with them) so be careful in
lumping them all together as a single problem type.

Quote:> Is it a 400Mhz by any chance? If so, its a known problem with the
> external CPU cache.
> More info here:
> http://aa11.cjb.net/sun_managers/2000/09/msg00109.html

The original poster mentioned no async error reports such
as UE errors (as mentioned in the link).  The issue
mentioned there is that a parity error detected in the
external cache will panic the system.  These were more
likely on the faster cpus with bigger ecaches, but this
is an ultra10 and the ecache on those was pretty small
(I think 512K or similar vs 8MB).

I would defintely not class the observed failures as anything
to do with ecache (unless there are UE/EDP/LDP etc messages
that the OP has not mentioned).

Gavin

 
 
 

Solaris 8 on ultra10 platform keeps crashing with....................

Post by Terry Pi » Sat, 24 Apr 2004 02:08:59


Thanks for the detailed analysis, I wasn't expecting that much depth -
cheers, anyway, the server has just rebooted itsself again, when It
comes back up, I will check the messages file and look for savecore.
(by the way, the only thing loaded on this box, apart from the OS and
patch revision, is BIND 9.2.2)
I have swapped the DIMM modules and reseated them a number of times,
this did in fact work for about a week - then it started booting
itself daily again)
Terry
 
 
 

Solaris 8 on ultra10 platform keeps crashing with....................

Post by Roland Main » Sun, 25 Apr 2004 10:31:53



> Thanks for the detailed analysis, I wasn't expecting that much depth -
> cheers, anyway, the server has just rebooted itsself again, when It
> comes back up, I will check the messages file and look for savecore.
> (by the way, the only thing loaded on this box, apart from the OS and
> patch revision, is BIND 9.2.2)
> I have swapped the DIMM modules and reseated them a number of times,
> this did in fact work for about a week - then it started booting
> itself daily again)

I remember that I had an Ultra5 long ago which was randomly running into
a PANIC after a memory upgrade... the problems was that it didn't like
the 60ns memory modules we added - replacing them by 50ns memory modules
fixed the problem...

----

Bye,
Roland

--
  __ .  . __

  \__\/\/__/  MPEG specialist, C&&JAVA&&Sun&&Unix programmer
  /O /==\ O\  TEL +49 2426 901568 FAX +49 2426 901569
 (;O/ \/ \O;)

 
 
 

Solaris 8 on ultra10 platform keeps crashing with....................

Post by Roland Main » Sun, 25 Apr 2004 10:37:10


[snip]

Quote:> The original poster mentioned no async error reports such
> as UE errors (as mentioned in the link).  The issue
> mentioned there is that a parity error detected in the
> external cache will panic the system.  These were more
> likely on the faster cpus with bigger ecaches, but this
> is an ultra10 and the ecache on those was pretty small
> (I think 512K or similar vs 8MB).

AFAIK all the U5/U10 machines had 2MB Ecache except the first 360MHz
ones (if I remember it correctly they only had 256KB Ecache...
ewwwwww... ;-( ) ...
-- snip --
% /usr/platform/SUNW,Ultra-5_10/sbin/prtdiag
[snip]
                    Run   Ecache   CPU    CPU
Brd  CPU   Module   MHz     MB    Impl.   Mask
---  ---  -------  -----  ------  ------  ----
 0     0     0      333     2.0   12       1.3
-- snip --

Newer UltraSPARC-IIi versions have smaller-but-onchip 2ndLevel caches.

----

Bye,
Roland

--
  __ .  . __

  \__\/\/__/  MPEG specialist, C&&JAVA&&Sun&&Unix programmer
  /O /==\ O\  TEL +49 2426 901568 FAX +49 2426 901569
 (;O/ \/ \O;)