sick Ultra 60

sick Ultra 60

Post by Griff Miller I » Thu, 25 Jul 2002 05:32:32



A couple of months ago I made some posts asking for help diagnosing a sick
Ultra 60 at a remote site. An excerpt:

Quote:> In the past few weeks it has been more and more frequently crashing to the ok
> prompt with a "panic interrupt 14" message. If you power cycle it, it comes
> right back up (after fsck) and works fine for anywhere from an hour to a week,
> though lately the former figure is the more likely.

I have since replaced the entire machine, and have their old one on my bench.
The replacement seems to be running fine at the remote site.

I opened up the one that came back, cleaned it out (it had a lot of lint inside)
and reseated all the cards/processors/RAM. I then fired it up, and let it run.

It ran fine for a couple of weeks, but then again no one was putting it to heavy
use and we weren't making use of the Bit3 interface that is installed in our
production machines.

My point about the Bit3 is not meant to suggest anything; I'm just trying to
be as complete about the facts as I can.

Then, a few days ago, we were finally using the machine and making it sweat a
little (but still not making use of the Bit3) and it did something very
strange: it rebooted all by itself! A look in the logs after it came back up
reveals:

Jul 12 16:32:26 leo-build unix: WARNING: [AFT1] WP event on CPU0, errID 0x0000539e.5ace0078^M
Jul 12 16:32:26 leo-build unix:     AFSR 0x00000000.00800008<WP> AFAR 0x000001fe.01800800^M
Jul 12 16:32:26 leo-build unix:     AFSR.PSYND 0x0008(Score 95) AFSR.ETS 0x00 Fault_PC 0x31848^M
Jul 12 16:32:26 leo-build unix:     UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESYND 0x00
Jul 12 16:32:29 leo-build unix: WARNING: [AFT1] Uncorrectable Memory Error on CPU0 Data access at TL=0, errID 0x0000539f.25d84557^M
Jul 12 16:32:29 leo-build unix:     AFSR 0x00000000.80200000<PRIV,UE> AFAR 0x00000000.a5dffbe8^M
Jul 12 16:32:29 leo-build unix:     AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x10020f98^M
Jul 12 16:32:29 leo-build unix:     UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0203<UE> UDBL.ESYND 0x03^M
Jul 12 16:32:29 leo-build unix:     UDBL Syndrome 0x3 Memory Module U1001 U1002 U1003 U1004
Jul 12 16:32:29 leo-build unix: WARNING: [AFT1] errID 0x0000539f.25d84557 Syndrome 0x3 indicates that this may not be a memory
module problem
Jul 12 16:32:29 leo-build unix: [AFT2] errID 0x0000539f.25d84557 PA=0x00000000.a5dffbe8
Jul 12 16:32:29 leo-build unix:     E$tag 0x00000000.1ec014bb E$State: Exclusive E$parity 0x0f
Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x00): 0x00000000.3d976458
Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x08): 0x3d87671c.00000000
Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x10): 0x00000000.3ddf057f
Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x18): 0x3da27bda.00000000
Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x20): 0x00000000.3ca69aae
Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x28): 0x00000000.35b04e29 *Bad* PSYND=0x00ff
Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x30): 0x00000000.00000000
Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x38): 0x00000000.3e0c6549
Jul 12 16:32:29 leo-build unix: NOTICE: Scheduling clearing of error on page 0x00000000.a5dfe000
Jul 12 16:32:29 leo-build unix: [AFT3] errID 0x0000539f.25d84557 Above Error detected by protected Kernel code
Jul 12 16:32:29 leo-build unix:     that will try to clear error from system
Jul 12 16:32:29 leo-build unix: WARNING: [AFT1] Uncorrectable Memory Error on CPU0 Data access at TL=0, errID 0x0000539f.27340c72^M
Jul 12 16:32:29 leo-build unix:     AFSR 0x00000000.80200000<PRIV,UE> AFAR 0x00000000.a5dffbe8^M
Jul 12 16:32:29 leo-build unix:     AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x10020f98^M
Jul 12 16:32:29 leo-build unix:     UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0203<UE> UDBL.ESYND 0x03^M
Jul 12 16:32:29 leo-build unix:     UDBL Syndrome 0x3 Memory Module U1001 U1002 U1003 U1004
Jul 12 16:32:29 leo-build unix: WARNING: [AFT1] errID 0x0000539f.27340c72 Syndrome 0x3 indicates that this may not be a memory
module problem
Jul 12 16:32:29 leo-build unix: [AFT2] errID 0x0000539f.27340c72 PA=0x00000000.a5dffbe8
Jul 12 16:32:29 leo-build unix:     E$tag 0x00000000.1ec014bb E$State: Exclusive E$parity 0x0f
Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x00): 0x00000000.3d976458
Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x08): 0x3d87671c.00000000
Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x10): 0x00000000.3ddf057f
Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x18): 0x3da27bda.00000000
Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x20): 0x00000000.3ca69aae
Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x28): 0x00000000.35b04e29 *Bad* PSYND=0x00ff
Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x30): 0x00000000.00000000
Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x38): 0x00000000.3e0c6549
Jul 12 16:32:29 leo-build unix: NOTICE: Scheduling clearing of error on page 0x00000000.a5dfe000
Jul 12 16:32:29 leo-build unix: [AFT3] errID 0x0000539f.27340c72 Above Error detected by protected Kernel code
Jul 12 16:32:29 leo-build unix:     that will try to clear error from system
Jul 12 16:32:32 leo-build unix: WARNING: [AFT1] Uncorrectable Memory Error on CPU0 Data access at TL=0, errID 0x0000539f.b32b04ac^M
Jul 12 16:32:32 leo-build unix:     AFSR 0x00000000.00200000<UE> AFAR 0x00000000.a5dffbe8^M
Jul 12 16:32:32 leo-build unix:     AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x15dc0^M
Jul 12 16:32:32 leo-build unix:     UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0203<UE> UDBL.ESYND 0x03^M
Jul 12 16:32:32 leo-build unix:     UDBL Syndrome 0x3 Memory Module U1001 U1002 U1003 U1004
Jul 12 16:32:32 leo-build unix: WARNING: [AFT1] errID 0x0000539f.b32b04ac Syndrome 0x3 indicates that this may not be a memory
module problem
Jul 12 16:32:32 leo-build unix: [AFT2] errID 0x0000539f.b32b04ac PA=0x00000000.a5dffbe8
Jul 12 16:32:32 leo-build unix:     E$tag 0x00000000.1ec014bb E$State: Exclusive E$parity 0x0f
Jul 12 16:32:32 leo-build unix: [AFT2] E$Data (0x00): 0x00000000.3d976458
Jul 12 16:32:32 leo-build unix: [AFT2] E$Data (0x08): 0x3d87671c.00000000
Jul 12 16:32:32 leo-build unix: [AFT2] E$Data (0x10): 0x00000000.3ddf057f
Jul 12 16:32:32 leo-build unix: [AFT2] E$Data (0x18): 0x3da27bda.00000000
Jul 12 16:32:32 leo-build unix: [AFT2] E$Data (0x20): 0x00000000.3ca69aae
Jul 12 16:32:32 leo-build unix: [AFT2] E$Data (0x28): 0x00000000.35b04e29 *Bad* PSYND=0x00ff
Jul 12 16:32:32 leo-build unix: [AFT2] E$Data (0x30): 0x00000000.00000000
Jul 12 16:32:32 leo-build unix: [AFT2] E$Data (0x38): 0x00000000.3e0c6549
Jul 12 16:32:32 leo-build unix: NOTICE: Scheduling clearing of error on page 0x00000000.a5dfe000
Jul 12 16:32:32 leo-build unix: [AFT3] errID 0x0000539f.b32b04ac Above Error is in User Mode
Jul 12 16:32:32 leo-build unix:     and is fatal: will reboot
Jul 12 16:32:32 leo-build unix: WARNING: [AFT1] initiating reboot due to above error in pid 5473 (crrctat)
Jul 12 16:32:36 leo-build unix: NOTICE: Previously reported error on page 0x00000000.a5dfe000 cleared
Jul 12 16:32:37 leo-build syslogd: going down on signal 15
Jul 12 16:34:59 leo-build unix: cpu0: SUNW,UltraSPARC-II (upaid 0 impl 0x11 ver 0xa0 clock 450 MHz)
Jul 12 16:34:59 leo-build unix: cpu1: SUNW,UltraSPARC-II (upaid 2 impl 0x11 ver 0xa0 clock 450 MHz)
Jul 12 16:34:59 leo-build unix: ^MSunOS Release 5.6 Version Generic_105181-32 [UNIX(R) System V Release 4.0]
Jul 12 16:34:59 leo-build unix: Copyright (c) 1983-1997, Sun Microsystems, Inc.
Jul 12 16:34:59 leo-build unix: mem = 2097152K (0x80000000)
Jul 12 16:34:59 leo-build unix: avail mem = 2088091648

...and so on with the normal boot.

It looks like bad memory at first glance, but what's that bit about "Syndrome 0x3
indicates that this may not be a memory module problem" ?

Thanks in advance for any help anyone can provide.

--
Griff Miller II                   |                                           |
Manager of Information Technology | "Do Lipton employees take coffee breaks?" |
Positron Corporation              |                                           |

 
 
 

sick Ultra 60

Post by Ed Wensell II » Thu, 25 Jul 2002 11:00:02



> Jul 12 16:32:26 leo-build unix: WARNING: [AFT1] WP event on CPU0, errID 0x0000539e.5ace0078^M
> Jul 12 16:32:26 leo-build unix:     AFSR 0x00000000.00800008<WP> AFAR 0x000001fe.01800800^M
> Jul 12 16:32:26 leo-build unix:     AFSR.PSYND 0x0008(Score 95) AFSR.ETS 0x00 Fault_PC 0x31848^M
> Jul 12 16:32:26 leo-build unix:     UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESYND 0x00

I have seen similar on several Netra t1400 systems manufactured in
y2000. The error points to the CPU cache. Sometimes patching the kernel
to the latest rev helped. Usually it took a CPU module replacement to
completely fix it. Although, I have support contracts to help out...

I won't say there was a bad batch of CPU modules, but let's just say I
have encountered quite a few modules that had cache which were overly
sensitive to sun spots and stray electrons (in 'Telco Grade' hardware no
less). Do a Google and groups.google.com search on 'sun cpu aft1' and
you'll see more message to the same.

--
Ed Wensell III
NetBSD/Alpha at home - Solaris/SPARC at work - OpenVMS in a past life
E-mail address is valid if you know the appropriate bits to drop.

 
 
 

sick Ultra 60

Post by Ron Kelle » Thu, 25 Jul 2002 06:06:33




Quote:> A couple of months ago I made some posts asking for help diagnosing a sick
> Ultra 60 at a remote site. An excerpt:

<snip>

Quote:> Jul 12 16:32:26 leo-build unix: WARNING: [AFT1] WP event on CPU0, errID

0x0000539e.5ace0078^M
Quote:> Jul 12 16:32:26 leo-build unix:     AFSR 0x00000000.00800008<WP> AFAR

0x000001fe.01800800^M
Quote:> Jul 12 16:32:26 leo-build unix:     AFSR.PSYND 0x0008(Score 95) AFSR.ETS

0x00 Fault_PC 0x31848^M
Quote:> Jul 12 16:32:26 leo-build unix:     UDBH 0x0000 UDBH.ESYND 0x00 UDBL

0x0000 UDBL.ESYND 0x00
Quote:> Jul 12 16:32:29 leo-build unix: WARNING: [AFT1] Uncorrectable Memory Error

on CPU0 Data access at TL=0, errID 0x0000539f.25d84557^M
Quote:> Jul 12 16:32:29 leo-build unix:     AFSR 0x00000000.80200000<PRIV,UE> AFAR

0x00000000.a5dffbe8^M
Quote:> Jul 12 16:32:29 leo-build unix:     AFSR.PSYND 0x0000(Score 05) AFSR.ETS

0x00 Fault_PC 0x10020f98^M
Quote:> Jul 12 16:32:29 leo-build unix:     UDBH 0x0000 UDBH.ESYND 0x00 UDBL

0x0203<UE> UDBL.ESYND 0x03^M
Quote:> Jul 12 16:32:29 leo-build unix:     UDBL Syndrome 0x3 Memory Module U1001
U1002 U1003 U1004
> Jul 12 16:32:29 leo-build unix: WARNING: [AFT1] errID 0x0000539f.25d84557

Syndrome 0x3 indicates that this may not be a memory
Quote:> module problem
> Jul 12 16:32:29 leo-build unix: [AFT2] errID 0x0000539f.25d84557

PA=0x00000000.a5dffbe8
Quote:> Jul 12 16:32:29 leo-build unix:     E$tag 0x00000000.1ec014bb E$State:

Exclusive E$parity 0x0f
Quote:> Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x00): 0x00000000.3d976458
> Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x08): 0x3d87671c.00000000
> Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x10): 0x00000000.3ddf057f
> Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x18): 0x3da27bda.00000000
> Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x20): 0x00000000.3ca69aae
> Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x28): 0x00000000.35b04e29
*Bad* PSYND=0x00ff
> Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x30): 0x00000000.00000000
> Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x38): 0x00000000.3e0c6549
> Jul 12 16:32:29 leo-build unix: NOTICE: Scheduling clearing of error on

page 0x00000000.a5dfe000
Quote:> Jul 12 16:32:29 leo-build unix: [AFT3] errID 0x0000539f.25d84557 Above

Error detected by protected Kernel code
Quote:> Jul 12 16:32:29 leo-build unix:     that will try to clear error from
system
> Jul 12 16:32:29 leo-build unix: WARNING: [AFT1] Uncorrectable Memory Error

on CPU0 Data access at TL=0, errID 0x0000539f.27340c72^M
Quote:> Jul 12 16:32:29 leo-build unix:     AFSR 0x00000000.80200000<PRIV,UE> AFAR

0x00000000.a5dffbe8^M
Quote:> Jul 12 16:32:29 leo-build unix:     AFSR.PSYND 0x0000(Score 05) AFSR.ETS

0x00 Fault_PC 0x10020f98^M
Quote:> Jul 12 16:32:29 leo-build unix:     UDBH 0x0000 UDBH.ESYND 0x00 UDBL

0x0203<UE> UDBL.ESYND 0x03^M
Quote:> Jul 12 16:32:29 leo-build unix:     UDBL Syndrome 0x3 Memory Module U1001
U1002 U1003 U1004
> Jul 12 16:32:29 leo-build unix: WARNING: [AFT1] errID 0x0000539f.27340c72

Syndrome 0x3 indicates that this may not be a memory
Quote:> module problem
> Jul 12 16:32:29 leo-build unix: [AFT2] errID 0x0000539f.27340c72

PA=0x00000000.a5dffbe8
Quote:> Jul 12 16:32:29 leo-build unix:     E$tag 0x00000000.1ec014bb E$State:

Exclusive E$parity 0x0f

<snip>

We had a very similar problem on of our Ultra-2s just the other day.
According to the log files (and our tech-support folks), the L2-cache on one
of the CPUs went bad.  At first glance, it looks like a memory problem.
But, looking a little closer to the log file, I noticed the entry:

E$tag 0x00000000.1ec014bb E$State: Exclusive E$parity 0x0f

That seems to indicate an L2-cache problem on CPU0.

Looks like you may have the same problem.

-Ron

 
 
 

sick Ultra 60

Post by Dr. David Kirkb » Thu, 25 Jul 2002 16:40:02



> I won't say there was a bad batch of CPU modules, but let's just say I
> have encountered quite a few modules that had cache which were overly
> sensitive to sun spots and stray electrons (in 'Telco Grade' hardware no
> less). Do a Google and groups.google.com search on 'sun cpu aft1' and
> you'll see more message to the same.

Are the CPUs you are aware of the faster (450 MHz) units ? I have a couple in my
Ultra 60, but before buying them from eBay, I was warned by someone of this
problem on the 450 MHz CPUs. I'm not sure if it affects the 360's too, but not
the 300s from what I have heard.

--
Dr. David Kirkby PhD,

web page: http://www.david-kirkby.co.uk
Amateur radio callsign: G8WRB

 
 
 

sick Ultra 60

Post by Scott Lawso » Thu, 25 Jul 2002 17:24:53


Sounds like you have a cache problem on CPU0. By the sounds of your post you
have two cpu's? If you can pull out cpu0 and swap cpu1 with it. run with a
single cpu
for a while and see if it is stable. Also upgrade to the latest kernel rev
as their have been
 problems with kernels being overly sensitive for certain processor
conditons (temperature).

Check your fans and also the heatsinks on the cpus, make sure they are not
loose. If it is still
in warranty call sun!

"Griff Miller II" <griff.mil...@positron.com> wrote in message
news:3D3DBD60.3050101@positron.com...

> A couple of months ago I made some posts asking for help diagnosing a sick
> Ultra 60 at a remote site. An excerpt:

> > In the past few weeks it has been more and more frequently crashing to
the ok
> > prompt with a "panic interrupt 14" message. If you power cycle it, it
comes
> > right back up (after fsck) and works fine for anywhere from an hour to a
week,
> > though lately the former figure is the more likely.

> I have since replaced the entire machine, and have their old one on my
bench.
> The replacement seems to be running fine at the remote site.

> I opened up the one that came back, cleaned it out (it had a lot of lint
inside)
> and reseated all the cards/processors/RAM. I then fired it up, and let it
run.

> It ran fine for a couple of weeks, but then again no one was putting it to
heavy
> use and we weren't making use of the Bit3 interface that is installed in
our
> production machines.

> My point about the Bit3 is not meant to suggest anything; I'm just trying
to
> be as complete about the facts as I can.

> Then, a few days ago, we were finally using the machine and making it
sweat a
> little (but still not making use of the Bit3) and it did something very
> strange: it rebooted all by itself! A look in the logs after it came back
up
> reveals:

> Jul 12 16:32:26 leo-build unix: WARNING: [AFT1] WP event on CPU0, errID

0x0000539e.5ace0078^M
> Jul 12 16:32:26 leo-build unix:     AFSR 0x00000000.00800008<WP> AFAR

0x000001fe.01800800^M
> Jul 12 16:32:26 leo-build unix:     AFSR.PSYND 0x0008(Score 95) AFSR.ETS

0x00 Fault_PC 0x31848^M
> Jul 12 16:32:26 leo-build unix:     UDBH 0x0000 UDBH.ESYND 0x00 UDBL

0x0000 UDBL.ESYND 0x00
> Jul 12 16:32:29 leo-build unix: WARNING: [AFT1] Uncorrectable Memory Error

on CPU0 Data access at TL=0, errID 0x0000539f.25d84557^M
> Jul 12 16:32:29 leo-build unix:     AFSR 0x00000000.80200000<PRIV,UE> AFAR

0x00000000.a5dffbe8^M
> Jul 12 16:32:29 leo-build unix:     AFSR.PSYND 0x0000(Score 05) AFSR.ETS

0x00 Fault_PC 0x10020f98^M
> Jul 12 16:32:29 leo-build unix:     UDBH 0x0000 UDBH.ESYND 0x00 UDBL

0x0203<UE> UDBL.ESYND 0x03^M
> Jul 12 16:32:29 leo-build unix:     UDBL Syndrome 0x3 Memory Module U1001
U1002 U1003 U1004
> Jul 12 16:32:29 leo-build unix: WARNING: [AFT1] errID 0x0000539f.25d84557

Syndrome 0x3 indicates that this may not be a memory
> module problem
> Jul 12 16:32:29 leo-build unix: [AFT2] errID 0x0000539f.25d84557

PA=0x00000000.a5dffbe8
> Jul 12 16:32:29 leo-build unix:     E$tag 0x00000000.1ec014bb E$State:

Exclusive E$parity 0x0f
> Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x00): 0x00000000.3d976458
> Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x08): 0x3d87671c.00000000
> Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x10): 0x00000000.3ddf057f
> Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x18): 0x3da27bda.00000000
> Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x20): 0x00000000.3ca69aae
> Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x28): 0x00000000.35b04e29
*Bad* PSYND=0x00ff
> Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x30): 0x00000000.00000000
> Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x38): 0x00000000.3e0c6549
> Jul 12 16:32:29 leo-build unix: NOTICE: Scheduling clearing of error on

page 0x00000000.a5dfe000
> Jul 12 16:32:29 leo-build unix: [AFT3] errID 0x0000539f.25d84557 Above

Error detected by protected Kernel code
> Jul 12 16:32:29 leo-build unix:     that will try to clear error from
system
> Jul 12 16:32:29 leo-build unix: WARNING: [AFT1] Uncorrectable Memory Error

on CPU0 Data access at TL=0, errID 0x0000539f.27340c72^M
> Jul 12 16:32:29 leo-build unix:     AFSR 0x00000000.80200000<PRIV,UE> AFAR

0x00000000.a5dffbe8^M
> Jul 12 16:32:29 leo-build unix:     AFSR.PSYND 0x0000(Score 05) AFSR.ETS

0x00 Fault_PC 0x10020f98^M
> Jul 12 16:32:29 leo-build unix:     UDBH 0x0000 UDBH.ESYND 0x00 UDBL

0x0203<UE> UDBL.ESYND 0x03^M
> Jul 12 16:32:29 leo-build unix:     UDBL Syndrome 0x3 Memory Module U1001
U1002 U1003 U1004
> Jul 12 16:32:29 leo-build unix: WARNING: [AFT1] errID 0x0000539f.27340c72

Syndrome 0x3 indicates that this may not be a memory
> module problem
> Jul 12 16:32:29 leo-build unix: [AFT2] errID 0x0000539f.27340c72

PA=0x00000000.a5dffbe8
> Jul 12 16:32:29 leo-build unix:     E$tag 0x00000000.1ec014bb E$State:

Exclusive E$parity 0x0f
> Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x00): 0x00000000.3d976458
> Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x08): 0x3d87671c.00000000
> Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x10): 0x00000000.3ddf057f
> Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x18): 0x3da27bda.00000000
> Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x20): 0x00000000.3ca69aae
> Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x28): 0x00000000.35b04e29
*Bad* PSYND=0x00ff
> Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x30): 0x00000000.00000000
> Jul 12 16:32:29 leo-build unix: [AFT2] E$Data (0x38): 0x00000000.3e0c6549
> Jul 12 16:32:29 leo-build unix: NOTICE: Scheduling clearing of error on

page 0x00000000.a5dfe000
> Jul 12 16:32:29 leo-build unix: [AFT3] errID 0x0000539f.27340c72 Above

Error detected by protected Kernel code
> Jul 12 16:32:29 leo-build unix:     that will try to clear error from
system
> Jul 12 16:32:32 leo-build unix: WARNING: [AFT1] Uncorrectable Memory Error

on CPU0 Data access at TL=0, errID 0x0000539f.b32b04ac^M
> Jul 12 16:32:32 leo-build unix:     AFSR 0x00000000.00200000<UE> AFAR

0x00000000.a5dffbe8^M
> Jul 12 16:32:32 leo-build unix:     AFSR.PSYND 0x0000(Score 05) AFSR.ETS

0x00 Fault_PC 0x15dc0^M
> Jul 12 16:32:32 leo-build unix:     UDBH 0x0000 UDBH.ESYND 0x00 UDBL

0x0203<UE> UDBL.ESYND 0x03^M
> Jul 12 16:32:32 leo-build unix:     UDBL Syndrome 0x3 Memory Module U1001
U1002 U1003 U1004
> Jul 12 16:32:32 leo-build unix: WARNING: [AFT1] errID 0x0000539f.b32b04ac

Syndrome 0x3 indicates that this may not be a memory
> module problem
> Jul 12 16:32:32 leo-build unix: [AFT2] errID 0x0000539f.b32b04ac

PA=0x00000000.a5dffbe8
> Jul 12 16:32:32 leo-build unix:     E$tag 0x00000000.1ec014bb E$State:

Exclusive E$parity 0x0f
> Jul 12 16:32:32 leo-build unix: [AFT2] E$Data (0x00): 0x00000000.3d976458
> Jul 12 16:32:32 leo-build unix: [AFT2] E$Data (0x08): 0x3d87671c.00000000
> Jul 12 16:32:32 leo-build unix: [AFT2] E$Data (0x10): 0x00000000.3ddf057f
> Jul 12 16:32:32 leo-build unix: [AFT2] E$Data (0x18): 0x3da27bda.00000000
> Jul 12 16:32:32 leo-build unix: [AFT2] E$Data (0x20): 0x00000000.3ca69aae
> Jul 12 16:32:32 leo-build unix: [AFT2] E$Data (0x28): 0x00000000.35b04e29
*Bad* PSYND=0x00ff
> Jul 12 16:32:32 leo-build unix: [AFT2] E$Data (0x30): 0x00000000.00000000
> Jul 12 16:32:32 leo-build unix: [AFT2] E$Data (0x38): 0x00000000.3e0c6549
> Jul 12 16:32:32 leo-build unix: NOTICE: Scheduling clearing of error on

page 0x00000000.a5dfe000
> Jul 12 16:32:32 leo-build unix: [AFT3] errID 0x0000539f.b32b04ac Above

Error is in User Mode
> Jul 12 16:32:32 leo-build unix:     and is fatal: will reboot
> Jul 12 16:32:32 leo-build unix: WARNING: [AFT1] initiating reboot due to

above error in pid 5473 (crrctat)
> Jul 12 16:32:36 leo-build unix: NOTICE: Previously reported error on page

0x00000000.a5dfe000 cleared
> Jul 12 16:32:37 leo-build syslogd: going down on signal 15
> Jul 12 16:34:59 leo-build unix: cpu0: SUNW,UltraSPARC-II (upaid 0 impl

0x11 ver 0xa0 clock 450 MHz)
> Jul 12 16:34:59 leo-build unix: cpu1: SUNW,UltraSPARC-II (upaid 2 impl

0x11 ver 0xa0 clock 450 MHz)
> Jul 12 16:34:59 leo-build unix: ^MSunOS Release 5.6 Version

Generic_105181-32 [UNIX(R) System V Release 4.0]

- Show quoted text -

> Jul 12 16:34:59 leo-build unix: Copyright (c) 1983-1997, Sun Microsystems,
Inc.
> Jul 12 16:34:59 leo-build unix: mem = 2097152K (0x80000000)
> Jul 12 16:34:59 leo-build unix: avail mem = 2088091648

> ...and so on with the normal boot.

> It looks like bad memory at first glance, but what's that bit about
"Syndrome 0x3
> indicates that this may not be a memory module problem" ?

> Thanks in advance for any help anyone can provide.

> --
> Griff Miller II                   |
|
> Manager of Information Technology | "Do Lipton employees take coffee
breaks?" |
> Positron Corporation              |
|
> griff.mil...@positron.com         |
|

 
 
 

sick Ultra 60

Post by Chris Thomps » Thu, 25 Jul 2002 22:33:59




[...]

Quote:

>It looks like bad memory at first glance, but what's that bit about "Syndrome 0x3
>indicates that this may not be a memory module problem" ?

When a dirty E-cache line is flushed to main memory, and is found to have bad
parity, then main memory is written with deliberately "uncorrectable" ECC, and
instruction flow is not affected. (This "WP" event is reported as an interrupt
unrelated to the current instruction flow: technically a "disrupting trap".)

The idea is that if the contents of the main memory line are never referenced
again, then no harm has been done. But of course, they usually are, and then a
UE event happens.

The message about syndrome 0x3 is saying "this look like the sort of uncorrectable
ECC caused by a previous WP event, in which case it's not the main memory that
is at fault".

The UltraSPARC User's Manual (802-7220-02) contains a lot of information about
the various sorts of cache and main memory errors on UltraSPARC-1 and -2 systems:
you should read it if you are trying to understand error reports like yours.

Chris Thompson
Email: cet1 [at] cam.ac.uk

 
 
 

sick Ultra 60

Post by Ed Wensell II » Fri, 26 Jul 2002 07:18:52



> Are the CPUs you are aware of the faster (450 MHz) units ? I have a couple in my
> Ultra 60, but before buying them from eBay, I was warned by someone of this
> problem on the 450 MHz CPUs. I'm not sure if it affects the 360's too, but not
> the 300s from what I have heard.

I have had this happen with several 440Mhz units in Netras and one
400Mhz in an Ultra5. Support contract replaced the ones in the Netras,
but the Ultra5 was not under contract and was six months out of warranty
(of course).

I almost bought a replacement for the Ultra5 ($1200 new Mar2002... ugh!)
when Googling usenet found me this:

http://groups.google.com/groups?selm=3AA7E503.6DA8884C%40aol.com&outp...

Gave the info to my local support rep and the numbers for the CPU in my
dead Ultra5 matched up. Got a free replacement. I'm not sure if this
covers all USII modules in all systems, but it's worth checking out in
the event you have problems. Later modules seem to be ok. Haven't had
any trouble for some time now (knock on wood).

--
Ed Wensell III
NetBSD/Alpha at home - Solaris/SPARC at work - OpenVMS in a past life
E-mail address is valid if you know the appropriate bits to drop.

 
 
 

sick Ultra 60

Post by Dr. David Kirkb » Fri, 26 Jul 2002 08:22:42



> I have had this happen with several 440Mhz units in Netras and one
> 400Mhz in an Ultra5. Support contract replaced the ones in the Netras,
> but the Ultra5 was not under contract and was six months out of warranty
> (of course).

> I almost bought a replacement for the Ultra5 ($1200 new Mar2002... ugh!)
> when Googling usenet found me this:

> http://groups.google.com/groups?selm=3AA7E503.6DA8884C%40aol.com&outp...

From that, it does seem to indicate this is a real problem, although that post
refers to U5s and U10s, which take different cpus to the Ultra 60 in question.
However, how different I don't know - they probably share a lot in common.

Someone I trust first mentioned it to me when he was aware I was looking to
replace my U60's 300 MHz CPUs with 450 MHz versions. He thought it was probably
more an issue of heat, since 450's generate more heat than 300s. After his
warning, I made sure to clean fans/case of dust, after replacing the CPUs.

Sun should do the decent thing and replace any affected component. Covering it
up does nobody favours. Intel had to replace early Pentiums due to the FDIV
problem - Sun should do likewise with any component that clearly had a
manufacturing fault.

The FDIV problem convinced me of one thing - never to buy a Dell PC again. I did
not blame Dell for the problem of my Dell PC (mistakes do happen and clearly
Intel made this one), but it was Dell's attitude about the whole thing that made
me so angry. Dell seemed unwilling to do anything about it, saying I should
contact Intel, despite the fact Dell had my companies' money, not Intel. Under
UK law at least, Dell was responsible, not Intel, but they seemed unwilling to
accept that responsibility.

I think IT professionals accept problems do sometimes occur (like the FDIV error
on Pentiums, or cache problems on UltraSPARCs). It is attempting to cover them
up which annoys people and makes them weary of using a product in future.

If a company is honest about their mistakes, people will put trust in that
company. Sun needs that, since clearly they can not compete on price/performance
ratio with GNU/Linux PCs.
--
Dr. David Kirkby PhD,

web page: http://www.david-kirkby.co.uk
Amateur radio callsign: G8WRB

 
 
 

1. cases of Ultra 30, Ultra 60 and Ultra 450 and PC components

Has anybody got an idea if you can put standard PC components (mainboard,
HDs, CD-ROM and 3,5" floppy) into the case of a Ultra 30, 60 or 450
workstation?

Thanks in advance
Matthias Werner

2. Iomega ZIP & Printer

3. SCSI Ultra 320 HDD on Sun Ultra 60

4. add_timer??

5. Ultra 10 / Ultra 60

6. moving /bsd off /

7. ultra-scsi diff. card in Ultra 60 (problem)

8. Getting DHCP to work

9. Ultra 60 vs Ultra II

10. Ultra 10 vs Ultra 60

11. Swapping an 18GB disk from an Ultra 1 to an Ultra 60

12. Ultra 60 to Ultra 2 disk clone

13. PDF Ultra 60 datasheet