Panic k_trap on OSR5

Panic k_trap on OSR5

Post by Lucky Leavel » Sat, 27 Jul 1996 04:00:00



Just an additional post to add the output of crash trace and user commands.

Thank you,
Lucky

Lucky Leavell                      Phone: (800) 481-2393  or  (812) 945-6555
UniXpress - Your Source for SCO      FAX: (812) 949-9233

New Albany, IN 47150-2013                 71534,2674 (CompuServe)
WWW Home Page:  http://www.UniXpress.com   ftp://www.iglou.com/members/ris

Quote:> trace

KERNEL STACK TRACE FOR PROCESS 117:
STKADDR   FRAMEPTR  FUNCTION   POSSIBLE ARGUMENTS
e0000b98  e0000c0c  prf_task_s (0x4,0,0x5,0xe)
e0000c14  e0000c2c  cmn_err    (0x3,dmsize+0x210,0xe,u+0xc6c)
e0000c34  e0000c60  k_trap     (u+0xc6c)
          e0000c6c  kern_trap  from 0xf0156f62 in ifreeget  
  ax:       0 cx:    4099 dx:       0 bx:       5 fl:    10202 ds: 160 fs:   0
  sp:e0000c9c bp:e0000cc4 si:       1 di:fb901bf0 err:       2 es: 160 gs:   0
e0000c74  e0000cc4  ifreeget   (0x5,u+0xd1c,mount,inode+0x10c20)
e0000ccc  e0000cf0  igetput    (mount,0x5799,0x18,inode+0x10c20)
e0000cf8  e0000dd4  namei      (upath,0,u+0x1148)
e0000ddc  e0000de8  cmn_stat   (0x33,0x80585e0,0x8047b3c,u+0xe34)
e0000df0  e0000e00  xstat      (0x8005828c,0x80585c1,0x80585e0,u+0xe28)
e0000e08  e0000e28  systrap    (u+0xe34)
          e0000e34  scall_noke from 0x800504c9
  ax:      7b cx: 80585e0 dx: 8047b3c bx:8005828c fl:      202 ds:  1f fs:   0
  sp:e0000e64 bp: 8047b18 si: 80585c1 di: 80585e0 err:      7b es:  1f gs:   0

Quote:> user

PER PROCESS USER AREA FOR PROCESS 117
USER ID's:      uid: 5, gid: 5, real uid: 5, real gid: 5
        supplementary gids: 5
PROCESS TIMES:  user: 0, sys: 10, child user: 0, child sys: 0
PROCESS MISC:
        command: sh, psargs: sh -c /usr/lib/uucp/uudemon.poll > /dev/null
        proc: P#117, cntrl tty: maj(??) min(??)
        start: Fri Jul 26 10:10:00 1996
        mem: 0x4af, type: fork
        proc/text lock: none
        current directory: I#755
OPEN FILES AND POFILE FLAGS:
          [ 0]: F#345        [ 1]: F#414        [ 2]: F#357      
          [59]: F#445 c r  
FILE I/O:
        u_base: 0x80580e8, file offset: 180, bytes: 16
        segment: data, cmask: 0022, ulimit: 2097151
        file mode(s): read
SIGNAL DISPOSITION:
        sig#      signal oldmask sigmask
           1: ignore        -     1
           2: ignore        -     2
           3: ignore        -     3
           4:  0x804afb0    -     4
           5:  0x804afb0    -     5
           6:  0x804afb0    -     6
           7:  0x804afb0    -     7
           8:  0x804afb0    -     8
          10:  0x804afb0    -    10
          11:  0x804b100    -    11
          12:  0x804afb0    -    12
          13:  0x804afb0    -    13
          14:  0x804b100    -    14
          15:  0x804b100    -    15
          16:  0x804afb0    -    16
          17:  0x804afb0    -    17
 
 
 

Panic k_trap on OSR5

Post by Jean-Pierre Radl » Sat, 27 Jul 1996 04:00:00


Quote:Lucky Leavell writes:
> Just had a client machine do it the second time this week.  I had them
> save the dump file to tape, reloaded it to /tmp and looked at it with re
> olde crash. Upon inspecting the panic messages area, the output was:

Is oss434a applied?

--


 
 
 

Panic k_trap on OSR5

Post by Lucky Leavel » Sat, 27 Jul 1996 04:00:00



> Is oss434a applied?

Yes, it is.

Thank you,
Lucky

Lucky Leavell                            Phone: (812) 945-6555
Relational Information Systems, Inc.       FAX: (812) 949-9233

New Albany, IN 47150-2013                       71534,2674 (CompuServe)
WWW Home Page:  http://www.iglou.com/ris   ftp://www.iglou.com/members/ris

 
 
 

Panic k_trap on OSR5

Post by Lucky Leavel » Sat, 27 Jul 1996 04:00:00


Just had a client machine do it the second time this week.  I had them
save the dump file to tape, reloaded it to /tmp and looked at it with re
olde crash. Upon inspecting the panic messages area, the output was:

Quote:> panic

System Messages:

S init H HPPS initH HS init  H HTFS initH XENIX initH NFS init  H finit   H strinitH ksl initH iinit   H flckinitH seminit H msginitH xsdinitH xseminitH cfgmsginitI cpyrtstartI sockstart I arpstart I ipstart I ripstartI igmpstartI ic
mpstartI lo_start I tcpstartI udpstartI incfstartI rtestart mem: total = 16000k, kernel = 5116k, user = 10884k
J         K swapdev = 1/41, swplo = 0, nswap = 144000, swapmem = 72000k
M rootdev = 1/42, pipedev = 1/42, dumpdev = 1/41
kernel: Hz = 100, i/o bufs = 1424k

%disk     -             -       -       type=S ha=0 id=3 lun=0 bus=0 ht=ad
%Sdsk     -             -       -       cyls=511 hds=64 secs=32 fts=sb
%disk     -             -       -       type=S ha=0 id=1 lun=0 bus=0 ht=ad
%Sdsk     -             -       -       cyls=202 hds=64 secs=32 fts=sb

WARNING: NLM: RPC call failed: RPC error: RPC_PMAPFAILURE, errno 0

WARNING: NLM: RPC call failed: RPC error: RPC_PMAPFAILURE, errno 0

WARNING: NLM: RPC call failed: RPC error: RPC_PMAPFAILURE, errno 0

WARNING: HTFS: Invalid block 14720819 on dev hd (1/168)
%Stp-0    -             -       -       Vendor=TANDBERG Product= TDC 3800
Unexpected trap in kernel mode:
cr0 0x8001001B     cr2  0x00000008     cr3 0x00002000     tlb  0x00000000
ss  0x0000E000     uesp 0xF02719C4     efl 0x00010202     ipl  0x00000000
cs  0x00000158     eip  0xF0156F62     err 0x00000002     trap 0x0000000E
eax 0x00000000     ecx  0x00004099     edx 0x00000000     ebx  0x00000005
esp 0xE0000C9C     ebp  0xE0000CC4     esi 0x00000001     ed  0xFB901BF0
ds  0x00000160     es   0x00000160     fs  0x00000000     gs   0x00000000
cpu 0x00000001

PANIC: k_trap - Kernel mode trap type 0x0000000E
Trying to dump 4000 pages to dumpdev hd (1/41) at block 0, 50 pages per '.'
.....................................................................H DTF

Panic String: k_trap - Kernel mode trap type 0x%x

Kernel Trap.  Kernel Registers saved at 0xe0000c6c
ERR=2, TRAPNO=14
cs:eip=0158:f0156f62 Flags=10202
ds = 0160   es = 0160   fs = 0000   gs = 0000
esi= 00000001   edi= fb901bf0   ebp= e0000cc4   esp= e0000c9c
eax= 00000000   ebx= 00000005   ecx= 00004099   edx= 00000000

Kernel Stack before Trap:
STKADDR   FRAMEPTR  FUNCTION   POSSIBLE ARGUMENTS
e0000c9c  e0000cc4  ifreeget   (0x5,u+0xd1c,mount,inode+0x10c20)
e0000ccc  e0000cf0  igetput    (mount,0x5799,0x18,inode+0x10c20)
e0000cf8  e0000dd4  namei      (upath,0,u+0x1148)
e0000ddc  e0000de8  cmn_stat   (0x33,0x80585e0,0x8047b3c,u+0xe34)
e0000df0  e0000e00  xstat      (0x8005828c,0x80585c1,0x80585e0,u+0xe28)
e0000e08  e0000e28  systrap    (u+0xe34)

The following line

        WARNING: HTFS: Invalid block 14720819 on dev hd (1/168)

appears in neither the /usr/adm/messages nor /usr/adm/syslog file. The
/dev file for (1/168) is /dev/usr2 which is the oldest (a Maxstor 7213S).

Questions:

    1. Just what is an "Invalid block " message signifying?
    2. Does this mean the hd is dying and needs to be replaced?
    3. Suggestions, caveats, etc.?

Their system is a no-name P90 with 16MB (probably non-parity) RAM running
OSR5 with Rel. Supp. D, net100, oss434a (and a few others).

Thank you,
Lucky

Lucky Leavell                      Phone: (800) 481-2393  or  (812) 945-6555
UniXpress - Your Source for SCO      FAX: (812) 949-9233

New Albany, IN 47150-2013                 71534,2674 (CompuServe)
WWW Home Page:  http://www.UniXpress.com   ftp://www.iglou.com/members/ris

 
 
 

Panic k_trap on OSR5

Post by Steve Ra » Sat, 27 Jul 1996 04:00:00




>Just had a client machine do it the second time this week.  I had them
>save the dump file to tape, reloaded it to /tmp and looked at it with re
>olde crash. Upon inspecting the panic messages area, the output was:

>> panic
>System Messages:

>S init H HPPS initH HS init  H HTFS initH XENIX initH NFS init  H finit   H strinitH ksl initH iinit   H flckinitH seminit H msginitH xsdinitH xseminitH cfgmsginitI cpyrtstartI sockstart I arpstart I ipstart I ripstartI igmpstartI ic
>mpstartI lo_start I tcpstartI udpstartI incfstartI rtestart mem: total = 16000k, kernel = 5116k, user = 10884k
>J         K swapdev = 1/41, swplo = 0, nswap = 144000, swapmem = 72000k
>M rootdev = 1/42, pipedev = 1/42, dumpdev = 1/41
>kernel: Hz = 100, i/o bufs = 1424k

>%disk     -         -       -       type=S ha=0 id=3 lun=0 bus=0 ht=ad
>%Sdsk     -         -       -       cyls=511 hds=64 secs=32 fts=sb
>%disk     -         -       -       type=S ha=0 id=1 lun=0 bus=0 ht=ad
>%Sdsk     -         -       -       cyls=202 hds=64 secs=32 fts=sb

>WARNING: NLM: RPC call failed: RPC error: RPC_PMAPFAILURE, errno 0

>WARNING: NLM: RPC call failed: RPC error: RPC_PMAPFAILURE, errno 0

>WARNING: NLM: RPC call failed: RPC error: RPC_PMAPFAILURE, errno 0

>WARNING: HTFS: Invalid block 14720819 on dev hd (1/168)
>%Stp-0    -         -       -       Vendor=TANDBERG Product= TDC 3800
>Unexpected trap in kernel mode:
>cr0 0x8001001B     cr2  0x00000008     cr3 0x00002000     tlb  0x00000000
>ss  0x0000E000     uesp 0xF02719C4     efl 0x00010202     ipl  0x00000000
>cs  0x00000158     eip  0xF0156F62     err 0x00000002     trap 0x0000000E
>eax 0x00000000     ecx  0x00004099     edx 0x00000000     ebx  0x00000005
>esp 0xE0000C9C     ebp  0xE0000CC4     esi 0x00000001     ed  0xFB901BF0
>ds  0x00000160     es   0x00000160     fs  0x00000000     gs   0x00000000
>cpu 0x00000001

>PANIC: k_trap - Kernel mode trap type 0x0000000E
>Trying to dump 4000 pages to dumpdev hd (1/41) at block 0, 50 pages per '.'
>.....................................................................H DTF

>Panic String: k_trap - Kernel mode trap type 0x%x

>Kernel Trap.  Kernel Registers saved at 0xe0000c6c
>ERR=2, TRAPNO=14
>cs:eip=0158:f0156f62 Flags=10202
>ds = 0160   es = 0160   fs = 0000   gs = 0000
>esi= 00000001   edi= fb901bf0   ebp= e0000cc4   esp= e0000c9c
>eax= 00000000   ebx= 00000005   ecx= 00004099   edx= 00000000

>Kernel Stack before Trap:
>STKADDR   FRAMEPTR  FUNCTION   POSSIBLE ARGUMENTS
>e0000c9c  e0000cc4  ifreeget   (0x5,u+0xd1c,mount,inode+0x10c20)
>e0000ccc  e0000cf0  igetput    (mount,0x5799,0x18,inode+0x10c20)
>e0000cf8  e0000dd4  namei      (upath,0,u+0x1148)
>e0000ddc  e0000de8  cmn_stat   (0x33,0x80585e0,0x8047b3c,u+0xe34)
>e0000df0  e0000e00  xstat      (0x8005828c,0x80585c1,0x80585e0,u+0xe28)
>e0000e08  e0000e28  systrap    (u+0xe34)

>The following line

>    WARNING: HTFS: Invalid block 14720819 on dev hd (1/168)

>appears in neither the /usr/adm/messages nor /usr/adm/syslog file. The
>/dev file for (1/168) is /dev/usr2 which is the oldest (a Maxstor 7213S).

>Questions:

>    1. Just what is an "Invalid block " message signifying?
>    2. Does this mean the hd is dying and needs to be replaced?
>    3. Suggestions, caveats, etc.?

>Their system is a no-name P90 with 16MB (probably non-parity) RAM running
>OSR5 with Rel. Supp. D, net100, oss434a (and a few others).

>Thank you,
>Lucky

>Lucky Leavell                      Phone: (800) 481-2393  or  (812) 945-6555
>UniXpress - Your Source for SCO      FAX: (812) 949-9233

>New Albany, IN 47150-2013                 71534,2674 (CompuServe)
>WWW Home Page:  http://www.UniXpress.com   ftp://www.iglou.com/members/ris

1) "Invalid block" means that the block number is invalid, so
   something is corrupted on disk.  How big is the disk?  The
   block number 14720819 is somewhere into the 14th gigabyte.

2) It doesn't necessarily mean that the disk is dying, but that's
   possible.  Usually a dying disk exhibits a continually increasing
   number of bad blocks.

3) Suggestions:
      - Run crash again and figure out what function is at address
        0xf0156f62.  This is the value of the instruction pointer
        when the panic occurred.  The system probably panicked from
        a bad pointer dereference (CR2 was 0x8, the address causing
        the page fault).
      - Unmount /usr2 and run "fsck -ofull /dev/rusr2" on the device
        and see what's been corrupted.  If it looks like major damage,
        you might have to restore some files from backups.
      - I don't see how this panic is a result of the single warning
        message displayed, but if other things are corrupted on disk,
        then all bets are off.

Hope this helps.

Steve Rago

 
 
 

Panic k_trap on OSR5

Post by Bela Lubki » Sun, 28 Jul 1996 04:00:00



> Just had a client machine do it the second time this week.  I had them
> save the dump file to tape, reloaded it to /tmp and looked at it with re
> olde crash. Upon inspecting the panic messages area, the output was:

[...]

Quote:> WARNING: HTFS: Invalid block 14720819 on dev hd (1/168)

[...]

Quote:> cr0 0x8001001B     cr2  0x00000008     cr3 0x00002000     tlb  0x00000000

[...]

Quote:> Kernel Stack before Trap:
> STKADDR   FRAMEPTR  FUNCTION   POSSIBLE ARGUMENTS
> e0000c9c  e0000cc4  ifreeget   (0x5,u+0xd1c,mount,inode+0x10c20)
> e0000ccc  e0000cf0  igetput    (mount,0x5799,0x18,inode+0x10c20)
> e0000cf8  e0000dd4  namei      (upath,0,u+0x1148)
> e0000ddc  e0000de8  cmn_stat   (0x33,0x80585e0,0x8047b3c,u+0xe34)
> e0000df0  e0000e00  xstat      (0x8005828c,0x80585c1,0x80585e0,u+0xe28)
> e0000e08  e0000e28  systrap    (u+0xe34)

> The following line

>    WARNING: HTFS: Invalid block 14720819 on dev hd (1/168)

> appears in neither the /usr/adm/messages nor /usr/adm/syslog file. The
> /dev file for (1/168) is /dev/usr2 which is the oldest (a Maxstor 7213S).

> Questions:

>     1. Just what is an "Invalid block " message signifying?

That message means "Hey!  Block number 14720819 is outside the actual
dimensions of this device!"  Filesystem blocks are 1K, so that block
number is in the 15th gigabyte of space -- presumably well past the end
of your old 7213S.

Quote:>     2. Does this mean the hd is dying and needs to be replaced?

Not necessarily.  It means the filesystem code got a bad number from
somewhere.  It could have come from the block list of a corrupted inode
on the disk (corrupted by any number of possible glitches), or from a
corrupted in-memory image of a block list.  If, as you say, the machine
has non-parity memory, practically anything could happen.

The invalid block message and the panic are probably not directly
related.  They probably both stem from the same underlying problem of
unstable hardware.

- Show quoted text -

Quote:>     3. Suggestions, caveats, etc.?

> Their system is a no-name P90 with 16MB (probably non-parity) RAM running
> OSR5 with Rel. Supp. D, net100, oss434a (and a few others).
>Bela<

 
 
 

Panic k_trap on OSR5

Post by Troy DeJon » Sun, 28 Jul 1996 04:00:00




:: Just had a client machine do it the second time this week.  I had them
:: save the dump file to tape, reloaded it to /tmp and looked at it with re
:: olde crash. Upon inspecting the panic messages area, the output was:
::

[snippage]

::
:: WARNING: HTFS: Invalid block 14720819 on dev hd (1/168)
:: %Stp-0    -          -       -       Vendor=TANDBERG Product= TDC 3800
:: Unexpected trap in kernel mode:
:: cr0 0x8001001B     cr2  0x00000008     cr3 0x00002000     tlb  0x00000000
:: ss  0x0000E000     uesp 0xF02719C4     efl 0x00010202     ipl  0x00000000
:: cs  0x00000158     eip  0xF0156F62     err 0x00000002     trap 0x0000000E
:: eax 0x00000000     ecx  0x00004099     edx 0x00000000     ebx  0x00000005
:: esp 0xE0000C9C     ebp  0xE0000CC4     esi 0x00000001     ed  0xFB901BF0
:: ds  0x00000160     es   0x00000160     fs  0x00000000     gs   0x00000000
:: cpu 0x00000001
::
:: PANIC: k_trap - Kernel mode trap type 0x0000000E

[snippage]

::
:: Kernel Stack before Trap:
:: STKADDR   FRAMEPTR  FUNCTION   POSSIBLE ARGUMENTS
:: e0000c9c  e0000cc4  ifreeget   (0x5,u+0xd1c,mount,inode+0x10c20)
:: e0000ccc  e0000cf0  igetput    (mount,0x5799,0x18,inode+0x10c20)
:: e0000cf8  e0000dd4  namei      (upath,0,u+0x1148)
:: e0000ddc  e0000de8  cmn_stat   (0x33,0x80585e0,0x8047b3c,u+0xe34)
:: e0000df0  e0000e00  xstat      (0x8005828c,0x80585c1,0x80585e0,u+0xe28)
:: e0000e08  e0000e28  systrap    (u+0xe34)
::
:: The following line
::
::      WARNING: HTFS: Invalid block 14720819 on dev hd (1/168)
::
:: appears in neither the /usr/adm/messages nor /usr/adm/syslog file. The
:: /dev file for (1/168) is /dev/usr2 which is the oldest (a Maxstor 7213S).
::
:: Questions:
::      
::     1. Just what is an "Invalid block " message signifying?
::     2. Does this mean the hd is dying and needs to be replaced?
::     3. Suggestions, caveats, etc.?
::
:: Their system is a no-name P90 with 16MB (probably non-parity) RAM running
:: OSR5 with Rel. Supp. D, net100, oss434a (and a few others).
::
:: Thank you,
:: Lucky
::

: 1) "Invalid block" means that the block number is invalid, so
:    something is corrupted on disk.  How big is the disk?  The
:    block number 14720819 is somewhere into the 14th gigabyte.

: 2) It doesn't necessarily mean that the disk is dying, but that's
:    possible.  Usually a dying disk exhibits a continually increasing
:    number of bad blocks.

: 3) Suggestions:
:       - Run crash again and figure out what function is at address
:       0xf0156f62.  This is the value of the instruction pointer
:       when the panic occurred.  The system probably panicked from
:       a bad pointer dereference (CR2 was 0x8, the address causing
:       the page fault).

You guys are having too much fun without me...  :-)

Noting that cr2 is 0x8 (as Steve mentioned) and that %eax=0, %edx=0, and
%ebx=5, it looks to me that the likely possibilities for the faulting
address to be 0x8 in ifreeget() is at these points (offsets in decimal):

     <ifreeget+98>:     movl   %esi,0x8(%eax)
     <ifreeget+104>:    movl   %edi,0x8(%eax)
     <ifreeget+110>:    movl   %edx,0x8(%eax)
     <ifreeget+314>:    movl   %edx,0x8(%eax)

This is all assuming that the panic occured in the ifreeget() function
(I never did like the fact that crash wasn't a little more verbose on
its panic info).

Lucky, armed with these offsets, guys like Steve and Bela can better pinpoint
your exact problem.  But it looks as if they already might have a pretty
good clue as to what your problem is...

:       - Unmount /usr2 and run "fsck -ofull /dev/rusr2" on the device
:       and see what's been corrupted.  If it looks like major damage,
:       you might have to restore some files from backups.
:       - I don't see how this panic is a result of the single warning
:       message displayed, but if other things are corrupted on disk,
:       then all bets are off.

: Hope this helps.

: Steve Rago

--
Troy de Jongh           "No matter how hard you push and no matter what the
                         priority, you can't increase the speed of light."  
                         Fundamental Networking Truth #2, RFC 1925

 
 
 

Panic k_trap on OSR5

Post by Lucky Leavel » Tue, 30 Jul 1996 04:00:00




> >Just had a client machine do it the second time this week.  I had them
> >save the dump file to tape, reloaded it to /tmp and looked at it with re
> >olde crash. Upon inspecting the panic messages area, the output was:

> >WARNING: HTFS: Invalid block 14720819 on dev hd (1/168)

> I figure that block is about 7.5 GIGAbytes into the hard drive.
> If you aren't running a drive that large then it may be a
> problem with bad data on some spot on the hard drive - eg an
> inode pointing to a non-existant point be because of a data
> read or write error.

It is only a 210MB drive (soon to be replaced with a 1GB).

Quote:> Have you done an fsck on that drive?

Yes, and it found a partially allocated inode (whatever on earth that is)
and several duplicate indodes.  I went ahead a cleaned the other two
non-root filesystems as well.

Bela and others also recommended replacing the non-parity memory
motherboard with one supporting parity (at least) or ECC memory.  I have
located one using the Triton2 chipset but they only sell parity memory
though the Triton2 will also support ECC.  Know of any good sources for
ECC memory?

Thank you,
Lucky

Lucky Leavell                      Phone: (800) 481-2393  or  (812) 945-6555
UniXpress - Your Source for SCO      FAX: (812) 949-9233

New Albany, IN 47150-2013                 71534,2674 (CompuServe)
WWW Home Page:  http://www.UniXpress.com   ftp://www.iglou.com/members/ris

 
 
 

Panic k_trap on OSR5

Post by Bela Lubki » Tue, 30 Jul 1996 04:00:00



> Bela and others also recommended replacing the non-parity memory
> motherboard with one supporting parity (at least) or ECC memory.  I have
> located one using the Triton2 chipset but they only sell parity memory
> though the Triton2 will also support ECC.  Know of any good sources for
> ECC memory?

It is good that the Triton II chipsets are finally out, with support for
parity and ECC.  It's unfortunate there is so much misinformation
swirling around them.  :-(  Here's a summary of what I know.

Intel doesn't use the names "Triton" and "Triton II" in their official
literature; at least, not on their web site.  They use names like "430FX
PCIset".  If you are offered a "Triton II" motherboard, and the reason
you're interested in it is for parity/ECC support, you need to be
careful, make sure you're getting the right thing.

There appear to be four chipsets (today) in the 430 PCIset line.  The
oldest is the 430FX PCIset, commonly known as "Triton".  It doesn't
support parity or ECC.

The two newest models are the 430HX PCIset and the 430VX PCIset.

The 430HX supports both parity and ECC, using standard 36-bit parity
memories.  ECC is implemented by combining the 8 parity bits from a
64-bit memory word.  It uses perfectly standard parity memory modules to
implement ECC.  You do *not* need special memory, just 36-bit parity
SIMMs.  Most of the usual motherboard vendors are already shipping 430HX
motherboards.  Some are now starting to ship dual-Pentium 430HX boards.

NOTE: according to USENET posts, revisions of the 430HX chipset before
the A-2 stepping do not actually work right when parity or ECC is
enabled.  So make sure you're getting at least the A-2 stepping.

One final thing to watch out for: although the chipset supports both
parity and ECC, using the same memory hardware, the system BIOS may not
be that flexible.  I have seen at least one system which refused to
program the chipset for parity operation; ECC was the only choice.
Presumably a BIOS could make the opposute choice for you as well -- or
could even contrive to entirely waste these features by refusing to
enable either!  I would look in BIOS setup and make sure it gives me,
the end user, control over this important system setting.

The 430VX does *not* support parity or ECC.  So if you're looking for
these reliability features, do not buy a 430VX-based system.  Some
vendors are labeling the 430VX PCIset "Triton II VX", some are labeling
it "Triton III".  Look for the Intel product name, not the code names.

The fourth 430 PCIset is the 430MX PCIset, a "mobile" chipset designed
for laptops.  I couldn't tell from Intel's web pages whether this was
new or old.  Not particularly relevant, I suppose.  Vendors will
undoubtably call it "Triton something".  It doesn't support parity or
ECC.

So far, all of Intel's Pentium Pro PCIsets (440FX "Mars", 450GX "Orion
ST", 450KX "Orion DT") support parity, ECC, or both.

BTW, when buying parity memory, watch out for "logic parity" SIMMs,
which really only have 32 bits of memory plus a circuit to dynamically
generate the "right" 4 bits of parity.  Such SIMMs have no protection
against memory corruption.  They exist only to to allow low-budget
system vendors to put together parity-less systems using motherboards
which cannot be configured to ignore parity.

Summary:

  o Intel 430HX PCIset is the only "Triton anything" chipset that
    currently supports parity or ECC, and (according to USENET posting)
    you need stepping A-2 or later.  May be called "Triton HX" by
    advertisers.  If you see "Triton II" or "Triton III", it could be
    one of the other PCIsets; make sure it is actually 430HX.

  o It supports your choice of parity or ECC, as long as the BIOS lets
    you choose.  Both use the same standard 36-bit parity SIMMs.

  o Avoid "logic parity" SIMMs, which provide fake parity that won't
    protect you from memory glitches.

Quote:>Bela<

 
 
 

Panic k_trap on OSR5

Post by Bela Lubki » Thu, 01 Aug 1996 04:00:00


Lucky Leavell wrote (in private mail -- posted with permission):

Quote:> I checked with this MB vendor and their hardware person (which I
> obviously am NOT) confirmed that they use the 430HX chipset with A2
> stepping.  THey further confirmed what you said about being able to
> handle ECC or parity but with the caveat that the ECC part only works
> with single bit errors.  ANy more than one error and the ECC part gives a
> parity error.  Hopefully, single bit errors should help a lot though I've
> heard of more robust ECC implementations which can handle multiple bit
> (up to 12, I think) errors.

Ah, well, perhaps they're trying to scare you, but believe me, a
single-bit ECC implementation is more than enough.

Mainframe memories use single-bit-error-correct, double-bit-error-detect
ECC systems.  There is hardware which uses 12-bit-error-correct ECC
systems, but it's completely different hardware.  To be specific, disk,
tape and CD-ROM drives.  To understand the distinction, you need to
understand the possible failure modes of the hardware.

Dynamic RAM has a separate "cell", consisting of perhaps 5-10
transistors, for each memory bit.  Each cell must be "refreshed" many
times a second by being "strobed" by the memory controller.  (Newer
memory designs are pushing this functionality onto the RAM chips
themselves, but the idea is the same: they must be refreshed).  Typical
failure modes involve a single bit getting toggled by one of two
mechanisms: either a passing radioactive particle (typically an alpha
particle) destroys the current state, or refresh is delayed so long that
the bit manages to decay.  Now, if refresh is sufficiently delayed, a
*lot* of your bits will decay and the memory will just act completely
broken, in which case no form of ECC will correct the problem (but it
*will* *detect* it, which is enough for you to know that you have to
replace the SIMM or soften up your memory timings).

Furthermore, RAM cells are structured into arrays which physically
separate the bits which make up a single word of memory.  In other
words, the CPU may see memory as an array of 64-bit "cache lines", or
some other such pattern, but each bit of one of those words or cache
lines is physically distant from each other bit.

Thus, the normal failure mode of RAM affects only a single bit at time.
Even in the rare case where an alpha particle might blunder through
several physically adjacent bits, the resulting errors will affect
different words of memory.  Because of that separation, single-bit-
correct ECC can almost always fix RAM problems.

Contrast this with disks and tapes.  On these media, physically adjacent
bits belong to logically adjacent memory locations.  If you read a byte
of data from a tape, you receive 8 bits which were lined up one after
another on the physical tape ribbon.  A glitch on that ribbon would wipe
out several adjacent data bits.  It's also an inherently less reliable
medium -- moving parts, stretchable tape and magnetic heads instead of
nice quiet transistors.  For tapes and disks to be reliable, they *have*
to be able to reconstruct fairly long strings of lost bits.

Returning to RAM: many people will argue that RAM is now so reliable
that parity and ECC are a waste of money.  There is probably some truth
in the argument that the RAM itself is that reliable, when subjected to
perfect environmental conditions.  However, it's physically impossible
for it to be perfectly reliable in real-world conditions.  If you put a
strong alpha particle source next to your SIMMs, you can be sure they
will start failing.  Unless your computer, your building and your body
have been specially formulated out of purely non-radioactive materials,
(or the computer kept inside heavy lead shielding -- I hope the cooling
system's good) the SIMMs will be subject to *some* radiation and will
occasionally drop a bit.

According to Intel's white papers, a typical "PC" machine with typical
amounts of memory can expect a soft error (bad memory read which could
be repaired by ECC and would not repeat, i.e. the memory cell itself
isn't bad) about once every 20 years.  The numbers are all theoretical
and I disbelieve their analysis.  Anyway, whether you think it's once a
year or once every 20 years, that's for a single-bit error.  Because the
bits are physically separated, the odds of a double-bit error are on the
order of the square of the odds for a single-bit error.  So yes, a
double-bit error will cause the ECC circuitry to give up and report a
parity error, which will crash the machine.  But *that* should happen
more like once in a million years.  When it does, at least the bad bits
won't be propagated into your spreadsheet to haunt you later.

Quote:>Bela<

 
 
 

1. PANIC: k_trap

Where might I find a list of the different types of panics and how
the various k_trap numbers are defined.

I've gotten two k_trap: 0x00000006 in the last two days on different
machines (both Pentium clones, one a P75 and one P90)

If there isn't a concise list what is the k_trap: 0x00000006 mean?

Thanks,

-- Marc Ferguson --

804-497-8951

2. Looking for Unobstrusive Apllication Interupter

3. Nagging PANIC K_trap error!

4. Basic F84 development tools for Linux needed

5. Adaptec 2940U2W panic k_trap SCO 3.2.4.2

6. devices

7. "Panic: K_trap - Kernal mode trap type 0x00000000" HELP

8. Desperate!! Constant permission error with tapedrive SCO V5

9. Double panic k_trap...Need help fast...

10. PANIC: k_trap After install of oss451b

11. Panic:k_trap on Prioris ZX6000

12. PANIC: k_trap 0x00000006

13. Adaptec 2940U2W panic k_trap SCO 3.2.4.2