how to interpret ide error messages (2.4)

how to interpret ide error messages (2.4)

Post by ME » Thu, 03 Apr 2003 14:30:14



Hello list,

pls help to interpret the following error log: (kernel 2.4.18-5, redhat 7.3)

Mar 31 21:22:56:
kernel: hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
kernel: hdc: dma_intr: error=0x01 { AddrMarkNotFound }, LBAsect=20300322, sector=1263288
kernel: hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
kernel: hdc: dma_intr: error=0x40 { UncorrectableError }, LBAsect=20803307, sector=1766272
kernel: end_request: I/O error, dev 16:04 (hdc), sector 1766272
kernel: raid1: Disk failure on hdc4, disabling device.
kernel: ^IOperation continuing on 1 devices
kernel: raid1: hdc4: rescheduling block 1766272
kernel: md: updating md3 RAID superblock on device
kernel: md: (skipping faulty hdc4 )

Q1: does that mean that the first error (LBAsect=20300322, sector=1263288) was
    a soft one and the second error a hard error which resulted in the I/O error?

Q2: 19037025 (start of hdc4) + 1766272 = 20803297 and not 20803307 so what
    is the arithmetic magic here? I hope LBAsectors are counted from 0 up?

The affected sectors dont generate any error messages if I read them today...

Since this error happened ClearCase moans about a corrupted
replica packet so I suspect that the errors somehow affected user space
as well - it is very well possible that stuff is unrelated but replica
corruptions did not happen during the whole 110 days of uptime with
lots of replica traffic.

BTW, it would be nice if 'dev 16:04' was more explicit about being hex
and not decimal.

explanations are welcome.

Greetings,
Karl

Disk /dev/hdc: 14946 cylinders, 255 heads, 63 sectors/track
Units = sectors of 512 bytes, counting from 0

   Device Boot    Start       End  #sectors  Id  System
/dev/hdc1            63    144584    144522  fd  Linux raid autodetect
/dev/hdc2        144585   2249099   2104515  fd  Linux raid autodetect
/dev/hdc3       2249100  19037024  16787925  fd  Linux raid autodetect
/dev/hdc4      19037025 240107489 221070465  fd  Linux raid autodetect

--

GE Medical Kretztechnik
Tiefenbach 15
A-4871 Zipf         Tel: (++43) 7682-3800-710  Fax (++43) 7682-3800-47
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

how to interpret ide error messages (2.4)

Post by Alan Co » Thu, 03 Apr 2003 16:40:08



> kernel: hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> kernel: hdc: dma_intr: error=0x01 { AddrMarkNotFound }, LBAsect=20300322, sector=1263288
> kernel: hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }

The drive could not find the requested sector. That normally means bad
things but for some drivers can also mean the controller asked for a
totally bogon sector number

Quote:> kernel: hdc: dma_intr: error=0x40 { UncorrectableError }, LBAsect=20803307, sector=1766272
> kernel: end_request: I/O error, dev 16:04 (hdc), sector 1766272

Unrecoverable data error.

Quote:> The affected sectors dont generate any error messages if I read them today...

On errors the next write to a bad sector will typically remap it
transparently to another spare block on the disk. Read obviously cannot
do the same. That would mean that if for example clearcase ignored the
I/O error and wrote back what it thought it saw but did not that it may
have recovered the sector with invalid data. Its also possible of course
clearcase actually handles I/O errors properly (which is hard).

Consult the clearcase support I guess, there should be tools to verify
your clearcase datasets. You might also want to force an fsck on your
file systems while the box is down for disk replacement to check
everything out.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

how to interpret ide error messages (2.4)

Post by ME » Thu, 03 Apr 2003 19:50:13




> > kernel: hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> > kernel: hdc: dma_intr: error=0x01 { AddrMarkNotFound }, LBAsect=20300322, sector=1263288
> > kernel: hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }

> The drive could not find the requested sector. That normally means bad
> things but for some drivers can also mean the controller asked for a
> totally bogon sector number

> > kernel: hdc: dma_intr: error=0x40 { UncorrectableError }, LBAsect=20803307, sector=1766272
> > kernel: end_request: I/O error, dev 16:04 (hdc), sector 1766272

> Unrecoverable data error.

> > The affected sectors dont generate any error messages if I read them today...

> On errors the next write to a bad sector will typically remap it
> transparently to another spare block on the disk. Read obviously cannot
> do the same. That would mean that if for example clearcase ignored the
> I/O error and wrote back what it thought it saw but did not that it may
> have recovered the sector with invalid data. Its also possible of course
> clearcase actually handles I/O errors properly (which is hard).

What is giving me an alarm signal is:

Since it is a raid1 I expected user space not being affected.
(The other drive did not show any error messages since installation,
they are Maxtors 6Y120L0 (120 GB) cooled quite well) So I thought that
ClearCase should not have seen any error return code.

In the meantime I have just re-synced the faulty drive and there were
no write error messages and the bad block seems to be properly re-mapped

Quote:

> Consult the clearcase support I guess, there should be tools to verify
> your clearcase datasets. You might also want to force an fsck on your

fortunately only one replica packet seems to be corrupted and I can
request it again - the database seems ok after a run of the equivalent
of fsck.

Quote:> file systems while the box is down for disk replacement to check
> everything out.

there is still interest what "sector=1766272" actually means, e.g given
a partition table and LBAsect, how to calculate 'sector' or vice versa.

Thanks,

Karl

--

GE Medical Kretztechnik
Tiefenbach 15
A-4871 Zipf         Tel: (++43) 7682-3800-710  Fax (++43) 7682-3800-47
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

how to interpret ide error messages (2.4)

Post by Alan Co » Thu, 03 Apr 2003 20:20:33



Quote:> > On errors the next write to a bad sector will typically remap it
> > transparently to another spare block on the disk. Read obviously cannot
> > do the same. That would mean that if for example clearcase ignored the
> > I/O error and wrote back what it thought it saw but did not that it may
> > have recovered the sector with invalid data. Its also possible of course
> > clearcase actually handles I/O errors properly (which is hard).
> Since it is a raid1 I expected user space not being affected.

I didn't realise it was rai1. If it is raid1 you are right, the upper
layer will supply the data from the other drive

Quote:> (The other drive did not show any error messages since installation,
> they are Maxtors 6Y120L0 (120 GB) cooled quite well) So I thought that
> ClearCase should not have seen any error return code.

Correct

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

1. Interpret Lilo error message

Hi,
I use lilo to write to a floppy in order to append parameters to the
boot command needed to get linux to recognize my bios-less SoundBaster
SCSI II interface.  Lilo has always worked fine until I upgraded from
RedHat 2.1 to the new 3.03.  I also upgraded to Kernel 2.0.  I
configure kernels fine and can dd them to floppies.  However, without
using lilo to generate the floppies, I can't append the necessary
parameters (aha152x=0x340,11,7,1).

The error message is:

lilo: geo_comp_addr: Cylinder number too big (1025 > 1023)

Can anyone interpret this message?  I have already tried reloading the
3.03 version of lilo from the distribution disk and even stepping back
to the 2.1 version of lilo.  I can only think that some variable or
configuration file has been changed by one of the above updates.

Thanks in advance for any help!

Dave

2. linux kernel conf 0.4

3. Red Hat 6.1 Installation -- Help Interpreting Error Messages.

4. [PATCH] watchdog nowayout and timeout module parameters

5. Help needed interpreting SCSI error messages

6. How to reset modems & server from a remote location

7. Interpreting an error message

8. Comp.os.linux.hardware Q&A 18 Mar.

9. Need help interpreting xautolock error messages...

10. Interpret printer error message?

11. Error message from Answerbook on Solaris 2.4

12. I_PUSH error message when using telnet/rlogin (2.4)

13. kernel 2.4 test5 - pleae help with ide error