Hard disk "crashing"? How to debug?

Hard disk "crashing"? How to debug?

Post by suneb.. » Thu, 22 Jun 2006 05:17:44



Hi,

I'm running a machine with md raid (two sata drives in a mirror).

Every once a week, the machine just stalls, and on-screen messages like
"sense key medium error" show up (it has happened two times now). I
reboot, and the machine works just fine, the md system is resyncing the
drives and there seem to be no further errors.

Is it the drive? Some other hardware problem? Or more importantly,
which log files can I examine to find out? I tried the usual syslog,
messages, etc, but they are no more generous than telling me the raid
array is broken and is being resynced.

Can i review the onscreen messages that scrolled when the machine
stalled? I think it looked pretty much the same as the one time I
happened to unplug a SATA drive while the machine was running.

(And also, the box is not stalling completely, it responds to pings,
will show the MOTD when i log in over SSH, but won't show me a prompt).

Sorry if I'm not giving enough details but I'm unsure on where to look.

Kind regards
Sune Beck

 
 
 

Hard disk "crashing"? How to debug?

Post by John-Paul Stewar » Thu, 22 Jun 2006 23:47:01



> Hi,

> I'm running a machine with md raid (two sata drives in a mirror).

> Every once a week, the machine just stalls, and on-screen messages like
> "sense key medium error" show up (it has happened two times now). I
> reboot, and the machine works just fine, the md system is resyncing the
> drives and there seem to be no further errors.

> Is it the drive?

Yes.  The "medium error" indicates a bad sector on the hard disk.  The
disk should automatically re-map these to spare sectors, however.  I'd
use 'smartctl' to check the health of the disk.  It may be fine
(re-mapping the bad sectors) or it may not (having run out of spare
sectors).

 
 
 

Hard disk "crashing"? How to debug?

Post by buck » Fri, 23 Jun 2006 01:43:53



Quote:>Hi,

>I'm running a machine with md raid (two sata drives in a mirror).

>Every once a week, the machine just stalls, and on-screen messages like
>"sense key medium error" show up (it has happened two times now). I
>reboot, and the machine works just fine, the md system is resyncing the
>drives and there seem to be no further errors.

>Is it the drive? Some other hardware problem? Or more importantly,
>which log files can I examine to find out? I tried the usual syslog,
>messages, etc, but they are no more generous than telling me the raid
>array is broken and is being resynced.

>Can i review the onscreen messages that scrolled when the machine
>stalled? I think it looked pretty much the same as the one time I
>happened to unplug a SATA drive while the machine was running.

>(And also, the box is not stalling completely, it responds to pings,
>will show the MOTD when i log in over SSH, but won't show me a prompt).

>Sorry if I'm not giving enough details but I'm unsure on where to look.

>Kind regards
>Sune Beck

Although this probably is due to a bad sector, it is also possible
that the drive is disconnecting.  Try removing and replacing the
cable(s).  Since SATA has two ways to be cabled, you might also want
to try the alternate cable method.  Temperature changes can affect
dubious connections, as can pollution, oxidation, Etc.  Back in the
Bad Old Days, we used an ArtGum eraser on the contacts of cards
inserted into the motherboard and aerosol contact cleaner on devices
with pins that the ArtGum couldn't clean.  Once I found an insect nest
in the parallel connection between my computer and the cable to the
printer...
--
buck
 
 
 

Hard disk "crashing"? How to debug?

Post by suneb.. » Fri, 23 Jun 2006 04:14:07


Hi John-Paul,


> Yes.  The "medium error" indicates a bad sector on the hard disk.  The
> disk should automatically re-map these to spare sectors, however.  I'd
> use 'smartctl' to check the health of the disk.  It may be fine
> (re-mapping the bad sectors) or it may not (having run out of spare
> sectors).

OK, the drive is brand new so it should have plenty of spare sectors
left (or so I would believe). It's happened twice, I could always hope
this was it for now :) Smartctl says the drive doesn't have S.M.A.R.T
support.

I also tried refitting the cables, although they seemed snug (as the
other reply suggested). And I found no moths, flies or beetles inside
:)

Another thing: is there any way to prevent the machine from completely
stalling when a disk crashes? Everything (boot files, swap) is on a
software raid-1, can I make it simply continue operating with the
remaining disk?

--
Kind regards
Sune Beck

 
 
 

Hard disk "crashing"? How to debug?

Post by Gran » Fri, 23 Jun 2006 05:43:40



>Hi John-Paul,


>> Yes.  The "medium error" indicates a bad sector on the hard disk.  The
>> disk should automatically re-map these to spare sectors, however.  I'd
>> use 'smartctl' to check the health of the disk.  It may be fine
>> (re-mapping the bad sectors) or it may not (having run out of spare
>> sectors).

>OK, the drive is brand new so it should have plenty of spare sectors
>left (or so I would believe). It's happened twice, I could always hope
>this was it for now :) Smartctl says the drive doesn't have S.M.A.R.T
>support.

dmesg?  lspci?  Correct SATA driver for chipset?  

Quote:>Another thing: is there any way to prevent the machine from completely
>stalling when a disk crashes? Everything (boot files, swap) is on a
>software raid-1, can I make it simply continue operating with the
>remaining disk?

A foolish way to setup linux on dual SATA, and waste of a hard drive.

Grant.
--
Cats are smarter than dogs.  You can't make eight cats pull
a sled through the snow.

 
 
 

Hard disk "crashing"? How to debug?

Post by M?ns Rullg?r » Fri, 23 Jun 2006 06:03:31




>>Another thing: is there any way to prevent the machine from completely
>>stalling when a disk crashes? Everything (boot files, swap) is on a
>>software raid-1, can I make it simply continue operating with the
>>remaining disk?

Normally it should keep on running happily.  If it locks up that is an
indication that something bad is happening in the kernel.

Quote:> A foolish way to setup linux on dual SATA, and waste of a hard drive.

Why is that foolish?  Have you never had a disk die taking all the
data with to the grave?

--
M?ns Rullg?rd

 
 
 

Hard disk "crashing"? How to debug?

Post by Gran » Fri, 23 Jun 2006 08:13:22




...
>> A foolish way to setup linux on dual SATA, and waste of a hard drive.

>Why is that foolish?

RAID1 is promoted as some easy, 'no effort' automatic backup, it's not,
since a hardware / software error may wipe both drives, in parallel.  

Quote:>  Have you never had a disk die taking all the
>data with to the grave?

Back in the early '90s I had a 100MB HDD die a horrible death, think
when I was running os/2 -- got a new drive on warranty, dismantled
that drive some years later when 100MB HDD size became a joke ;)

I once spent the best part of a week restoring from a drawer full of
CDROMs after a partitioning thinko ;)  Zero 200GB on two drives and
start over :(  The sort of boo-boo one doesn't repeat ;)  

I've replaced two hard drives when they started making strange bearing
noises.  Both were well past their 'use-by' date, eight or ten years old.

Grant.
--
Cats are smarter than dogs.  You can't make eight cats pull
a sled through the snow.

 
 
 

Hard disk "crashing"? How to debug?

Post by M?ns Rullg?r » Fri, 23 Jun 2006 10:02:38





> ...
>>> A foolish way to setup linux on dual SATA, and waste of a hard drive.

>>Why is that foolish?

> RAID1 is promoted as some easy, 'no effort' automatic backup, it's not,
> since a hardware / software error may wipe both drives, in parallel.  

RAID1 is not a replacement for backup.  It will however save you the
trouble of restoring all your data from backups, and will save
whatever data hasn't yet made it onto backup (you do actually use the
system, I presume).

Quote:>>  Have you never had a disk die taking all the
>>data with to the grave?

> Back in the early '90s I had a 100MB HDD die a horrible death, think
> when I was running os/2 -- got a new drive on warranty, dismantled
> that drive some years later when 100MB HDD size became a joke ;)

But the replacement drive didn't come with your old data on it, now
did it?

Quote:> I once spent the best part of a week restoring from a drawer full of
> CDROMs after a partitioning thinko ;)  Zero 200GB on two drives and
> start over :(  The sort of boo-boo one doesn't repeat ;)  

That's the kind of thing RAID doesn't protect you against.

Quote:> I've replaced two hard drives when they started making strange bearing
> noises.  Both were well past their 'use-by' date, eight or ten years old.

Then you were lucky that you had time to replace them before they
died.  I once had an 80GB WD drive drop dead without any prior
indication only days after the 1-year warranty expired.

Disk drives are cheap enough nowadays that there is really no excuse
*not* to run some sort of RAID (1 or higher).

--
M?ns Rullg?rd

 
 
 

Hard disk "crashing"? How to debug?

Post by suneb.. » Fri, 23 Jun 2006 10:05:25


Hi Grant,


> dmesg?  lspci?  Correct SATA driver for chipset?

I posted my dmesg and lspci here: <http://pastebin.com/724548>. As far
as I can read, I'm using "sata_nv", and the chipset is nvidia so it
sounds right (?)

Quote:> A foolish way to setup linux on dual SATA, and waste of a hard drive.

The idea was to have the machine keep running smoothly even after a
disk crash. And yes, I know RAID-1 is no substitute for a back-up. :)

--
Sune Beck

 
 
 

Hard disk "crashing"? How to debug?

Post by suneb.. » Fri, 23 Jun 2006 10:13:25


Hi M?ns


> Normally it should keep on running happily.  If it locks up that is an
> indication that something bad is happening in the kernel.

That sounds... Well, bad. I just encountered something else weird on a
box with exactly same setup and specs: During a raid resync, the system
might "freeze" for a moment, especially when doing something like

  watch -n 1 cat /proc/mdstat

it would cause a few updates and then it would lock up for up to 30
seconds, and then update again. The following would trigger lock-ups of
a shorter duration:

  watch -n 1 mkdir test

After the resync was done, no problems.

But anyways, the dmesg and lspci is here:

http://pastebin.com/724548

I tried carefully reviewing the dmesg for suspicious messages, but
didn't find anything related (except maybe for some nvidia devices
listed as "unknown"). If it's deep in the kernel, I'm unsure how to
debug.

--
Sune Beck

 
 
 

Hard disk "crashing"? How to debug?

Post by Gran » Fri, 23 Jun 2006 12:13:47



>But the replacement drive didn't come with your old data on it, now
>did it?

No, I had backups on floppy disks, very slow ;)  And when I got that
PC, it was when a 100MB drive had to be split into 32MB max size
partitions for ms-dos to work.

Quote:>Disk drives are cheap enough nowadays that there is really no excuse
>*not* to run some sort of RAID (1 or higher).

I have several machines on LAN and keep redundant backups between
machines with tar or rsync.  Yes hard drives are cheap, but the
RAID1 solution doesn't capture my attention.  RAID5 would be overkill
here -- but if I was running high data availability setup, RAID with
hot spare / hot swap might be an option, as would failover hardware.

Grant.
--
Cats are smarter than dogs.  You can't make eight cats pull
a sled through the snow.

 
 
 

Hard disk "crashing"? How to debug?

Post by Gran » Fri, 23 Jun 2006 12:22:43



>Hi Grant,


>> dmesg?  lspci?  Correct SATA driver for chipset?

>I posted my dmesg and lspci here: <http://pastebin.com/724548>.

Update your pci.ids data file to lose those 'unknowns' ;)

Linux-2.6.8 is way too old for modern hardware, update it?  Current
stable is 2.6.16.21 (with .22 in the wings) or 2.6.17.1 (very new).

Also check OS -- I don't use Debian, dunno what their security and
driver backporting is like.

Quote:>I posted my dmesg and lspci here: <http://pastebin.com/724548>. As far
>as I can read, I'm using "sata_nv", and the chipset is nvidia so it
>sounds right (?)

Check Jeff Garzik's page for SATA status (google), nvidia still has
problems on some mobo's with the latest drivers.

Grant.
--
Cats are smarter than dogs.  You can't make eight cats pull
a sled through the snow.

 
 
 

Hard disk "crashing"? How to debug?

Post by chuckca » Sat, 24 Jun 2006 05:16:41








>> ...
>>>> A foolish way to setup linux on dual SATA, and waste of a hard
>>>> drive.

>>>Why is that foolish?

>> RAID1 is promoted as some easy, 'no effort' automatic backup, it's
>> not, since a hardware / software error may wipe both drives, in
>> parallel.  

> RAID1 is not a replacement for backup.  It will however save you the
> trouble of restoring all your data from backups, and will save
> whatever data hasn't yet made it onto backup (you do actually use the
> system, I presume).

>>>  Have you never had a disk die taking all the
>>>data with to the grave?

>> Back in the early '90s I had a 100MB HDD die a horrible death, think
>> when I was running os/2 -- got a new drive on warranty, dismantled
>> that drive some years later when 100MB HDD size became a joke ;)

> But the replacement drive didn't come with your old data on it, now
> did it?

>> I once spent the best part of a week restoring from a drawer full of
>> CDROMs after a partitioning thinko ;)  Zero 200GB on two drives and
>> start over :(  The sort of boo-boo one doesn't repeat ;)  

> That's the kind of thing RAID doesn't protect you against.

>> I've replaced two hard drives when they started making strange
>> bearing noises.  Both were well past their 'use-by' date, eight or
>> ten years old.

> Then you were lucky that you had time to replace them before they
> died.  I once had an 80GB WD drive drop dead without any prior
> indication only days after the 1-year warranty expired.

> Disk drives are cheap enough nowadays that there is really no excuse
> *not* to run some sort of RAID (1 or higher).

Wouldn't fsck do something to fix the problem? I don't have it's man
page handy right now, but isn't that what it does?

--
(setq (chuck nil)  car(chuck) )