Proliant 2500R + Red Hat 6.2 - RAID5 suddenly loses drives

Proliant 2500R + Red Hat 6.2 - RAID5 suddenly loses drives

Post by The Archimag » Sun, 28 Jan 2001 13:46:43



OK, here's the scenario:

I have a Compaq Proliant 2500R, dual PPro 200's, a gig of EDO ECC RAM,
and an external Compaq F1 Raid array with 9 identical Seagate 9.1 gig
SCSI drives.  Several months ago (right when RH6.2 was released), I
installed RH6.2 in GUI mode, set up a 24 meg partition mounted on /boot,
and the rest of the array as /dev/md0, a RAID5 array, mounted on /.  

The server was at a remote site, and I have to admit I didn't watch the
logs as closely as I should have.  I ssh'd in on January 16th, and
noticed in /var/log/messages that a "disk failure" had occured on one of
the disks, and the array was continuing on 6 disks.  I used
raidhotremove to remove the drive from the stripe set, and then
raidhotadd to add it back in and regenerate the stripe set.  The server
crashed, showing a second drive "failed."  I lost all the data on the
array.

Assuming I had a bad drive or drives, blew the machine away, formatted
the drives with an NT workstation install and format.com.  The drives
formatted and scandisked fine, so I blew NT away and reinstalled RH6.2
with RAID5, checking for bad blocks, configured again as mentioned
above.  The install didn't even finish before RAID errors were reported.

I figured maybe I had a bad CPU, so I replaced both just to be sure.

I called Compaq, and they told me to upgrade to the latest BIOS and
utilities, and then run the system erase utility.  I did, then I ran the
compaq diagnostics.  I ran two complete runs (took 66 hours), and the
machine, the CPUs, the memory, the SCSI controller, and the disks all
passed.  I figured I had it*ed, so I reinstalled RH6.2 again, with
RAID5.  It installed beautifully, ran for a couple of hours fine, and
then I started running kernel compiles in six 10-cycle loops, one loop
on each console to stress test it.

Sure enough, a disk was marked bad and removed from the array.

I'm baffled.  This setup ran fine for MONTHS.  The machine tests fine
after ridiculously granular tests.  But it can't keep striping going.

Any clues where to look next?  I'm leaning towards bypassing the onboard
NCR SCSI card and putting an Adaptec 2940UW in a PCI slot and running
the array off of it.

Thanks
The Archimage

 
 
 

Proliant 2500R + Red Hat 6.2 - RAID5 suddenly loses drives

Post by jw » Sun, 28 Jan 2001 15:34:42


On Sat, 27 Jan 2001 04:46:43 GMT, The Archimage

>OK, here's the scenario:

>I have a Compaq Proliant 2500R, dual PPro 200's, a gig of EDO ECC RAM,
>and an external Compaq F1 Raid array with 9 identical Seagate 9.1 gig
>SCSI drives.  Several months ago (right when RH6.2 was released), I
>installed RH6.2 in GUI mode, set up a 24 meg partition mounted on /boot,
>and the rest of the array as /dev/md0, a RAID5 array, mounted on /.  

>The server was at a remote site, and I have to admit I didn't watch the
>logs as closely as I should have.  I ssh'd in on January 16th, and
>noticed in /var/log/messages that a "disk failure" had occured on one of
>the disks, and the array was continuing on 6 disks.  I used
>raidhotremove to remove the drive from the stripe set, and then
>raidhotadd to add it back in and regenerate the stripe set.  The server
>crashed, showing a second drive "failed."  I lost all the data on the
>array.

First of all, please trim down the list of newsgroups. Note the F'Up.

Second, are there any log-messages that might indicate WHY some drives
failed?

Jurriaan

--
BOFH excuse #125:

we just switched to Sprint.
GNU/Linux 2.2.19pre7 SMP 2x1402 bogomips load av: 0.08 0.09 0.08

 
 
 

Proliant 2500R + Red Hat 6.2 - RAID5 suddenly loses drives

Post by The Archimag » Sun, 28 Jan 2001 22:37:29


Logs below.  They are long.

Jan 26 06:05:10 archimage kernel: SCSI disk error : host 0 channel 0 id
3 lun 0 return code = 2
Jan 26 06:05:10 archimage kernel: scsidisk I/O error: dev 08:31, sector
1747688
Jan 26 06:05:10 archimage kernel: raid5: Disk failure on sdd1, disabling
device. Operation continuing on 6 devices
Jan 26 06:05:10 archimage kernel: md: recovery thread got woken up ...
Jan 26 06:05:10 archimage kernel: md0: no spare disk to reconstruct
array! -- continuing in degraded mode
Jan 26 06:05:10 archimage kernel: md: recovery thread finished ...
Jan 26 06:05:10 archimage kernel: md: updating md0 RAID superblock on
device
Jan 26 06:05:10 archimage kernel: sdg1 [events: 00000006](write) sdg1's
sb offset: 8807296
Jan 26 06:05:10 archimage kernel: sdf1 [events: 00000006](write) sdf1's
sb offset: 8807296
Jan 26 06:05:10 archimage kernel: sde1 [events: 00000006](write) sde1's
sb offset: 8807296
Jan 26 06:05:10 archimage kernel: (skipping faulty sdd1 )
Jan 26 06:05:10 archimage kernel: sdc1 [events: 00000006](write) sdc1's
sb offset: 8807296
Jan 26 06:05:10 archimage kernel: sdb1 [events: 00000006](write) sdb1's
sb offset: 8807296
Jan 26 06:05:10 archimage kernel: sda5 [events: 00000006](write) sda5's
sb offset: 8811520
Jan 26 06:05:10 archimage kernel: .
Jan 26 06:05:10 archimage kernel: raid5: restarting stripe 4849728
Jan 26 06:05:10 archimage kernel: raid5: restarting stripe 4849736
Jan 26 06:05:10 archimage kernel: raid5: restarting stripe 4849744
Jan 26 06:05:10 archimage kernel: raid5: restarting stripe 4849752
Jan 26 06:05:10 archimage kernel: raid5: restarting stripe 4849768
Jan 26 06:05:10 archimage kernel: raid5: restarting stripe 4849776
Jan 26 06:05:10 archimage kernel: raid5: restarting stripe 4849784
Jan 26 06:05:10 archimage kernel: raid5: restarting stripe 830136
Jan 26 06:05:10 archimage kernel: raid5: restarting stripe 4500184
Jan 26 06:05:10 archimage kernel: raid5: restarting stripe 2140888
Jan 26 06:05:10 archimage kernel: raid5: restarting stripe 2271976
Jan 26 06:05:10 archimage kernel: raid5: restarting stripe 1747688
Jan 26 06:05:10 archimage kernel: raid5: restarting stripe 1747696
Jan 26 06:05:10 archimage kernel: raid5: restarting stripe 1747704
Jan 26 06:05:10 archimage kernel: raid5: restarting stripe 3648
Jan 26 06:05:10 archimage kernel: raid5: restarting stripe 3656
Jan 26 06:05:10 archimage kernel: raid5: restarting stripe 3664
Jan 26 06:05:10 archimage kernel: raid5: restarting stripe 4723088
Jan 26 06:05:10 archimage kernel: raid5: restarting stripe 4723096
Jan 26 06:05:10 archimage kernel: raid5: restarting stripe 7387792
Jan 26 06:05:10 archimage kernel: raid5: restarting stripe 7493680
Jan 26 06:05:10 archimage kernel: raid5: restarting stripe 7493688
Jan 26 06:05:10 archimage kernel: raid5: restarting stripe 7406000
Jan 26 06:05:10 archimage kernel: raid5: restarting stripe 7406008
Jan 26 06:05:10 archimage kernel: raid5: restarting stripe 6728400
Jan 26 06:05:10 archimage kernel: raid5: restarting stripe 13068280
Jan 26 06:05:10 archimage kernel: raid5: restarting stripe 14985976
Jan 26 21:58:55 archimage kernel: trying to remove sdd1 from md0 ...  
Jan 26 21:58:55 archimage kernel: RAID5 conf printout:
Jan 26 21:58:55 archimage kernel:  --- rd:7 wd:6 fd:1
Jan 26 21:58:55 archimage kernel:  disk 0, s:0, o:1, n:0 rd:0 us:1
dev:sdb1
Jan 26 21:58:55 archimage kernel:  disk 1, s:0, o:1, n:1 rd:1 us:1
dev:sdc1
Jan 26 21:58:55 archimage kernel:  disk 2, s:0, o:0, n:2 rd:2 us:1
dev:sdd1
Jan 26 21:58:55 archimage kernel:  disk 3, s:0, o:1, n:3 rd:3 us:1
dev:sde1
Jan 26 21:58:55 archimage kernel:  disk 4, s:0, o:1, n:4 rd:4 us:1
dev:sdf1
Jan 26 21:58:55 archimage kernel:  disk 5, s:0, o:1, n:5 rd:5 us:1
dev:sdg1
Jan 26 21:58:55 archimage kernel:  disk 6, s:0, o:1, n:6 rd:6 us:1
dev:sda5
Jan 26 21:58:55 archimage kernel:  disk 7, s:0, o:0, n:0 rd:0 us:0
dev:[dev 00:00]
Jan 26 21:58:55 archimage kernel:  disk 8, s:0, o:0, n:0 rd:0 us:0
dev:[dev 00:00]
Jan 26 21:58:55 archimage kernel:  disk 9, s:0, o:0, n:0 rd:0 us:0
dev:[dev 00:00]
Jan 26 21:58:55 archimage kernel:  disk 10, s:0, o:0, n:0 rd:0 us:0
dev:[dev 00:00]
Jan 26 21:58:55 archimage kernel:  disk 11, s:0, o:0, n:0 rd:0 us:0
dev:[dev 00:00]
Jan 26 21:58:55 archimage kernel: RAID5 conf printout:
Jan 26 21:58:55 archimage kernel:  --- rd:7 wd:6 fd:1
Jan 26 21:58:55 archimage kernel:  disk 0, s:0, o:1, n:0 rd:0 us:1
dev:sdb1
Jan 26 21:58:55 archimage kernel:  disk 1, s:0, o:1, n:1 rd:1 us:1
dev:sdc1
Jan 26 21:58:55 archimage kernel:  disk 2, s:0, o:0, n:2 rd:2 us:0
dev:[dev 00:00]
Jan 26 21:58:55 archimage kernel:  disk 3, s:0, o:1, n:3 rd:3 us:1
dev:sde1
Jan 26 21:58:55 archimage kernel:  disk 4, s:0, o:1, n:4 rd:4 us:1
dev:sdf1
Jan 26 21:58:55 archimage kernel:  disk 5, s:0, o:1, n:5 rd:5 us:1
dev:sdg1
Jan 26 21:58:55 archimage kernel:  disk 6, s:0, o:1, n:6 rd:6 us:1
dev:sda5
Jan 26 21:58:55 archimage kernel:  disk 7, s:0, o:0, n:0 rd:0 us:0
dev:[dev 00:00]
Jan 26 21:58:55 archimage kernel:  disk 8, s:0, o:0, n:0 rd:0 us:0
dev:[dev 00:00]
Jan 26 21:58:55 archimage kernel:  disk 9, s:0, o:0, n:0 rd:0 us:0
dev:[dev 00:00]
Jan 26 21:58:55 archimage kernel:  disk 10, s:0, o:0, n:0 rd:0 us:0
dev:[dev 00:00]
Jan 26 21:58:55 archimage kernel:  disk 11, s:0, o:0, n:0 rd:0 us:0
dev:[dev 00:00]
Jan 26 21:58:55 archimage kernel: unbind<sdd1,6>
Jan 26 21:58:55 archimage kernel: export_rdev(sdd1)
Jan 26 21:58:55 archimage kernel: md: updating md0 RAID superblock on
device
Jan 26 21:58:55 archimage kernel: sdg1 [events: 00000007](write) sdg1's
sb offset: 8807296
Jan 26 21:58:55 archimage kernel: md: updating md0 RAID superblock on
device
Jan 26 21:58:55 archimage kernel: sdg1 [events: 00000008](write) sdg1's
sb offset: 8807296
Jan 26 21:58:55 archimage kernel: sdf1 [events: 00000008](write) sdf1's
sb offset: 8807296
Jan 26 21:58:55 archimage kernel: sdf1 [events: 00000008](write) sdf1's
sb offset: 8807296
Jan 26 21:58:55 archimage kernel: sde1 [events: 00000008](write) sde1's
sb offset: 8807296
Jan 26 21:58:55 archimage kernel: sde1 [events: 00000008](write) sde1's
sb offset: 8807296
Jan 26 21:58:55 archimage kernel: sdc1 [events: 00000008](write) sdc1's
sb offset: 8807296
Jan 26 21:58:55 archimage kernel: sdc1 [events: 00000008](write) sdc1's
sb offset: 8807296
Jan 26 21:58:55 archimage kernel: sdb1 [events: 00000008](write) sdb1's
sb offset: 8807296
Jan 26 21:58:55 archimage kernel: sdb1 [events: 00000008](write) sdb1's
sb offset: 8807296
Jan 26 21:58:55 archimage kernel: sda5 [events: 00000008](write) sda5's
sb offset: 8811520
Jan 26 21:58:55 archimage kernel: sda5 [events: 00000008](write) sda5's
sb offset: 8811520
Jan 26 21:58:55 archimage kernel: .
Jan 26 21:58:55 archimage kernel: .
Jan 26 21:59:04 archimage kernel: trying to hot-add sdd1 to md0 ...  
Jan 26 21:59:04 archimage kernel: bind<sdd1,7>
Jan 26 21:59:04 archimage kernel: RAID5 conf printout:
Jan 26 21:59:04 archimage kernel:  --- rd:7 wd:6 fd:1
Jan 26 21:59:04 archimage kernel:  disk 0, s:0, o:1, n:0 rd:0 us:1
dev:sdb1
Jan 26 21:59:04 archimage kernel:  disk 1, s:0, o:1, n:1 rd:1 us:1
dev:sdc1
Jan 26 21:59:04 archimage kernel:  disk 2, s:0, o:0, n:2 rd:2 us:0
dev:[dev 00:00]
Jan 26 21:59:04 archimage kernel:  disk 3, s:0, o:1, n:3 rd:3 us:1
dev:sde1
Jan 26 21:59:04 archimage kernel:  disk 4, s:0, o:1, n:4 rd:4 us:1
dev:sdf1
Jan 26 21:59:04 archimage kernel:  disk 5, s:0, o:1, n:5 rd:5 us:1
dev:sdg1
Jan 26 21:59:04 archimage kernel:  disk 6, s:0, o:1, n:6 rd:6 us:1
dev:sda5
Jan 26 21:59:04 archimage kernel:  disk 7, s:0, o:0, n:0 rd:0 us:0
dev:[dev 00:00]
Jan 26 21:59:05 archimage kernel:  disk 8, s:0, o:0, n:0 rd:0 us:0
dev:[dev 00:00]
Jan 26 21:59:05 archimage kernel:  disk 9, s:0, o:0, n:0 rd:0 us:0
dev:[dev 00:00]
Jan 26 21:59:05 archimage kernel:  disk 10, s:0, o:0, n:0 rd:0 us:0
dev:[dev 00:00]
Jan 26 21:59:05 archimage kernel:  disk 11, s:0, o:0, n:0 rd:0 us:0
dev:[dev 00:00]
Jan 26 21:59:05 archimage kernel: RAID5 conf printout:
Jan 26 21:59:05 archimage kernel:  --- rd:7 wd:6 fd:1
Jan 26 21:59:05 archimage kernel:  disk 0, s:0, o:1, n:0 rd:0 us:1
dev:sdb1
Jan 26 21:59:05 archimage kernel:  disk 1, s:0, o:1, n:1 rd:1 us:1
dev:sdc1
Jan 26 21:59:05 archimage kernel:  disk 2, s:0, o:0, n:2 rd:2 us:0
dev:[dev 00:00]
Jan 26 21:59:05 archimage kernel:  disk 3, s:0, o:1, n:3 rd:3 us:1
dev:sde1
Jan 26 21:59:05 archimage kernel:  disk 4, s:0, o:1, n:4 rd:4 us:1
dev:sdf1
Jan 26 21:59:05 archimage kernel:  disk 5, s:0, o:1, n:5 rd:5 us:1
dev:sdg1
Jan 26 21:59:05 archimage kernel:  disk 6, s:0, o:1, n:6 rd:6 us:1
dev:sda5
Jan 26 21:59:05 archimage kernel:  disk 7, s:1, o:0, n:7 rd:7 us:1
dev:sdd1
Jan 26 21:59:05 archimage kernel:  disk 8, s:0, o:0, n:0 rd:0 us:0
dev:[dev 00:00]
Jan 26 21:59:05 archimage kernel:  disk 9, s:0, o:0, n:0 rd:0 us:0
dev:[dev 00:00]
Jan 26 21:59:05 archimage kernel:  disk 10, s:0, o:0, n:0 rd:0 us:0
dev:[dev 00:00]
Jan 26 21:59:05 archimage kernel:  disk 11, s:0, o:0, n:0 rd:0 us:0
dev:[dev 00:00]
Jan 26 21:59:05 archimage kernel: md: updating md0 RAID superblock on
device
Jan 26 21:59:05 archimage kernel: sdd1 [events: 00000009](write) sdd1's
sb offset: 8807296
Jan 26 21:59:05 archimage kernel: md: updating md0 RAID superblock on
device
Jan 26 21:59:05 archimage kernel: sdd1 [events: 0000000a](write) sdd1's
sb offset: 8807296
Jan 26 21:59:05 archimage kernel: sdg1 [events: 0000000a](write) sdg1's
sb offset: 8807296
Jan 26 21:59:05 archimage kernel: sdg1 [events: 0000000a](write) sdg1's
sb offset: 8807296
Jan 26 21:59:05 archimage kernel: sdf1 [events: 0000000a](write) sdf1's
sb offset: 8807296
Jan 26 21:59:05 archimage kernel: sdf1 [events: 0000000a](write) sdf1's
sb offset: 8807296
Jan 26 21:59:05 archimage kernel: sde1 [events: 0000000a](write) sde1's
sb offset: 8807296
Jan 26 21:59:05 archimage kernel: sde1 [events: 0000000a](write) sde1's
sb offset: 8807296
Jan 26 21:59:05 archimage kernel: sdc1 [events: ...

read more »

 
 
 

Proliant 2500R + Red Hat 6.2 - RAID5 suddenly loses drives

Post by jw » Mon, 29 Jan 2001 18:53:02


On Sat, 27 Jan 2001 13:37:29 GMT, The Archimage


>> Second, are there any log-messages that might indicate WHY some drives
>> failed?
>Logs below.  They are long.

>Jan 26 06:05:10 archimage kernel: SCSI disk error : host 0 channel 0 id
>3 lun 0 return code = 2
>Jan 26 06:05:10 archimage kernel: scsidisk I/O error: dev 08:31, sector
>1747688
>Jan 26 06:05:10 archimage kernel: raid5: Disk failure on sdd1, disabling
>device. Operation continuing on 6 devices

This is the important part, I think. What happens after a disk has
failed is interesting, but it shouldn't fail in the first place.

Is it always at the same sector? Always the same disk? Is there
something always happening just before these errors?

The message seems awfully short to me, when I get a scsi-error it looks
something like this:

Jan  5 13:13:30 middle kernel: scsi : aborting command due to timeout : pid 6254, scsi0, channel 0, id 1, lun 0 Read (10) 00 00 01 a9 ca 00 00 20 00
Jan  5 13:13:30 middle kernel: sym53c8xx_abort: pid=6254 serial_number=6272 serial_number_at_timeout=6272
Jan  5 13:14:00 middle kernel: scsi : aborting command due to timeout : pid 6263, scsi0, channel 0, id 1, lun 0 Read (10) 00 00 01 aa d0 00 00 20 00
Jan  5 13:14:00 middle kernel: sym53c8xx_abort: pid=6263 serial_number=6282 serial_number_at_timeout=6282
Jan  5 13:14:31 middle kernel: scsi : aborting command due to timeout : pid 6274, scsi0, channel 0, id 1, lun 0 Read (10) 00 00 01 ac 30 00 00 20 00
Jan  5 13:14:31 middle kernel: sym53c8xx_abort: pid=6274 serial_number=6294 serial_number_at_timeout=6294
Jan  5 13:15:06 middle kernel: scsi : aborting command due to timeout : pid 6537, scsi0, channel 0, id 1, lun 0 Read (10) 00 00 01 cb 77 00 00 04 00
Jan  5 13:15:06 middle kernel: sym53c8xx_abort: pid=6537 serial_number=6558 serial_number_at_timeout=6558

If it is always the same sector, it still looks like a hardware failure.
If it varies, perhaps a driver / cabling problem?

Good luck,
Jurriaan
--
All this being the case, I feel that the off-worlder's opinions should
be carefully heeded.
True, said Morlock, especially in view of the powerful warship.
        Jack Vance - Nightlamp
GNU/Linux 2.2.19pre7 SMP 2x1402 bogomips load av: 0.05 0.17 0.08

 
 
 

Proliant 2500R + Red Hat 6.2 - RAID5 suddenly loses drives

Post by Jez Thoma » Tue, 30 Jan 2001 00:55:17


This belongs firmly in alt.sys.pc-clone.compaq.servers, so follow-ups set.


Quote:> OK, here's the scenario:

> I have a Compaq Proliant 2500R, dual PPro 200's, a gig of EDO ECC RAM,
> and an external Compaq F1 Raid array with 9 identical Seagate 9.1 gig
> SCSI drives.
<snip - disks fail>
> I called Compaq, and they told me to upgrade to the latest BIOS and
> utilities, and then run the system erase utility.

You needed to do:
- Systemboard firmware
- RAID (SMART) card firmware
- Disk firmware
- System partition upgrade (Thinks - does Linux support the system
partition?)

IMO, the system erase should not have been necessary.

Quote:> I did, then I ran the
> compaq diagnostics.  I ran two complete runs (took 66 hours), and the
> machine, the CPUs, the memory, the SCSI controller, and the disks all
> passed.

Diags won't pick up a firmware bug.

Quote:> Any clues where to look next?  I'm leaning towards bypassing the onboard
> NCR SCSI card and putting an Adaptec 2940UW in a PCI slot and running
> the array off of it.

Eh? Are you not using a Compaq SMART Array card???
Where do you get an "onboard NCR SCSI card" from?
 
 
 

1. Can't RE-install software RAID on Red Hat + Proliant 2500R after CPUs overheated

Howdy all -

I have a customer who had a Compaq Proliant 2500R with dual Pentium Pro
200's and 1GB Compaq RAM running Red Hat 6.2 set up with software level
RAID across seven 9GB drives in an external drive enclosure.  

The CPU fan's grimed up and seized, and the server wound up corrupting
the entire RAID array.  We blew the machine away, and started to
re-install Red Hat 6.2.  The install failed (the RAID array came up at
the first boot missing a drive), and I figured maybe I'd cooked the
CPUs.  So, I replaced the CPUs and resintalled.  Same problem.  After
several re-install attempts, I keep getting NCR SCSI controller errors
(error is similar to:  "SIR 17, CCB done queue overflow"), from various
drives at various times.

I have run the latest Compaq diagnostics against the server twice, and
they come back fine.  So does anyone have any idea why all of a sudden I
can't install Red Hat on this box?  I am stretching here, but is it
possible that when the CPUs were overheated, they could have sent
instructions to the drives that actually damaged the drives?  I can't
see how this would be possible, but the system itself (including the
SCSI controller) tests fine.

If convenient, please cc: me directly with any responses.

Thomas Cameron

2. emacs

3. Red Hat 6.2 on Compaq Proliant 1000

4. AST 4 port extremely slow!! Help!

5. Red Hat 6.2 works but not Red hat 7

6. Netware "terminal access" to Linux

7. Immunix OS 6.2 (StackGuarded Red Hat 6.2)

8. adduser

9. Red Hat 6.2 Hardware compatibility with Hard Disk Drives

10. Red Hat 7.1 - Installing Red Hat packages after Red Hat is already installed.

11. Compaq Proliant 2500R with RAID

12. Red Hat 6.2

13. Annoying Netscape Problem in Red Hat 6.2/Gnome Environment