physical drive replacement

physical drive replacement

Post by frank brow » Thu, 26 Jun 2003 00:27:31



I'm getting UNCORRECTABLE ECC errors on DUA12 which is actually a pair of
striped RZ29L (SCSI disks) shadowed with an identical 2-volume stripeset.
(We have 4 disks, configured as 2 shadowed stripesets.)  The drives are
attached to an HSD10 DSSI-SCSI controller in a Storageworks BA350 tower
connected to a VMS 5.5-2 VAXcluster.

My first problem is determining which of the 2 physical drives in stripeset
DUA12 is throwing the errors.  ANALYZE/ERROR identifies the unit as
_RAID1$DUA12.  Since DUA12 is a stripset how can I tell which physical drive
it is?

Once I identify the drive with the errors, I'd like to replace it with a
spare.  Ideally I'd love to simply dismount the volume, pop the old drive
module out of the enclosure, replace the drive inside the module, slide it
back into the enclosure, initialize the drive, remount it as part of the
shadowset and have the system rebuild the volume.  However I realize this is
improbable.

1. The HSD10 manual says I can warm-swap the drive but I need to 'quiesce
the SCSI bus'.  Is there a way to perform this operation from the VMS
command line or do I need to shutdown VMS to get to >>> console to enter
HSD10 commands?
2. Will I need to recreate the stripeset at the HSD10 or will the existing
stripeset definition work with the replacement drive (since it's the same
model in the same slot)?
3. Any other thoughts or suggestions on dealing with this situation.

-Frank Brown
Seattle Fire Dept.
http://www.inwa.net/~frog/

 
 
 

physical drive replacement

Post by John Santo » Thu, 26 Jun 2003 12:12:42



> I'm getting UNCORRECTABLE ECC errors on DUA12 which is actually a pair of
> striped RZ29L (SCSI disks) shadowed with an identical 2-volume stripeset.
> (We have 4 disks, configured as 2 shadowed stripesets.)  The drives are
> attached to an HSD10 DSSI-SCSI controller in a Storageworks BA350 tower
> connected to a VMS 5.5-2 VAXcluster.

> My first problem is determining which of the 2 physical drives in stripeset
> DUA12 is throwing the errors.  ANALYZE/ERROR identifies the unit as
> _RAID1$DUA12.  Since DUA12 is a stripset how can I tell which physical drive
> it is?

> Once I identify the drive with the errors, I'd like to replace it with a
> spare.  Ideally I'd love to simply dismount the volume, pop the old drive
> module out of the enclosure, replace the drive inside the module, slide it
> back into the enclosure, initialize the drive, remount it as part of the
> shadowset and have the system rebuild the volume.  However I realize this is
> improbable.

> 1. The HSD10 manual says I can warm-swap the drive but I need to 'quiesce
> the SCSI bus'.  Is there a way to perform this operation from the VMS
> command line or do I need to shutdown VMS to get to >>> console to enter
> HSD10 commands?

Frank - since no one else has responded yet (though I only see 36 new
messages today, so my ISP might be having news server problems), I'll
throw in my 2 cents.

I haven't used an HSD10, but on HSZ & HSJ controllers there are a set
of buttons on the front, one for each bus.  You hold down the button
for a few seconds and it starts flashing.  This means the controller
has noticed the button press and is stalling all I/O to that bus.  You
then have about 30 seconds to make changes.  (I'm not sure, but I think
you can press the button again to resume activity when you are down,
but the controller will time out after about 30 seconds and resume
even if you do nothing.  While quiesced, the controllor will continue
to accept I/O requests for drives on the bus, but won't do anything
about them.  To the hosts, it just looks like the disks have gotten
really slow, but nothing breaks.  You probably want to do this when
the system is relatively idle, just to keep your users from complaining.

Quote:> 2. Will I need to recreate the stripeset at the HSD10 or will the existing
> stripeset definition work with the replacement drive (since it's the same
> model in the same slot)?

Sorry, don't know how HSD's handle stripe sets...

Quote:> 3. Any other thoughts or suggestions on dealing with this situation.

You say "shadowed stripesets"... HBVS?  

My guess, since half the blocks will need to be replaced is that
the write-logging stuff in later VMS (probably not available in V5.5
anyway) wouldn't help in this and you'll have to do a full shadow
copy, but I think you should be able to dismount the bad stripeset
from the shadowset, if it hasn't already been kicked out (do this
before pulling the bad drive), replace the broken drive, reconstitute
the stripe set (Don't know how to do this), init the reconstituted
DUA12: with the same volume label as your original stripe set,
and mount it into the shadow set, which should trigger a shadow copy
(not Merge!) to it.  (Half the blocks, more or less, should still be
identical to the source, but the other half will be blank or test
patterns or old data, so you definitely want it to do a copy.)
In an hour or so, the copy should complete and you should be all
set.

No down time for the application, provided you don't mind
running without the shadow backup you normally have.  If really
paranoid, you can shut down the application, backup the good
remaining stripeset (the good half of the shadow set), do the
drive replace and shadow set rebuild, and then turn the application
back on.  This method will result in considerable downtime, probably
about an hour plus whatever time it takes to backup the good disks and
to swap out the bad disk and rebuild the stripeset, but you will have
a good backup at all times.

Other people recently have discussed the benefits of doing an
physical backup of a shadow set to a new disk that you want to add
to the shadow set, in order to reduce the amount of copying that
the shadow copy needs to do.  (I think shadow copying operates in
a very cautious way.  Something like read from the source disk,
read-check again to verify it reads okay, read from the destination
disk (looking for a potential bad block), read-check the destination
disk, compare the source and destination, and if different, write
to the destination disk, and then writecheck the stuff just written.
If a bad block or check failure is found anywhere in all this, then
the bad block replacement process is initiated.  I think if you
backup/physical the source disk to the destination disk first, you
save the copy from having to do the writes and write-checks, but
of course the system still has to write all the data while doing
the backup, and has to read the data an extra time, so I don't
see how this saves you much, especially if you /verify the backup.
A Google search should find the thread that discussed this, a few
months ago.

Quote:

> -Frank Brown
> Seattle Fire Dept.
> http://www.inwa.net/~frog/

Fire Dept?  Maybe you want to be paranoid :-)

--
John Santos
Evans Griffiths & Hart, Inc.
781-861-0670 ext 539

 
 
 

physical drive replacement

Post by mckinn.. » Thu, 26 Jun 2003 15:53:15





>> I'm getting UNCORRECTABLE ECC errors on DUA12 which is actually a pair of
>> striped RZ29L (SCSI disks) shadowed with an identical 2-volume stripeset.
>> (We have 4 disks, configured as 2 shadowed stripesets.)  The drives are
>> attached to an HSD10 DSSI-SCSI controller in a Storageworks BA350 tower
>> connected to a VMS 5.5-2 VAXcluster.

>> My first problem is determining which of the 2 physical drives in stripeset
>> DUA12 is throwing the errors.  ANALYZE/ERROR identifies the unit as
>> _RAID1$DUA12.  Since DUA12 is a stripset how can I tell which physical drive
>> it is?

>> Once I identify the drive with the errors, I'd like to replace it with a
>> spare.  Ideally I'd love to simply dismount the volume, pop the old drive
>> module out of the enclosure, replace the drive inside the module, slide it
>> back into the enclosure, initialize the drive, remount it as part of the
>> shadowset and have the system rebuild the volume.  However I realize this is
>> improbable.

>> 1. The HSD10 manual says I can warm-swap the drive but I need to 'quiesce
>> the SCSI bus'.  Is there a way to perform this operation from the VMS
>> command line or do I need to shutdown VMS to get to >>> console to enter
>> HSD10 commands?

> Frank - since no one else has responded yet (though I only see 36 new
> messages today, so my ISP might be having news server problems), I'll
> throw in my 2 cents.

> I haven't used an HSD10, but on HSZ & HSJ controllers there are a set
> of buttons on the front, one for each bus.  You hold down the button
> for a few seconds and it starts flashing.  This means the controller
> has noticed the button press and is stalling all I/O to that bus.  You
> then have about 30 seconds to make changes.  (I'm not sure, but I think
> you can press the button again to resume activity when you are down,
> but the controller will time out after about 30 seconds and resume
> even if you do nothing.  While quiesced, the controllor will continue
> to accept I/O requests for drives on the bus, but won't do anything
> about them.  To the hosts, it just looks like the disks have gotten
> really slow, but nothing breaks.  You probably want to do this when
> the system is relatively idle, just to keep your users from complaining.

>> 2. Will I need to recreate the stripeset at the HSD10 or will the existing
>> stripeset definition work with the replacement drive (since it's the same
>> model in the same slot)?

> Sorry, don't know how HSD's handle stripe sets...

>> 3. Any other thoughts or suggestions on dealing with this situation.

> You say "shadowed stripesets"... HBVS?  

You'll want to first disolve the shadowset with a DISMOUNT of the failing
member from VMS.

Then on the controller,
 record the SHOW display for unit, stripeset and the failing disk,
 DELETE the stripeset's unit,
 DELETE the stripeset,
 DELETE the disk, quiesce the bus and perform the physical swap,
 ADD the disk as seen in your prior SHOW display,
 INIT the disk,
 ADD the stripeset as seen in your prior SHOW display,
 INIT the stripeset,
 ADD the unit as seen in your prior SHOW display,
 and SET any characteristics that are absent on the unit.

And back in VMS re-form your shadowset with the appropriate MOUNT command.

- Show quoted text -

Quote:> My guess, since half the blocks will need to be replaced is that
> the write-logging stuff in later VMS (probably not available in V5.5
> anyway) wouldn't help in this and you'll have to do a full shadow
> copy, but I think you should be able to dismount the bad stripeset
> from the shadowset, if it hasn't already been kicked out (do this
> before pulling the bad drive), replace the broken drive, reconstitute
> the stripe set (Don't know how to do this), init the reconstituted
> DUA12: with the same volume label as your original stripe set,
> and mount it into the shadow set, which should trigger a shadow copy
> (not Merge!) to it.  (Half the blocks, more or less, should still be
> identical to the source, but the other half will be blank or test
> patterns or old data, so you definitely want it to do a copy.)
> In an hour or so, the copy should complete and you should be all
> set.

> No down time for the application, provided you don't mind
> running without the shadow backup you normally have.  If really
> paranoid, you can shut down the application, backup the good
> remaining stripeset (the good half of the shadow set), do the
> drive replace and shadow set rebuild, and then turn the application
> back on.  This method will result in considerable downtime, probably
> about an hour plus whatever time it takes to backup the good disks and
> to swap out the bad disk and rebuild the stripeset, but you will have
> a good backup at all times.

> Other people recently have discussed the benefits of doing an
> physical backup of a shadow set to a new disk that you want to add
> to the shadow set, in order to reduce the amount of copying that
> the shadow copy needs to do.  (I think shadow copying operates in
> a very cautious way.  Something like read from the source disk,
> read-check again to verify it reads okay, read from the destination
> disk (looking for a potential bad block), read-check the destination
> disk, compare the source and destination, and if different, write
> to the destination disk, and then writecheck the stuff just written.
> If a bad block or check failure is found anywhere in all this, then
> the bad block replacement process is initiated.  I think if you
> backup/physical the source disk to the destination disk first, you
> save the copy from having to do the writes and write-checks, but
> of course the system still has to write all the data while doing
> the backup, and has to read the data an extra time, so I don't
> see how this saves you much, especially if you /verify the backup.
> A Google search should find the thread that discussed this, a few
> months ago.

>> -Frank Brown
>> Seattle Fire Dept.
>> http://www.inwa.net/~frog/

> Fire Dept?  Maybe you want to be paranoid :-)

> --
> John Santos
> Evans Griffiths & Hart, Inc.
> 781-861-0670 ext 539

--
- Jim
 
 
 

physical drive replacement

Post by John Santo » Sat, 28 Jun 2003 11:53:28






> >> I'm getting UNCORRECTABLE ECC errors on DUA12 which is actually a pair of
> >> striped RZ29L (SCSI disks) shadowed with an identical 2-volume stripeset.
> >> (We have 4 disks, configured as 2 shadowed stripesets.)  The drives are
> >> attached to an HSD10 DSSI-SCSI controller in a Storageworks BA350 tower
> >> connected to a VMS 5.5-2 VAXcluster.

> >> My first problem is determining which of the 2 physical drives in stripeset
> >> DUA12 is throwing the errors.  ANALYZE/ERROR identifies the unit as
> >> _RAID1$DUA12.  Since DUA12 is a stripset how can I tell which physical drive
> >> it is?

> >> Once I identify the drive with the errors, I'd like to replace it with a
> >> spare.  Ideally I'd love to simply dismount the volume, pop the old drive
> >> module out of the enclosure, replace the drive inside the module, slide it
> >> back into the enclosure, initialize the drive, remount it as part of the
> >> shadowset and have the system rebuild the volume.  However I realize this is
> >> improbable.

> >> 1. The HSD10 manual says I can warm-swap the drive but I need to 'quiesce
> >> the SCSI bus'.  Is there a way to perform this operation from the VMS
> >> command line or do I need to shutdown VMS to get to >>> console to enter
> >> HSD10 commands?

> > Frank - since no one else has responded yet (though I only see 36 new
> > messages today, so my ISP might be having news server problems), I'll
> > throw in my 2 cents.

> > I haven't used an HSD10, but on HSZ & HSJ controllers there are a set
> > of buttons on the front, one for each bus.  You hold down the button
> > for a few seconds and it starts flashing.  This means the controller
> > has noticed the button press and is stalling all I/O to that bus.  You
> > then have about 30 seconds to make changes.  (I'm not sure, but I think
> > you can press the button again to resume activity when you are down,
> > but the controller will time out after about 30 seconds and resume
> > even if you do nothing.  While quiesced, the controllor will continue
> > to accept I/O requests for drives on the bus, but won't do anything
> > about them.  To the hosts, it just looks like the disks have gotten
> > really slow, but nothing breaks.  You probably want to do this when
> > the system is relatively idle, just to keep your users from complaining.

> >> 2. Will I need to recreate the stripeset at the HSD10 or will the existing
> >> stripeset definition work with the replacement drive (since it's the same
> >> model in the same slot)?

> > Sorry, don't know how HSD's handle stripe sets...

> >> 3. Any other thoughts or suggestions on dealing with this situation.

> > You say "shadowed stripesets"... HBVS?  

> You'll want to first disolve the shadowset with a DISMOUNT of the failing
> member from VMS.

I don't think you actually have to disolve the shadow set, just remove
the failing member from it (by dismounting it with a simple
"$ dismount RAID1$DUA12:".)  This should leave the other disk mounted
as a single-member shadow set.  (This could be a matter of terminology;
I think of disolving the shadow set as meaning to dismount the set and
remount the individual members with /override=shadow_membership, thus
making them no longer be a shadow set.  Or did you just mean "reduce
the shadow set to one member?")

Quote:

> Then on the controller,
>  record the SHOW display for unit, stripeset and the failing disk,
>  DELETE the stripeset's unit,
>  DELETE the stripeset,
>  DELETE the disk, quiesce the bus and perform the physical swap,

Above, I suggested there might be buttons on the HSD10 to quiesce
the bus, but someone in another reply said HSD10's don't have
buttons...  I have an HSD05 (no buttons, either) but have used a
lot of HSJ40's, HSJ50's, HSZ70's and HSZ80's, all of which do have
buttons.  Do you know how to quiesce the bus on the HSD10 (or
for that matter, on the HSD05?)  Is there actually a button on
it that I haven't noticed?

- Show quoted text -

Quote:>  ADD the disk as seen in your prior SHOW display,
>  INIT the disk,
>  ADD the stripeset as seen in your prior SHOW display,
>  INIT the stripeset,
>  ADD the unit as seen in your prior SHOW display,
>  and SET any characteristics that are absent on the unit.

> And back in VMS re-form your shadowset with the appropriate MOUNT command.

> > My guess, since half the blocks will need to be replaced is that
> > the write-logging stuff in later VMS (probably not available in V5.5
> > anyway) wouldn't help in this and you'll have to do a full shadow
> > copy, but I think you should be able to dismount the bad stripeset
> > from the shadowset, if it hasn't already been kicked out (do this
> > before pulling the bad drive), replace the broken drive, reconstitute
> > the stripe set (Don't know how to do this), init the reconstituted
> > DUA12: with the same volume label as your original stripe set,
> > and mount it into the shadow set, which should trigger a shadow copy
> > (not Merge!) to it.  (Half the blocks, more or less, should still be
> > identical to the source, but the other half will be blank or test
> > patterns or old data, so you definitely want it to do a copy.)
> > In an hour or so, the copy should complete and you should be all
> > set.

> > No down time for the application, provided you don't mind
> > running without the shadow backup you normally have.  If really
> > paranoid, you can shut down the application, backup the good
> > remaining stripeset (the good half of the shadow set), do the
> > drive replace and shadow set rebuild, and then turn the application
> > back on.  This method will result in considerable downtime, probably
> > about an hour plus whatever time it takes to backup the good disks and
> > to swap out the bad disk and rebuild the stripeset, but you will have
> > a good backup at all times.

> > Other people recently have discussed the benefits of doing an
> > physical backup of a shadow set to a new disk that you want to add
> > to the shadow set, in order to reduce the amount of copying that
> > the shadow copy needs to do.  (I think shadow copying operates in
> > a very cautious way.  Something like read from the source disk,
> > read-check again to verify it reads okay, read from the destination
> > disk (looking for a potential bad block), read-check the destination
> > disk, compare the source and destination, and if different, write
> > to the destination disk, and then writecheck the stuff just written.
> > If a bad block or check failure is found anywhere in all this, then
> > the bad block replacement process is initiated.  I think if you
> > backup/physical the source disk to the destination disk first, you
> > save the copy from having to do the writes and write-checks, but
> > of course the system still has to write all the data while doing
> > the backup, and has to read the data an extra time, so I don't
> > see how this saves you much, especially if you /verify the backup.
> > A Google search should find the thread that discussed this, a few
> > months ago.

> >> -Frank Brown
> >> Seattle Fire Dept.
> >> http://www.inwa.net/~frog/

> > Fire Dept?  Maybe you want to be paranoid :-)
> - Jim

--
John Santos
Evans Griffiths & Hart, Inc.
781-861-0670 ext 539
 
 
 

physical drive replacement

Post by Vinit Ad » Sat, 28 Jun 2003 21:45:47


There is no need to shutdown the node! Afterall its a VMS node.

You could access the HSD10 right from VMS while its up an running.

SET HOST /DUP /SERVER=MSCP$DUP /TASK=PARAMS <HSD Name>

alternatively you could try /TASK=DIRECT or /TASK=CLI

Different HSD models works with different tasks. Works with so many
that i forget the exact HSD10 implimentation

In case you get a error about FYDRIVER not loaded:

$ MCR SYSGEN CONN FYA0:/NOADAP/DRIVER=FYDRIVER

Hope this helps :)

Cheers,
Vinit Adya

 
 
 

1. Tape Drive replacement options


All the above should work.  The HP device may not show up perfectly, so you
might have to specify certain attributes using documentation provided by HP.
--
Howard S Shubs
"Run in circles, scream and shout!"  "I hope you have good backups!"

2. Wanted: Storage Engineer with Info

3. Physical Device Names

4. Price of installing...?

5. Physical disks

6. Lotus 123 5 clobbers files - help!

7. VAXstation 4000 model 90 memory physical layout

8. KeyboardProc (hooks)

9. Physical File Size Limit on OVMS 7.1?

10. How to read physical pages of an other process

11. Alpha TZ89 DLT Tape Drive/VAX TU81 9Track Tape Drive