> I'm getting UNCORRECTABLE ECC errors on DUA12 which is actually a pair of
> striped RZ29L (SCSI disks) shadowed with an identical 2-volume stripeset.
> (We have 4 disks, configured as 2 shadowed stripesets.) The drives are
> attached to an HSD10 DSSI-SCSI controller in a Storageworks BA350 tower
> connected to a VMS 5.5-2 VAXcluster.
> My first problem is determining which of the 2 physical drives in stripeset
> DUA12 is throwing the errors. ANALYZE/ERROR identifies the unit as
> _RAID1$DUA12. Since DUA12 is a stripset how can I tell which physical drive
> it is?
> Once I identify the drive with the errors, I'd like to replace it with a
> spare. Ideally I'd love to simply dismount the volume, pop the old drive
> module out of the enclosure, replace the drive inside the module, slide it
> back into the enclosure, initialize the drive, remount it as part of the
> shadowset and have the system rebuild the volume. However I realize this is
> 1. The HSD10 manual says I can warm-swap the drive but I need to 'quiesce
> the SCSI bus'. Is there a way to perform this operation from the VMS
> command line or do I need to shutdown VMS to get to >>> console to enter
> HSD10 commands?
Frank - since no one else has responded yet (though I only see 36 new
messages today, so my ISP might be having news server problems), I'll
throw in my 2 cents.
I haven't used an HSD10, but on HSZ & HSJ controllers there are a set
of buttons on the front, one for each bus. You hold down the button
for a few seconds and it starts flashing. This means the controller
has noticed the button press and is stalling all I/O to that bus. You
then have about 30 seconds to make changes. (I'm not sure, but I think
you can press the button again to resume activity when you are down,
but the controller will time out after about 30 seconds and resume
even if you do nothing. While quiesced, the controllor will continue
to accept I/O requests for drives on the bus, but won't do anything
about them. To the hosts, it just looks like the disks have gotten
really slow, but nothing breaks. You probably want to do this when
the system is relatively idle, just to keep your users from complaining.
Quote:> 2. Will I need to recreate the stripeset at the HSD10 or will the existing
> stripeset definition work with the replacement drive (since it's the same
> model in the same slot)?
Sorry, don't know how HSD's handle stripe sets...
Quote:> 3. Any other thoughts or suggestions on dealing with this situation.
You say "shadowed stripesets"... HBVS?
My guess, since half the blocks will need to be replaced is that
the write-logging stuff in later VMS (probably not available in V5.5
anyway) wouldn't help in this and you'll have to do a full shadow
copy, but I think you should be able to dismount the bad stripeset
from the shadowset, if it hasn't already been kicked out (do this
before pulling the bad drive), replace the broken drive, reconstitute
the stripe set (Don't know how to do this), init the reconstituted
DUA12: with the same volume label as your original stripe set,
and mount it into the shadow set, which should trigger a shadow copy
(not Merge!) to it. (Half the blocks, more or less, should still be
identical to the source, but the other half will be blank or test
patterns or old data, so you definitely want it to do a copy.)
In an hour or so, the copy should complete and you should be all
No down time for the application, provided you don't mind
running without the shadow backup you normally have. If really
paranoid, you can shut down the application, backup the good
remaining stripeset (the good half of the shadow set), do the
drive replace and shadow set rebuild, and then turn the application
back on. This method will result in considerable downtime, probably
about an hour plus whatever time it takes to backup the good disks and
to swap out the bad disk and rebuild the stripeset, but you will have
a good backup at all times.
Other people recently have discussed the benefits of doing an
physical backup of a shadow set to a new disk that you want to add
to the shadow set, in order to reduce the amount of copying that
the shadow copy needs to do. (I think shadow copying operates in
a very cautious way. Something like read from the source disk,
read-check again to verify it reads okay, read from the destination
disk (looking for a potential bad block), read-check the destination
disk, compare the source and destination, and if different, write
to the destination disk, and then writecheck the stuff just written.
If a bad block or check failure is found anywhere in all this, then
the bad block replacement process is initiated. I think if you
backup/physical the source disk to the destination disk first, you
save the copy from having to do the writes and write-checks, but
of course the system still has to write all the data while doing
the backup, and has to read the data an extra time, so I don't
see how this saves you much, especially if you /verify the backup.
A Google search should find the thread that discussed this, a few
> -Frank Brown
> Seattle Fire Dept.
Fire Dept? Maybe you want to be paranoid :-)
Evans Griffiths & Hart, Inc.
781-861-0670 ext 539