>> I recently bought some HSJ50's, and I was wondering if I should convert
to
>> controller-based mirrorsets or stick with host-based shadowing (phase
II).
>> Are there any advantages/disadvantages to controller-based mirroring?
> I've posted a number of times on this topic, but at the risk of
> boring everyone, here's my take on the subject, one more time. :-)
> I depends in on your requirements, of course. Do you need to
> avoid any single point of failure, that is, is availability your
> primary criterion? Or is making maximum use of the the admittedly
> expensive controllers more important? Are you doing a split-site
> cluster, for example?
> As background, we had all our storage on HSC's, with RA9x's and
> RA7x's dual ported to pairs of HSC's so we had no single point of
> failure. We had three VAX6xxx's on the CI serving the disks to
> about 30 VAXstations. I think the most important problems in our
> environment were that (a) the system disk was shadowed and (b) the
> quorum disk could not be shadowed. Because of (a) any system crash,
> including workstation crashes, would lead to a full system disk
> merge. _That_ was always painful! Because of (b) we had a
> potential single point of failure...which we finessed by suitable
> choice of votes, expected_votes, and qdskvotes such that the cluster
> would maintain quorum if all CI nodes were up but we lost the quorum
> disk [comments about not needing a quorum disk in this configuration
> are noted...we need the flexibility to run the cluster with a single
> voting node up]. The more shadow sets that were merging after a
> crash, the worse it was, of course. [And for reasons that are still
> unclear to me, I was very difficult to do even a normal shutdown of
> a CI node without generating at least a _few_ full merges...some-
> thing about one of the nodes not receiving the shutting node's last
> gasp message...]
> When we added the HSJ52/SW300 storage, I moved the VAX and (new)
> Alpha system disks to the HSJ's. I've configured _most_ of the
> disks in the SW300 as 2-member mirror sets and I've split the mirror
> sets between the two HSJ's. So I can survive a disk failure or a
> controller failure.
Actually, you've got a false sense of security (albeit small). In some
cases, you can NOT survive a disk failure. Speaking from experience here
(sigh...), I've had a disk go bad in such a way that the it hung the SCSI
bus. The HSJ (a -40 in this case) tried to clear the problem by crashing
itself. Somebody wasn't thinking the situation through fully here when they
designed this, since you can guess what happens next. The disk fails over
the redundant controller, like it should. Oops, the bus is still hung.
Yup, the second controller crashes. By now, the first controller is back
up, takes control of the failed disk, and crashes. Repeat forever... Oops,
some critical drives are on that controller pair. Oops, there goes the
cluster. MAJOR OOPS! My first change window at a new job (less than a week
later), and I had the drive fail right after an install of the CLUSIO patch
that replaces the shadowing software. Drove me nuts, and still worries me.
I worked this through with Digital for a month or so, and Digital
Engineering even took the drive for analysis. The upshot of all this is
that the problem is there, isn't likely to be fixed, and they suggested that
I never plug the drive in again. I sent it to them for replacement :-).
I've been torn between controller-based mirroring and host-based shadowing.
My thinking now is that I'll continue to use HBVS, and make sure that I
split drives across cabinets, not controllers (currently my cluster is
configured with both members of the shadow set in the same cabinet, sharing
the redundant controller pair).
I may have to change my thinking later this year as we go to a
disaster-tolerant cluster. HBVS only supports 3 member shadow sets, so I
can't shadow in 2 physical locations. I'll have to choose between 2 member
sets, one at each site, or mirroring within each site, and then shadowing
across the sites. If I don't take the latter option, and if I do lose a
site, I've suddenly lost all shadowing, and I'd be forced to scramble to
roll in 200+GB of disk space to re-shadow everything on short notice.
Quote:> So there's a cost/benefit analysis to be done here. I gather
> from some of Ed Wilt's posts in the past, that he chose to remain
> with HBVS in his former cluster, so you can rest assured that
> there's nothing inherently wrong with that approach. :-)
Thanks. I don't believe that there is inherently wrong with either
approach. There are costs and benefits (and risks) associated with both
options. The nice thing is that Digital allows us to choose which one is
right for our own individual requirements. The tricky part is that we all
have to understand those risks and benefits to make an intelligent decision.
.../Ed (who's playing with a new News client but isn't ready to commit
to a switch yet)
Ed Wilts
Mounds View, MN