Thanks for the replies on this but, as expected, some people gave me the
answer I already knew:
I know how to configure Error Notification via RAS Support - If you read the
original posting, below, it quite clearly states that.
With thanks to Rodney Clark, who seems to be the only person who actually
read the question, here's what I've done (for anybody else who wishes to do
the same):
Don't forget, this is a two-node cluster, which makes it a bit easier.
1. Configure the switch adapters as a network topology resource
2. Decide which node is the quickest to failover (this will make sense in 3)
3. Configure a custom event, which simply pings the switch address of the
other node. If the ping fails (do a 'ping -c1 - we don't want a
config_too_long) , 'clstop' the node, with takeover. The calling script must
only be run on one node, or both will attempt to failover, resulting in a
total cluster failure - believe me, hearts were broken! Nominate the node
from (2) as the one that will do the ping and fail, if necessary.
4. Customise the 'network_down' event, to call the 'pre - event' you created
in (3).
And there you have it. If either switch card fails, network_down will call
your custom event and, if the switch doesn't respond, the node will fail. If
the switch does respond, your script will exit and network_down will do
whatever it needs to do.
The only (small) problem you have here is that the physical card, on the
nominated node from (2), could fail. In this case, as all of the traffic
will be internal to the node, there is no issue as far as live running is
concerned. The only issue is when you need to arrange for the card to be
replaced / repaired. What you need to do here is to edit your script, so
that the node doesn't fail (comment out the 'clstop'), fail all resources
over to the other node and Bob's your Mother's Brother.
Don't forget to change the script back afterwards.
Boomshanka,
John
Quote:> Can anybody help?
> I'm trying to promote a switch failure (SP Switch, not HPS) to a total
node
> failure, in HACMP 4.4 (AIX 4.3.3, PSSP 3.2 SP high node), which in itself
> isn't a problem. However, when I do an 'errpt -t' to try and decide what
> failures I am going to trap, using RAS, loads come out, and I don't want
to
> promote them all to node failure, as some are not as serious as others -
> obviously. Ideally, I would like to configure them in groups, such as;
> "Worm_dead" errors - try to re-start Worm
> "CSS0 PERM H/W" errors - Fail node
> etc, etc
> If I was to configure them all, one-by-one, it'd take ages!
> Surely there's an easier way to trap all? I don't want to configure IP
> aliasing, so the switch fails onto the ethernet, for various reasons, so
> please don't suggest this.
> The answer I got, from CallAIX was a loud "RTFM", but all of the "HACMP"
> redbooks simply show you how to configure RAS, and I already know that.
> Surely somebody, somewhere, has done this before.
> Basically, can I trap certain classes of error, using HA?
> Is it possible to determine 'probable causes' and 'recommended actions',
> from errpt templates?
> Much appreciated,
> John
> PS. Please reply by e-mail, as well as placing a copy on the newsgroup for
> other interested parties