Promoting switch failures to node failures (HACMP 4.4)

Promoting switch failures to node failures (HACMP 4.4)

Post by John Skyne » Thu, 22 Feb 2001 03:21:57



Can anybody help?

I'm trying to promote a switch failure (SP Switch, not HPS) to a total node
failure, in HACMP 4.4 (AIX 4.3.3, PSSP 3.2 SP high node), which in itself
isn't a problem. However, when I do an 'errpt -t' to try and decide what
failures I am going to trap, using RAS, loads come out, and I don't want to
promote them all to node failure, as some are not as serious as others -
obviously. Ideally, I would like to configure them in groups, such as;

    "Worm_dead" errors - try to re-start Worm
    "CSS0 PERM H/W" errors - Fail node
    etc, etc

If I was to configure them all, one-by-one, it'd take ages!

Surely there's an easier way to trap all? I don't want to configure IP
aliasing, so the switch fails onto the ethernet, for various reasons, so
please don't suggest this.

The answer I got, from CallAIX was a loud "RTFM", but all of the "HACMP"
redbooks simply show you how to configure RAS, and I already know that.
Surely somebody, somewhere, has done this before.

Basically, can I trap certain classes of error, using HA?

Is it possible to determine 'probable causes' and 'recommended actions',
from errpt templates?

Much appreciated,

John

PS. Please reply by e-mail, as well as placing a copy on the newsgroup for
other interested parties

 
 
 

Promoting switch failures to node failures (HACMP 4.4)

Post by Matthew Land » Thu, 22 Feb 2001 05:08:59



> Can anybody help?

> I'm trying to promote a switch failure (SP Switch, not HPS) to a total node
> failure, in HACMP 4.4 (AIX 4.3.3, PSSP 3.2 SP high node), which in itself
> isn't a problem. However, when I do an 'errpt -t' to try and decide what
> failures I am going to trap, using RAS, loads come out, and I don't want to
> promote them all to node failure, as some are not as serious as others -
> obviously. Ideally, I would like to configure them in groups, such as;

[SNIP]

> Basically, can I trap certain classes of error, using HA?

> Is it possible to determine 'probable causes' and 'recommended actions',
> from errpt templates?

Yes, with HACMP there is a piece of code called "Error Notification".
This is a latch into the syslogd that you can build MANY test conditions
on errors ENTERING the syslogd.  Then you application can take the sense
data and determine the type of error that occured and what to do with
it.  YOu could easily set up error notification to kick off your anlaysis
script on any [P]erm [H]ardware error on a specific adapter (sp adapter).
Your script will then take the info passed to it an determine what to
do.

Error notification is in standard AIX, but it is all ODM additions.  
HACMP is the only application that I know of that gives a nice smitty
interface to configuring/setting up error notification.

smitty hacmp -> RAS support -> Error Notification
or
smitty cm_EN_menu

 - Matt

--
_______________________________________________________________________

   << Comments, views, and opinions are mine alone, not IBM's. >>

 
 
 

Promoting switch failures to node failures (HACMP 4.4)

Post by Tom Weav » Thu, 22 Feb 2001 05:18:53




>I'm trying to promote a switch failure (SP Switch, not HPS) to a total node
>failure, in HACMP 4.4 (AIX 4.3.3, PSSP 3.2 SP high node), which in itself
>isn't a problem. However, when I do an 'errpt -t' to try and decide what
>failures I am going to trap, using RAS, loads come out, and I don't want to
>promote them all to node failure, as some are not as serious as others -
>obviously. Ideally, I would like to configure them in groups, such as;

>    "Worm_dead" errors - try to re-start Worm
>    "CSS0 PERM H/W" errors - Fail node
>    etc, etc

>If I was to configure them all, one-by-one, it'd take ages!

>Surely there's an easier way to trap all?
>Basically, can I trap certain classes of error, using HA?

Some of this may have already been done for you with "Automatic Error
Notificaion" - at least for the ones that you might want to fail the node.

The error notification facility lets you specify a method to run when specific
errors are logged - e.g., permanent errors for resource CSS0.  However, you're
going to have to decide what's the appropriate response.

Quote:>Is it possible to determine 'probable causes' and 'recommended actions',
>from errpt templates?

If you know the error id, you could always enter:

errpt -J <id goes here> -t -a

and look at the probable causes and recommended actions fields.

--
______________
Tom Weaver        (512) 838 8277, T/L 678-8277

 
 
 

Promoting switch failures to node failures (HACMP 4.4)

Post by John Skyne » Sat, 24 Feb 2001 08:47:13


Thanks for the replies on this but, as expected, some people gave me the
answer I already knew:

I know how to configure Error Notification via RAS Support - If you read the
original posting, below, it quite clearly states that.

With thanks to Rodney Clark, who seems to be the only person who actually
read the question, here's what I've done (for anybody else who wishes to do
the same):

Don't forget, this is a two-node cluster, which makes it a bit easier.

1. Configure the switch adapters as a network topology resource
2. Decide which node is the quickest to failover (this will make sense in 3)
3. Configure a custom event, which simply pings the switch address of the
other node. If the ping fails (do a 'ping -c1 - we don't want a
config_too_long) , 'clstop' the node, with takeover. The calling script must
only be run on one node, or both will attempt to failover, resulting in a
total cluster failure - believe me, hearts were broken! Nominate the node
from (2) as the one that will do the ping and fail, if necessary.
4. Customise the 'network_down' event, to call the 'pre - event' you created
in (3).

And there you have it. If either switch card fails, network_down will call
your custom event and, if the switch doesn't respond, the node will fail. If
the switch does respond, your script will exit and network_down will do
whatever it needs to do.

The only (small) problem you have here is that the physical card, on the
nominated node from (2), could fail. In this case, as all of the traffic
will be internal to the node, there is no issue as far as live running is
concerned. The only issue is when you need to arrange for the card to be
replaced / repaired. What you need to do here is to edit your script, so
that the node doesn't fail (comment out the 'clstop'), fail all resources
over to the other node and Bob's your Mother's Brother.

Don't forget to change the script back afterwards.

Boomshanka,

John


Quote:> Can anybody help?

> I'm trying to promote a switch failure (SP Switch, not HPS) to a total
node
> failure, in HACMP 4.4 (AIX 4.3.3, PSSP 3.2 SP high node), which in itself
> isn't a problem. However, when I do an 'errpt -t' to try and decide what
> failures I am going to trap, using RAS, loads come out, and I don't want
to
> promote them all to node failure, as some are not as serious as others -
> obviously. Ideally, I would like to configure them in groups, such as;

>     "Worm_dead" errors - try to re-start Worm
>     "CSS0 PERM H/W" errors - Fail node
>     etc, etc

> If I was to configure them all, one-by-one, it'd take ages!

> Surely there's an easier way to trap all? I don't want to configure IP
> aliasing, so the switch fails onto the ethernet, for various reasons, so
> please don't suggest this.

> The answer I got, from CallAIX was a loud "RTFM", but all of the "HACMP"
> redbooks simply show you how to configure RAS, and I already know that.
> Surely somebody, somewhere, has done this before.

> Basically, can I trap certain classes of error, using HA?

> Is it possible to determine 'probable causes' and 'recommended actions',
> from errpt templates?

> Much appreciated,

> John

> PS. Please reply by e-mail, as well as placing a copy on the newsgroup for
> other interested parties

 
 
 

1. CLLOCKD, hacmp, and master/remaster node to node

An attempt to determine where my resources and locks are "mastered" in a
concurrent HA cluster has turned up almost useless info, certainly
unreliable.

The HACMP 4.2.1 documentation ( the full set ) falls terribly short in the
area of lock management within the cluster.  Have any of you found a better
source for info on the lock manager under AIX.  I'm particularly interested
in documentation that describes the output of the cldiag => debug => cllockd
code.  Where are my lock resources allocated and mastered?  Also, I'd like
to be able to write a tool that would implement a re-mastering of specific
sets of locks.

Simply parsing through the cld_debug.out file to get lock info is not the
best method.  There must be a better, more dynamic way to get at the lock
structures in the kernel.

Please cc any responses to my email address.

Thanks in advance,

-Kevin Brand

--
remove the x for email response

2. Do you need an X server on an headless server to run X apps over ssh (X forwarding) ?

3. cyrus: failure: prot layer failure

4. modem code available (unencumbered)

5. HACMP 4.4 New Filesystem

6. Leafnode

7. cyrus: failure: prot layer failure (SuSE 8.2)

8. Knews and reading saved articles

9. HACMP 4.4 inst. damaged?

10. power failure vs slow ups battery failure

11. Fixes for HACMP 4.4

12. HACMP 4.4 Installation Problem

13. HACMP 4.4 / Oracle listener prob