HACMP Fails Wrong Node

HACMP Fails Wrong Node

Post by RR » Tue, 17 Oct 2000 04:00:00



I have two 44P-270's in a rotating HA 4.3.1 classic cluster, with AIX
4.3.3.0.05 and Firewall-1 4.0.7.

I monitor the status of three NICs and fail to the other node when one
fails.  This works great for one adapter.  I can pull the NIC's cable
from the hub and have the node fail fine every time.

But with the other two adapters, when I pull either cable, the same
node always goes down, regardless of whether or not it currently holds
the service addresses.

The systems are running Firewall-1 and maintaining on the order of
60,000 active TCP connections, although they handle this extreme load
just fine, which is spread roughly evenly across the three adapters I
monitor.

I cannot figure why one adapter works perfectly, while the other two
always fail the same node.  Any help is much appreciated.

Thanks,
RR

 
 
 

HACMP Fails Wrong Node

Post by Gus Schlachte » Wed, 18 Oct 2000 04:00:00



> I monitor the status of three NICs and fail to the other node when one
> fails.  This works great for one adapter.  I can pull the NIC's cable
> from the hub and have the node fail fine every time.

> But with the other two adapters, when I pull either cable, the same
> node always goes down, regardless of whether or not it currently holds
> the service addresses.

HACMP does not automatically fallover to another node due to an adapter
failure.  What method did you use to customize this behavior?

Of course, I think the more important question is: Why not let HACMP move the IP
address of the failed adapter to a standby adapter, rather than having the whole
machine fallover?

Gus

--------------------------------------------------

--------------------------------------------------

 
 
 

HACMP Fails Wrong Node

Post by RR » Wed, 18 Oct 2000 04:00:00





> > I monitor the status of three NICs and fail to the other node when one
> > fails.  This works great for one adapter.  I can pull the NIC's cable
> > from the hub and have the node fail fine every time.

> > But with the other two adapters, when I pull either cable, the same
> > node always goes down, regardless of whether or not it currently holds
> > the service addresses.

> HACMP does not automatically fallover to another node due to an adapter
> failure.  What method did you use to customize this behavior?

I attached a post-event to the "network_down" event, where I  shutdown
the node name given me by HA.  I took this hint from the Redbook on
running Firewall-1 on AIX.  The script simply compares the hostname of
the system to the node name passed it by HA.  If they match, come down
now.  If not, continue on as usual.  But even so, only one adapter
fails the correct node; the other two adapters always fail the same
node regardless.

Quote:> Of course, I think the more important question is: Why not let HACMP move the IP
> address of the failed adapter to a standby adapter, rather than having the whole
> machine fallover?

There are not enough slots in the computer for six total Ethernet
adapters.  (I use the built-in adapter as the Firewall-1 management
network and is not under control of HA).

I also realized something else just now, as I was typing.  The one
adapter that does fail properly, is plugged into a 64-bit slot (but it
still the same model card as the other two).  The other two adapters
are both in 32-bit slots.  Not sure if this has any bearing here or
not, but it is certainly a curious fact...

Thanks much,
RR

 
 
 

HACMP Fails Wrong Node

Post by Victor Le » Thu, 19 Oct 2000 04:00:00


Hi,
Just a hint, have you add the entries to /usr/sbin/cluster/netmon.cf??
regards,
Victor
 
 
 

HACMP Fails Wrong Node

Post by Gus Schlachte » Thu, 19 Oct 2000 04:00:00



> I attached a post-event to the "network_down" event, where I  shutdown
> the node name given me by HA.  I took this hint from the Redbook on
> running Firewall-1 on AIX.  The script simply compares the hostname of
> the system to the node name passed it by HA.  If they match, come down
> now.  If not, continue on as usual.  But even so, only one adapter
> fails the correct node; the other two adapters always fail the same
> node regardless.

I would help tremendously if we could see the script.  Most likely there is an
error in it's logic, but you can blame that on the Redbook if you like.  :-)

Gus

 
 
 

HACMP Fails Wrong Node

Post by RR » Fri, 20 Oct 2000 04:00:00





> > I attached a post-event to the "network_down" event, where I  shutdown
> > the node name given me by HA.  I took this hint from the Redbook on
> > running Firewall-1 on AIX.  The script simply compares the hostname of
> > the system to the node name passed it by HA.  If they match, come down
> > now.  If not, continue on as usual.  But even so, only one adapter
> > fails the correct node; the other two adapters always fail the same
> > node regardless.

> I would help tremendously if we could see the script.  Most likely there is an
> error in it's logic, but you can blame that on the Redbook if you like.  :-)

The script is nothing more than a simple comparison of system hostname
to parameter $3.  If they match, come down.  If not, just exit.  There
is no logic having to do with whichever Ethernet adapter is failing
and so this would not explain why one adapter works but the other two
don't.

I believe the problem has to do with my /usr/sbin/cluster/netmon.cf
file, as another poster suggested.  I will be testing this at my next
maintenance window.

Thank you for the help!

RR

 
 
 

1. CLLOCKD, hacmp, and master/remaster node to node

An attempt to determine where my resources and locks are "mastered" in a
concurrent HA cluster has turned up almost useless info, certainly
unreliable.

The HACMP 4.2.1 documentation ( the full set ) falls terribly short in the
area of lock management within the cluster.  Have any of you found a better
source for info on the lock manager under AIX.  I'm particularly interested
in documentation that describes the output of the cldiag => debug => cllockd
code.  Where are my lock resources allocated and mastered?  Also, I'd like
to be able to write a tool that would implement a re-mastering of specific
sets of locks.

Simply parsing through the cld_debug.out file to get lock info is not the
best method.  There must be a better, more dynamic way to get at the lock
structures in the kernel.

Please cc any responses to my email address.

Thanks in advance,

-Kevin Brand

--
remove the x for email response

2. Netscape Server 3.5.1 - Multiple Users

3. How to recover an application node and an ADSM node into one node?

4. Enlightenment in RH6.1

5. HACMP cluster with only one network adapter per node?

6. SMP and the Digital AS2100 (Alpha)

7. Promoting switch failures to node failures (HACMP 4.4)

8. Help: VPN server behind Linux firewall..

9. HACMP: starting applications on a single node?

10. separation distance between HACMP nodes

11. HACMP manual takeover from standby node

12. HACMP for AIX node faliure not detected

13. Activating application on surviving node of HACMP