watchdog failure at same time every week

watchdog failure at same time every week

Post by Mike » Wed, 31 Oct 2001 01:38:17




watchdog fails to communicate and shuts down one node and then the
whole database. The machines then reboot.  This happens without fail.
 It happened on RedHat 7.1 and it happens on SUSE 7.2.  Our SA are not
away of any crontab jobs runing at that time.      Any help would be
appreciated.

NODE 2
wdd.log
UTC:  Wed Oct 24 19:00:07 GMT 2001 (746103)
wddProcRegisterPacket: info: registered client
    name = /tmp/.watchdog/cl_sock_788_15373,
    pid = 788,
    tid = 15373,
    margin = 5000,
    level = 1,
    option = 0,
    description = ClientProcListen.
Time: Wed Oct 24 15:00:07 EDT 2001 (746170)
UTC:  Wed Oct 24 19:00:07 GMT 2001 (746170)
wddSendRegisterReply: info: sent register ack to client.
Time: Sun Oct 28 01:01:44 EDT 2001 (207767)
UTC:  Sun Oct 28 05:01:44 GMT 2001 (207767)
wddScanClients: fatal: client (name=cl_sock_714_5125) ping came too late
    (expiry=1004245303,764, now=1004245304,208).
wddPerformWatch: fatal: at least one client is late in checking in.
Time: Sun Oct 28 01:01:44 EDT 2001 (208071)
UTC:  Sun Oct 28 05:01:44 GMT 2001 (208071)
Shutting down the entire node...

nm.log

 | WARNING | ClusterListener (pid=708, tid=1026): WatchdogPing failed
(rc=12).
Sun Oct 28 01:02:36 2001
 | WARNING | ClusterListener (pid=708, tid=1026): WatchdogPing failed
(rc=12).
Sun Oct 28 01:02:37 2001
 | WARNING | ClusterListener (pid=708, tid=1026): WatchdogPing failed
(rc=12).
Sun Oct 28 01:02:37 2001
 | WARNING | ClusterListener (pid=708, tid=1026): WatchdogPing failed
(rc=12).
Sun Oct 28 01:02:38 2001
 | WARNING | ClusterListener (pid=708, tid=1026): WatchdogPing failed
(rc=12).
Sun Oct 28 01:02:38 2001
 | WARNING | ClusterListener (pid=708, tid=1026): WatchdogPing failed
(rc=12).
Sun Oct 28 01:02:39 2001
 | WARNING | ClusterListener (pid=708, tid=1026): WatchdogPing failed
(rc=12).

cm.log

 | WARNING | 340b | ClientProcListener (pid=784, tid=13323):
WatchdogPing failed (rc=12).
Sun Oct 28 01:02:39 2001
 | WARNING | 2006 | ClientProcListener (pid=779, tid=8198):
WatchdogPing failed (rc=12).
Sun Oct 28 01:02:39 2001
 | WARNING | 2c09 | ClientProcListener (pid=782, tid=11273):
WatchdogPing failed (rc=12).
Sun Oct 28 01:02:39 2001

--
Sent  by dbadba62 from hotmail within  field com
This is a spam protected message. Please answer with reference header.
Posted via http://www.usenet-replayer.com/cgi/content/new

 
 
 

watchdog failure at same time every week

Post by Pat Welc » Wed, 31 Oct 2001 09:20:16




> watchdog fails to communicate and shuts down one node and then the
> whole database. The machines then reboot.  This happens without fail.
>  It happened on RedHat 7.1 and it happens on SUSE 7.2.  Our SA are not
> away of any crontab jobs runing at that time.      Any help would be
> appreciated.

> NODE 2
> wdd.log
> UTC:  Wed Oct 24 19:00:07 GMT 2001 (746103)
> wddProcRegisterPacket: info: registered client
>     name = /tmp/.watchdog/cl_sock_788_15373,
>     pid = 788,
>     tid = 15373,
>     margin = 5000,
>     level = 1,
>     option = 0,
>     description = ClientProcListen.
> Time: Wed Oct 24 15:00:07 EDT 2001 (746170)
> UTC:  Wed Oct 24 19:00:07 GMT 2001 (746170)
> wddSendRegisterReply: info: sent register ack to client.
> Time: Sun Oct 28 01:01:44 EDT 2001 (207767)
> UTC:  Sun Oct 28 05:01:44 GMT 2001 (207767)
> wddScanClients: fatal: client (name=cl_sock_714_5125) ping came too late
>     (expiry=1004245303,764, now=1004245304,208).
> wddPerformWatch: fatal: at least one client is late in checking in.
> Time: Sun Oct 28 01:01:44 EDT 2001 (208071)
> UTC:  Sun Oct 28 05:01:44 GMT 2001 (208071)
> Shutting down the entire node...

> nm.log

>  | WARNING | ClusterListener (pid=708, tid=1026): WatchdogPing failed
> (rc=12).
> Sun Oct 28 01:02:36 2001
>  | WARNING | ClusterListener (pid=708, tid=1026): WatchdogPing failed
> (rc=12).
> Sun Oct 28 01:02:37 2001
>  | WARNING | ClusterListener (pid=708, tid=1026): WatchdogPing failed
> (rc=12).
> Sun Oct 28 01:02:37 2001
>  | WARNING | ClusterListener (pid=708, tid=1026): WatchdogPing failed
> (rc=12).
> Sun Oct 28 01:02:38 2001
>  | WARNING | ClusterListener (pid=708, tid=1026): WatchdogPing failed
> (rc=12).
> Sun Oct 28 01:02:38 2001
>  | WARNING | ClusterListener (pid=708, tid=1026): WatchdogPing failed
> (rc=12).
> Sun Oct 28 01:02:39 2001
>  | WARNING | ClusterListener (pid=708, tid=1026): WatchdogPing failed
> (rc=12).

> cm.log

>  | WARNING | 340b | ClientProcListener (pid=784, tid=13323):
> WatchdogPing failed (rc=12).
> Sun Oct 28 01:02:39 2001
>  | WARNING | 2006 | ClientProcListener (pid=779, tid=8198):
> WatchdogPing failed (rc=12).
> Sun Oct 28 01:02:39 2001
>  | WARNING | 2c09 | ClientProcListener (pid=782, tid=11273):
> WatchdogPing failed (rc=12).
> Sun Oct 28 01:02:39 2001

> --
> Sent  by dbadba62 from hotmail within  field com
> This is a spam protected message. Please answer with reference header.
> Posted via http://www.usenet-replayer.com/cgi/content/new

Could there be a backup *still* running or being launched at that time?

--
----------------------------------------------------
Pat Welch, UBB Computer Services, a WCS Affiliate
           Caldera Authorized Partner  
           Unix/Linux/Windows/Hardware Sales/Support
           (209) 745-1401 Fax: (413) 714-2833
           Nationwide pager: (800) 608-7122

Hunt down and KILL anyone involved in NY/DC attacks!
----------------------------------------------------

 
 
 

1. NMI watchdog generating NMIs every ~2 minutes??

I have a machine with a UP Pentium III-M.  I am trying to set up the
nmi_watchdog using the local APCI support (nmi_watchdog=2).  While the
kernel is booting, I see that the nmi watchdog being successfully
tested as indicated by the except from dmesg below:

testing NMI watchdog ... OK.
Using local APIC timer interrupts.
calibrating APIC timer ...
..... CPU clock speed is 1196.6183 MHz.
..... host bus clock speed is 132.9574 MHz.
cpu: 0, clocks: 1329574, slice: 664787
CPU0<T0:1329568,T1:664768,D:13,S:664787,C:1329574>

After the system boots, I see ~25 NMI interrupts in /proc/interrupts
and then the NMI interrupt count increments at a period of about 1 per
1.5 to 2 minutes.  Reading the Intel spec for the processor clearly
seems to indicate that this feature is supported in this processor (as
further evidenced by the fact that the boot time test indicates "OK").
 I have examined the code in nmi.c and I decreased the PERFCTR0 value
by a factor of 100 to see if more frequent overflows of this counter
would increase the frequency of the NMIs.  This did not work,
hopefully for some reason that is obvious to someone more
knowledgeable than I.  I am trying to debug a nasty kernel hang and
would very much like to take advantage of the built-in nmi oopser
capability.  Can anyone give me any insight into what I might try to
get this working?

Thanks in advance,

Dan Eaton

cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 11
model name      : Mobile Intel(R) Pentium(R) III CPU - M  1200MHz
stepping        : 4
cpu MHz         : 1196.592
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 mmx fxsr sse
bogomips        : 2385.51

2. Diamond Viper V330 AGP

3. Q: schedule cron for every other weeks ?

4. ethernet setting to 100baseT?

5. cron runs every 2 weeks

6. Snort

7. Cron dies every few weeks

8. Problems with LDAPmodule on sunos5

9. Losing mouse settings every couple of weeks

10. Q: schedule cron for every other weeks ?

11. Linux box hang every week

12. New Apache logfile every week

13. Server goes down every two weeks for no apparent reason