Software watchdog for monitoring?

Software watchdog for monitoring?

Post by Tramm Huds » Fri, 12 Mar 1999 04:00:00



Kernel minded folks,

We are building yet-another massive Linux cluster here at Sandia
Labs with currently 512 DEC Alpha's and plan to add another 1024
sometime soon.  The Computational Plant project is similar to the
Beowolf systems, although we are writing our own system support
software.  More details may be found at:

        http://www.cs.sandia.gov/cplant/

Part of our goal is an automated fault detection system.  I have
written some software to monitor the network and some of the hardware,
and would like to be able to know within a few minutes or seconds if a
kernel crashes on one of our many compute nodes.

My first attempt has been to add a static counter to ./kernel/softirq.c
in the run_bottom_half() routine.  I selected run_bottom_halves instead
of do_bottom_half, since run_... is executed only once per softirq
handling rather than once per processor per softirq.

  static inline void run_bottom_halves(void)
  {
  ...
        {
                static unsigned long count = 0;
                if( (count++ & 0xFFFF) == 0 )
                        printk("#####: watchdog output 0x%lx$\n", count>>16 );
        }
  ...
  }

run_bottom_halves is executed roughly 0x10000 times per minute on our
DEC Alpha 500 MHz EV56's running 2.2.3 when the system is idle.  This
works to produce the watchdog output frequently enough for my monitoring
system to keep tabs on the kernel, yet not so often as to bog down the
kernel in I/O over the serial line.

Is this the best place to install this extension?  Is there a lower
overhead way to handle this sort of event?  Am I totally off base here?

Thanks for any input,
Tramm
--

 /|\  http://www.swcp.com/~hudson/          H 505.266.59.96   /\  \_  

  0                                                            U \_  |