How to write a program that monitors a process

How to write a program that monitors a process

Post by Kurtis D. Rade » Sat, 27 Apr 2002 12:50:25




> I'm trying to write a unix program that knows when another process goes
> down,(normally or abnormally). I know the pid of the process that I want to
> monitor.
> I tried using named-pipes - creating the pipe in the monitored process using
> mkfifo, and then monitoring it using poll, but sometimes the poll returns
> unexpectedly with POLLHUP even if process is still active and the pipe
> hasn't been broken.

How did you determine the pipe has not been broken? If poll() is returning
POLLHUP then the kernel believes the last file descriptor open for writing on
the fifo has been closed. If that is not the case your kernel is seriously
broken and I wouldn't trust it. You should install the lsof(1) utility and
run it against the daemon (e.g., "lsof -p $pid"). It will tell you whether or
not your daemon still has the fifo open for writing.

Do not use the pid for determining whether or not the daemon is running.
Between the time your monitoring process obtains the pid and uses it the
daemon could die and the pid be reassigned to another process. So solutions
such as "kill(pid,0);" are inherently unreliable.

Using what I call a "positive interlock" such as a fifo is far more reliable.
If you don't care whether or not the daemon is actually responding to
requests (i.e., you only want to know if the process still exists) I prefer
to use a file lock. The daemon opens a well known file name and uses fcntl(),
lockf(), or flock() to obtain an exclusive lock on the file. The monitoring
process then attempts to lock the file in blocking or non-blocking mode as
dictated by the other requirements for the monitoring process. If the daemon
exits (normally or abnormally) the kernel will automatically release its lock
on the interlock file. Thus allowing your monitoring process to obtain the
lock which in turn tells it the daemon is no longer running.

But if you have control of the daemon source code, and hence its behavior,
the best solution is to implement a health-check protocol. The monitoring
process then sends periodic "are you okay" messages to the daemon and waits
for a "I'm okay" reply with a timeout. If the message can't be sent (e.g.,
the socket is no longer connected) or a reply isn't received in a reasonable
period the monitoring process can readily determine whether the daemon is
still alive and processing requests.

 
 
 

How to write a program that monitors a process

Post by GoogleF » Sat, 27 Apr 2002 16:38:13




> > Thanks, I understand the rationale now... however, for me at least, this
> > is one of the most uncommon programming errors.  I can't recall the last
> > time I made this particular error...

> This error may be rare, but when you do make it, you'll
> probably spend 2 days trying to find it, because your
> mind will refuse to read the code "as written" instead of
> "as intended".

Agreed.

Quote:> > IMHO, the readability of the 'traditional' style *far*
> > outweighs any protection one might get from
> > using the other method.  

> It's a matter of opinion...

Agree that the traditional way is much easier to read and comprehend.

Quote:> Once you know what it is for, it is not at all difficult
> to read the expression "backwards" ...

Yes - but if its difficult to read, its a 'nuisance' to maintain. May
as well then use macros like:

# define EQ(a, b) (a) == (b)

which would result in similar levels of horror.

Quote:> *Far* outweighs even when you loose 2 days tracking it down
> under deadline pressure? Maybe, maybe not...

Agreed. I cannot remember the last time I did this but it takes *y
ages to figure out why the impossible happens.

My solution is to occasionally take the line: solving this bug
requires logic, yet logic isnt solving the bug. So the bug is
illogical. Stop looking where you think it is and try a different
tactic:

   grep "if.* = *" *.c | grep -v ==

Is a good approximation to finding * assignments. (Of course the
above is not guaranteed to be 100% accurate but it helps takes your
mind off the current problem for a while).

 
 
 

How to write a program that monitors a process

Post by zero » Sat, 27 Apr 2002 18:27:42




>>I'm trying to write a unix program that knows when another process goes
>>down,(normally or abnormally). I know the pid of the process that I want to
>>monitor.
>>I tried using named-pipes - creating the pipe in the monitored process using
>>mkfifo, and then monitoring it using poll, but sometimes the poll returns
>>unexpectedly with POLLHUP even if process is still active and the pipe
>>hasn't been broken.

> How did you determine the pipe has not been broken? If poll() is returning
> POLLHUP then the kernel believes the last file descriptor open for writing on
> the fifo has been closed. If that is not the case your kernel is seriously
> broken and I wouldn't trust it. You should install the lsof(1) utility and
> run it against the daemon (e.g., "lsof -p $pid"). It will tell you whether or
> not your daemon still has the fifo open for writing.

> Do not use the pid for determining whether or not the daemon is running.
> Between the time your monitoring process obtains the pid and uses it the
> daemon could die and the pid be reassigned to another process. So solutions
> such as "kill(pid,0);" are inherently unreliable.

> Using what I call a "positive interlock" such as a fifo is far more reliable.
> If you don't care whether or not the daemon is actually responding to
> requests (i.e., you only want to know if the process still exists) I prefer
> to use a file lock. The daemon opens a well known file name and uses fcntl(),
> lockf(), or flock() to obtain an exclusive lock on the file. The monitoring
> process then attempts to lock the file in blocking or non-blocking mode as
> dictated by the other requirements for the monitoring process. If the daemon
> exits (normally or abnormally) the kernel will automatically release its lock
> on the interlock file. Thus allowing your monitoring process to obtain the
> lock which in turn tells it the daemon is no longer running.

> But if you have control of the daemon source code, and hence its behavior,
> the best solution is to implement a health-check protocol. The monitoring
> process then sends periodic "are you okay" messages to the daemon and waits
> for a "I'm okay" reply with a timeout. If the message can't be sent (e.g.,
> the socket is no longer connected) or a reply isn't received in a reasonable
> period the monitoring process can readily determine whether the daemon is
> still alive and processing requests.

The monitor can start the programm and catch SIG_CHLD ??
 
 
 

How to write a program that monitors a process

Post by Andy Isaacs » Sun, 28 Apr 2002 03:29:05





>> Thanks, I understand the rationale now... however, for me at least, this
>> is one of the most uncommon programming errors.  I can't recall the last
>> time I made this particular error...

>This error may be rare, but when you do make it, you'll
>probably spend 2 days trying to find it, because your
>mind will refuse to read the code "as written" instead of
>"as intended".

Or else you'll just add -Wall to your compile flags and get a nice
warning from the compiler:

foo.c:3: warning: suggest parentheses around assignment used as truth value

cc: Info: foo.c, line 3: In this statement, the assignment expression "a=5"
is used as the controlling expression of an if, while or for statement.
(controlassign)

Quote:>Once you know what it is for, it is not at all difficult
>to read the expression "backwards" ...

>*Far* outweighs even when you loose 2 days tracking it down
>under deadline pressure? Maybe, maybe not...

If you spend 2 days tracking down a bug in code that you're not compiling
with -Wall, you are a greater fool than ...

-andy

 
 
 

How to write a program that monitors a process

Post by Nils O. Sel?sd » Sun, 28 Apr 2002 03:48:44


In article <36186e0e.0204242345.7931b...@posting.google.com>, Einat Ariel wrote:
> Hi,
> I'm trying to write a unix program that knows when another process
> goes down,(normally or abnormally). I know the pid of the process that
> I want to monitor.
> I tried using named-pipes - creating the pipe in the monitored process
> using mkfifo, and then monitoring it using poll, but sometimes the
> poll returns unexpectedly with POLLHUP even if process is still active
> and the pipe hasn't been broken.
> If you know any other solutions for this problem, please answer.

> Thanks in advance,
> Einat Ariel

I did this once:
it have been monitoring processes for me on a solaris box for 7 months now..
(you need to write code for daemon(..) if the standard libraries doesnt have
it..)

#include <stdio.h>
#include <syslog.h>
#include <unistd.h>
#include <string.h>
#include <stdlib.h>
#include <signal.h>
#include <errno.h>
static int interval;
static int sleepinterval;
static char *cfgfilename;
static char *outfile;
static char *pscmd = "ps --no-headers -e -o comm";
struct processentry {
        char *name;
        int max;
        int min;
        int nrfound;
        struct processentry *next;

};

static sig_atomic_t rereadconfig = 0;
static void clear();
static void freelist();
static void check();
static struct processentry *head;
static void doexit();
static void debuglist();
void confighandler(int signo);
static void parseconfig();
static void addproc(char *cmd)
{
        struct processentry *tmp = head;
        while (tmp) {
                if (!strcmp(cmd, tmp->name)) {
                        tmp->nrfound++;
                }
                tmp = tmp->next;
        }
}

static void mainloop()
{
        char cmd[128];
        FILE *pspipe = NULL;
        while (1) {
                int ret = sleep(sleepinterval);
                if (errno == EINTR) {   //interrupted - dont care...
                        errno = 0;
                        sleepinterval = ret;    //wait the rest of the interval
                        continue;
                }
                sleepinterval = interval;       //reset interval incase interrupted
                if (rereadconfig) {     //flag set by signal handler..
                        syslog(LOG_DEBUG, "Rereading configfile: %s",
                               cfgfilename);
                        freelist();     //the parseconfig better DAMN not fail.. else we are lost..
                        parseconfig();
                        rereadconfig = (sig_atomic_t) 0;
                }
                pspipe = popen(pscmd, "r");
                if (pspipe == NULL) {
                        syslog(LOG_ERR, "Unable to execute %s command: %s",
                               pscmd, strerror(errno));
                        errno = 0;
                        continue;
                }
                while (fscanf(pspipe, "%128s", cmd) != EOF) {
                        addproc(cmd);
                }
                check();
                clear();
                if (pclose(pspipe) != 0) {
                        syslog(LOG_ERR, "Executing %s failed.", pscmd);
                }

        }

}

static void clear()
{

        struct processentry *tmp = head;
        while (tmp) {
                tmp->nrfound = 0;
                tmp = tmp->next;
        }

}

static void check()
{
        struct processentry *tmp = head;
        while (tmp) {
                int haserror = 0;
                if (tmp->min < 0) {
                        if (tmp->max < 0) {
                                if (tmp->nrfound <= 0) {
                                        haserror = 1;
                                }
                        } else if (tmp->nrfound > tmp->max) {
                                haserror = 1;
                        }

                } else if (tmp->max < 0) {
                        if (tmp->nrfound < tmp->min) {
                                haserror = 1;
                        }
                } else if (tmp->min > tmp->nrfound
                           || tmp->max < tmp->nrfound) {
                        haserror = 1;

                }
                if (haserror) {
                        syslog(LOG_WARNING,
                               "Found process %s %d times ,limit is [%4d-%4d]",
                               tmp->name, tmp->nrfound, tmp->min,
                               tmp->max);

                }
                tmp = tmp->next;
        }

}

int main(int argc, char *argv[])
{
        int ch;
        while ((ch = getopt(argc, argv, "i:c:o:")) != -1) {
                switch ((char) ch) {
                case 'c':
                        cfgfilename = strdup(optarg);
                        break;
                case 'o':
                        outfile = strdup(optarg);
                        break;
                case 'i':
                        interval = atoi(optarg);
                        if (interval <= 0) {
                                fprintf(stderr,
                                        "Interval cannot be %d, must be above zero.\n",
                                        interval);
                                return 1;
                        }
                        break;

                default:

                        return 1;
                }
        }
        if (interval == 0) {

                fprintf(stderr,
                        "Interval cannot be %d, must be above zero.\n",
                        interval);
                return 1;
        }
        sleepinterval = interval;
        if (cfgfilename == NULL) {

                fprintf(stderr, "Configfile not supplied\n");
                return 1;
        }
        openlog("chkproc", LOG_PID, LOG_DAEMON);
        parseconfig();
        debuglist();
        if (daemon(0, 0) == -1) {
                perror("Could not daemonize");
                return 5;
        }
        atexit(doexit);
        signal(SIGHUP, confighandler);
        syslog(LOG_NOTICE,
               "chkproc starting, checking processes every %d seconds.",
               interval);
        mainloop();
        return 0;

}

static void freelist()
{
        struct processentry *tmp = head;
        while (tmp) {
                struct processentry *helper = tmp;
                tmp = tmp->next;
                free(helper->name);
                free(helper);

        }
        head = NULL;

}

static void doexit()
{
        syslog(LOG_NOTICE, "chkproc exiting");
}

static void parseconfig()
{

        FILE *cfile;
        char entry[128];
        int max;
        int min;
        struct processentry *tmp = NULL;

        cfile = fopen(cfgfilename, "r");
        if (cfile == NULL) {
                syslog(LOG_ERR, "Opening configfile %s failed",
                       cfgfilename);
                exit(2);
        }
        while (fscanf(cfile, "%128s %d %d", entry, &min, &max) != EOF) {
                struct processentry *mtmp;
                printf("[entry] %s %d %d\n", entry, min, max);
                mtmp = malloc(sizeof(struct processentry));
                if (mtmp == NULL) {
                        syslog(LOG_ERR,
                               "Out of memory when reading configfile %s",
                               cfgfilename);
                        exit(3);
                }
                mtmp->name = strdup(entry);
                mtmp->max = max;
                mtmp->min = min;
                mtmp->nrfound = 0;
                mtmp->next = NULL;
                if (tmp == NULL) {
                        head = mtmp;
                } else {
                        tmp->next = mtmp;
                }
                tmp = mtmp;
                min = -1;
                max = -1;

        }
        if (fclose(cfile) != 0) {
                syslog(LOG_DEBUG, "Could not close configfile: %s",
                       strerror(errno));
                errno = 0;

        }

}

void debuglist()
{
        struct processentry *tmp;
        tmp = head;
        while (tmp) {
                printf("process %s min %d max %d found %d\n", tmp->name,
                       tmp->min, tmp->max, tmp->nrfound);
                tmp = tmp->next;
        }

}

void confighandler(int signo)
{
        rereadconfig = (sig_atomic_t) 1;

}

it uses a config file like e.g:
xscreensaver 1 1
sshd 1 1
mingetty
syslogd 1 1
portmap 1 1
xinetd 1 -1
klogd 1 1