Monitoring processes and machines (program itself and central)

Monitoring processes and machines (program itself and central)

Post by Kriss Houglan » Sat, 30 Mar 1991 12:50:23



I'm intrested in being able to keep tabs on our whole domain.  That way, when
people log off for the day; it's usable CPU time!  The unfortune problem is
that sometimes the programs crash and burn by themselves and sometimes ye old
operator does a kill -9 one them.

What I am wondering is:
1) Where can I find say the source code for a "ps" function so I don't have
to C shell out and get the info.

2) I'm trying to find the "ofiles" on a comp.source.unix machine. (so far
no luck.)

3) I'm trying to figure out if there is a way to totally swap out the program
(context or whatever) so I can resume execution later. Or at worst, have a
central program (daemon time) that will kill it remotely.  (like when someone
comes in the morning and logs on, I want to either kill the process via a
central program on another machine -- trying to use sockets now, or swap out
the program so people don't gripe and get the operator to do a #9 on it.)

Currently, I don't have source for the number crunching programs.

Please post any comments or suggestions.  I hope I have not screwed up my
point, but I have a feeling that other people might be intrested in
distributive computing the chuncky way other than using "at".

All rights given, All wrongs deserved!
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-
Addresses:              !Disclaimer:  All information is my own and is not that


--
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-
Addresses:              !Disclaimer:  All information is my own and is not that


 
 
 

Monitoring processes and machines (program itself and central)

Post by Brian J. Hafn » Mon, 01 Apr 1991 10:06:52



>I'm intrested in being able to keep tabs on our whole domain.  That way, when
>people log off for the day; it's usable CPU time!  The unfortune problem is
>that sometimes the programs crash and burn by themselves and sometimes ye old
>operator does a kill -9 one them.

You may be interested in "condor" from the Univ. of Wisconsin.
A portion of the condor_intro man page:

     Condor is a facility for executing UNIX jobs on a pool of
     cooperating workstations.  Jobs are queued and executed
     remotely on workstations at times when those workstations
     would otherwise be idle.  A transparent checkpointing
     mechanism is provided, and jobs migrate from workstation to
     workstation without user intervention.  When the jobs com-
     plete, users are notified by mail.

Condor may be obtained via anon-ftp from shorty.cs.wisc.edu

Brian J. Hafner
Computer Sciences Department
University of Wisconsin - Madison