MISalert - a beeper tool for Unix SysAdmins

MISalert - a beeper tool for Unix SysAdmins

Post by Stuart Cracra » Wed, 20 Sep 1995 04:00:00

MISalert - self-diagnosing computers send alphanumeric beeps to MIS staff
*** See our World Wide Web Page at http://www.interbahn.com/pub/cracraft ***

Most of the time Unix system admins spend chasing down problems reported
by users or wading monk-like through reams of reports, printed or E-mail.

What if there was another way? MISalert is one such way. Using this tool,
MIS staff receives alphanumeric diagnostics about critical areas of
systems, networks, databases, and daemons.

So you become like a doctor, only called when you have a sick patient.
And when you are beeped, you know exactly what the problem is because
you have a very precise message service.


MISalert activates a series of agents, each agent specialized to
examine and check on a specific area of the system, network,
databases, daemons, attached peripheral hardware, etc. Typically these
areas are thought of to be critical for general good system health.
The agents monitor the health of your system.

Normally, an agent will detect no problem and it will give way to
another agent. Eventually, an agent may be encountered which reports
a problem, for example, runaway processes or filesystems nearing
a certain high watermark in usage.

At this point, the agent queues a message in an internal buffer.
Ultimately, after all agents have had a chance to run, this internal
buffer is scanned and compared to the last run. Any new problems are
extracted and stored for the next phase.

Certain queued agent-messages are high-frequency and are sent via
electronic mail to individuals on MIS staff (or the entire staff via
a system alias.) But most are not high-frequency and are then transmitted
to alphanumeric pagers in the form of alert messages to MIS staff

Current agents are:

# High load averages
        When load average goes above a high-watermark.
# Runaway processes
        Flags any user or system processes above a high watermark in
        terms of system utilization
# Check disk.
        Ensure that disk filesystems don't go above a certain high watermark
# Line printer daemon not running
        Standard Berkeley daemons
# Tape devices no tape
        Complains if no tapes are loaded in tape units for day's backup
# Tmp directory
        Checks permissions of tmp directory to ensure writeability
# Systems down
        Reports if other systems are down. Cross-check by all hosts
# Link to Internet down
        Checks if Internet link is down
# Dead wordperfect or lotus 1-2-3 daemons
        When standard daemons, system or third-party, are down
# Financials production
        Oracle (or other db) database financials production down
# Production company database
        Oracle (or other db) production database down
# specialized line printer daemons not running
        Various lp daemons
# Fax server
        Check that fax services are up
# XDM server
        Ensures X processes are up

These are current, implemented agents. Write others or consult with
the author or other MISalert users for more.


An "agent" is the name for a small piece of code which checks for the
desired condition that normally an MIS staff person would have to be
paid to check for. Instead, now they can be paid to fix more of these
and do other higher-level things than scanning for errors.

An agent is easy to write. Typically when a problem occurs during
production, an irate user sends an electronic message to MIS. It
is best to make a list of such possible messages (or dig through
your email logs) and find out the types of problems that users

After you've made a list, or found a problem that a user reports,
it is simple to verify whether MISalert has an agent for this
type of problem.  If it doesn't, to write it typically takes
10-20 minutes, including debugging and installing it for the regular
MISalert automatic run, using the other agents as standard examples
or templates of how to write your new agent.

From that point on, MISalert will monitor for this problem. If you
have your wakeup interval for MISalert set reasonably, then a user
need never again embarrass you or get angry about having seen this
problem. You will find it first.

Writing a typical agent takes 10 minutes, including testing. It is
far, far easier, than coding an agent for SUN's SUNNET manager or
other similar systems. Basically, anything you can do with standard
existing Unix tools in terms of tracking system events can be tracked
using MISalert agents, the difference being that MISalert runs everything
and reports it elegantly to your pager with minimal effort on your part.

Expansion to hundreds of agents and a shorter granularity is possible
The system is efficient in that expensive statistics gathering is done
once per pass, and, of course, because of Perl.

Agents can be configured, based on type or class, for transmission to
beeper or via E-mail. To keep beeper activity (and charges low), high
volume alerts like cpu or disk conditions, are typically configured as
E-mails, with everything else configured as beeps.


MISalert can be especially effective for organizations with only a
small staff and a large number of computers or services provided to
the user base.  Also, it would very likely be effective for large
staff sites that must trade off responsibilities in terms of a daily
"hot seat" or "system help desk" as the transport layer it uses
permits call-schedule times for on-call people.


A master log with timestamp for each alert is maintained by MISalert
The system can turn itself off and on at specific times depending
upon MIS availability/commitments to your overall organization.


The system consists of about 425 lines of Perl code and makes full use
of IxoBeep/Tpage for the transport layer. Since it is written in Perl,
it is extremely easy to add agents to. It is all currently running on
SUN systems but other systems should be able to run it. It does not
have any "hard-coding" dependency on the alphanumeric transport layer
it uses.

There is also an optional "cookie" feature to send out a motivational
fortune cookie if no problems are found, to keep MIS staff motivated
and interested (just kidding, we're all that way already, aren't we?)


MISalert's logical next evolutionary step is to not just self-diagnose
but to also self-fix system problems. For example, if disk is found to
be low on a particular filesystem, MISalert will go out and fix it.
If MISalert finds that a daemon is down, it will restart it. Only
problems that truly cannot be fixed will generate a beeper alert.
The rest will just general email and logfile confirmations.

        o requires Perl 4.036 or later
        o 14 current agents, expandable/customizable
        o 425 lines of Perl code
        o source code supplied on non-redistribution terms
        o new agents you write or written by others are sharable.
        o requires ixobeep/tpaged (supplied)
        o requires alphanumeric beepers

         $149 plus California state sales tax (7.75%) if applicable

        o documentation & installation instructions
        o tips on how to write agents
        o perl source code  (non-redistributable)
        o free advice/support via E-mail (by author)
        o permission to freely redistribute/collect any agents

        o available via Internet, U.S. First Class, Federal Express.

        The MSM Company
        25682 Cresta Loma
        Laguna Niguel, CA