System/Network Monitoring/Metrics Survey

System/Network Monitoring/Metrics Survey

Post by Pat Hu » Tue, 15 Oct 1996 04:00:00

Fellow Admins,

I am a sys admin in the Semi Conductor group at Texas Instruments.  We have
many diverse UNIX/PC networks world-wide supporting various engineering
efforts within the SC group at TI, and although we do not all work directly
for the same organization, much of our business is so similar that we attempt
to take common approaches and share ideas on many of our efforts.  Some of
these efforts include such things as common ways to do DNS/NIS, mail,
automounting and file system layout, security, backups, etc.

One of the more hotly debated topics is how we should be collecting and
reporting system/network uptime metrics to management of the various design
organizations that we support, as well as how we can use metrics and various
monitoring tools to become more pro-active rather than re-active in our
efforts.  A team of admins from various networks (of which I am a participant)
has been given the task to determine a method for doing this.  Personally I
hold no particular fondness for collecting metrics.  I have seen situations
where metrics were applied in such a way that it actually hindered people's
productivity.  We are trying to avoid such an outcome, but still come up with
a usable solution that everyone can adapt to their own needs.

We attempted to tackle this task (at least a small portion of it) about 1 year
ago before this team was actually formed, and managed to come up with a set of
relatively simple scripts that utilize the "uptime" and "fping" commands to
determine what percentage of time a machine was up and accessible from the
network during the course of a week.  Although it's implementation had some
drawbacks, in my opinion these scripts worked very well in that they were
relatively accurate, and provided an easy to understand synopsis of the health
of our network to management.  However we had difficulty getting buy-in from
all networks.  Many complained of having to have to "install" the scripts on
every machine in their network, some made personalized modifications to the
scripts, and we did not have a central group who wanted to take ownership of
maintaining them.  Also it was argued that these scripts did not necessarily
reflect an accurate idea of "uptime" (whatever that may be) since a machine
could theoretically be up wrt the scripts, but have some other problem that
the scripts did not account for.

This new "team" has been tasked with developing a better approach, if possible
to collecting and reporting some set of metrics that makes sense (if it makes
sense to do this at all).  That is why I am coming to the various newsgroups
as an attempt to determine what if anything the rest of you are doing out
there.  I would greatly appreciate any feedback at all from other
system/network admins wrt your experiences or ideas on uptime metrics -
positive or negative.  Whether you are currently attempting to collect
metrics, are just monitoring your network in various ways, or even if you are
merely at the stages of considering a similar approach, I would appreciate
hearing from all of you. I would like to hear what some of you are measuring
and why (or why not), and what you are getting or expecting to get out of it.

Our focus currently is on a SNMP network monitoring tool called Spectrum
(from Cabletron).  We are using it in conjunction with an agent that runs on
client machines called MaestroVision.  Spectrum has a plethora of
configuration options in terms of monitoring just about any system that can
speak SNMP, or run some agent like MaestroVision that it is compliant with.
Some of the SC groups are also utilizing a popular trouble reporting and
tracking system called Remedy, and we are also trying to figure out how to
fit data collected from this system into our metrics.  The goals that have
been defined for our team are as follows:

1) Collect and report weekly uptime metrics for core networks using Spectrum

2) Achievement of 99.8% uptime for core computing environments during prime

3) Develop process for using metrics to improve computing environment.  

2) Prime time = Monday thru Friday 7:00 am - 7:00 pm

3) For purposes of these deliverables uptime measured for infrastructure
   servers only.   Those include at least email, dns, compute, file, and
   web servers.

We are currently debating whether the 99.8% uptime goal of item #2 is a bit
unrealistic.  Our results so far indicate that this may be the case.

ANY feedback I can get would be greatly appreciated!  Please contact me
directly through email (address below) as I may miss something if you post
back to the groups.

Many thanks for your inputs,
Pat H.


 Texas Instruments, Inc.        |
 M/S 802                        |      TI msg id:  PHUL              
 6430 Hwy 75 South              |      Phone:  903-868-7208          
 Sherman, TX  75090             |      FAX:    903-868-5980          


1. Survey: Solaris (Unix) system monitoring tools

Hi folks,

I'm doing a short survey to broaden our view about unix system monitoring
suite products. We have a need but we are not sure about knowing all
available products and issues. Could you please answer the following

1) What unix system monitoring suite products do you know of ?

2) What unix system monitoring suite products do you use ?

3) What are your major gripes about the unix system monitoring suite you
use and other suites in general ?

Thanks for your time

Eric Dee.

2. gcc for Solaris

3. Solaris 9 Certification Survey: System and Network Administration!!!

4. Starting httpd at boot on Redhat Linux

5. System Admin Metrics (human)

6. Help with Kmedia & System Sounds

7. C & C++ Source Code Quality Metrics Tool - All Operating Systems

8. Getting a list of matching files within a C program

9. Metrics of System Performance on HP UX

10. Network perfromance metric

11. Network metrics

12. S/W UPDATE: Big Brother System & Network Monitor v 1.06

13. Good system/network monitor for NT/UNIX/Linux?