98% percent availablity is it possible?

98% percent availablity is it possible?

Post by Russell Conne » Tue, 04 Nov 1997 04:00:00



A question:

My direct superior, The VP of OPS, who was once long ago an SA as well,
gave me a directive of 98% availability. Needless to say that leaves a huge
'?'. I was wondering if you guys would help me define what the deliverable
really is when someone says 98% avail.

Here's our scenario:
1. The rookie hero, yours truly, a capable NT admin who is now doing SCO
Unix, for the last 6 months, who had not had a formal training in SA let
alone UNIX. In short, don't make any assumptions as to my skills.
2. The suspect, a Intel based machine, consists of 128mg, 14 GB on three
drives, 2 200PPros, etc. Takes pride in being the best of breed, with no
generic parts, and no new gewgaws. Runs SCO 5.0.2. Keeps 5 programmers and
3 QA people in food, clothing and shelter. Has a tendency to be far to open
to change, root password on the loose.
3. The goal: how do I define uptime?
To wit:
What is a reasonable percentage and sample time?
If not what can be achieved without redundant parts or systems?
What items are normally considered out of my control besides acts of God?
How do I record and verify this info?

I am in the hot seat, about to be reassigned to cross walk duty or worse.
Please help.

Russ Conner
SA
Intrix Systems Grou

 
 
 

98% percent availablity is it possible?

Post by test » Tue, 04 Nov 1997 04:00:00



> A question:

> My direct superior, The VP of OPS, who was once long ago an SA as well,
> gave me a directive of 98% availability. Needless to say that leaves a huge
> '?'. I was wondering if you guys would help me define what the deliverable
> really is when someone says 98% avail.

First get him to define 98% of what?  24x7 (ie 24 hrs/day forever)?
7:00am - 6:00pm Mon-Fri? 98% of any given day? This will have a BIG
affect on how you approach the problem. It is just as important to
determine how many crashes are resonable as well has how many hours up.
That is, 98% 0f 24x7 allows for 1 crash a year.  98% of any given day
allows for one crash a day. Ideally, you want to shoot for something in
the middle. :)

Quote:> Here's our scenario:
> 1. The rookie hero, yours truly, a capable NT admin who is now doing SCO
> Unix, for the last 6 months, who had not had a formal training in SA let
> alone UNIX. In short, don't make any assumptions as to my skills.

Include in your proposal a request for SA level training for yourself.

Quote:> 2. The suspect, a Intel based machine, consists of 128mg, 14 GB on three
> drives, 2 200PPros, etc. Takes pride in being the best of breed, with no
> generic parts, and no new gewgaws. Runs SCO 5.0.2. Keeps 5 programmers and
> 3 QA people in food, clothing and shelter. Has a tendency to be far to open
> to change, root password on the loose.

Immediately change the root password.  Have anyone who complains submit
to you a request in writing (going to need a certain amount of paper for
CYA!) with their needs for root access. Have them list the functions
they will be doing, and why it cannot be done another way. Keep in mind
there is ligitimate stuff that falls into this category. Develop a
policy that says who has root level, and how often the root password
changes.  Make it an immediate policy that the root password is to NEVER
be hardcoded into any scripts (especially FTP) or programs on this, or
any other system. If you do not have the backing from you boss to do
this, then tell him that what he wants is not doable.  No sense in
trying to be responsible for an area over which you do not have
authority.

Find out how many of the rest of the logon ids have group zero
priveledges (cat /etc/groups).  This gives them essentially the same
security level as root, except for those processes that explicitly look
for the name 'root'.

Quote:> 3. The goal: how do I define uptime?
> To wit:
> What is a reasonable percentage and sample time?

see above. Also, have to allow for regular maintenance (upgrades,
repairs, file reorgs). Get your boss to agree ahead of time whether this
counts against the 98% or not.

Quote:> If not what can be achieved without redundant parts or systems?

Not much. You can go for a RAID 5 file system, which will mirror your
data on multiple drives. Do your backups religously (I would recommend
using vdump, if it is available).

Quote:> What items are normally considered out of my control besides acts of God?

Anything caused directly by any individual who has acess to the system,
i.e. if one of the programmers crashed the telnet daemon (if your are
using telnet, or LAT, etc.) and you have to reboot to recover, that
should not be counted against you.  On the other hand, if, as system
admin, you have final say on what does and what does not get run, then
it could.

Also need to coordinate with you local power company as to what their
scheduled down times in your area are, and get a contact name from them
so you can deal with those "non-scheduled events". :)

Quote:> How do I record and verify this info?

setup a cron process to run once an hour every day that will run uptime,
sar and/or vmstat; maybe even a 'ps aux' too. Append both standard out
and standard error to a daily log file.  This will give you a consistent
snapshot of both availability and performance (in general).
Quote:

> I am in the hot seat, about to be reassigned to cross walk duty or worse.
> Please help.

> Russ Conner
> SA
> Intrix Systems Grou

It sounds as if this is a development box. Generally, the uptime
tolerance is a bit greater than if it were a production system, for most
companies. Good luck and hope this helps.
Courtesy copy e-mailed.
--

The above opinions are mine, not my employer's.

 
 
 

98% percent availablity is it possible?

Post by Michael Vila » Tue, 04 Nov 1997 04:00:00




>A question:

>My direct superior, The VP of OPS, who was once long ago an SA as well,
>gave me a directive of 98% availability. Needless to say that leaves a huge
>'?'. I was wondering if you guys would help me define what the deliverable
>really is when someone says 98% avail.

>Here's our scenario:
>1. The rookie hero, yours truly, a capable NT admin who is now doing SCO
>Unix, for the last 6 months, who had not had a formal training in SA let
>alone UNIX. In short, don't make any assumptions as to my skills.
>2. The suspect, a Intel based machine, consists of 128mg, 14 GB on three
>drives, 2 200PPros, etc. Takes pride in being the best of breed, with no
>generic parts, and no new gewgaws. Runs SCO 5.0.2. Keeps 5 programmers and
>3 QA people in food, clothing and shelter. Has a tendency to be far to open
>to change, root password on the loose.
>3. The goal: how do I define uptime?
>To wit:
>What is a reasonable percentage and sample time?
>If not what can be achieved without redundant parts or systems?
>What items are normally considered out of my control besides acts of God?
>How do I record and verify this info?

>I am in the hot seat, about to be reassigned to cross walk duty or worse.
>Please help.

I don't know if SCO has a "failover" product which would allow you to
have 1 system be the "master" and the other the "slave" in hot-standby.
On failure of the master, the slave would come up and have full use of
the master's disks.  This is doable with Sun's Storage Array products
as a fiber channel connection can be in both master and slave systems.

If you have a catastrophic power hit, which is usually bad for Unix
file systems, you would need a UPS and deisel generator for power backup.
I would think all the network equipment would be on the UPS and generator
as well as the company PBX.

If there are any databases on the system, they need to be up even when
being backed up.  So you'll need a tape library and backup software to
ensure this.  Also institute a backup schedule and disaster recovery
procedure.

Being given this 98% metric as a charter is reasonable if you're running
the right hardware and solutions to accomplish it.  If you're expected
to do this without spending money, your boss is essentially saying YOU
are the squirrel in the cage when the power goes out rather than a deisel
generator.  I'd update my resume at that point, but then I've been there
and don't want this type of responcibility any more.  If you price out
the stuff needed to provide the 98% metric and he won't go for it, cut
the cost into chunks and he can decide on the risk vs. cost implementing
what he's willing to pay for.  Remember: TANSTAFFL (there ain't no such
thing as a free lunch).

/MeV/

--
Michael Vilain
Certified Rolfer(reg)
http://www.slip.net/~vilain
[remove "NOSPAM." from my address to reply]

 
 
 

98% percent availablity is it possible?

Post by Timothy J. L » Tue, 04 Nov 1997 04:00:00



|someone writes:

|> 2. The suspect, a Intel based machine, consists of 128mg, 14 GB on three
|> drives, 2 200PPros, etc. Takes pride in being the best of breed, with no
|> generic parts, and no new gewgaws. Runs SCO 5.0.2. Keeps 5 programmers and
|> 3 QA people in food, clothing and shelter. Has a tendency to be far to open
|> to change, root password on the loose.
|
|Immediately change the root password.  Have anyone who complains submit
|to you a request in writing (going to need a certain amount of paper for
|CYA!) with their needs for root access. Have them list the functions
|they will be doing, and why it cannot be done another way. Keep in mind
|there is ligitimate stuff that falls into this category.

Also consider the use of sudo (freeware) or writing setuid programs
to allow doing specific tasks as root without giving out general root
access.  Of course, one has to be very, very careful to ensure that
these methods don't have bugs that allow general root access, since
bugs in setuid programs are a common source of security holes.

--
------------------------------------------------------------------------

Unsolicited bulk or commercial email is not welcome.             netcom.com
No warranty of any kind is provided with this message.

 
 
 

98% percent availablity is it possible?

Post by Ivica Smol » Wed, 05 Nov 1997 04:00:00




>> well this is not too bad, if he means 98% of all days, this gives you at least
>>  6 days a year down time. Witch is not bad if your running unix.

>6 days a year downtime?  That's _very_ bad.

        Bad?!?! Why? It is easy to achive.
        You can take 2 times a year 2 whole days for upgrades and
similar and still have 48 hours left.
        I can tell you that if you need more than 10 hours of downtime
becouse of hardware/software problems something is wrong with your
computer system.

########################################################################

Company:

Quote:>    Medimurska banka                <URL=http://www.open.hr/com/mb>

        Valenta Morandinija 37
        40000 Cakovec                   tel.: [385 [40]] 810638
        Croatia                         fax.: [385 [40]] 810623

########################################################################

 
 
 

98% percent availablity is it possible?

Post by William C DenBeste » Wed, 05 Nov 1997 04:00:00



Quote:> My direct superior, The VP of OPS, who was once long ago an SA as well,
> gave me a directive of 98% availability. Needless to say that leaves a huge
> '?'. I was wondering if you guys would help me define what the deliverable
> really is when someone says 98% avail.

98% uptime is not difficult to acheive.  2% outage time is a generous
amount on a healthy system...

        2% of an hour is 1.2 minutes.
        2% of an 8 hour work day is nearly 5 minutes.
        2% of a 24 hour day is nearly 15 minutes.
        2% of a 40 hour work week is more than 45 minutes.
        2% of a 24x7 week is 3.5 hours.

The tricks to making this work are:

1) Plan scheduled outage times which do not count against the average.

2) Have a UPS with good power conditioning, trim and boost capabilities,
   so that you don't have to deal with power glitches.

3) Pay attention to the 24 hour temperature of your machine room and
   the insides of cabinets.  Heat fatigues hardware and hard drives.

4) Keep resources available.  Continuous load average should be no greater
   than the number of processors; continuous disk and memory usage
   should leave "healthy" breathing room.  The machine should not swap.

5) Buy Hardware RAID subsystems for all hard drives, complete with hot
   spares.

6) Don't share the root password with anyone who does not share
   responsibility for the uptime average.

7) Have a second machine upon which you pre-flight risky actions.

 
 
 

98% percent availablity is it possible?

Post by JDick788 » Wed, 05 Nov 1997 04:00:00


Quote:>A question:

>My direct superior, The VP of OPS, who was once long ago an SA as well,
>gave me a directive of 98% availability. Needless to say that leaves a huge
>'?'. I was wondering if you guys would help me define what the deliverable
>really is when someone says 98% avail.

well this is not too bad, if he means 98% of all days, this gives you at least
 6 days a year down time. Witch is not bad if your running unix.

Quote:>Here's our scenario:
>1. The rookie hero, yours truly, a capable NT admin who is now doing SCO
>Unix, for the last 6 months, who had not had a formal training in SA let
>alone UNIX. In short, don't make any assumptions as to my skills.
>2. The suspect, a Intel based machine, consists of 128mg, 14 GB on three
>drives, 2 200PPros, etc. Takes pride in being the best of breed, with no
>generic parts, and no new gewgaws. Runs SCO 5.0.2. Keeps 5 programmers and

unless you really need 2 200 PPros, take one out and install it in a another
 computer and take half the ram, and take one of the three drives, and make
 this into a test bed server ( or a backup if the main one fails) on this
 machine you have an identical software setup, and you try all major changes on
 the test bed, before you touch the main system. This alone could give you 98%
 , uptime because the chances of human error are greatly reduced, and of course
 follow the advice about root access given in other reply's, and get an
 uniteruptible powersupply.  

with smart purchasing and reusing parts laying around you could get this setup
 with about $300 though the main system will not be as fast.... but of course
 you could use the test bed to do trival tasks.... as well removing them from
 the main system.

Quote:>3 QA people in food, clothing and shelter. Has a tendency to be far to open
>to change, root password on the loose.
>3. The goal: how do I define uptime?
>To wit:
>What is a reasonable percentage and sample time?
>If not what can be achieved without redundant parts or systems?
>What items are normally considered out of my control besides acts of God?
>How do I record and verify this info?

>I am in the hot seat, about to be reassigned to cross walk duty or worse.
>Please help.

>Russ Conner
>SA
>Intrix Systems Grou

 
 
 

98% percent availablity is it possible?

Post by Andre » Wed, 05 Nov 1997 04:00:00



> well this is not too bad, if he means 98% of all days, this gives you at least
>  6 days a year down time. Witch is not bad if your running unix.

6 days a year downtime?  That's _very_ bad.

--

Sysadmin, Insync Internet Services  |  believing in it, doesn't go away."
BOFH, Wielder of the sacred LART    |           -- Philip K.*

 
 
 

98% percent availablity is it possible?

Post by Geoffrey C Marshal » Wed, 05 Nov 1997 04:00:00



> A question:

> My direct superior, The VP of OPS, who was once long ago an SA as well,
> gave me a directive of 98% availability. Needless to say that leaves a huge
> '?'. I was wondering if you guys would help me define what the deliverable
> really is when someone says 98% avail.

> Here's our scenario:
> 1. The rookie hero, yours truly, a capable NT admin who is now doing SCO
> Unix, for the last 6 months, who had not had a formal training in SA let
> alone UNIX. In short, don't make any assumptions as to my skills.
> 2. The suspect, a Intel based machine, consists of 128mg, 14 GB on three
> drives, 2 200PPros, etc. Takes pride in being the best of breed, with no
> generic parts, and no new gewgaws. Runs SCO 5.0.2. Keeps 5 programmers and
> 3 QA people in food, clothing and shelter. Has a tendency to be far to open
> to change, root password on the loose.
> 3. The goal: how do I define uptime?
> To wit:
> What is a reasonable percentage and sample time?
> If not what can be achieved without redundant parts or systems?
> What items are normally considered out of my control besides acts of God?
> How do I record and verify this info?

> I am in the hot seat, about to be reassigned to cross walk duty or worse.
> Please help.

I can probably help, but you want to get this offline...
It is too big....

Geoff...

 
 
 

98% percent availablity is it possible?

Post by Ivica Smol » Thu, 06 Nov 1997 04:00:00




>>>6 days a year downtime?  That's _very_ bad.

>>        Bad?!?! Why? It is easy to achive.

>Please reread the thread.  :)  Sure, it's easy to achieve six days downtime.
>Unplug the damn thing for a week.  ;-)

>I was saying that it's terribly easy with a UNIX system to have considerably
>higher than 98% uptime.  The worst lack of uptime I'm aware of on any of my
>UNIX machines over the past year is about 6 hours, with several machines
>having under 5 minutes downtime in the year.  Both of those values calculate
>to well above 99% uptime.

        Misunderstanding! Sorry!

########################################################################

Company:

Quote:>    Medimurska banka                <URL=http://www.open.hr/com/mb>

        Valenta Morandinija 37
        40000 Cakovec                   tel.: [385 [40]] 810638
        Croatia                         fax.: [385 [40]] 810623

########################################################################

 
 
 

98% percent availablity is it possible?

Post by Andre » Thu, 06 Nov 1997 04:00:00



>>6 days a year downtime?  That's _very_ bad.
>    Bad?!?! Why? It is easy to achive.

Please reread the thread.  :)  Sure, it's easy to achieve six days downtime.
Unplug the damn thing for a week.  ;-)

I was saying that it's terribly easy with a UNIX system to have considerably
higher than 98% uptime.  The worst lack of uptime I'm aware of on any of my
UNIX machines over the past year is about 6 hours, with several machines
having under 5 minutes downtime in the year.  Both of those values calculate
to well above 99% uptime.

--

Sysadmin, Insync Internet Services  |  believing in it, doesn't go away."
BOFH, Wielder of the sacred LART    |           -- Philip K.*

 
 
 

98% percent availablity is it possible?

Post by Doug Barne » Thu, 06 Nov 1997 04:00:00




> >>6 days a year downtime?  That's _very_ bad.

> >       Bad?!?! Why? It is easy to achive.

> Please reread the thread.  :)  Sure, it's easy to achieve six days downtime.
> Unplug the damn thing for a week.  ;-)

> I was saying that it's terribly easy with a UNIX system to have considerably
> higher than 98% uptime.  The worst lack of uptime I'm aware of on any of my
> UNIX machines over the past year is about 6 hours, with several machines
> having under 5 minutes downtime in the year.  Both of those values calculate
> to well above 99% uptime.

> --

> Sysadmin, Insync Internet Services  |  believing in it, doesn't go away."
> BOFH, Wielder of the sacred LART    |           -- Philip K.*

I agree, 98%+ for a UNIX box with "Well tested applications" (i.e.
Oracle, Informix, etc) should be easy to do.
--
Remove the NOSPAM to E-mail me.  Tired of the MLM schemes.
 
 
 

1. IP masq ftp's hang at 100 percent a possible solution

I was looking for a solution to the problem of ftp's hanging at %100 ot
pretty close.  I came across some things about MTU path discovery
causing some problems for trumpet winsock and Open transport.  So I
check yahoo and there was some talk about it but the link was dead.
I disabled the path discovery on a win95 machine behind the router and
did not get any hangs in a few downloads.

I was just wondering if anyone has found that this solves their problem
so I can stop downloading uneeded things trying to get a hang...
--
I've followed you, talked to your neighbors, tapped your phone, and even shot
at you to see how you would react.  From my observations I have come to one
irrefutable conclusion: You are Paranoid.  --unknown

2. unable to open /dev/hdc1

3. Is it possible to see Linux over network with Win 98?

4. The game without a name 0.5 released

5. Mount/unmount problems in 0.98.3 (Possible BUG?)

6. ntp problems under AIX - daylight saving

7. PCI UltraSCSI card REQUIRES Sol 2.6 HW 5/98 (not 3/98) ?

8. UNIX System V release 4, password security

9. Solaris 2.6 3/98 vs 5/98?

10. Differences between Solaris 2.6 March 98 and May 98

11. how to create 98 boot image in linux thro' 98 backup

12. Patched 2.6 3/98 and 2.6 5/98 the same?

13. Windows 98 crashes at COMDEX '98