Why are my boxes sooo slow sometimes? (w/ detailed 'Monitor' output)

Why are my boxes sooo slow sometimes? (w/ detailed 'Monitor' output)

Post by SR » Sat, 19 Oct 1996 04:00:00




Quote:Richard Ross writes:

   I've got a few RISC 6k (520) running AIX 3.2.5 networked
together. At night at ~7pm things slow to a crawl. I get
NFS server 'image' not responding on a machine w/ a few of
'images' drives mounted (tabletop).  Also 'ls' seems
to work fine but when I use 'ls -al' it take several minutes
to come back.(what is the difference?). This happens on 'tabletop'
(the main box) and 'image' (legacy names).
  Pings from image to tabletop comes back in 1ms no problem.
  What could make this happen like clockwork when there is
no big cron job occuring? An informix backup is running but
has been going for several hours w/ no noticable drag on the
system and when it finishes I still have the problem.
  What else should I be looking at?
Below is detailed output from the 'Monitor' utility

(caution: use of wide margins important below)

Using 'monitor' (downloaded software) I find:

1   Users:   1 of  11 active 11 remote 502:02 sleep time
CPU: Sys  7.7% Wait 18.0% User  1.5% Idle 72.8%   Refresh: 10.00 s
0%             25%              50%               75%              100%
=====WWWWWWWWWWWW>

Runnable (Swap-in) processes  0.10 (0.00)  load average:  0.28,  0.22,  0.26

Memory    Real     Virtual    Paging (4kB)    Process events     File/TTY-IO
free      161 MB    360 MB       0.1 pgfaults     187 pswitch       3 iget
files     126 MB                 0.1 pgin         143 syscall       1 namei
total     384 MB    480 MB       1.5 pgout         38 read          0 dirblk
IO (kB/s) read  write busy%      0.1 pgsin         34 write    480815 readch
hdisk0     0.0    0.4    0       0.0 pgsout         0 fork     479077 writech
hdisk1     0.0    0.4    0                          0 exec          0 ttyrawch
hdisk2     0.0    0.0    0                          0 rcvint        0 ttycanch
hdisk3   467.2    0.0   18                          0 xmtint       26 ttyoutch
hdisk4     0.0    5.2    3                          0 mdmint
hdisk5     0.0    0.0    0
hdisk6     0.4    0.0    0                            Netw   read  write kB/s
                                                       lo0     0.0     0.0
                                                       sl1     0.0     0.0
                                                       en0     0.0     0.0
                                                       en1     0.4     0.9
                                                       en2     0.0     0.0
                                                       sl17     0.0     0.0
My top ten processes are:
root        514 46.2  0.0   12    8      - R      Oct 09 5237:53 kproc
lisah    161607  7.3  1.0  420 1736 pts/29 S    19:43:47  3:14 sqlturbo lisah 4
informix 256226  5.2  0.0  252  536      - R    17:25:11  9:34 /u/informix/bin/
lisah    224582  1.5  1.0  440 1264 pts/29 S    19:43:47  0:39 newclaim.4ge
root      21759  1.5  0.0 1652   12      - S      Oct 09 171:40 nsrmmd -n 1
root      20655  0.8  0.0  168   64      - S      Oct 09 89:14 /usr/etc/rpc.loc
root          0  0.5  0.0    8    8      - S      Oct 09 57:47 swapper
root     276692  0.5  0.0  204  280      - S    19:10:31  0:21 telnetd
root        771  0.4  0.0   16   16      - S      Oct 09 48:43 kproc
root       3251  0.4  0.0 1672 1104      - S      Oct 09 46:26 /usr/bin/nsrd
number of sqlturbos:      12

The informix job is the backup. hdisk3 access goes to zero when the backup
is finished but the slowness persists.
    Any ideas or direction very appreciated!
Richard Ross

 
 
 

Why are my boxes sooo slow sometimes? (w/ detailed 'Monitor' output)

Post by Stefan Weidenede » Sat, 19 Oct 1996 04:00:00




> Richard Ross writes:

>    I've got a few RISC 6k (520) running AIX 3.2.5 networked
> together. At night at ~7pm things slow to a crawl. I get
> NFS server 'image' not responding on a machine w/ a few of
> 'images' drives mounted (tabletop).  Also 'ls' seems
> to work fine but when I use 'ls -al' it take several minutes
> to come back.(what is the difference?). This happens on 'tabletop'
> (the main box) and 'image' (legacy names).
>   Pings from image to tabletop comes back in 1ms no problem.
>   What could make this happen like clockwork when there is
> no big cron job occuring? An informix backup is running but
> has been going for several hours w/ no noticable drag on the
> system and when it finishes I still have the problem.
>   What else should I be looking at?
> Below is detailed output from the 'Monitor' utility

Hi Richard,

I don't know much about your file system and your machine. But what
I know is the difference between an "ls" and an "ls -al".
The ordinary "ls" command doesn't have to read the I-Node table.
It just reads the directory file and there is no need to position
on the disks. If you start "ls -l" you must read all the I-Node
informations and therefore your disk will have to position between
the upper I-Node Table and the Data Blocks below. Depending on the
size of your filesystems, it may take a very long time using "ls -l".
But I think the I-Nodes must reside in the filesystem's cache buffer.
Maybe your filesystem cache is too small ?
Reading 20MB of your disk in sequential mode would take about 5 to 10
seconds. Reading the same amount of data with positioning would take
about 2 minutes.

Perhaps this helps,

Stefan.


PS: When you found the solution, please send a short reply. I'm very
interested in solving performance problems.

 
 
 

Why are my boxes sooo slow sometimes? (w/ detailed 'Monitor' output)

Post by Scott Nemet » Sat, 19 Oct 1996 04:00:00


Quote:> 'images' drives mounted (tabletop).  Also 'ls' seems
> to work fine but when I use 'ls -al' it take several minutes
> to come back.(what is the difference?). This happens on 'tabletop'
> (the main box) and 'image' (legacy names).

When I've seen this problem it was related to NFS and TCP/IP.  For some
reason under AIX 3.2 there is a problem when you take a machine down to
"init 0" and are running NFS.  This problem was corrected with AIX 4.  The
solution for AIX 3.2 is to execute /etc/nfs.clean immediately before
shutting down or immediately before running "init 0".  Check through the
crontab processes that run (or finish) around 7PM.  See if the machine is
being initialized to "init 0", or if TCP/IP is shutting down.  Try adding
the /etc/nfs.clean to the script immediately before the "init 0".  I know
there are other things that might cause this but this is what I ended up
doing and it did solve the problem.
 
 
 

Why are my boxes sooo slow sometimes? (w/ detailed 'Monitor' output)

Post by David William » Sun, 20 Oct 1996 04:00:00



writes


>Richard Ross writes:

>   I've got a few RISC 6k (520) running AIX 3.2.5 networked
>together. At night at ~7pm things slow to a crawl. I get
>NFS server 'image' not responding on a machine w/ a few of
>'images' drives mounted (tabletop).  Also 'ls' seems
>to work fine but when I use 'ls -al' it take several minutes
>to come back.(what is the difference?). This happens on 'tabletop'
>(the main box) and 'image' (legacy names).
>  Pings from image to tabletop comes back in 1ms no problem.
>  What could make this happen like clockwork when there is
>no big cron job occuring? An informix backup is running but
>has been going for several hours w/ no noticable drag on the
>system and when it finishes I still have the problem.
>  What else should I be looking at?

  Sounds like a  network problem to me try running netstat -i on both
  machines and look for Ierrs/Oerrs columns. Any number a few percent
  indicates a network problem.

Quote:>Below is detailed output from the 'Monitor' utility

>(caution: use of wide margins important below)

>Using 'monitor' (downloaded software) I find:

>1   Users:   1 of  11 active 11 remote 502:02 sleep time
>CPU: Sys  7.7% Wait 18.0% User  1.5% Idle 72.8%   Refresh: 10.00 s
>0%             25%              50%               75%              100%
>=====WWWWWWWWWWWW>

  It's not lack of CPU power

Quote:>Runnable (Swap-in) processes  0.10 (0.00)  load average:  0.28,  0.22,  0.26

>Memory    Real     Virtual    Paging (4kB)    Process events     File/TTY-IO
>free      161 MB    360 MB       0.1 pgfaults     187 pswitch       3 iget
>files     126 MB                 0.1 pgin         143 syscall       1 namei
>total     384 MB    480 MB       1.5 pgout         38 read          0 dirblk
>IO (kB/s) read  write busy%      0.1 pgsin         34 write    480815 readch
>hdisk0     0.0    0.4    0       0.0 pgsout         0 fork     479077 writech

                              It's not paging/lack of memory

Quote:>hdisk1     0.0    0.4    0                          0 exec          0 ttyrawch
>hdisk2     0.0    0.0    0                          0 rcvint        0 ttycanch
>hdisk3   467.2    0.0   18                          0 xmtint       26 ttyoutch

  Possibly disk I/O is the NFS mount point on hdisk3?

Quote:>hdisk4     0.0    5.2    3                          0 mdmint
>hdisk5     0.0    0.0
>hdisk6     0.4    0.0    0                            Netw   read  write kB/s
>                                                       lo0     0.0     0.0
>                                                       sl1     0.0     0.0
>                                                       en0     0.0     0.0
>                                                       en1     0.4     0.9
>                                                       en2     0.0     0.0
>                                                       sl17     0.0     0.0

       It's not network overload!

Quote:>My top ten processes are:
>root        514 46.2  0.0   12    8      - R      Oct 09 5237:53 kproc

   What is kproc?

Quote:>lisah    161607  7.3  1.0  420 1736 pts/29 S    19:43:47  3:14 sqlturbo lisah 4

  This could be worrying - possibly this sqlturbo is scanning a database
  table on hdisk3 - I'd check what the user is doing and get the
  developer of the associated .4ge to 'set explain on' and produce a
  trace of the sql being executed. Get the developer to check no
  database indicies are missing.  

- Show quoted text -

>informix 256226  5.2  0.0  252  536      - R    17:25:11  9:34 /u/informix/bin/
>lisah    224582  1.5  1.0  440 1264 pts/29 S    19:43:47  0:39 newclaim.4ge
>root      21759  1.5  0.0 1652   12      - S      Oct 09 171:40 nsrmmd -n 1
  What is nsrmmd?
>root      20655  0.8  0.0  168   64      - S      Oct 09 89:14 /usr/etc/rpc.loc
>root          0  0.5  0.0    8    8      - S      Oct 09 57:47 swapper
>root     276692  0.5  0.0  204  280      - S    19:10:31  0:21 telnetd
>root        771  0.4  0.0   16   16      - S      Oct 09 48:43 kproc
>root       3251  0.4  0.0 1672 1104      - S      Oct 09 46:26 /usr/bin/nsrd
>number of sqlturbos:      12

>The informix job is the backup. hdisk3 access goes to zero when the backup
>is finished but the slowness persists.
>    Any ideas or direction very appreciated!
>Richard Ross


   E-mail me if this doesn't help - like Stefan I'm very interested in
   performance tuning.

--
David Williams

 
 
 

Why are my boxes sooo slow sometimes? (w/ detailed 'Monitor' output)

Post by joe do » Tue, 22 Oct 1996 04:00:00


Difference that "ls" is reading a file (the directory file) which is
located on the current directory (i.e: . or od -xc . ). ls -l has to
access the inode table which is located on the remote system.
Increase the biod on the local (on the remote also for the server
caching daemon), increase your mbuf's. Consider mounting all remote nfs
mount filesystems with the options=bg,soft,intr,retry=2,retrans=8 .
Check errpt for network problem. Check stats with netstat -s, netstat -m
and nfstat.
Regards,
Bassel Bekdache


> > 'images' drives mounted (tabletop).  Also 'ls' seems
> > to work fine but when I use 'ls -al' it take several minutes
> > to come back.(what is the difference?). This happens on 'tabletop'
> > (the main box) and 'image' (legacy names).

> When I've seen this problem it was related to NFS and TCP/IP.  For some
> reason under AIX 3.2 there is a problem when you take a machine down to
> "init 0" and are running NFS.  This problem was corrected with AIX 4.  The
> solution for AIX 3.2 is to execute /etc/nfs.clean immediately before
> shutting down or immediately before running "init 0".  Check through the
> crontab processes that run (or finish) around 7PM.  See if the machine is
> being initialized to "init 0", or if TCP/IP is shutting down.  Try adding
> the /etc/nfs.clean to the script immediately before the "init 0".  I know
> there are other things that might cause this but this is what I ended up
> doing and it did solve the problem.

 
 
 

1. Confirmed D2 bug: cached updates don't work for master/detail - here's why

By tracing through the D2 VCL, I have uncovered a flaw in the design of cached
updates and am working on a solution to same.

The problem exists when updating tables having a master/detail relationship.  
It is necessary to update the master first, then the details (e.g. when
inserting a new tree of records), but when you do so, the detail tables are
requeried and the updates are lost.

Specifically, the flaw goes like this.  Let's assume for simplicity that the
master table is unmodified, and only the detail has updates pending:

(1)  Application calls DB.ApplyUpdates.  
(2)  DB calls the ApplyUpdates routine for the first table (master).
(3)  TDataset.ApplyUpdates calls ProcessUpdates(dbiUpdatePrepare)
(4)  In our case, nothing interesting happens until ProcessUpdates calls
Resync().
(5)  Nothing interesting happens in Resync until DataEvent(DataSetChange, 0).
(6)  However, at that time, all of the subordinate datasets are requeried!  
(You can trace right to the culprit statement if you want to but it would make
this memo much longer.)

The changes, inserts, etc. you made to the subordinate tables are now *gone.*  
The "UpdateStatus" of the table becomes dsUnmodified and no further cached
update queries are issued -- at all! -- against them.

Naturally, I am stunned and dismayed that Borland permitted the code to go out
in this condition.  I think that an hours' worth of honest-to-God testing
would have revealed this flaw, but it cost me two days.  Let's just say that I
don't think I paid $1,800 for this piece of software to _ever have this
experience, and no one else did, either.

If there's an update-fix CD-ROM that contains a solution to this problem, if
it's a known problem, then I hope one will arrive at my door and that it will
be complementary.

/mr/
(602) 946-8259; fax (602) 874-2068

2. help with select statement

3. Sometimes it works, sometimes it doesn't

4. identifying tables with identity columns

5. Sometimes they display...sometimes they don't

6. Anyone know what '-1' means in sp_who?

7. File Save -- sometimes works and sometimes doesn't

8. JDBC Oracle XA and ConnectionPool documentation ?

9. 'ROUND' works sometimes

10. WIN: Why Can't detail table be edited in 1-1

11. Why named pipes connect slow and sometimes quick

12. Numeric Output problem with 'MS ODBC Driver to Oracle'/Access'97/Oracle

13. Why it's soooo slow