Unresponsive NFS server, high load, 100% in kernel

Unresponsive NFS server, high load, 100% in kernel

Post by Daniel Rychci » Tue, 03 May 2005 07:19:43



Hi all,

I am having a very strange problem and after a week of browsing through the
archives,*croft's book, 'NFS performance tuning' on Sun website, trying
various tricks found here and there, still don't get it any better :(

There is a NFS server (SunFire 280R, 2CPU, 1GB RAM) with two T3+ arrays
configured with RAID5 (no LVM, just one ~500GB partition on each). It is
connected through a gigabit network interface dedicated for NFS, all the other
networking is done through eri0. Machine runs Solaris 8. It serves as a
workarea for some scientific number-crunching in a private operations network,
not connected to the Internet. It is accessed by several (10-20, typically
around 10) computational machines, mostly V100's and V240's, running the
same OS. It's been running flawlesly for years, but recently 'something' has
changed:

Since few weeks, from time to time (several times/day) the server goes nuts:

- the load slowly increases, up to 15 or more

- 100% of this load is in kernel (as shown by top/prstat/sar). And I mean it
  this is not like it's a "majority" or something - it is the real 100%
  kernel during at least few minutes (looking at sar/prstat)

- all of this grows up for few minutes (up to hour), until the machine is
  almost unresponsive (doesn't even ping) for a while

- then suddenly there is a 'cut' and everything goes back to normal - load
  <1.0, reasonable user/sys/iowait proportions etc.

I checked the following possible reasons during the problematic/normal
periods:

- I/O is OK, almost no iowait, less than 10MB/s disk usage (iostat -xtcpn)
  not much different than during normal work. T3 shows typical throughput
  of tens of MB/s when tested locally with 'dd bs=1048576'

- network shows no errors, overload etc. (netstat -i)

- no lack of memory, swap barely touched

- no suspicious messages in the log, neither during the normal work, nor
  after the reboot. The T3+'s are also OK.

- tried to compare 'sar -a' statistics for a) few hours without a problem
  and b) a period that was problematic. I lack some graphical way of
  analyzing sar output, but for what I understand from the numbers, there
  was not much difference, apart from the system load

- tried adjusting DNLC size (see below)

- tried to ask the users if they have some logs that could show that during
  this 'cut' point (when everything goes back to normal) something has
  finished/failed/restarted/etc. - nothing.

- I have even tried to enable verbose NFS logging and compared the NFS usage
  pattern during the problematic/non-problematic system state - not much
  difference (looking at the areas used, client machines, percentage of
  different call types etc.). Of course, NFS worked a bit slower, but the
  same kinds of operations were done.

What is unusual here is that the NFS-served disk contains few big directories
with lots (several k) of zero-length files that have very long (>100c) names.
This place is used as a scoreboard to synchronize various parts of processing.
Therefore. I tried to adjust the directory cache:

At first, vmstat -s shown that the DNLC hit ratio was very small, around
15%. I increased the 'ncsize' with adb (step by step, up to hundred times
of the initial value), the hit ratio got better (over 50%), but the problem
stays the same. Apart from this, the DNLC is supposed to hold only the entries
<30chars, isn't it? - so theoretically, this shouldn't affect the issue.

I am aware that the 50% DNLC hit rate is referred as 'very bad, look into it'
- but hey, it worked in this setup flawlessly for years now :)

Obviously something changed in the way this machine is used and if I find
_what_ is causing this behavior, I will be able to discuss with the users
how to get around it. It's just that I don't know - neither they do :(

Does anybody have any idea what else I should look into?

                                                                Daniel
--
\ Daniel Rychcik     INTEGRAL Science Data Centre, Versoix/Geneve, CH
 \--------------------------------------------------------------------
  \  GCM/CS/MU/M d- s++:+ a- C+++$ US+++$ P+>++ L+++$ E--- W++ N++ K-
   \ w- O- M PS+ PE Y+ PGP t+ 5 X- R tv b+ D++ G+ e+++ h--- r+++ y+++

 
 
 

Unresponsive NFS server, high load, 100% in kernel

Post by Daniel Rychci » Wed, 04 May 2005 05:18:49



Quote:> Bad memory or bad CPU. Any crashes? Does the machine come back OK
> when you power it down (and all the peripherals), then up again?  

None of these. The machine passes the long diagnostics (after a
poweroff/poweron with the key in the 'diag' position) like a charm.
There is also no indication of CPU/memory failure during the operation,
no crashes, not even a coredump. No air-conditioning problems either.

I would just like to know what could be the possible reason for this
prolonged '100% kernel' thing.

                                                                Daniel
--
\ Daniel Rychcik     INTEGRAL Science Data Centre, Versoix/Geneve, CH
 \--------------------------------------------------------------------
  \  GCM/CS/MU/M d- s++:+ a- C+++$ US+++$ P+>++ L+++$ E--- W++ N++ K-
   \ w- O- M PS+ PE Y+ PGP t+ 5 X- R tv b+ D++ G+ e+++ h--- r+++ y+++

 
 
 

Unresponsive NFS server, high load, 100% in kernel

Post by Jorgen Moquis » Wed, 04 May 2005 08:20:06




>>Bad memory or bad CPU. Any crashes? Does the machine come back OK
>>when you power it down (and all the peripherals), then up again?  

> None of these. The machine passes the long diagnostics (after a
> poweroff/poweron with the key in the 'diag' position) like a charm.
> There is also no indication of CPU/memory failure during the operation,
> no crashes, not even a coredump. No air-conditioning problems either.

> I would just like to know what could be the possible reason for this
> prolonged '100% kernel' thing.

>                                                                 Daniel

just a guess, a bad cd or cdplayer can mess up a system quite well ?
/j?rgen
 
 
 

Unresponsive NFS server, high load, 100% in kernel

Post by Paul » Wed, 04 May 2005 18:26:07


Daniel Rychcik wrote :

Quote:> I would just like to know what could be the possible reason for this
> prolonged '100% kernel' thing.

lockstat(1M) may be usefull ...
 
 
 

Unresponsive NFS server, high load, 100% in kernel

Post by Michael Schreibe » Wed, 04 May 2005 20:30:06


please send an

# nfsstat

output

mike

 
 
 

Unresponsive NFS server, high load, 100% in kernel

Post by Daniel Rychci » Sun, 08 May 2005 06:15:37



Quote:> please send an
> # nfsstat
> output

Tried this (and looking at the lockstat statistics, too) without
success. However, someone suggested that I could try changing NFS
from UDP to TCP for this particular area. And somehow, it seems
to do the trick! At least there was just one load 'spike' during
past few days, contrary to several times a day before the change.
(The software guys swear that they didn't change anything in the
NFS-loading computations).

I will keep an eye on this.

                                                                Daniel
--
\ Daniel Rychcik     INTEGRAL Science Data Centre, Versoix/Geneve, CH
 \--------------------------------------------------------------------
  \  GCM/CS/MU/M d- s++:+ a- C+++$ US+++$ P+>++ L+++$ E--- W++ N++ K-
   \ w- O- M PS+ PE Y+ PGP t+ 5 X- R tv b+ D++ G+ e+++ h--- r+++ y+++

 
 
 

Unresponsive NFS server, high load, 100% in kernel

Post by Greg Menk » Mon, 09 May 2005 02:06:52




> > please send an
> > # nfsstat
> > output

> Tried this (and looking at the lockstat statistics, too) without
> success. However, someone suggested that I could try changing NFS
> from UDP to TCP for this particular area. And somehow, it seems
> to do the trick! At least there was just one load 'spike' during
> past few days, contrary to several times a day before the change.
> (The software guys swear that they didn't change anything in the
> NFS-loading computations).

> I will keep an eye on this.

I had a problem vaguely similar to this once on a Linux box- little
relation I know, turned out it was a flakey lanboard.  Before swapping
it out I tried all kinds of buffering and vm config tweaks which didn't
help in the least.

Gregm

 
 
 

1. Surprisingly high load averages on an NFS server

I've recently setup an NFS server and I'm seeing strange behavior I
can't explain. It's a Red Hat 8.0 system, dual P3 733 with 1Gb of RAM
and two 15K RPM SCSI disk set up as a software RAID, acting as an NFS
server and a mail server and nothing else. Only 1 NFS share, although
quite sizable, about 500 Mb of stuff, with only 2 clients so far, but
reading/writing files to the server every second 24/7. Ethernet
traffic on the dedicated port for NFS is about 100 Kb per second on
average, so I would say it is quite busy but not too much. The mail
server is practically not loaded at all, maybe 20-30 mails per hour.
So here's the strange part -- I've setup MRTG to monitor load averages
too and the graph looks like this -- very low LA for 20-60 hours
straight (~0.1), then gradual (practically linear) increase during
several hours to 1.2-1.5 LA, then it stays fairly constant at that
level for a few hours and a sharp drop off back to 0.1 level for
another 20-60 hours. If I run "top -id1" during the high LA period, it
doesn't show any active processes (besides top itself) and the CPU
idles are at 99%. Nothing in the log files to indicate any type of
problem. Restarting NFS doesn't help to lower the high LA. NFS clients
produce fairly consistent load on the server and there seems to be no
correlation with the load on the clients vs load on the server. Any
ideas what could it be and if I should be concerned about it at all?

2. NFS and Windows

3. Slow NFS server -> high client load av

4. Telnet Access

5. dlt8000 showing 100% busy, but unresponsive

6. problem with fast ethetnet drivers voor SCO 3.x.x

7. Sun Blade 100 becomes unresponsive...

8. ditto easy 800 tape drive support ???

9. load with kernels 100/102

10. NFS high load when writing

11. High load average, low cpu usage when /home NFS mounted

12. nfs performance at high loads

13. CPU Load WAY high.....NFS issue?