Hi all,
I am having a very strange problem and after a week of browsing through the
archives,*croft's book, 'NFS performance tuning' on Sun website, trying
various tricks found here and there, still don't get it any better :(
There is a NFS server (SunFire 280R, 2CPU, 1GB RAM) with two T3+ arrays
configured with RAID5 (no LVM, just one ~500GB partition on each). It is
connected through a gigabit network interface dedicated for NFS, all the other
networking is done through eri0. Machine runs Solaris 8. It serves as a
workarea for some scientific number-crunching in a private operations network,
not connected to the Internet. It is accessed by several (10-20, typically
around 10) computational machines, mostly V100's and V240's, running the
same OS. It's been running flawlesly for years, but recently 'something' has
changed:
Since few weeks, from time to time (several times/day) the server goes nuts:
- the load slowly increases, up to 15 or more
- 100% of this load is in kernel (as shown by top/prstat/sar). And I mean it
this is not like it's a "majority" or something - it is the real 100%
kernel during at least few minutes (looking at sar/prstat)
- all of this grows up for few minutes (up to hour), until the machine is
almost unresponsive (doesn't even ping) for a while
- then suddenly there is a 'cut' and everything goes back to normal - load
<1.0, reasonable user/sys/iowait proportions etc.
I checked the following possible reasons during the problematic/normal
periods:
- I/O is OK, almost no iowait, less than 10MB/s disk usage (iostat -xtcpn)
not much different than during normal work. T3 shows typical throughput
of tens of MB/s when tested locally with 'dd bs=1048576'
- network shows no errors, overload etc. (netstat -i)
- no lack of memory, swap barely touched
- no suspicious messages in the log, neither during the normal work, nor
after the reboot. The T3+'s are also OK.
- tried to compare 'sar -a' statistics for a) few hours without a problem
and b) a period that was problematic. I lack some graphical way of
analyzing sar output, but for what I understand from the numbers, there
was not much difference, apart from the system load
- tried adjusting DNLC size (see below)
- tried to ask the users if they have some logs that could show that during
this 'cut' point (when everything goes back to normal) something has
finished/failed/restarted/etc. - nothing.
- I have even tried to enable verbose NFS logging and compared the NFS usage
pattern during the problematic/non-problematic system state - not much
difference (looking at the areas used, client machines, percentage of
different call types etc.). Of course, NFS worked a bit slower, but the
same kinds of operations were done.
What is unusual here is that the NFS-served disk contains few big directories
with lots (several k) of zero-length files that have very long (>100c) names.
This place is used as a scoreboard to synchronize various parts of processing.
Therefore. I tried to adjust the directory cache:
At first, vmstat -s shown that the DNLC hit ratio was very small, around
15%. I increased the 'ncsize' with adb (step by step, up to hundred times
of the initial value), the hit ratio got better (over 50%), but the problem
stays the same. Apart from this, the DNLC is supposed to hold only the entries
<30chars, isn't it? - so theoretically, this shouldn't affect the issue.
I am aware that the 50% DNLC hit rate is referred as 'very bad, look into it'
- but hey, it worked in this setup flawlessly for years now :)
Obviously something changed in the way this machine is used and if I find
_what_ is causing this behavior, I will be able to discuss with the users
how to get around it. It's just that I don't know - neither they do :(
Does anybody have any idea what else I should look into?
Daniel
--
\ Daniel Rychcik INTEGRAL Science Data Centre, Versoix/Geneve, CH
\--------------------------------------------------------------------
\ GCM/CS/MU/M d- s++:+ a- C+++$ US+++$ P+>++ L+++$ E--- W++ N++ K-
\ w- O- M PS+ PE Y+ PGP t+ 5 X- R tv b+ D++ G+ e+++ h--- r+++ y+++