Hello. I work for an ISP in the San Diego area. We're currently in the process of *trying* to upgrade from Apache 1.2.6 to 1.3.4.
Our web servers are as follows:
Pentium II
256MBs RAM
FreeBSD 2.2.8-STABLE
Apache 1.3.4
Serving multiple hundreds of virtually hosted sites
There has been a consistent problem throughout all Apache 1.3.x distributions in that after a few hours of live operation (we've
actually tested it feeding customer sites since we can't seem to reproduce the problem without extensive load), chilld httpd
processes get stuck in Disk operation, and do not die even when the parent httpd process is killed. This is bad since system
resources will eventually run out (which we have seen happen). Especially considering that each httpd process takes up almost 9MB
of ram (see SIZE field of top output, below), which is an anamoly in of itself since the 1.2.6 processes are less than 2MB.
Furthermore, trying to manually kill the child processes with kill -9, after the parent httpd is killed, does not work. Ther server
must be rebooted.
The problem *appears* to be related to NFS in that the processes, when they stick, are in something called 'nfsvin' state
(consistent with a disk operation). Compare this to 1.2.6 which doesn't have this problem, whatever, on an identically configured
machine (we have both versions of apache installed on the same machine, and can quickly revert from one to the other).
I guess my questions are: what is done differently between 1.2.6 and 1.3.4 that would have this affect, possibly specifically
relating to NFS on FreeBSD? Has anyone else seen this behaviour on a similarly configured machine? If so, did you fix it? And if
so, how?
Matthew H. North
CTSNET Assistant Administrator
___ ____ ___
/ / /__ Network | Tel 619/637-3600 | Mail: cts...@cts.com
/__ / ___/ Services | Fax 619/637-3630 | WWW: http://www.cts.com
Following are the output from top and ps for both apache 1.3.4 and 1.2.6 on the same machine.
APACHE 1.3.4 top output: Notice the SIZE field for each httpd process, and the numerous httpd processes in 'nfsvin' state. Each of
these 'nfsvin' processes has a -4 priority...
last pid: 2829; load averages: 0.07, 0.10, 0.08
14:41:21
267 processes: 1 running, 266 sleeping
CPU states: 4.7% user, 0.0% nice, 4.3% system, 3.1% interrupt, 87.9%
idle
Mem: 121M Active, 36M Inact, 47M Wired, 45M Cache, 8347K Buf, 1072K Free
Swap: 256M Total, 2084K Used, 254M Free, 1% Inuse
PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
80 root 2 0 736K 544K select 160:01 0.76% 0.76% routed
29847 xnouser 2 0 8884K 5180K sbwait 0:02 0.08% 0.08% httpd
1626 xnouser 2 0 8884K 5140K sbwait 0:01 0.08% 0.08% httpd
2763 xnouser 2 0 8884K 4756K sbwait 0:00 0.08% 0.08% httpd
2826 root 28 0 780K 1036K RUN 0:00 0.19% 0.08% top
29292 xnouser 2 0 8892K 5360K sbwait 0:02 0.04% 0.04% httpd
1631 xnouser 2 0 8884K 4996K sbwait 0:01 0.04% 0.04% httpd
2077 xnouser 2 0 8884K 4876K sbwait 0:01 0.04% 0.04% httpd
2790 xnouser 2 0 8884K 4720K accept 0:00 0.04% 0.04% httpd
2827 xnouser 2 0 8884K 4692K accept 0:00 0.21% 0.04% httpd
121 root 10 0 220K 16K nfsidl 5:31 0.00% 0.00% nfsiod
236 root 18 0 1896K 1072K pause 3:51 0.00% 0.00% perl
198 root 2 0 1820K 1248K select 3:47 0.00% 0.00% perl
107 root 2 -12 436K 400K select 1:22 0.00% 0.00% xntpd
11555 root 2 0 8884K 4212K select 1:21 0.00% 0.00% httpd
122 root 10 0 220K 16K nfsidl 0:57 0.00% 0.00% nfsiod
98 root 2 0 204K 328K select 0:34 0.00% 0.00% syslogd
145 root 2 0 212K 212K select 0:27 0.00% 0.00% inetd
184 root 2 0 340K 472K select 0:23 0.00% 0.00% sshd1
111 daemon 2 0 180K 360K select 0:19 0.00% 0.00% portmap
11569 root 2 0 3824K 1960K select 0:11 0.00% 0.00% httpsd
148 root 18 0 344K 388K pause 0:06 0.00% 0.00% cron
23916 xnouser -4 0 8836K 4280K nfsvin 0:04 0.00% 0.00% httpd
23915 xnouser -4 0 8836K 4232K nfsvin 0:04 0.00% 0.00% httpd
23857 xnouser -4 0 8836K 4224K nfsvin 0:04 0.00% 0.00% httpd
123 root 10 0 220K 16K nfsidl 0:03 0.00% 0.00% nfsiod
23871 xnouser -4 0 8836K 4340K nfsvin 0:04 0.00% 0.00% httpd
23865 xnouser -4 0 8836K 4336K nfsvin 0:04 0.00% 0.00% httpd
23847 xnouser -4 0 8900K 4284K nfsvin 0:04 0.00% 0.00% httpd
23843 xnouser -4 0 8844K 4248K nfsvin 0:04 0.00% 0.00% httpd
23863 xnouser -4 0 8836K 4236K nfsvin 0:03 0.00% 0.00% httpd
23853 xnouser -4 0 8900K 4196K nfsvin 0:04 0.00% 0.00% httpd
23874 xnouser -4 0 8836K 4180K nfsvin 0:04 0.00% 0.00% httpd
Apache 1.3.4 ps output: again, notice the STAT field for httpd processes is D:
PID TT STAT TIME COMMAND
0 ?? DLs 0:01.20 (swapper)
1 ?? Is 0:00.70 /sbin/init --
2 ?? DL 0:10.92 (pagedaemon)
3 ?? DL 0:00.00 (vmdaemon)
4 ?? DL 20:03.06 (update)
25 ?? Is 0:00.00 adjkerntz -i
80 ?? Ss 160:02.57 routed -P ripv2
98 ?? Is 0:33.89 syslogd
107 ?? S<s 1:22.03 xntpd -p /var/run/xntpd.pid
111 ?? Is 0:18.96 portmap
121 ?? I 5:30.99 nfsiod -n 8
122 ?? I 0:56.78 nfsiod -n 8
123 ?? I 0:03.48 nfsiod -n 8
124 ?? I 0:01.38 nfsiod -n 8
125 ?? I 0:00.91 nfsiod -n 8
126 ?? I 0:00.76 nfsiod -n 8
127 ?? I 0:00.68 nfsiod -n 8
128 ?? I 0:00.62 nfsiod -n 8
145 ?? Ss 0:27.47 inetd
148 ?? Is 0:06.07 cron
152 ?? Is 0:00.46 /usr/sbin/sendmail -q30m
184 ?? Is 0:22.55 /usr/local/sbin/sshd (sshd1)
649 ?? D 0:00.42 /usr/local/apache/sbin/httpd
2734 ?? Ss 0:00.09 telnetd
23861 ?? D 0:03.55 /usr/local/apache/sbin/httpd
28520 ?? D 0:00.01 /usr/local/apache/sbin/httpd
29296 ?? D 0:00.94 /usr/local/apache/sbin/httpd
2735 p0 Is 0:00.07 -bash (bash)
2738 p0 S 0:00.13 su (bash)
2864 p0 R+ 0:00.00 ps -ax
2865 p0 R+ 0:00.00 su (bash)
218 v0 Is 0:00.07 -bash (bash)
236 v0 I+ 3:51.02 /usr/local/bin/perl /web/webmaster/bin/trafmon
219 v1 Is+ 0:00.22 -bash (bash)
220 v2 Is+ 0:00.01 /usr/libexec/getty Pc ttyv2
221 v3 Is+ 0:00.01 /usr/libexec/getty Pc ttyv3
222 v4 Is+ 0:00.01 /usr/libexec/getty Pc ttyv4
223 v5 Is+ 0:00.01 /usr/libexec/getty Pc ttyv5
224 v6 Is+ 0:00.01 /usr/libexec/getty Pc ttyv6
APACHE 1.2.6 top output: Notice the SIZE field at ~2000K, and the httpd process state in 'sbwait':
last pid: 577; load averages: 0.16, 0.10, 0.03
14:54:43
100 processes: 2 running, 98 sleeping
CPU states: 10.9% user, 0.0% nice, 4.6% system, 1.5% interrupt, 83.0%
idle
Mem: 24M Active, 5104K Inact, 18M Wired, 8344K Buf, 204M Free
Swap: 256M Total, 64K Used, 256M Free
PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
80 root 2 0 680K 564K select 0:07 1.98% 1.98% routed
573 xnouser 2 0 1988K 2136K sbwait 0:00 0.63% 0.11%
httpd_1.2.6fp-
569 xnouser 2 0 1988K 2152K sbwait 0:00 0.12% 0.04%
httpd_1.2.6fp-
503 xnouser 2 0 1988K 2148K sbwait 0:00 0.04% 0.04%
httpd_1.2.6fp-
548 xnouser 2 0 1988K 2148K select 0:00 0.04% 0.04%
httpd_1.2.6fp-
505 xnouser 2 0 1980K 2144K select 0:00 0.04% 0.04%
httpd_1.2.6fp-
501 xnouser 2 0 1980K 2136K select 0:00 0.04% 0.04%
httpd_1.2.6fp-
557 xnouser 2 0 1988K 2128K sbwait 0:00 0.06% 0.04%
httpd_1.2.6fp-
479 root 18 0 1968K 2016K pause 0:00 0.00% 0.00%
httpd_1.2.6fp-
489 root 18 0 1076K 1216K pause 0:00 0.00% 0.00%
httpd_1.2.6sf
152 root 18 0 576K 876K pause 0:00 0.00% 0.00% sendmail
148 root 18 0 344K 600K pause 0:00 0.00% 0.00% cron
25 root 18 0 208K 92K pause 0:00 0.00% 0.00% adjkerntz
218 root 10 0 752K 848K wait 0:00 0.00% 0.00% bash
239 root 10 0 744K 832K wait 0:00 0.00% 0.00% bash
236 mdavis 10 0 744K 828K wait 0:00 0.00% 0.00% bash
1 root 10 0 512K 248K wait 0:00 0.00% 0.00% init
128 root 10 0 220K 96K nfsidl 0:00 0.00% 0.00% nfsiod
127 root 10 0 220K 96K nfsidl 0:00 0.00% 0.00% nfsiod
123 root 10 0 220K 96K nfsidl 0:00 0.00% 0.00% nfsiod
121 root 10 0 220K 96K nfsidl 0:00 0.00% 0.00% nfsiod
126 root 10 0 220K 96K nfsidl 0:00 0.00% 0.00% nfsiod
122 root 10 0 220K 96K nfsidl 0:00 0.00% 0.00% nfsiod
124 root 10 0 220K 96K nfsidl 0:00 0.00% 0.00% nfsiod
125 root 10 0 220K 96K nfsidl 0:00 0.00% 0.00% nfsiod
222 root 3 0 180K 568K ttyin 0:00 0.00% 0.00% getty
219 root 3 0 180K 568K ttyin 0:00 0.00% 0.00% getty
220 root 3 0 180K 568K ttyin 0:00 0.00% 0.00% getty
221 root 3 0 180K 568K ttyin 0:00 0.00% 0.00% getty
223 root 3 0 180K 568K ttyin 0:00 0.00% 0.00% getty
224 root 3 0 180K 568K ttyin 0:00 0.00% 0.00% getty
520 xnouser 2 0 2004K 2204K sbwait 0:00 0.00% 0.00%
httpd_1.2.6fp-
544 xnouser 2 0 2004K 2200K sbwait 0:00 0.00% 0.00%
httpd_1.2.6fp-
Here's an extended output from ps showing Apache 1.3.4 servers. Notice how
many of them were in a 'D' (disk) operation. With apache 1.2.6, all of them
are either in an 'S' state or 'I' (which I'm sure means idle).
xnouser 174 11555 11555 5d36500 0 D ?? 0:01.81
/usr/local/apache/sbin/httpd
xnouser 175 11555 11555 5d36500 0 D ?? 0:01.86
/usr/local/apache/sbin/httpd
xnouser 176 11555 11555 5d36500 0 S ?? 0:02.20
/usr/local/apache/sbin/httpd
xnouser 177 11555 11555 5d36500 0 D ?? 0:01.09
/usr/local/apache/sbin/httpd
xnouser 178 11555 11555 5d36500 0 D ?? 0:00.87
/usr/local/apache/sbin/httpd
xnouser 180 11555 11555 5d36500 0 D ?? 0:00.29
/usr/local/apache/sbin/httpd
xnouser 186 11555 11555 5d36500 0 D ?? 0:00.91
/usr/local/apache/sbin/httpd
xnouser 189 11555 11555 5d36500 0 D ?? 0:00.85
/usr/local/apache/sbin/httpd
xnouser 191 11555 11555 5d36500 0 D ?? 0:00.23
/usr/local/apache/sbin/httpd
xnouser 199 11555 11555 5d36500 0 D ?? 0:00.21
/usr/local/apache/sbin/httpd
xnouser 213 11555 11555 5d36500 0 D ?? 0:00.06
/usr/local/apache/sbin/httpd
xnouser 291 11555 11555 5d36500 0 D ?? 0:00.36
/usr/local/apache/sbin/httpd
xnouser 292 11555 11555 5d36500 0 S ?? 0:02.05
/usr/local/apache/sbin/httpd
xnouser 293 11555 11555 5d36500 0 D ?? 0:00.40
/usr/local/apache/sbin/httpd
xnouser 295 11555 11555 5d36500 0 D ?? 0:00.62
/usr/local/apache/sbin/httpd
xnouser 296 11555 11555 5d36500 0 D ?? 0:00.74
/usr/local/apache/sbin/httpd
xnouser 353 11555 11555 5d36500 0 D ?? 0:00.04
/usr/local/apache/sbin/httpd
xnouser 444 11555 11555 5d36500 0 D ?? 0:00.40
/usr/local/apache/sbin/httpd
xnouser 445 11555 11555 5d36500 0 D ?? 0:01.72
/usr/local/apache/sbin/httpd
xnouser 446 11555 11555 5d36500 0 D ?? 0:00.30
/usr/local/apache/sbin/httpd
xnouser 608 11555 11555 5d36500 0 S ?? 0:01.34
/usr/local/apache/sbin/httpd
xnouser 609 11555 11555 5d36500 0 S ?? 0:01.62
/usr/local/apache/sbin/httpd
xnouser 611 11555 11555 5d36500 0 D ?? 0:00.24
/usr/local/apache/sbin/httpd
xnouser 614 11555 11555 5d36500 0 D ?? 0:00.04
/usr/local/apache/sbin/httpd
xnouser 648 11555 11555 5d36500 0 D ?? 0:00.07
/usr/local/apache/sbin/httpd
xnouser 649 11555 11555 5d36500 0 D ?? 0:00.42
/usr/local/apache/sbin/httpd
xnouser 651 11555 11555 5d36500 0 D ?? 0:00.27
/usr/local/apache/sbin/httpd
xnouser 652 11555 11555 5d36500 0 D ?? 0:00.03
/usr/local/apache/sbin/httpd
xnouser 653 11555 11555 5d36500 0 D ?? 0:01.34
/usr/local/apache/sbin/httpd
xnouser 742 11555 11555 5d36500 0 S ?? 0:01.76
/usr/local/apache/sbin/httpd
xnouser 803 11555 11555 5d36500 0 D ?? 0:00.19
/usr/local/apache/sbin/httpd
xnouser 821 11555 11555 5d36500 0 D ?? 0:00.04
/usr/local/apache/sbin/httpd
xnouser 838 11555 11555 5d36500 0 D ?? 0:00.03
/usr/local/apache/sbin/httpd
xnouser 839 11555 11555 5d36500 0 D ?? 0:00.04
/usr/local/apache/sbin/httpd
The only errors found in the log file came when apache couldn't kill it's 'stuck' children after the parent httpd process received a
kill signal... Note that kill -9 won't kill these, either. The only thing that does the job is a reboot.
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 23857 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 23871 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 23861 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 29296 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 29299 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 29302 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 29306 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 29308 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 27638 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 29317 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 28520 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 178 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 29557 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 444 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 353 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 648 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 649 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 1025 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 1316 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 1322 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 2361 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 2411 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 2533 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 2634 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 2637 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:42 1999] [error] child process 23861 still did not exit, sending a SIGKILL
httpd: [Sun Mar 14 14:41:42 1999] [error] child process 29296 still did not exit, sending a SIGKILL
httpd: [Sun Mar 14 14:41:42 1999] [error] child process 28520 still did not exit, sending a SIGKILL
httpd: [Sun Mar 14 14:41:42 1999] [error] child process 649 still did not exit, sending a SIGKILL
httpd: [Sun Mar 14 14:41:59 1999] [error] could not make child process 23861 exit, attempting to continue anyway
httpd: [Sun Mar 14 14:41:59 1999] [error] could not make child process 29296 exit, attempting to continue anyway
httpd: [Sun Mar 14 14:41:59 1999] [error] could not make child process 28520 exit, attempting to continue anyway
httpd: [Sun Mar 14 14:41:59 1999] [error] could not make child process 649 exit, attempting to continue anyway
httpd: [Sun Mar 14 14:41:59 1999] [notice] caught SIGTERM, shutting down
Matthew H. North
CTSNET Assistant Administrator
___ ____ ___
/ / /__ Network | Tel 619/637-3600 | Mail: cts...@cts.com
/__ / ___/ Services | Fax 619/637-3630 | WWW: http://www.cts.com