Critically anamolous 'stuck' httpd processes on Apache1.3.x

Critically anamolous 'stuck' httpd processes on Apache1.3.x

Post by cts.. » Tue, 16 Mar 1999 04:00:00



Hello.  I work for an ISP in the San Diego area.  We're currently in the process of *trying* to upgrade from Apache 1.2.6 to 1.3.4.
Our web servers are as follows:

Pentium II
256MBs RAM
FreeBSD 2.2.8-STABLE
Apache 1.3.4
Serving multiple hundreds of virtually hosted sites

There has been a consistent problem throughout all Apache 1.3.x distributions in that after a few hours of live operation (we've
actually tested it feeding customer sites since we can't seem to reproduce the problem without extensive load), chilld httpd
processes get stuck in Disk operation, and do not die even when the parent httpd process is killed.  This is bad since system
resources will eventually run out (which we have seen happen).  Especially considering that each httpd process takes up almost 9MB
of ram (see SIZE field of top output, below), which is an anamoly in of itself since the 1.2.6 processes are less than 2MB.
Furthermore, trying to manually kill the child processes with kill -9, after the parent httpd is killed, does not work.  Ther server
must be rebooted.

The problem *appears* to be related to NFS in that the processes, when they stick, are in something called 'nfsvin' state
(consistent with a disk operation).  Compare this to 1.2.6 which doesn't have this problem, whatever, on an identically configured
machine (we have both versions of apache installed on the same machine, and can quickly revert from one to the other).

I guess my questions are: what is done differently between 1.2.6 and 1.3.4 that would have this affect, possibly specifically
relating to NFS on FreeBSD?  Has anyone else seen this behaviour on a similarly configured machine?  If so, did you fix it?  And if
so, how?

Matthew H. North
CTSNET Assistant Administrator
   ___  ____  ___
  /      /   /__   Network | Tel 619/637-3600 | Mail: cts...@cts.com
 /__    /   ___/  Services | Fax 619/637-3630 | WWW: http://www.cts.com

Following are the output from top and ps for both apache 1.3.4 and 1.2.6 on the same machine.

APACHE 1.3.4 top output: Notice the SIZE field for each httpd process, and the numerous httpd processes in 'nfsvin' state.  Each of
these 'nfsvin' processes has a -4 priority...

last pid:  2829;  load averages:  0.07,  0.10,  0.08
14:41:21
267 processes: 1 running, 266 sleeping
CPU states:  4.7% user,  0.0% nice,  4.3% system,  3.1% interrupt, 87.9%
idle
Mem: 121M Active, 36M Inact, 47M Wired, 45M Cache, 8347K Buf, 1072K Free
Swap: 256M Total, 2084K Used, 254M Free, 1% Inuse

  PID USERNAME PRI NICE SIZE    RES STATE    TIME   WCPU    CPU COMMAND
   80 root      2   0   736K   544K select 160:01  0.76%  0.76% routed
29847 xnouser   2   0  8884K  5180K sbwait   0:02  0.08%  0.08% httpd
 1626 xnouser   2   0  8884K  5140K sbwait   0:01  0.08%  0.08% httpd
 2763 xnouser   2   0  8884K  4756K sbwait   0:00  0.08%  0.08% httpd
 2826 root     28   0   780K  1036K RUN      0:00  0.19%  0.08% top
29292 xnouser   2   0  8892K  5360K sbwait   0:02  0.04%  0.04% httpd
 1631 xnouser   2   0  8884K  4996K sbwait   0:01  0.04%  0.04% httpd
 2077 xnouser   2   0  8884K  4876K sbwait   0:01  0.04%  0.04% httpd
 2790 xnouser   2   0  8884K  4720K accept   0:00  0.04%  0.04% httpd
 2827 xnouser   2   0  8884K  4692K accept   0:00  0.21%  0.04% httpd
  121 root     10   0   220K    16K nfsidl   5:31  0.00%  0.00% nfsiod
  236 root     18   0  1896K  1072K pause    3:51  0.00%  0.00% perl
  198 root      2   0  1820K  1248K select   3:47  0.00%  0.00% perl
  107 root      2 -12   436K   400K select   1:22  0.00%  0.00% xntpd
11555 root      2   0  8884K  4212K select   1:21  0.00%  0.00% httpd
  122 root     10   0   220K    16K nfsidl   0:57  0.00%  0.00% nfsiod
   98 root      2   0   204K   328K select   0:34  0.00%  0.00% syslogd
  145 root      2   0   212K   212K select   0:27  0.00%  0.00% inetd
  184 root      2   0   340K   472K select   0:23  0.00%  0.00% sshd1
  111 daemon    2   0   180K   360K select   0:19  0.00%  0.00% portmap
11569 root      2   0  3824K  1960K select   0:11  0.00%  0.00% httpsd
  148 root     18   0   344K   388K pause    0:06  0.00%  0.00% cron
23916 xnouser  -4   0  8836K  4280K nfsvin   0:04  0.00%  0.00% httpd
23915 xnouser  -4   0  8836K  4232K nfsvin   0:04  0.00%  0.00% httpd
23857 xnouser  -4   0  8836K  4224K nfsvin   0:04  0.00%  0.00% httpd
  123 root     10   0   220K    16K nfsidl   0:03  0.00%  0.00% nfsiod
23871 xnouser  -4   0  8836K  4340K nfsvin   0:04  0.00%  0.00% httpd
23865 xnouser  -4   0  8836K  4336K nfsvin   0:04  0.00%  0.00% httpd
23847 xnouser  -4   0  8900K  4284K nfsvin   0:04  0.00%  0.00% httpd
23843 xnouser  -4   0  8844K  4248K nfsvin   0:04  0.00%  0.00% httpd
23863 xnouser  -4   0  8836K  4236K nfsvin   0:03  0.00%  0.00% httpd
23853 xnouser  -4   0  8900K  4196K nfsvin   0:04  0.00%  0.00% httpd
23874 xnouser  -4   0  8836K  4180K nfsvin   0:04  0.00%  0.00% httpd

Apache 1.3.4 ps output: again, notice the STAT field for httpd processes is D:

  PID  TT  STAT      TIME COMMAND
    0  ??  DLs    0:01.20  (swapper)
    1  ??  Is     0:00.70 /sbin/init --
    2  ??  DL     0:10.92  (pagedaemon)
    3  ??  DL     0:00.00  (vmdaemon)
    4  ??  DL    20:03.06  (update)
   25  ??  Is     0:00.00 adjkerntz -i
   80  ??  Ss   160:02.57 routed -P ripv2
   98  ??  Is     0:33.89 syslogd
  107  ??  S<s    1:22.03 xntpd -p /var/run/xntpd.pid
  111  ??  Is     0:18.96 portmap
  121  ??  I      5:30.99 nfsiod -n 8
  122  ??  I      0:56.78 nfsiod -n 8
  123  ??  I      0:03.48 nfsiod -n 8
  124  ??  I      0:01.38 nfsiod -n 8
  125  ??  I      0:00.91 nfsiod -n 8
  126  ??  I      0:00.76 nfsiod -n 8
  127  ??  I      0:00.68 nfsiod -n 8
  128  ??  I      0:00.62 nfsiod -n 8
  145  ??  Ss     0:27.47 inetd
  148  ??  Is     0:06.07 cron
  152  ??  Is     0:00.46 /usr/sbin/sendmail -q30m
  184  ??  Is     0:22.55 /usr/local/sbin/sshd (sshd1)
  649  ??  D      0:00.42 /usr/local/apache/sbin/httpd
 2734  ??  Ss     0:00.09 telnetd
23861  ??  D      0:03.55 /usr/local/apache/sbin/httpd
28520  ??  D      0:00.01 /usr/local/apache/sbin/httpd
29296  ??  D      0:00.94 /usr/local/apache/sbin/httpd
 2735  p0  Is     0:00.07 -bash (bash)
 2738  p0  S      0:00.13 su (bash)
 2864  p0  R+     0:00.00 ps -ax
 2865  p0  R+     0:00.00 su (bash)
  218  v0  Is     0:00.07 -bash (bash)
  236  v0  I+     3:51.02 /usr/local/bin/perl /web/webmaster/bin/trafmon
  219  v1  Is+    0:00.22 -bash (bash)
  220  v2  Is+    0:00.01 /usr/libexec/getty Pc ttyv2
  221  v3  Is+    0:00.01 /usr/libexec/getty Pc ttyv3
  222  v4  Is+    0:00.01 /usr/libexec/getty Pc ttyv4
  223  v5  Is+    0:00.01 /usr/libexec/getty Pc ttyv5
  224  v6  Is+    0:00.01 /usr/libexec/getty Pc ttyv6

APACHE 1.2.6 top output: Notice the SIZE field at ~2000K, and the httpd process state in 'sbwait':

last pid:   577;  load averages:  0.16,  0.10,  0.03
14:54:43
100 processes: 2 running, 98 sleeping
CPU states: 10.9% user,  0.0% nice,  4.6% system,  1.5% interrupt, 83.0%
idle
Mem: 24M Active, 5104K Inact, 18M Wired, 8344K Buf, 204M Free
Swap: 256M Total, 64K Used, 256M Free

  PID USERNAME PRI NICE SIZE    RES STATE    TIME   WCPU    CPU COMMAND
   80 root      2   0   680K   564K select   0:07  1.98%  1.98% routed
  573 xnouser   2   0  1988K  2136K sbwait   0:00  0.63%  0.11%
httpd_1.2.6fp-
  569 xnouser   2   0  1988K  2152K sbwait   0:00  0.12%  0.04%
httpd_1.2.6fp-
  503 xnouser   2   0  1988K  2148K sbwait   0:00  0.04%  0.04%
httpd_1.2.6fp-
  548 xnouser   2   0  1988K  2148K select   0:00  0.04%  0.04%
httpd_1.2.6fp-
  505 xnouser   2   0  1980K  2144K select   0:00  0.04%  0.04%
httpd_1.2.6fp-
  501 xnouser   2   0  1980K  2136K select   0:00  0.04%  0.04%
httpd_1.2.6fp-
  557 xnouser   2   0  1988K  2128K sbwait   0:00  0.06%  0.04%
httpd_1.2.6fp-
  479 root     18   0  1968K  2016K pause    0:00  0.00%  0.00%
httpd_1.2.6fp-
  489 root     18   0  1076K  1216K pause    0:00  0.00%  0.00%
httpd_1.2.6sf
  152 root     18   0   576K   876K pause    0:00  0.00%  0.00% sendmail
  148 root     18   0   344K   600K pause    0:00  0.00%  0.00% cron
   25 root     18   0   208K    92K pause    0:00  0.00%  0.00% adjkerntz
  218 root     10   0   752K   848K wait     0:00  0.00%  0.00% bash
  239 root     10   0   744K   832K wait     0:00  0.00%  0.00% bash
  236 mdavis   10   0   744K   828K wait     0:00  0.00%  0.00% bash
    1 root     10   0   512K   248K wait     0:00  0.00%  0.00% init
  128 root     10   0   220K    96K nfsidl   0:00  0.00%  0.00% nfsiod
  127 root     10   0   220K    96K nfsidl   0:00  0.00%  0.00% nfsiod
  123 root     10   0   220K    96K nfsidl   0:00  0.00%  0.00% nfsiod
  121 root     10   0   220K    96K nfsidl   0:00  0.00%  0.00% nfsiod
  126 root     10   0   220K    96K nfsidl   0:00  0.00%  0.00% nfsiod
  122 root     10   0   220K    96K nfsidl   0:00  0.00%  0.00% nfsiod
  124 root     10   0   220K    96K nfsidl   0:00  0.00%  0.00% nfsiod
  125 root     10   0   220K    96K nfsidl   0:00  0.00%  0.00% nfsiod
  222 root      3   0   180K   568K ttyin    0:00  0.00%  0.00% getty
  219 root      3   0   180K   568K ttyin    0:00  0.00%  0.00% getty
  220 root      3   0   180K   568K ttyin    0:00  0.00%  0.00% getty
  221 root      3   0   180K   568K ttyin    0:00  0.00%  0.00% getty
  223 root      3   0   180K   568K ttyin    0:00  0.00%  0.00% getty
  224 root      3   0   180K   568K ttyin    0:00  0.00%  0.00% getty
  520 xnouser   2   0  2004K  2204K sbwait   0:00  0.00%  0.00%
httpd_1.2.6fp-
  544 xnouser   2   0  2004K  2200K sbwait   0:00  0.00%  0.00%
httpd_1.2.6fp-

Here's an extended output from ps showing Apache 1.3.4 servers.  Notice how
many of them were in a 'D' (disk) operation.  With apache 1.2.6, all of them
are either in an 'S' state or 'I' (which I'm sure means idle).

xnouser    174 11555 11555 5d36500    0 D     ??    0:01.81
/usr/local/apache/sbin/httpd
xnouser    175 11555 11555 5d36500    0 D     ??    0:01.86
/usr/local/apache/sbin/httpd
xnouser    176 11555 11555 5d36500    0 S     ??    0:02.20
/usr/local/apache/sbin/httpd
xnouser    177 11555 11555 5d36500    0 D     ??    0:01.09
/usr/local/apache/sbin/httpd
xnouser    178 11555 11555 5d36500    0 D     ??    0:00.87
/usr/local/apache/sbin/httpd
xnouser    180 11555 11555 5d36500    0 D     ??    0:00.29
/usr/local/apache/sbin/httpd
xnouser    186 11555 11555 5d36500    0 D     ??    0:00.91
/usr/local/apache/sbin/httpd
xnouser    189 11555 11555 5d36500    0 D     ??    0:00.85
/usr/local/apache/sbin/httpd
xnouser    191 11555 11555 5d36500    0 D     ??    0:00.23
/usr/local/apache/sbin/httpd
xnouser    199 11555 11555 5d36500    0 D     ??    0:00.21
/usr/local/apache/sbin/httpd
xnouser    213 11555 11555 5d36500    0 D     ??    0:00.06
/usr/local/apache/sbin/httpd
xnouser    291 11555 11555 5d36500    0 D     ??    0:00.36
/usr/local/apache/sbin/httpd
xnouser    292 11555 11555 5d36500    0 S     ??    0:02.05
/usr/local/apache/sbin/httpd
xnouser    293 11555 11555 5d36500    0 D     ??    0:00.40
/usr/local/apache/sbin/httpd
xnouser    295 11555 11555 5d36500    0 D     ??    0:00.62
/usr/local/apache/sbin/httpd
xnouser    296 11555 11555 5d36500    0 D     ??    0:00.74
/usr/local/apache/sbin/httpd
xnouser    353 11555 11555 5d36500    0 D     ??    0:00.04
/usr/local/apache/sbin/httpd
xnouser    444 11555 11555 5d36500    0 D     ??    0:00.40
/usr/local/apache/sbin/httpd
xnouser    445 11555 11555 5d36500    0 D     ??    0:01.72
/usr/local/apache/sbin/httpd
xnouser    446 11555 11555 5d36500    0 D     ??    0:00.30
/usr/local/apache/sbin/httpd
xnouser    608 11555 11555 5d36500    0 S     ??    0:01.34
/usr/local/apache/sbin/httpd
xnouser    609 11555 11555 5d36500    0 S     ??    0:01.62
/usr/local/apache/sbin/httpd
xnouser    611 11555 11555 5d36500    0 D     ??    0:00.24
/usr/local/apache/sbin/httpd
xnouser    614 11555 11555 5d36500    0 D     ??    0:00.04
/usr/local/apache/sbin/httpd
xnouser    648 11555 11555 5d36500    0 D     ??    0:00.07
/usr/local/apache/sbin/httpd
xnouser    649 11555 11555 5d36500    0 D     ??    0:00.42
/usr/local/apache/sbin/httpd
xnouser    651 11555 11555 5d36500    0 D     ??    0:00.27
/usr/local/apache/sbin/httpd
xnouser    652 11555 11555 5d36500    0 D     ??    0:00.03
/usr/local/apache/sbin/httpd
xnouser    653 11555 11555 5d36500    0 D     ??    0:01.34
/usr/local/apache/sbin/httpd
xnouser    742 11555 11555 5d36500    0 S     ??    0:01.76
/usr/local/apache/sbin/httpd
xnouser    803 11555 11555 5d36500    0 D     ??    0:00.19
/usr/local/apache/sbin/httpd
xnouser    821 11555 11555 5d36500    0 D     ??    0:00.04
/usr/local/apache/sbin/httpd
xnouser    838 11555 11555 5d36500    0 D     ??    0:00.03
/usr/local/apache/sbin/httpd
xnouser    839 11555 11555 5d36500    0 D     ??    0:00.04
/usr/local/apache/sbin/httpd

The only errors found in the log file came when apache couldn't kill it's 'stuck' children after the parent httpd process received a
kill signal...  Note that kill -9 won't kill these, either.  The only thing that does the job is a reboot.

httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 23857 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 23871 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 23861 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 29296 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 29299 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 29302 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 29306 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 29308 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 27638 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 29317 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 28520 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 178 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 29557 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 444 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 353 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 648 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 649 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 1025 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 1316 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 1322 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 2361 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 2411 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 2533 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 2634 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:38 1999] [warn] child process 2637 still did not exit, sending a SIGTERM
httpd: [Sun Mar 14 14:41:42 1999] [error] child process 23861 still did not exit, sending a SIGKILL
httpd: [Sun Mar 14 14:41:42 1999] [error] child process 29296 still did not exit, sending a SIGKILL
httpd: [Sun Mar 14 14:41:42 1999] [error] child process 28520 still did not exit, sending a SIGKILL
httpd: [Sun Mar 14 14:41:42 1999] [error] child process 649 still did not exit, sending a SIGKILL
httpd: [Sun Mar 14 14:41:59 1999] [error] could not make child process 23861 exit, attempting to continue anyway
httpd: [Sun Mar 14 14:41:59 1999] [error] could not make child process 29296 exit, attempting to continue anyway
httpd: [Sun Mar 14 14:41:59 1999] [error] could not make child process 28520 exit, attempting to continue anyway
httpd: [Sun Mar 14 14:41:59 1999] [error] could not make child process 649 exit, attempting to continue anyway
httpd: [Sun Mar 14 14:41:59 1999] [notice] caught SIGTERM, shutting down

Matthew H. North
CTSNET Assistant Administrator
   ___  ____  ___
  /      /   /__   Network | Tel 619/637-3600 | Mail: cts...@cts.com
 /__    /   ___/  Services | Fax 619/637-3630 | WWW: http://www.cts.com