Process hangs in 2.4.19, RH7.latest, and 2.4.20-pre7-ac2

Process hangs in 2.4.19, RH7.latest, and 2.4.20-pre7-ac2

Post by Jeff Dik » Thu, 03 Oct 2002 02:00:11



I (and other people) have seen process hangs on stock 2.4.19, 2.4.20-pre7-ac2,
and (iirc) the latest RH 7.x kernel.  Any process that does the m*
equivalent of ps hangs.  The machine quickly becomes unusable, and needs to
be crashed.

It's been seen most often under heavy UML load.  I've seen it most often
doing UML development inside UML (stock 2.4.19).  It's been seen on
2.4.20-pre7-ac2 on a UML server.  However, I have had it happen with no
UMLs in sight.

We finally got sysrq information on this.  The hung processes all look like
this:
        Proc;  ps
        >>EIP; e352bef4 <_end+2307b320/386c042c>   <=====
        Trace; c032a955 <rwsem_down_read_failed+195/1c0>
        Trace; c016e3c0 <.text.lock.array+73/123>
        Trace; c016b340 <proc_info_read+50/110>
        Trace; c0148736 <sys_read+96/190>
        Trace; c0147fb3 <sys_open+53/b0>
        Trace; c01092cb <system_call+33/38>

The lock in question is the mmap_sem being acquired in proc_pid_stat.  There
should be a sleeping process which is holding the semaphore, but I haven't
spotted it among the multitudes that were running at the time.

The full ksymoops-ed sysrq-t output is available at
        http://www.veryComputer.com/

I'm not including it here because it's too large.

There should be one process which started this by grabbing a mm_sem and
sleeping forever and I would think its stack would be different from all
the others.  There are a few processes whose deepest IP are unique:

Proc;  grep

Quote:>>EIP; 00000002 Before first symbol   <=====

Trace; c0118120 <do_page_fault+0/438>

Proc;  killall

Quote:>>EIP; ea4b5ee4 <_end+2a005310/386c042c>   <=====

Trace; c0130b72 <__vma_link+62/c0>
Trace; c032a955 <rwsem_down_read_failed+195/1c0>
Trace; c016e3c0 <.text.lock.array+73/123>
Trace; c016b340 <proc_info_read+50/110>

Proc;  init

Quote:>>EIP; 00000000 Before first symbol

Trace; c013f5e3 <__get_free_pages+13/30>

None of these look like to culprits.  init is probably innocent, the grep
was processing the output of a hung ps, so it was too late, and the killall
is itself hung.

I'd appreciate any clues about what's going on here.  If anyone needs more
info than what's in the sysrq output at the URL above, contact me or Bill
Stearns (wstearns at pobox dot com).

                                Jeff

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://www.veryComputer.com/
Please read the FAQ at  http://www.veryComputer.com/

 
 
 

1. NFS/UDP/IP performance - 2.4.19 v/s 2.4.20, 2.4.20-pre3

Greetings.

There seem to be a remarkable performance difference
between 2.4.19 and 2.4.20/2.4.21-pre3 in regards to
NFS writes/reads. I am not sure, but the problem may not
in NFS but somewhere lower (UDP/IP or core).

For example, in my kernel and network configuration a
write to a new file over NFS on 2.4.19 for 5MB takes 2.5
seconds or so. With everything same (including kernel
configuration) 2.4.20 and 2.4.21-pre3 the same takes
11 or more seconds.

Also, when this file write is in progress, the system
time goes up to 15% on 2.4.19, whereas on 2.4.20/21-pre3,
it is about 4%. (I use sar/sysstat for this).

Memory accesses dont seem to be the issue either. Test
program to check this show same times and are ok (as I
expect on the board I use).

"netstat -s" or ifconfig or tcpdump traces dont seem to
point to dropped messages, collisions, retransmissions
etc.

The hardware configuration is PowerPC based, and there
are no changes in the board specific IO subsystem between
2.4.19 and 2.4.20/21-pre3. The same compiler is used for
building both the kernels, and have tried this even with
GCC 3.2, with same results.

So, I dont suspect this is either board or compiler
related issue.

Also, I see some differences in handling of the bottom
halves in net/core/dev.c between 2.4.19 and 2.4.20/21-pre3.
Although, I have not gone through these in details to
assert that this is indeed the problem area.

Questions:

  - Has anyone seen this? Perhaps on other platforms (x86 etc)?
    Is there some tunable that has been added (or is different)
    after 2.4.19, and which needs to be tuned?

  - I have tried to enable kernel profiling to find any
    potential problem code areas. But given the low cpu
    utilization during these copies, I am not sure if this
    can give any useful info.

    Could anyone offer any ideas to debug this?

I would appreciate if you copy me on any responses to this post, I
dont subscribe to this list.

Best regards,
-Arun.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2. Problems with su and permissions

3. Oops in usb_submit_urb with US_FL_MODE_XLATE (2.4.19 and 2.4.20-pre7)

4. Usage of getitimer()

5. TCP hangs in 2.4.20-pre11 (and 2.4.19)

6. telnet as a device like /dev/term/a

7. Hangs in 2.4.19 and 2.4.20-pre5 (IDE-related?)

8. Audio cds on NEC 210

9. PROBLEM: 2.4.19 & 2.4.20 hang without oops...

10. Hangs in 2.4.19 and 2.4.20-pre5 (IDE-related?)

11. PROBLEM: 2.4.19 & 2.4.20 hang without oops...

12. Fix swsusp in 2.4.19-pre7-ac2 (fwd)

13. CONFIG_RAMFS in 2.4.19-pre7-ac2 ???