Hi!
I have a problem with an RPC server I wrote for HP-UX (originally for 9.x,
but now I'm using 10.10 on a HP9000/725) and I need any help I can get!
The server's purpose is to control up to 10 modem devices and to provide
services like open_connection, close_connection, read_packet, write_packet
and use_error_correction_protocol to the server's clients, using an RPC
interface.
The clients connect to the server from the same host or from some other host
on the local area network using RPC.
The main server gets the service-requests from the client and starts a
child server, which handles the request and communicates with the client.
The main server then goes back to listen for new requests. The main server
is configured to start up to 10 child servers at a time. All this is pretty
much straightforward and usually works quite fine, in production environment
the server handles an average of about 50000 requests in 30 days. I have this
server installed on about 10 HP-UX systems
Now, sometimes the main server get's stuck in an endless loop and I don't know
how and why! In this situation it sums up CPU time and doesn't react to
service requests any more. I'm still able to kill that process, though!
I got the HP-UX "trace" command to take a look into the* process
(I'm quite used to debug running server processes), and a "trace -k -p 29812"
(29812 is the PID of the server-process) shows the following output:
[...]
29812: swtch()
29812: resume(nice=20)
29812: hardclock(state=CP_USER)
29812: hardclock(state=CP_USER)
29812: setrq()
29812: swtch()
29812: resume(nice=20)
29812: hardclock(state=CP_USER)
29812: hardclock(state=CP_USER)
29812: hardclock(state=CP_USER)
29812: hardclock(state=CP_USER)
29812: hardclock(state=CP_USER)
29812: hardclock(state=CP_USER)
29812: setrq()
29812: swtch()
29812: resume(nice=20)
29812: hardclock(state=CP_USER)
29812: hardclock(state=CP_USER)
29812: setrq()
29812: swtch()
29812: resume(nice=20)
29812: hardclock(state=CP_USER)
29812: hardclock(state=CP_USER)
29812: setrq()
29812: swtch()
29812: resume(nice=20)
[...]
These statements repeat in an endless loop. Note that these are not
usual system-calls! A simple "trace -p 29812" shows no output at all!
The flag "-k" tells "trace" to trace kernel routines as well as system
calls, so I guess the symbols listed above are just kernel routines.
But here I'm stuck! I don't know how to debug this situation. Why doesn't
my server-process resume normal operation? What can I do to prevent the
process to hang in an endless loop of kernel routines? And what makes it
enter that loop? It's also very hard to reproduce, sometimes it takes
months before such a situation occurs (but it's still a problem!)
Of course I also tried to trace a normal operating server process, and
it shows the expected output: normal system-calls get executed, and after
finishing the request it waits for a new one on it's RPC socket-interface.
I would be very interested in any information that could solve this mystery!
If you need more information about the problem please let me now and I'll
give you any infos I can provide!
TIA
- andreas
--
*x Software + Systeme | phone: +43.1.6001508 | on request.
Buchengasse 67/8 | +43.664.3004449 |
A-1100 Vienna, Austria | fax: +43.1.6001507 |