Kernel 2.0.27 hangs after 16 days uptime

Kernel 2.0.27 hangs after 16 days uptime

Post by Hans-Joachim Baad » Sun, 22 Dec 1996 04:00:00



Hi,

last week the netware server in our office crashed after 42 days
uptime. I thought with my linux server box at home I could beat
this anytime, especially with the later kernels.

Alas, today after 16 days the system locked up completely. ping from
another box showed no reaction so I had to hit the Red Button :-(

At the time of the crash the system was heavily loaded with povray
rendering a complex image. A lot of daemons ran in the background
and xearth was running in the root window. Several modules were
loaded (at least the DOSEMU modules, CDROMs, network card and ISDN).
I was typing something into a shell command line when it happened.

One hour earlier I had a big problem with a 'make' command running
wild. After it had spawned 122 copies of itself, the system ran
out of virtual memory and I could not even start 'ls' or 'ps'
because they could not load libc. (Here's some good advise: Always
have statically linked versions of the fileutils around!)

I solved the problem by using a statically linked 'mv' to move
/usr/bin/make out of the way. Can it be that this has made the
system unstable?

I definitely think it wasn't a hardware failure. It couldn't be
a power failure either because the system has an UPS. The CPU fan
(AMD486DX-100) is also working correctly.

What can be done to find the cause? My logfiles don't contain
anything. I think I'll just run povray for extended periods to
test the possibility of a CPU bug. I can also write a program that
exhausts all virtual memory, to see what happens then.

Hans-Joachim
--
    Uncle Ed's Rule of Thumb:  Never use your thumb for a rule.
    You'll either hit it with a hammer or get a splinter in it.

 
 
 

Kernel 2.0.27 hangs after 16 days uptime

Post by Albert D. Cahal » Fri, 27 Dec 1996 04:00:00



> last week the netware server in our office crashed after 42 days
> uptime. I thought with my linux server box at home I could beat
> this anytime, especially with the later kernels.

> Alas, today after 16 days the system locked up completely. ping from
> another box showed no reaction so I had to hit the Red Button :-(

There are hotkeys that can dump out registers, free page lists,
and other things. Try Alt, Shift, and Control in combination with
SysRq, Scroll Lock, and Break. On some kernels (patched or recent),
one of those hotkeys will kill all processes on the console
(so init spawns a new getty) and fix a messed up keyboard mode.

Quote:> One hour earlier I had a big problem with a 'make' command running
> wild. After it had spawned 122 copies of itself, the system ran
> out of virtual memory and I could not even start 'ls' or 'ps'
> because they could not load libc. (Here's some good advise: Always
> have statically linked versions of the fileutils around!)

They are good to have, but very bad to use unless needed. Even /bin/sh
and the filesystem tools should be dynamic to save memory. You might
want to have a rwx------ /static directory for them.

Quote:> I solved the problem by using a statically linked 'mv' to move
> /usr/bin/make out of the way. Can it be that this has made the
> system unstable?

Maybe. I think I have seen a few spots in the kernel that have
comments like "if kmalloc fails, we are dead anyway".

Quote:> I definitely think it wasn't a hardware failure. It couldn't be
> a power failure either because the system has an UPS. The CPU fan
> (AMD486DX-100) is also working correctly.
> What can be done to find the cause? My logfiles don't contain
> anything. I think I'll just run povray for extended periods to
> test the possibility of a CPU bug. I can also write a program that
> exhausts all virtual memory, to see what happens then.

Try the hotkeys. First figure out which one gives a register dump
and make sure klogd/syslogd don't swallow it. After you can get
a register dump to the screen when you want one, let the machine
run until it crashes. Take multiple register dumps so that you can
see if the registers are changing, then look up the addresses as
you would for a crash with register dump.

--
--
Albert Cahalan
acahalan at cs.uml.edu (no junk mail please - I will hunt you down)