(for gurus) ISSUES: crashes, crash on boot, crash on shutdown

(for gurus) ISSUES: crashes, crash on boot, crash on shutdown

Post by Arthur Sower » Wed, 08 Nov 2000 04:00:00



I am learning Linux. I have a bunch of Linux books, and four books on
Unix. I've read most of most of them, a few cover to cover, to get the
gist of what I'm getting into. I've asked about the subject line in
several Linux newsgroups and not gotten definitive answers. One of my Unix
books (Unix Secrets, 2nd Ed., by James C. Armstrong, an IDG book,
1999) has a little and inadequat chapter (chapter 51, all of seven pages
long) on crashes and its broken down into hardware crashes and software
crashes.

Here is why I'm interested: I've played with Linux now, off and on, for
over 1-1/2 years, and many distribution installs (some crashed on install,
others always install without problems). I've read lots of posts on the
Linux newsgroups. The books say (if and when they talk about it) that
Linux needs a graceful shutdown. You don't just hit the reset button
(which triggers a warm boot) like in DOS or Windows. It also needs a
graceful startup, too.

In the course of my learning (and stumbling over doing stupid things),
I've gotting into things (applications, utilities, whatever) that I did
not know how to get out of and nothing I did would help. Ergo, I had no
option but to shut off the switch. Big surprize on next bootup. It crashes
somewhere before the bootup is completed. I've also had one shutdown that
stopped at some stage before it reached shutdown (this was, I think,
because I used the User Mount Tool in Red Hat version 5.2 of Linux on a
CDROM disk that gave an error message "too many filesystems" and I did not
check for a problem before shutdown. incidentally, with RH ver 6.2 on a
different box, this did not happen when I used the file manager on the
same CDROM disk!). However, I am really not sure. The simplest solution
for these crashes was to re-install my OS (taking about 20-30 minutes).

I have run into three people on another non-comp newsgroup running Linux
who claim that they don't worry about graceful OS shutdowns and turn on
and off their power switches whenever they feel like it (or at least
this is what it sounds like they are saying). One says they have their box
running off the AC power line without a UPS between the power line and the
box. When I ask to tell me how this is possible, I don't get a lot of
detail about how they do it OR why I'm seeing my OS get trashed whenever I
have a lockup (I have since learned a few things about processes that
mis-behave and that I can kill them without an OS shutdown and that
in those cases where the process can't be killed, then a graceful shutdown
will "flush" the problem out [at least so far it has]) and have to do a
power switch off.

Linux install packages have an option for a rescue disk and a protocol for
booting up to get access to the filesystem but then they don't tell you
much (like, zero) about how to diagnose bootup problems (or shutdown
problems). But, I've had boot failures and one shutdown failure and I know
whenever I get it, I'm in trouble and have to reinstall the OS.
At some point in the future I'm going to have to learn how to do
rescues. But, I need to learn more about recovery from crashes AND what
are these three guys doing who are saying they can just turn off their
switches any time they want and turn them back on again [but can't tell
me in detail why they can do it and I can't]?

Is anyone aware of resources on this topic on one or more websites
anywhere?

Art Sowers

 
 
 

(for gurus) ISSUES: crashes, crash on boot, crash on shutdown

Post by Ed F. de Guzma » Wed, 08 Nov 2000 04:00:00


The answer to some of your questions lies in the fact that different kernel
versions of Linux behaves differently. I have worked with Linux Kernel 2.0.x,
2.1.x and 2.2x and all behave differently, and to make things worst, they
behave according to who had packaged them.

I have shutdown Red Hat 5.0, 5.1, 5.2, 6.0, 6.1, 6.2, and 7.0 by just hitting
the power off switch without any problem after reboot. The only thing the
kernel will do is run fsck to determine the HD integrity. Incidentally,
Solaris does this too. (I have to inject some Solaris content since we are at
the Solaris News group!)

What I suggest that you do is run a hardware check. Your disk on the RH 5.2
might be the culprit. I am also assuming that you have patched - updated RH
5.2 from update rpm files - the installation.

OS crashes might do harm to the entire system but not do it every now and
then.

By the way, I do run my Linux test box without a UPS too!!

Ed


> I am learning Linux. I have a bunch of Linux books, and four books on
> Unix. I've read most of most of them, a few cover to cover, to get the
> gist of what I'm getting into. I've asked about the subject line in
> several Linux newsgroups and not gotten definitive answers. One of my Unix
> books (Unix Secrets, 2nd Ed., by James C. Armstrong, an IDG book,
> 1999) has a little and inadequat chapter (chapter 51, all of seven pages
> long) on crashes and its broken down into hardware crashes and software
> crashes.

> Art Sowers


 
 
 

(for gurus) ISSUES: crashes, crash on boot, crash on shutdown

Post by Jens.Toerr.. » Wed, 08 Nov 2000 04:00:00



Quote:> Here is why I'm interested: I've played with Linux now, off and on, for
> over 1-1/2 years, and many distribution installs (some crashed on install,
> others always install without problems). I've read lots of posts on the
> Linux newsgroups. The books say (if and when they talk about it) that
> Linux needs a graceful shutdown. You don't just hit the reset button
> (which triggers a warm boot) like in DOS or Windows. It also needs a
> graceful startup, too.
> In the course of my learning (and stumbling over doing stupid things),
> I've gotting into things (applications, utilities, whatever) that I did
> not know how to get out of and nothing I did would help. Ergo, I had no
> option but to shut off the switch. Big surprize on next bootup. It crashes
> somewhere before the bootup is completed. I've also had one shutdown that
> stopped at some stage before it reached shutdown (this was, I think,
> because I used the User Mount Tool in Red Hat version 5.2 of Linux on a
> CDROM disk that gave an error message "too many filesystems" and I did not
> check for a problem before shutdown. incidentally, with RH ver 6.2 on a
> different box, this did not happen when I used the file manager on the
> same CDROM disk!). However, I am really not sure. The simplest solution
> for these crashes was to re-install my OS (taking about 20-30 minutes).

There are two major reasons why you shouldn't switch off a UNIX system
without a proper shutdown:

1. UNIX is a multi-user system, so switching off the machine without
   informing the other users and giving them a chance to save their
   files etc. won't make you many friends.
2. Data are not always written to disk immediately but caching is used,
   i.e. data that a program writes to the hard disk are temporarily kept in
   memory and only written out after some time. This is especially bad if
   files needed by the system are not written to the disk when you switch off
   the machine.

The first point is probably not very important to you because you seem to
use your computer alone. So the second is the real important one for you.

First of all, as you already found out, you usually can terminate a program
even though it doesn't react anymore - get its PID using ps and send it
a SIGNAL via kill, normally a TERM signal will do, but sometimes a real
KILL signal will be needed, i.e "kill -KILL xxx" (xxx = PID).

If your GUI seems to be frozen you still often can get to a command line
console by pressing Ctrl-Alt-F1 (at least with Linux) where you can then log
in and send a signal to kill the hung process.

If even this won't work, because X11 crashed badly and you can't log into your
computer from a different machine the reset button is your only way out. In
this case wait for disk activity of your programs to cease, then wait some
more time (a bit more than 10 seconds is a good guess) before you hit the
reset button. Waiting some these 10 extra seconds is necessary because the
cached data are written to the disk automatically after some fixed time (when
I look at the disks LED it seem to be about 10 s).

Usually, when you do it this way, the system will boot up without any major
problems. But it will take quite some time, because the system will detect
that it wasn't properly shut down the last time and will check all your
disks. This is comparable to Windows "Scan disk". In at least 9 out of 10
cases everything will work out ok (except some diagnostic messages that don't
have to worry you too much). Only in some rare cases the disk check program
(fsck) will find errors on the disk that it can't handle and will stop and ask
you to do the repair manually. It is therefore a very good idea to read fsck's
man-page in advance and have a hard-copy around, just in case...

I never, ever had to re-install my system after swiching off the computer
without a proper shutdown (and while playing around with a modules I tried to
write I had to do it more often than I like to ;-). So I guess you took the
disk check as a crash that it isn't.

                                                  HTH, Jens
--
        _  _____  _____

    _  | |  | |    | |           AG Moebius, Institut fuer Molekuelphysik
   | |_| |  | |    | |           Fachbereich Physik, Freie Universitaet Berlin
    \___/ens|_|homs|_|oerring    Tel: ++49 (0)30 838 - 53394 / FAX: - 56046

 
 
 

(for gurus) ISSUES: crashes, crash on boot, crash on shutdown

Post by Tim Hayne » Wed, 08 Nov 2000 04:00:00



Quote:> The answer to some of your questions lies in the fact that different
> kernel versions of Linux behaves differently.

...and?

Quote:> I have worked with Linux Kernel 2.0.x, 2.1.x and 2.2x and all behave
> differently, and to make things worst, they behave according to who had
> packaged them.

ITYM "what options you compiled in". See the APM support stuff...

Quote:> I have shutdown Red Hat 5.0, 5.1, 5.2, 6.0, 6.1, 6.2, and 7.0 by just
> hitting the power off switch without any problem after reboot. The only
> thing the kernel will do is run fsck

*boggle*! Kernels don't run anything, apart from whatever init points to on
the kernel parameter line. (And in 2.4.x, the khttpd...)

Quote:> > I am learning Linux. I have a bunch of Linux books, and four books on
> > Unix. I've read most of most of them, a few cover to cover, to get the
> > gist of what I'm getting into. I've asked about the subject line in
> > several Linux newsgroups and not gotten definitive answers. One of my
> > Unix books (Unix Secrets, 2nd Ed., by James C. Armstrong, an IDG book,
> > 1999) has a little and inadequat chapter (chapter 51, all of seven pages
> > long) on crashes and its broken down into hardware crashes and software
> > crashes.

For the OP's benefit, if you're looking into hardware problems, bear this
in mind as a starting point to identify what's going wrong:

If you're getting SIGSEGVs (Signal 11) when the machine is under load, it's
most often one of two things:
        a) dodgy RAM, eg mis-seated ill-fitting stuff, or incompatible
           speeds, or just plain Dodgy

        b) CPU overheating, eg due to a kapputt CPU fan

You can tell the two apart by a simple trick: recompile the kernel while
running X, make a mental note of where it fails and how long it takes.
Repeat. Rinse. If it takes progressively shorter time intervals, you're
looking at an overheating problem. Otherwise, it's probably memory.

There's a SIG11 FAQ floating around - see <http://www.bitwizard.nl/sig11/>
for that :)

HTH :)

~Tim
--

                                                | http://piglet.is.dreaming.org

 
 
 

(for gurus) ISSUES: crashes, crash on boot, crash on shutdown

Post by Arthur Sower » Wed, 08 Nov 2000 04:00:00


email and post...

I'm not surprized at your answer re the different kernels. I've noticed
differences on different hardware platforms even with the same
distribution. See below for more...


Quote:> The answer to some of your questions lies in the fact that different kernel
> versions of Linux behaves differently. I have worked with Linux Kernel 2.0.x,
> 2.1.x and 2.2x and all behave differently, and to make things worst, they
> behave according to who had packaged them.

> I have shutdown Red Hat 5.0, 5.1, 5.2, 6.0, 6.1, 6.2, and 7.0 by just hitting
> the power off switch without any problem after reboot.

Amazing.

 The only thing the

Quote:> kernel will do is run fsck to determine the HD integrity. Incidentally,
> Solaris does this too. (I have to inject some Solaris content since we are at
> the Solaris News group!)

> What I suggest that you do is run a hardware check. Your disk on the RH 5.2
> might be the culprit. I am also assuming that you have patched - updated RH
> 5.2 from update rpm files - the installation.

I have other OSs on the same HD and same box. No problems. No I have not
done any update rpm files re the installation.

Quote:> OS crashes might do harm to the entire system but not do it every now and
> then.

This has only happened when I got into some "thing" and could not get out
of it and I'm sorry I don't have a written record of what I did that led
me into the traps. I just want to learn how to avoid these problems in the
future.

Thanks for your response, however.

Art Sowers

=== no change to below, included for reference and context ====

> By the way, I do run my Linux test box without a UPS too!!

> Ed


> > I am learning Linux. I have a bunch of Linux books, and four books on
> > Unix. I've read most of most of them, a few cover to cover, to get the
> > gist of what I'm getting into. I've asked about the subject line in
> > several Linux newsgroups and not gotten definitive answers. One of my Unix
> > books (Unix Secrets, 2nd Ed., by James C. Armstrong, an IDG book,
> > 1999) has a little and inadequat chapter (chapter 51, all of seven pages
> > long) on crashes and its broken down into hardware crashes and software
> > crashes.

> > Art Sowers

 
 
 

(for gurus) ISSUES: crashes, crash on boot, crash on shutdown

Post by Arthur Sower » Wed, 08 Nov 2000 04:00:00


email (in case the post does not get to your server) and post (for
everyone elses benefit)...

Jens, Thanks for your explanations. I had not thought about the "wait
at least ten seconds [or a little more]". I did start learning how to use
the "ps" and "kill" and also the virtual login (the ctrl-alt-F1
stuff) after I got more real time experience, then read the book some
more, then more real time experience, etc. The more I read and learn, the
less often I make mistakes and so I have not gotten these crashes as often
as before because I know more about what not to do (i.e. the learning
curve).

I also take it from your explanation below that if my boot up hangs at
some step (any step? or just some particular step?), then it might not be
a real crash but some kind of delay and I just need to wait a little
longer to see if it continues. Right?

Also, a compliment: both your writing and comprehesion is better than
average. Thank you for taking the time to help. And, yes, I'm not planning
any multi-user work on this, a single desktop operation (I just want an
alternative to MS-Windows, even if I have to learn a lot, and for a number
of reasons).

Art Sowers

=== no change to below, included for reference and context ====



> > Here is why I'm interested: I've played with Linux now, off and on, for
> > over 1-1/2 years, and many distribution installs (some crashed on install,
> > others always install without problems). I've read lots of posts on the
> > Linux newsgroups. The books say (if and when they talk about it) that
> > Linux needs a graceful shutdown. You don't just hit the reset button
> > (which triggers a warm boot) like in DOS or Windows. It also needs a
> > graceful startup, too.

> > In the course of my learning (and stumbling over doing stupid things),
> > I've gotting into things (applications, utilities, whatever) that I did
> > not know how to get out of and nothing I did would help. Ergo, I had no
> > option but to shut off the switch. Big surprize on next bootup. It crashes
> > somewhere before the bootup is completed. I've also had one shutdown that
> > stopped at some stage before it reached shutdown (this was, I think,
> > because I used the User Mount Tool in Red Hat version 5.2 of Linux on a
> > CDROM disk that gave an error message "too many filesystems" and I did not
> > check for a problem before shutdown. incidentally, with RH ver 6.2 on a
> > different box, this did not happen when I used the file manager on the
> > same CDROM disk!). However, I am really not sure. The simplest solution
> > for these crashes was to re-install my OS (taking about 20-30 minutes).

> There are two major reasons why you shouldn't switch off a UNIX system
> without a proper shutdown:

> 1. UNIX is a multi-user system, so switching off the machine without
>    informing the other users and giving them a chance to save their
>    files etc. won't make you many friends.
> 2. Data are not always written to disk immediately but caching is used,
>    i.e. data that a program writes to the hard disk are temporarily kept in
>    memory and only written out after some time. This is especially bad if
>    files needed by the system are not written to the disk when you switch off
>    the machine.

> The first point is probably not very important to you because you seem to
> use your computer alone. So the second is the real important one for you.

> First of all, as you already found out, you usually can terminate a program
> even though it doesn't react anymore - get its PID using ps and send it
> a SIGNAL via kill, normally a TERM signal will do, but sometimes a real
> KILL signal will be needed, i.e "kill -KILL xxx" (xxx = PID).

> If your GUI seems to be frozen you still often can get to a command line
> console by pressing Ctrl-Alt-F1 (at least with Linux) where you can then log
> in and send a signal to kill the hung process.

> If even this won't work, because X11 crashed badly and you can't log into your
> computer from a different machine the reset button is your only way out. In
> this case wait for disk activity of your programs to cease, then wait some
> more time (a bit more than 10 seconds is a good guess) before you hit the
> reset button. Waiting some these 10 extra seconds is necessary because the
> cached data are written to the disk automatically after some fixed time (when
> I look at the disks LED it seem to be about 10 s).

> Usually, when you do it this way, the system will boot up without any major
> problems. But it will take quite some time, because the system will detect
> that it wasn't properly shut down the last time and will check all your
> disks. This is comparable to Windows "Scan disk". In at least 9 out of 10
> cases everything will work out ok (except some diagnostic messages that don't
> have to worry you too much). Only in some rare cases the disk check program
> (fsck) will find errors on the disk that it can't handle and will stop and ask
> you to do the repair manually. It is therefore a very good idea to read fsck's
> man-page in advance and have a hard-copy around, just in case...

> I never, ever had to re-install my system after swiching off the computer
> without a proper shutdown (and while playing around with a modules I tried to
> write I had to do it more often than I like to ;-). So I guess you took the
> disk check as a crash that it isn't.

>                                                   HTH, Jens
> --
>         _  _____  _____

>     _  | |  | |    | |           AG Moebius, Institut fuer Molekuelphysik
>    | |_| |  | |    | |           Fachbereich Physik, Freie Universitaet Berlin
>     \___/ens|_|homs|_|oerring    Tel: ++49 (0)30 838 - 53394 / FAX: - 56046

 
 
 

(for gurus) ISSUES: crashes, crash on boot, crash on shutdown

Post by Arthur Sower » Wed, 08 Nov 2000 04:00:00


email and post...

Thanks to Tim for the additional info-suggestions. I do have a cooling fan
with a bearing that is starting to make noise when it is cold, but I had
the crashes long long time ago and asked my questions on the linux
newsgroups and got no good answers, no good speculations, not even lousey
guesses. So I figured bigger brains might be on the unix-solaris
NGs. hence the post.

Art Sowers

=== no change to below, included for reference and context ====



> > The answer to some of your questions lies in the fact that different
> > kernel versions of Linux behaves differently.

> ...and?

> > I have worked with Linux Kernel 2.0.x, 2.1.x and 2.2x and all behave
> > differently, and to make things worst, they behave according to who had
> > packaged them.

> ITYM "what options you compiled in". See the APM support stuff...

> > I have shutdown Red Hat 5.0, 5.1, 5.2, 6.0, 6.1, 6.2, and 7.0 by just
> > hitting the power off switch without any problem after reboot. The only
> > thing the kernel will do is run fsck

> *boggle*! Kernels don't run anything, apart from whatever init points to on
> the kernel parameter line. (And in 2.4.x, the khttpd...)

> > > I am learning Linux. I have a bunch of Linux books, and four books on
> > > Unix. I've read most of most of them, a few cover to cover, to get the
> > > gist of what I'm getting into. I've asked about the subject line in
> > > several Linux newsgroups and not gotten definitive answers. One of my
> > > Unix books (Unix Secrets, 2nd Ed., by James C. Armstrong, an IDG book,
> > > 1999) has a little and inadequat chapter (chapter 51, all of seven pages
> > > long) on crashes and its broken down into hardware crashes and software
> > > crashes.

> For the OP's benefit, if you're looking into hardware problems, bear this
> in mind as a starting point to identify what's going wrong:

> If you're getting SIGSEGVs (Signal 11) when the machine is under load, it's
> most often one of two things:
>         a) dodgy RAM, eg mis-seated ill-fitting stuff, or incompatible
>            speeds, or just plain Dodgy

>         b) CPU overheating, eg due to a kapputt CPU fan

> You can tell the two apart by a simple trick: recompile the kernel while
> running X, make a mental note of where it fails and how long it takes.
> Repeat. Rinse. If it takes progressively shorter time intervals, you're
> looking at an overheating problem. Otherwise, it's probably memory.

> There's a SIG11 FAQ floating around - see <http://www.bitwizard.nl/sig11/>
> for that :)

> HTH :)

> ~Tim
> --

>                                                 | http://piglet.is.dreaming.org