Kernel bug? Real Time Clock Freezes

Kernel bug? Real Time Clock Freezes

Post by Adam New » Tue, 03 Jul 2001 20:20:23



We've had a recurring problem with our Dell PowerEdge 2450 servers,
running kernel 2.4.2 as part of Debian Linux, with SMP. We compiled
the kernel optimised for Pentium Pro with 'Enhanced Real Time Clock
support' on.

The servers are dual Pentium III Dell PowerEdge 2450 servers, with 2G
of RAM and four 9G disks configured in hardware for RAID 5. One
machine has 2x866MHz processors and the other has 2x1GHz processors.
They both have onboard Intel EtherExpressPro 10/100Mbps network cards,
forced to 100Mbps half duplex using kernel parameters.

The problem is:

--
Periodically, and without apparent pattern, the machine starts to
exhibit packet loss, which gradually worsens. Logged-in terminals drop
characters.

At this point, the real time clock stops on the machine, which
prevents many  time-dependent processes from running correctly.

The machine refuses to reboot with reboot, /sbin/shutdown or init 6.
Running a script to call the reboot() function hard-reboots the
machine.

Engineers on site report that the console is blank and the machine
does not respond to keyboard input.
--
One of the machines previously functioned as a MySQL database server,
and was extremely reliable.

After converting it to run as a Web server, running apache 1.3.12 (max
512 concurrent processes), and proprietary software (Perl scripts
running on perl 5.6 on this machine, on perl 5.005 on the other) and
cron jobs (cron version 3.0pl1-56), the above problem started.

We've run Dell's highly-extensive hardware diagnostics package on the
machines and found nothing (NB take note: this software may switch
your network card into 10Mbps ('slow ethernet') mode).

Dell support have no ideas.

Any help with this problem would be much appreciated, as these
machines are going down with annoying frequency!

 
 
 

Kernel bug? Real Time Clock Freezes

Post by Stev » Tue, 03 Jul 2001 21:49:36



[description snipped]

Personally I'd scrutinise the log files, and also try running one
machine at a time for as long as possible if that's a workable
proposition, it may help you narrow down the problem to an individual
machine.  

And if none of this yields anything I'd change all the chip fans for
new ones, it sounds like something is stopping unexpectedly somewhere.

Just my 2p worth.

--
Cheers

%HAV-A-NICEDAY Error not enough coffee  0 pps.

web http://www.zeropps.uklinux.net/

or  http://start.at/zero-pps

  1:41pm  up 11:35,  2 users,  load average: 1.00, 1.00, 1.00

 
 
 

Kernel bug? Real Time Clock Freezes

Post by James T. Denni » Sat, 07 Jul 2001 17:22:55



Quote:> We've had a recurring problem with our Dell PowerEdge 2450 servers,
> running kernel 2.4.2 as part of Debian Linux, with SMP. We compiled
> the kernel optimised for Pentium Pro with 'Enhanced Real Time Clock
> support' on.

        I'm currently deploying a stack of these 2450s and
        2550s (and some of their little 1U 1550s).

        Most of these have Rat Head, ^H^H^H^H^H Red Hat 7.0
        installed on them.  I have Debian/testing installed
        one one of them; using a custom build 2.2.19 kernel
        with AACRAID driver patches to support that Adaptec RAID
        controllers that I have in most of them.

        Tomorrow I hope to track down the Dell OMSA (open
        management system architecture?) kernel patches which
        provide instrumentation interfaces for their UCD SNMP
        agent modules and MIBs.   Have you got those ported/linked
        to your kernel?

        (Dell uses Adaptec and AMI "MegaRAID" controllers in
        these --- refers to all of them as PERC/XXX and I haven't
        yet grokked the pattern to which systems use which controller
        and which /XXX extensions to /PERC refer to which
        systems/controllers).

        Does the 2.4.x kernel support AACRAID "out of the box?"

Quote:> The servers are dual Pentium III Dell PowerEdge 2450 servers, with 2G
> of RAM and four 9G disks configured in hardware for RAID 5. One
> machine has 2x866MHz processors and the other has 2x1GHz processors.
> They both have onboard Intel EtherExpressPro 10/100Mbps network cards,
> forced to 100Mbps half duplex using kernel parameters.
> The problem is:

        Mine have similar configurations (much less RAM,
        but their dual processor, similar speed, same
        hard disks, though I'm just using mirroring, not RAID)

        I'm not fussing with the eepro settings (in fact I don't
        even know the options for that).  Why are you setting them
        to half duplex?

Quote:> --
> Periodically, and without apparent pattern, the machine starts to
> exhibit packet loss, which gradually worsens. Logged-in terminals drop
> characters.
> At this point, the real time clock stops on the machine, which
> prevents many  time-dependent processes from running correctly.
> The machine refuses to reboot with reboot, /sbin/shutdown or init 6.
> Running a script to call the reboot() function hard-reboots the
> machine.
> Engineers on site report that the console is blank and the machine
> does not respond to keyboard input.

        That sounds like a pretty hard lock up.  Have they tried
        Magic SysRq? (Do you have that feature enabled in your
        kernels; do they know how to use it?)

Quote:> --
> One of the machines previously functioned as a MySQL database server,
> and was extremely reliable.
> After converting it to run as a Web server, running apache 1.3.12 (max
> 512 concurrent processes), and proprietary software (Perl scripts
> running on perl 5.6 on this machine, on perl 5.005 on the other) and
> cron jobs (cron version 3.0pl1-56), the above problem started.
> We've run Dell's highly-extensive hardware diagnostics package on the
> machines and found nothing (NB take note: this software may switch
> your network card into 10Mbps ('slow ethernet') mode).
> Dell support have no ideas.
> Any help with this problem would be much appreciated, as these
> machines are going down with annoying frequency!

        Have you considered trying a newer 2.4.x kernel?  
        Perhaps you're being bitten by one or more of the bugs
        have been fixed between 2.4.2 and 2.4.6.  Perhaps you
        should also consider trying Alan Cox's 2.4.5ac25 or
        so.  At least have your engineers/programmers look through
        the change logs to see if any issues listed there relate
        to your hardware/software mixture.

        Have you considered going back to a 2.2.19 or 2.2.20pre7
        kernel?  If it's stable thereunder, and dies under
        2.4.6 then it's definitely something that the kernel
        team should hear about.

        What if you let the NICs run in full-duplex mode?  What
        if you pop a netgear or some other NIC into the backplane
        and run the web services off of eth1 (or eth2 as the case
        might be).  (I'm not suggesting that as a solution for all
        of your systems, but as a troubleshooting step on one of
        them.  The first goal is to isolate the problem to a
        specific driver or condition)

        Try increasing your kernel's verbosity (boot with the
        debug parameter; man bootparam(7)).  Set up a remote
        syslog host to capture as much of the log as you can;
        although the nature of your problem suggests that the
        network might fail before the most interesting logging
        messages (since it's your network card that's getting
        hammered).  It's possible to configure syslog to write
        to a device (such as /dev/ttyS0); so you could have a
        null modem to another (more stable, less loaded) system
        which is running something like: tail -f /dev/ttySX |
        tee /var/log/othersystem.log

        BTW: consider configuring your system with a serial
        console; enable the appropriate option in the kernel
        and/or add the serial option to your LILO (or or you
        using GRUB) and run mgetty or agetty one of the serial
        ports.  It's possible that your system can still respond
        on the serial console even when the video/keyboard console
        is completely hung.  (The problem might be that the kernel
        is still running, but it's lost its notion of the time and
        thus doesn't know how to unblank the screen. For that matter,
        try running setterm -blank 0 to disable the console video
        blanking).

        I'd also set a panic= directive (in LILO on the append
        line) or echo 240 > /proc/sys/kernel/panic; either of these
        will instruct the kernel to reboot after a delay (that's the
        number of seconds) when it panics.  Otherwise it may be
        sitting with a panic report on a blanked screen forever.

        Finally, recompile your kernel with watchdog timer support
        (software watchdog) and install the Debian watchdog
        package (apt-get install watchdog).  It will be interesting
        to see if they successfully "brings the system back from
        the dead" (by rebooting it).

        BTW: most of these suggestions should be standard practice
        for rackmount systems; serial consoles, panic=, watchdog
        support (possibly with the addition of hardware watchdog
        timer cards; Dell doesn't seem to have them built-in to these
        motherboards) and remote or serial syslog targets are
        de riguer for the data center.  I also like starting certain
        critical daemons from /etc/inittab (under respawn directives)
        --- things like syslogd (use -n), cron, ssh (use the new -D
        option, in the latest OpenSSH).  

 
 
 

1. Kernel Real Time Clock (RTC) Support for I2C Devices

I have been unable to find an answer for this in the LKML archives, so I
am hoping someone on this list might perhaps have some insight or pointers
thereto on this question.

I have an embedded board with a PowerPC 405GP on which Linux 2.4.2
(MontaVista's version thereof) is running swimmingly. Attached to that
PowerPC's I2C controller is a Dallas DS1307 I2C RTC.

From the looks of drivers/char/rtc.c it would appear that this kernel
driver only supports bus-attached RTCs such as the mentioned MC146818. Is
this correct?

What is the correct access method / kernel tie-in for supporting such an
I2C-based RTC device using the "standard" interfaces?

My hope is to use 'hwclock' from util-linux w/o modification. Is this
reasonable?

Thanks,

Grant Erickson

--
 Grant Erickson                       University of Minnesota Alumni

  o http://www.umn.edu/~erick205                          1998 MSEE

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2. Whitelist

3. Real-Time Clock

4. RPM

5. Real time clock delay problem

6. How To Setup MOSAIC?

7. enhanced real time clock support; how to say 'y'

8. User changing password

9. UDB and Real Time Clock

10. Real time clock (RTC) and UNIX

11. Real time clock in shell

12. Enhanced real time clock and Alpha UP2000 SMP

13. SCO Real Time Clock - Yeah Right!