Segfaults on Server after long uptime using 2.2.x

Segfaults on Server after long uptime using 2.2.x

Post by Tom Kyl » Thu, 13 May 1999 04:00:00



I have a dual P2 server that is experiencing segfaults if I try to use
fdisk or lilo after the server has been up for long periods of time.
Everything else seems to work fine, but then again this machine doesn't
see much action since it's just a test server for our department to see
how well Linux runs.

Anyway, the errors I receive are something like this:

May 11 14:36:18 empire kernel: Unable to handle kernel NULL pointer
dereference at virtual address 00000024
May 11 14:36:18 empire kernel: current->tss.cr3 = 0e50c000, `r3 =
0e50c000
May 11 14:36:18 empire kernel: *pde = 00000000
May 11 14:36:18 empire kernel: Oops: 0000
May 11 14:36:18 empire kernel: CPU:    0
May 11 14:36:18 empire kernel: EIP:    0010:[<c01380d6>]
May 11 14:36:18 empire kernel: EFLAGS: 00010286
May 11 14:36:18 empire kernel: eax: 00000004   ebx: cfa8ab60   ecx:
00000000   edx: 00000000
May 11 14:36:18 empire kernel: esi: cfa8ab00   edi: 0000ffff   ebp:
00000000   esp: cd3d3f6c
May 11 14:36:18 empire kernel: ds: 0018   es: 0018   ss: 0018
May 11 14:36:18 empire kernel: Process fdisk (pid: 32599, process nr:
55, stackpage=cd3d3000)
May 11 14:36:18 empire kernel: Stack: 0000ffff 00000000 c01280ce
00000001 c0138366 cfa8ab60 00000000 00000000
May 11 14:36:18 empire kernel:        00000000 00000000 c01281e2
00000000 ffffffff 00000000 00000000 00000000
May 11 14:36:18 empire kernel:        00000000 cd3d2000 c0128213
00000000 cd3d2000 c0108c00 00000004 0000002c
May 11 14:36:18 empire kernel: Call Trace: [<c01280ce>] [<c0138366>]
[<c01281e2>] [<c0128213>] [<c0108c00>]
May 11 14:36:18 empire kernel: Code: 8b 74 91 24 f6 43 2c 01 74 09 53 e8
02 ff ff ff 83 c4 04 80

I've gotten these errors since 2.2.5, and I've since recompiled from a
fresh tar file, just in case a file or two got corrupted.  Since the
problems seem to be related to SCSI disk access, here's a little info
about the SCSI setup on the system from dmesg:

(scsi0) <Adaptec AIC-7890/1 Ultra2 SCSI host adapter> found at PCI 6/0
(scsi0) Wide Channel, SCSI ID=7, 32/255 SCBs
(scsi0) Downloading sequencer code... 407 instructions downloaded
scsi0 : Adaptec AHA274x/284x/294x (EISA/VLB/PCI-Fast SCSI) 5.1.10/3.2.4
       <Adaptec AIC-7890/1 Ultra2 SCSI host adapter>
scsi : 1 host.
  Vendor: TOSHIBA   Model: CD-ROM XM-6401TA  Rev: 1009
  Type:   CD-ROM                             ANSI SCSI revision: 02
Detected scsi CD-ROM sr0 at scsi0, channel 0, id 4, lun 0
(scsi0:0:4:0) Synchronous at 20.0 Mbyte/sec, offset 16.
  Vendor: HP        Model: C1537A            Rev: L706
  Type:   Sequential-Access                  ANSI SCSI revision: 02
Detected scsi tape st0 at scsi0, channel 0, id 5, lun 0
(scsi0:0:5:0) Synchronous at 10.0 Mbyte/sec, offset 32.
  Vendor: QUANTUM   Model: VIKING II 9.1WLS  Rev: 5520
  Type:   Direct-Access                      ANSI SCSI revision: 02
Detected scsi disk sda at scsi0, channel 0, id 6, lun 0
(scsi0:0:6:0) Synchronous at 80.0 Mbyte/sec, offset 31.
  Vendor: QUANTUM   Model: VIKING II 9.1WLS  Rev: 5520
  Type:   Direct-Access                      ANSI SCSI revision: 02
Detected scsi disk sdb at scsi0, channel 0, id 10, lun 0
(scsi0:0:10:0) Synchronous at 80.0 Mbyte/sec, offset 31.
scsi : detected 1 SCSI tape 1 SCSI cdrom 2 SCSI disks total.
Uniform CDROM driver Revision: 2.54
SCSI device sda: hdwr sector= 512 bytes. Sectors= 17836668 [8709 MB]
[8.7 GB]
SCSI device sdb: hdwr sector= 512 bytes. Sectors= 17836668 [8709 MB]
[8.7 GB]

I'm using the AIC/7xxx driver compiled into the kernel.  The chipset
itself is onboard an Asus P2B-DS motherboard.

I'd be *very* interested in hearing from anyone who's run into the same
problems and/or understands these register/stack dumps...

Thanks,

Tom Kyle
Jr Unix Admin
Univ. of Missouri-St. Louis

 
 
 

1. BUG: AIX 4.3.2 nfs-server corrupts files using Linux 2.2 client

Using Linux 2.2.10 as a NFS-client and AIX 4.3.2
(bos.net.nfs.server.4.3.2.1) as NFS-server we get corrupted files when
doing heavy nfs-access (for example doing make on a large program like
apache, it shows as linkage-errors).

We had _exactly_ this problem against our Solaris 2.5.1 fileserver, and
the bug was traced to the Solaris nfs-server, NOT the Linux NFS-client
(although it is slowish and such).

The patch we applied was T105299-02 (a test-patch for some reason on
Solaris 2.5.1, the fix is included in Solaris 7). The SunSolve bug id is
4071076 data over length in nfs header was written to disk.

Linus Torvald's comment on the Solaris-bug was:
"Actually, it appears fine on the wire, this particular problem seems to
be due to Solaris getting a "merge adjacent packets" case wrong when the
merge happens to cross a 8kB boundary and the data payload of the first
packet is not divisible by four.."

Since the behaviour is the same we suspect it's the same bug. I have
been hoping that IBM would fix this but since it's existed for a while
now I've kind of lost my hope...

/Nikke - not fond of nfs corruption...
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

---------------------------------------------------------------------------
 I'm defending her honor - more than she ever did.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

2. Orchid FH 1280 VLB

3. How long after 2.2 kernel release before Caldera...

4. IBM AIX spy / idle session killer

5. noatun under KDE-2.2.2 workaround (long)

6. No mouse devices

7. admintool 2.2 does not truncate long passwds

8. gcc x86 problem

9. Fatal double fault after long uptime

10. Kernel messages coming to tty0 after long uptime

11. Using the long long type in C and C++ functions

12. error using unsigned long long not working in 2.4.x

13. ppp dies after long uptime: why?