Filesystem Corruption (ext2) on Tyan S2462, 2xAMD1900MP, 2.4.17SMP (RH7.2)

Filesystem Corruption (ext2) on Tyan S2462, 2xAMD1900MP, 2.4.17SMP (RH7.2)

Post by Martin Knoblauc » Thu, 21 Mar 2002 03:50:10



Hi,

 what could be the cause of filesystem corruption on a Tyan Thunder 2462
dual Athlon 1900MP? The system has 2GB "registered" ECC memory and is
running a 2.4.17SMP kernel.

 A customer of us has 8 of these beasts and one has started "acting up".
Basically destroying the root partition withing minutes after booting a
fresh installation. The other 7 (identical) systems are OK. Is there
anything that one should look for (like the noapic thing).

 Included is the messages file with three boot sequences. Interestingly
the first sequence seems to misdetect the system completetly .... At the
end of the file the number of junk characters seems to increase quite a
bit :-(

 Unfortunatelly I am just in remote debugging mode right now. Guess I
will see the system on thursday.

TIA
Martin
--
------------------------------------------------------------------
Martin Knoblauch         |    email:  Martin.Knobla...@TeraPort.de
TeraPort GmbH            |    Phone:  +49-89-510857-309
C+ITS                    |    Fax:    +49-89-510857-111
http://www.teraport.de   |    Mobile: +49-170-4904759

[ messages 77K ]
Mar 19 09:11:14 fems146 syslogd 1.4.1: restart.
Mar 19 09:11:14 fems146 syslog: syslogd startup succeeded
Mar 19 09:11:14 fems146 syslog: klogd startup succeeded
Mar 19 09:11:14 fems146 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Mar 19 09:11:14 fems146 kernel: Inspecting /boot/System.map-2.4.17smp
Mar 19 09:11:14 fems146 portmap: portmap startup succeeded
Mar 19 09:11:14 fems146 kernel: Loaded 16042 symbols from /boot/System.map-2.4.17smp.
Mar 19 09:11:14 fems146 kernel: Symbols match kernel version 2.4.17.
Mar 19 09:11:14 fems146 kernel: Loaded 10 symbols from 2 modules.
Mar 19 09:11:14 fems146 kernel: Linux version 2.4.17smp (root@dino) (gcc version 2.96 20000731 (Red Hat Linux 7.1 2.96-98)) #4 SMP Die Feb 5 14:00:50 CET 2002
Mar 19 09:11:14 fems146 kernel: BIOS-provided physical RAM map:
Mar 19 09:11:14 fems146 kernel:  BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
Mar 19 09:11:14 fems146 kernel:  BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
Mar 19 09:11:14 fems146 kernel:  BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
Mar 19 09:11:14 fems146 kernel:  BIOS-e820: 0000000000100000 - 000000000ffd0000 (usable)
Mar 19 09:11:14 fems146 kernel:  BIOS-e820: 000000000ffd0000 - 000000000fff0000 (ACPI NVS)
Mar 19 09:11:14 fems146 kernel:  BIOS-e820: 000000000fff0000 - 0000000010000000 (usable)
Mar 19 09:11:14 fems146 kernel:  BIOS-e820: 00000000feea0000 - 0000000100000000 (reserved)
Mar 19 09:11:14 fems146 nfslock: rpc.statd startup succeeded
Mar 19 09:11:14 fems146 kernel: found SMP MP-table at 000fbfe0
Mar 19 09:11:14 fems146 kernel: hm, page 000fb000 reserved twice.
Mar 19 09:11:14 fems146 kernel: hm, page 000fc000 reserved twice.
Mar 19 09:11:14 fems146 rpc.statd[591]: Version 0.3.1 Starting
Mar 19 09:11:14 fems146 kernel: hm, page 000e9000 reserved twice.
Mar 19 09:11:14 fems146 kernel: hm, page 000ea000 reserved twice.
Mar 19 09:11:14 fems146 kernel: On node 0 totalpages: 65536
Mar 19 09:11:14 fems146 kernel: zone(0): 4096 pages.
Mar 19 09:11:14 fems146 kernel: zone(1): 61440 pages.
Mar 19 09:11:14 fems146 kernel: zone(2): 0 pages.
Mar 19 09:11:14 fems146 kernel: Intel MultiProcessor Specification v1.4
Mar 19 09:11:14 fems146 kernel:     Virtual Wire compatibility mode.
Mar 19 09:11:14 fems146 kernel: OEM ID: COMPAQ   Product ID: Deskpro      APIC at: 0xFEE00000
Mar 19 09:11:14 fems146 kernel: Processor #0 Pentium(tm) Pro APIC version 16
Mar 19 09:11:14 fems146 kernel: I/O APIC #8 Version 17 at 0xFEC00000.
Mar 19 09:11:14 fems146 kernel: Processors: 1
Mar 19 09:11:14 fems146 kernel: Kernel command line: BOOT_IMAGE=2.4.17smp ro root=302 BOOT_FILE=/boot/vmlinuz-2.4.17smp
Mar 19 09:11:14 fems146 kernel: Initializing CPU#0
Mar 19 09:11:14 fems146 kernel: Detected 730.907 MHz processor.
Mar 19 09:11:14 fems146 kernel: Console: colour VGA+ 80x25
Mar 19 09:11:14 fems146 kernel: Calibrating delay loop... 1458.17 BogoMIPS
Mar 19 09:11:14 fems146 kernel: Memory: 254928k/262144k available (1279k kernel code, 6700k reserved, 386k data, 236k init, 0k highmem)
Mar 19 09:11:14 fems146 kernel: Dentry-cache hash table entries: 32768 (order: 6, 262144 bytes)
Mar 19 09:11:14 fems146 kernel: Inode-cache hash table entries: 16384 (order: 5, 131072 bytes)
Mar 19 09:11:14 fems146 kernel: Mount-cache hash table entries: 4096 (order: 3, 32768 bytes)
Mar 19 09:11:14 fems146 kernel: Buffer-cache hash table entries: 16384 (order: 4, 65536 bytes)
Mar 19 09:11:14 fems146 kernel: Page-cache hash table entries: 65536 (order: 6, 262144 bytes)
Mar 19 09:11:14 fems146 kernel: CPU: L1 I cache: 16K, L1 D cache: 16K
Mar 19 09:11:14 fems146 kernel: CPU: L2 cache: 256K
Mar 19 09:11:14 fems146 kernel: Intel machine check architecture supported.
Mar 19 09:11:14 fems146 kernel: Intel machine check reporting enabled on CPU#0.
Mar 19 09:11:14 fems146 kernel: Enabling fast FPU save and restore... done.
Mar 19 09:11:14 fems146 kernel: Enabling unmasked SIMD FPU exception support... done.
Mar 19 09:11:14 fems146 kernel: Checking 'hlt' instruction... OK.
Mar 19 09:11:14 fems146 kernel: POSIX conformance testing by UNIFIX
Mar 19 09:11:14 fems146 kernel: mtrr: v1.40 (20010327) Richard Gooch (rgo...@atnf.csiro.au)
Mar 19 09:11:14 fems146 kernel: mtrr: detected mtrr type: Intel
Mar 19 09:11:14 fems146 kernel: CPU: L1 I cache: 16K, L1 D cache: 16K
Mar 19 09:11:14 fems146 kernel: CPU: L2 cache: 256K
Mar 19 09:11:14 fems146 kernel: Intel machine check reporting enabled on CPU#0.
Mar 19 09:11:14 fems146 kernel: CPU0: Intel Pentium III (Coppermine) stepping 06
Mar 19 09:11:14 fems146 kernel: per-CPU timeslice cutoff: 731.53 usecs.
Mar 19 09:11:14 fems146 kernel: enabled ExtINT on CPU#0
Mar 19 09:11:14 fems146 kernel: ESR value before enabling vector: 00000000
Mar 19 09:11:14 fems146 kernel: ESR value after enabling vector: 00000000
Mar 19 09:11:14 fems146 kernel: Error: only one processor found.
Mar 19 09:11:14 fems146 kernel: ENABLING IO-APIC IRQs
Mar 19 09:11:14 fems146 kernel: Setting 8 in the phys_id_present_map
Mar 19 09:11:14 fems146 kernel: ...changing IO-APIC physical APIC ID to 8 ... ok.
Mar 19 09:11:14 fems146 kernel: ..TIMER: vector=0x31 pin1=-1 pin2=-1
Mar 19 09:11:14 fems146 kernel: ...trying to set up timer (IRQ0) through the 8259A ...  failed.
Mar 19 09:11:14 fems146 kernel: ...trying to set up timer as Virtual Wire IRQ... works.
Mar 19 09:11:14 fems146 kernel: testing the IO APIC.......................
Mar 19 09:11:14 fems146 kernel:
Mar 19 09:11:14 fems146 kernel: .................................... done.
Mar 19 09:11:14 fems146 kernel: Using local APIC timer interrupts.
Mar 19 09:11:14 fems146 kernel: calibrating APIC timer ...
Mar 19 09:11:14 fems146 kernel: ..... CPU clock speed is 730.9247 MHz.
Mar 19 09:11:14 fems146 kernel: ..... host bus clock speed is 132.8952 MHz.
Mar 19 09:11:14 fems146 kernel: cpu: 0, clocks: 1328952, slice: 664476
Mar 19 09:11:14 fems146 kernel: CPU0<T0:1328944,T1:664464,D:4,S:664476,C:1328952>
Mar 19 09:11:14 fems146 kernel: Waiting on wait_init_idle (map = 0x0)
Mar 19 09:11:14 fems146 kernel: All processors have done init_idle
Mar 19 09:11:14 fems146 kernel: PCI: PCI BIOS revision 2.10 entry at 0xe838d, last bus=2
Mar 19 09:11:14 fems146 kernel: PCI: Using configuration type 1
Mar 19 09:11:14 fems146 kernel: PCI: Probing PCI hardware
Mar 19 09:11:14 fems146 ypbind: Setting NIS domain name fd.de:  succeeded
Mar 19 09:11:14 fems146 kernel: Unknown bridge resource 0: assuming transparent
Mar 19 09:11:14 fems146 kernel: Unknown bridge resource 2: assuming transparent
Mar 19 09:11:14 fems146 kernel: PCI: Using IRQ router PIIX [8086/2440] at 00:1f.0
Mar 19 09:11:14 fems146 kernel: PCI->APIC IRQ transform: (B0,I31,P2) -> 23
Mar 19 09:11:14 fems146 kernel: PCI->APIC IRQ transform: (B0,I31,P1) -> 17
Mar 19 09:11:14 fems146 kernel: PCI->APIC IRQ transform: (B1,I0,P0) -> 18
Mar 19 09:11:14 fems146 kernel: PCI->APIC IRQ transform: (B2,I8,P0) -> 20
Mar 19 09:11:14 fems146 kernel: PCI->APIC IRQ transform: (B2,I10,P0) -> 21
Mar 19 09:11:14 fems146 kernel: Linux NET4.0 for Linux 2.4
Mar 19 09:11:14 fems146 kernel: Based upon Swansea University Computer Society NET3.039
Mar 19 09:11:14 fems146 kernel: Initializing RT netlink socket
Mar 19 09:11:14 fems146 kernel: Starting kswapd
Mar 19 09:11:14 fems146 kernel: VFS: Diskquotas version dquot_6.4.0 initialized
Mar 19 09:11:14 fems146 kernel: pty: 256 Unix98 ptys configured
Mar 19 09:11:14 fems146 kernel: Serial driver version 5.05c (2001-07-08) with MANY_PORTS SHARE_IRQ SERIAL_PCI enabled
Mar 19 09:11:14 fems146 kernel: ttyS00 at 0x03f8 (irq = 4) is a 16550A
Mar 19 09:11:14 fems146 kernel: ttyS01 at 0x02f8 (irq = 3) is a 16550A
Mar 19 09:11:14 fems146 kernel: block: 128 slots per queue, batch=32
Mar 19 09:11:14 fems146 kernel: RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize
Mar 19 09:11:14 fems146 kernel: Uniform Multi-Platform E-IDE driver Revision: 6.31
Mar 19 09:11:14 fems146 kernel: ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
Mar 19 09:11:14 fems146 kernel: PIIX4: IDE controller on PCI bus 00 dev f9
Mar 19 09:11:14 fems146 kernel: PIIX4: chipset revision 1
Mar 19 09:11:14 fems146 kernel: PIIX4: not 100%% native mode: will probe irqs later
Mar 19 09:11:14 fems146 kernel:     ide0: BM-DMA at 0x2460-0x2467, BIOS settings: hda:DMA, hdb:pio
Mar 19 09:11:14 fems146 kernel:     ide1: BM-DMA at 0x2468-0x246f, BIOS settings: hdc:DMA, hdd:pio
Mar 19 09:11:14 fems146 kernel: hda: ST340016A, ATA DISK drive
Mar 19 09:11:14 fems146 kernel: hdc: LTN485, ATAPI CD/DVD-ROM drive
Mar 19 09:11:14 fems146 kernel: ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
Mar 19 09:11:14 fems146 kernel: ide1 at 0x170-0x177,0x376 on irq 15
Mar 19 09:11:14 fems146 kernel: hda: 78165360 sectors (40021 MB) w/2048KiB Cache, CHS=5169/240/63
Mar 19 09:11:14 fems146 kernel: hdc: ATAPI 48X CD-ROM drive, 120kB Cache, DMA
Mar 19 09:11:14 fems146 kernel: Uniform CD-ROM driver Revision: 3.12
Mar 19 09:11:14 fems146 kernel: Partition check:
Mar 19 09:11:14 fems146 kernel:  hda: [PTBL] [4865/255/63] hda1 hda2 hda3 hda4 < hda5 hda6 >
Mar 19 09:11:14 fems146 kernel: Floppy drive(s): fd0 is 1.44M
Mar 19 09:11:14 fems146 kernel: FDC 0 is a post-1991 82077
Mar 19 09:11:14 fems146 kernel: SCSI subsystem driver Revision: 1.00
Mar 19 09:11:14 fems146 kernel: request_module[scsi_hostadapter]: Root fs not mounted
Mar 19 09:11:14 fems146 kernel: usb.c: registered new driver hub
Mar 19 09:11:14 fems146 kernel: Initializing USB Mass Storage driver...
Mar 19 09:11:14 fems146 kernel: usb.c: registered new driver usb-storage
Mar 19 09:11:14 fems146 kernel: USB Mass Storage support registered.
Mar 19 09:11:14 fems146 kernel: md: linear personality registered as nr 1
Mar 19 09:11:14 fems146 kernel: md: raid0 personality registered as nr 2
Mar 19 09:11:14 fems146 kernel: md: raid1 personality registered as nr 3
Mar 19 09:11:14 fems146 kernel: md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27
Mar 19 09:11:14 fems146 kernel: md: Autodetecting RAID arrays.
Mar 19 ...

read more »

 
 
 

Filesystem Corruption (ext2) on Tyan S2462, 2xAMD1900MP, 2.4.17SMP (RH7.2)

Post by Ken Brownfiel » Thu, 21 Mar 2002 10:10:11


I'll take a leap and guess that this is DMA.

Disable DMA, or go to MDMA2 would be my suggestion.  I don't think I've
used a Tyan board yet that doesn't get something horribly wrong.  Make
sure DMA is currently active with 'hdparm -d /dev/hda' though -- the
dmesg output can be misleading.

We're seeing this with Tyan 2410s and Seagate drives.  I think Tyan just
can't get DMA right.  Luckily we mainly lost docs or man pages before we
disabled DMA, although losing the rpm database sucked.  MDMA2 seems okay
but we haven't tested it long enough to form a lasting impression.

Side notes:

There are issues with CONFIG_IDEDMA{,_PCI}_AUTO in 2.4 that Martin fixed
(AFAIK), so I would suggest using hdparm at boot ('hdparm -d0 /dev/hda'
or 'hdparm -X34 /dev/hda') until those CONFIG options work when false.

I'm actually patching the ServerWorks driver to honor the CONFIG flag,
since even with hdparm there is a narrow risk to the fs during the boot
process before DMA is disabled.

Unless you want to compile DMA out of your kernels... but I actually
have some non-Tyan hardware that doesn't need to be crippled, and I
don't want to separate DMA vs non-DMA kernels.

Hope it helps,
--
Ken.
k...@irridia.com

On Tue, Mar 19, 2002 at 07:42:04PM +0100, Martin Knoblauch wrote:

| Hi,
|
|  what could be the cause of filesystem corruption on a Tyan Thunder 2462
| dual Athlon 1900MP? The system has 2GB "registered" ECC memory and is
| running a 2.4.17SMP kernel.
|
|  A customer of us has 8 of these beasts and one has started "acting up".
| Basically destroying the root partition withing minutes after booting a
| fresh installation. The other 7 (identical) systems are OK. Is there
| anything that one should look for (like the noapic thing).
|
|  Included is the messages file with three boot sequences. Interestingly
| the first sequence seems to misdetect the system completetly .... At the
| end of the file the number of junk characters seems to increase quite a
| bit :-(
|
|  Unfortunatelly I am just in remote debugging mode right now. Guess I
| will see the system on thursday.
|
| TIA
| Martin
| --
| ------------------------------------------------------------------
| Martin Knoblauch         |    email:  Martin.Knobla...@TeraPort.de
| TeraPort GmbH            |    Phone:  +49-89-510857-309
| C+ITS                    |    Fax:    +49-89-510857-111
| http://www.teraport.de   |    Mobile: +49-170-4904759
| Mar 19 09:11:14 fems146 syslogd 1.4.1: restart.
| Mar 19 09:11:14 fems146 syslog: syslogd startup succeeded
| Mar 19 09:11:14 fems146 syslog: klogd startup succeeded
| Mar 19 09:11:14 fems146 kernel: klogd 1.4.1, log source = /proc/kmsg started.
| Mar 19 09:11:14 fems146 kernel: Inspecting /boot/System.map-2.4.17smp
| Mar 19 09:11:14 fems146 portmap: portmap startup succeeded
| Mar 19 09:11:14 fems146 kernel: Loaded 16042 symbols from /boot/System.map-2.4.17smp.
| Mar 19 09:11:14 fems146 kernel: Symbols match kernel version 2.4.17.
| Mar 19 09:11:14 fems146 kernel: Loaded 10 symbols from 2 modules.
| Mar 19 09:11:14 fems146 kernel: Linux version 2.4.17smp (root@dino) (gcc version 2.96 20000731 (Red Hat Linux 7.1 2.96-98)) #4 SMP Die Feb 5 14:00:50 CET 2002
| Mar 19 09:11:14 fems146 kernel: BIOS-provided physical RAM map:
| Mar 19 09:11:14 fems146 kernel:  BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
| Mar 19 09:11:14 fems146 kernel:  BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
| Mar 19 09:11:14 fems146 kernel:  BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
| Mar 19 09:11:14 fems146 kernel:  BIOS-e820: 0000000000100000 - 000000000ffd0000 (usable)
| Mar 19 09:11:14 fems146 kernel:  BIOS-e820: 000000000ffd0000 - 000000000fff0000 (ACPI NVS)
| Mar 19 09:11:14 fems146 kernel:  BIOS-e820: 000000000fff0000 - 0000000010000000 (usable)
| Mar 19 09:11:14 fems146 kernel:  BIOS-e820: 00000000feea0000 - 0000000100000000 (reserved)
| Mar 19 09:11:14 fems146 nfslock: rpc.statd startup succeeded
| Mar 19 09:11:14 fems146 kernel: found SMP MP-table at 000fbfe0
| Mar 19 09:11:14 fems146 kernel: hm, page 000fb000 reserved twice.
| Mar 19 09:11:14 fems146 kernel: hm, page 000fc000 reserved twice.
| Mar 19 09:11:14 fems146 rpc.statd[591]: Version 0.3.1 Starting
| Mar 19 09:11:14 fems146 kernel: hm, page 000e9000 reserved twice.
| Mar 19 09:11:14 fems146 kernel: hm, page 000ea000 reserved twice.
| Mar 19 09:11:14 fems146 kernel: On node 0 totalpages: 65536
| Mar 19 09:11:14 fems146 kernel: zone(0): 4096 pages.
| Mar 19 09:11:14 fems146 kernel: zone(1): 61440 pages.
| Mar 19 09:11:14 fems146 kernel: zone(2): 0 pages.
| Mar 19 09:11:14 fems146 kernel: Intel MultiProcessor Specification v1.4
| Mar 19 09:11:14 fems146 kernel:     Virtual Wire compatibility mode.
| Mar 19 09:11:14 fems146 kernel: OEM ID: COMPAQ   Product ID: Deskpro      APIC at: 0xFEE00000
| Mar 19 09:11:14 fems146 kernel: Processor #0 Pentium(tm) Pro APIC version 16
| Mar 19 09:11:14 fems146 kernel: I/O APIC #8 Version 17 at 0xFEC00000.
| Mar 19 09:11:14 fems146 kernel: Processors: 1
| Mar 19 09:11:14 fems146 kernel: Kernel command line: BOOT_IMAGE=2.4.17smp ro root=302 BOOT_FILE=/boot/vmlinuz-2.4.17smp
| Mar 19 09:11:14 fems146 kernel: Initializing CPU#0
| Mar 19 09:11:14 fems146 kernel: Detected 730.907 MHz processor.
| Mar 19 09:11:14 fems146 kernel: Console: colour VGA+ 80x25
| Mar 19 09:11:14 fems146 kernel: Calibrating delay loop... 1458.17 BogoMIPS
| Mar 19 09:11:14 fems146 kernel: Memory: 254928k/262144k available (1279k kernel code, 6700k reserved, 386k data, 236k init, 0k highmem)
| Mar 19 09:11:14 fems146 kernel: Dentry-cache hash table entries: 32768 (order: 6, 262144 bytes)
| Mar 19 09:11:14 fems146 kernel: Inode-cache hash table entries: 16384 (order: 5, 131072 bytes)
| Mar 19 09:11:14 fems146 kernel: Mount-cache hash table entries: 4096 (order: 3, 32768 bytes)
| Mar 19 09:11:14 fems146 kernel: Buffer-cache hash table entries: 16384 (order: 4, 65536 bytes)
| Mar 19 09:11:14 fems146 kernel: Page-cache hash table entries: 65536 (order: 6, 262144 bytes)
| Mar 19 09:11:14 fems146 kernel: CPU: L1 I cache: 16K, L1 D cache: 16K
| Mar 19 09:11:14 fems146 kernel: CPU: L2 cache: 256K
| Mar 19 09:11:14 fems146 kernel: Intel machine check architecture supported.
| Mar 19 09:11:14 fems146 kernel: Intel machine check reporting enabled on CPU#0.
| Mar 19 09:11:14 fems146 kernel: Enabling fast FPU save and restore... done.
| Mar 19 09:11:14 fems146 kernel: Enabling unmasked SIMD FPU exception support... done.
| Mar 19 09:11:14 fems146 kernel: Checking 'hlt' instruction... OK.
| Mar 19 09:11:14 fems146 kernel: POSIX conformance testing by UNIFIX
| Mar 19 09:11:14 fems146 kernel: mtrr: v1.40 (20010327) Richard Gooch (rgo...@atnf.csiro.au)
| Mar 19 09:11:14 fems146 kernel: mtrr: detected mtrr type: Intel
| Mar 19 09:11:14 fems146 kernel: CPU: L1 I cache: 16K, L1 D cache: 16K
| Mar 19 09:11:14 fems146 kernel: CPU: L2 cache: 256K
| Mar 19 09:11:14 fems146 kernel: Intel machine check reporting enabled on CPU#0.
| Mar 19 09:11:14 fems146 kernel: CPU0: Intel Pentium III (Coppermine) stepping 06
| Mar 19 09:11:14 fems146 kernel: per-CPU timeslice cutoff: 731.53 usecs.
| Mar 19 09:11:14 fems146 kernel: enabled ExtINT on CPU#0
| Mar 19 09:11:14 fems146 kernel: ESR value before enabling vector: 00000000
| Mar 19 09:11:14 fems146 kernel: ESR value after enabling vector: 00000000
| Mar 19 09:11:14 fems146 kernel: Error: only one processor found.
| Mar 19 09:11:14 fems146 kernel: ENABLING IO-APIC IRQs
| Mar 19 09:11:14 fems146 kernel: Setting 8 in the phys_id_present_map
| Mar 19 09:11:14 fems146 kernel: ...changing IO-APIC physical APIC ID to 8 ... ok.
| Mar 19 09:11:14 fems146 kernel: ..TIMER: vector=0x31 pin1=-1 pin2=-1
| Mar 19 09:11:14 fems146 kernel: ...trying to set up timer (IRQ0) through the 8259A ...  failed.
| Mar 19 09:11:14 fems146 kernel: ...trying to set up timer as Virtual Wire IRQ... works.
| Mar 19 09:11:14 fems146 kernel: testing the IO APIC.......................
| Mar 19 09:11:14 fems146 kernel:
| Mar 19 09:11:14 fems146 kernel: .................................... done.
| Mar 19 09:11:14 fems146 kernel: Using local APIC timer interrupts.
| Mar 19 09:11:14 fems146 kernel: calibrating APIC timer ...
| Mar 19 09:11:14 fems146 kernel: ..... CPU clock speed is 730.9247 MHz.
| Mar 19 09:11:14 fems146 kernel: ..... host bus clock speed is 132.8952 MHz.
| Mar 19 09:11:14 fems146 kernel: cpu: 0, clocks: 1328952, slice: 664476
| Mar 19 09:11:14 fems146 kernel: CPU0<T0:1328944,T1:664464,D:4,S:664476,C:1328952>
| Mar 19 09:11:14 fems146 kernel: Waiting on wait_init_idle (map = 0x0)
| Mar 19 09:11:14 fems146 kernel: All processors have done init_idle
| Mar 19 09:11:14 fems146 kernel: PCI: PCI BIOS revision 2.10 entry at 0xe838d, last bus=2
| Mar 19 09:11:14 fems146 kernel: PCI: Using configuration type 1
| Mar 19 09:11:14 fems146 kernel: PCI: Probing PCI hardware
| Mar 19 09:11:14 fems146 ypbind: Setting NIS domain name fd.de:  succeeded
| Mar 19 09:11:14 fems146 kernel: Unknown bridge resource 0: assuming transparent
| Mar 19 09:11:14 fems146 kernel: Unknown bridge resource 2: assuming transparent
| Mar 19 09:11:14 fems146 kernel: PCI: Using IRQ router PIIX [8086/2440] at 00:1f.0
| Mar 19 09:11:14 fems146 kernel: PCI->APIC IRQ transform: (B0,I31,P2) -> 23
| Mar 19 09:11:14 fems146 kernel: PCI->APIC IRQ transform: (B0,I31,P1) -> 17
| Mar 19 09:11:14 fems146 kernel: PCI->APIC IRQ transform: (B1,I0,P0) -> 18
| Mar 19 09:11:14 fems146 kernel: PCI->APIC IRQ transform: (B2,I8,P0) -> 20
| Mar 19 09:11:14 fems146 kernel: PCI->APIC IRQ transform: (B2,I10,P0) -> 21
| Mar 19 09:11:14 fems146 kernel: Linux NET4.0 for Linux 2.4
| Mar 19 09:11:14 fems146 kernel: Based upon Swansea University Computer Society NET3.039
| Mar 19 09:11:14 fems146 kernel: Initializing RT netlink socket
| Mar 19 09:11:14 fems146 kernel: Starting kswapd
| Mar 19 09:11:14 fems146 kernel: VFS: Diskquotas version dquot_6.4.0 initialized
| Mar 19 09:11:14 fems146 kernel: pty: 256 Unix98 ptys ...

read more »

 
 
 

Filesystem Corruption (ext2) on Tyan S2462, 2xAMD1900MP, 2.4.17SMP (RH7.2)

Post by Alan Co » Thu, 21 Mar 2002 10:20:08


Quote:> We're seeing this with Tyan 2410s and Seagate drives.  I think Tyan just
> can't get DMA right.  Luckily we mainly lost docs or man pages before we
> disabled DMA, although losing the rpm database sucked.  MDMA2 seems okay
> but we haven't tested it long enough to form a lasting impression.
> I'm actually patching the ServerWorks driver to honor the CONFIG flag,
> since even with hdparm there is a narrow risk to the fs during the boot
> process before DMA is disabled.

I can confirm problems with serverworks OSB4 and UDMA. With UDMA and
a seagate disk you see 4 bytes repeat from one transfer into the next
shuffling all the data up 4 bytes (which since it includes inode and
metadata is *messy*). Current 2.4 has detect code that sometimes traps this
and panics to avoid fs death.

With MWDMA all was fine.

This was observed across a large number of boxes in a rendering farm so its
not a one off flawed box, and across two board vendors. I reported it to
serverworks who were interested but couldnt reproduce it in their lab.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

Filesystem Corruption (ext2) on Tyan S2462, 2xAMD1900MP, 2.4.17SMP (RH7.2)

Post by Ken Brownfiel » Thu, 21 Mar 2002 10:40:08


| I can confirm problems with serverworks OSB4 and UDMA. With UDMA and
[...]

Thanks.  This also points out that I mistakenly said DMA rather than
UDMA.  Someone else mentioned to me that MDMA worked for them as well.
It would have been "fine" if the serverworks driver didn't leave UDMA on
when it's off by default in the CONFIG.  At least then you would be
making the choice to specifically enable UDMA at your own risk...

| This was observed across a large number of boxes in a rendering farm so its
| not a one off flawed box, and across two board vendors. I reported it to
| serverworks who were interested but couldnt reproduce it in their lab.

Quite possible.  I'm only seeing this on ServerWorks mobos with IDE as
primary (vs SCSI).  I heard third-hand via a FreeBSD post that it's an
OSB4 issue effecting them as well.  Are Seagates a requirement for the
issues?

As to whether they can reproduce it... I'm not holding my breath for
them to try.

Thanks much for the info,
--
Ken.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

Filesystem Corruption (ext2) on Tyan S2462, 2xAMD1900MP, 2.4.17SMP (RH7.2)

Post by Andre Hedric » Thu, 21 Mar 2002 10:40:07



> > We're seeing this with Tyan 2410s and Seagate drives.  I think Tyan just
> > can't get DMA right.  Luckily we mainly lost docs or man pages before we
> > disabled DMA, although losing the rpm database sucked.  MDMA2 seems okay
> > but we haven't tested it long enough to form a lasting impression.
> > I'm actually patching the ServerWorks driver to honor the CONFIG flag,
> > since even with hdparm there is a narrow risk to the fs during the boot
> > process before DMA is disabled.

> I can confirm problems with serverworks OSB4 and UDMA. With UDMA and
> a seagate disk you see 4 bytes repeat from one transfer into the next
> shuffling all the data up 4 bytes (which since it includes inode and
> metadata is *messy*). Current 2.4 has detect code that sometimes traps this
> and panics to avoid fs death.

> With MWDMA all was fine.

> This was observed across a large number of boxes in a rendering farm so its
> not a one off flawed box, and across two board vendors. I reported it to
> serverworks who were interested but couldnt reproduce it in their lab.

I am in their lab trying to reproduce the error and I have found some docs
which could help address the error of the 4byte FIFO issue in the engine.
It looks fixable on paper.

As for the AMD driver, who knows which version is in that kernel.

Next, that config option is a distro addition not mine, but it has creeped
in so it is here.

Regards,

Andre Hedrick
LAD Storage Consulting Group

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

Filesystem Corruption (ext2) on Tyan S2462, 2xAMD1900MP, 2.4.17SMP (RH7.2)

Post by Alan Co » Thu, 21 Mar 2002 10:50:05


Quote:> I am in their lab trying to reproduce the error and I have found some docs
> which could help address the error of the 4byte FIFO issue in the engine.
> It looks fixable on paper.

Andre - if you want the info I have from the previous stuff I was involved
in I can strip out customer company info and send it on.

Quote:> As for the AMD driver, who knows which version is in that kernel.

2.4.18 has a very old one
2.4.18-ac has the Andre/AMD updated one, but not the further updates.
                (eg it turns off SWDMA on more chipsets than it needs to)

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

Filesystem Corruption (ext2) on Tyan S2462, 2xAMD1900MP, 2.4.17SMP (RH7.2)

Post by Alan Co » Thu, 21 Mar 2002 10:50:08


Quote:> It would have been "fine" if the serverworks driver didn't leave UDMA on
> when it's off by default in the CONFIG.  At least then you would be

That was a merge error from way back - now fixed (2.4.19pre)

Quote:> Quite possible.  I'm only seeing this on ServerWorks mobos with IDE as
> primary (vs SCSI).  I heard third-hand via a FreeBSD post that it's an
> OSB4 issue effecting them as well.  Are Seagates a requirement for the
> issues?

I wish I knew. If I did I'd slap a "no seagate UDMA" check in that driver
pronto.

Quote:> As to whether they can reproduce it... I'm not holding my breath for
> them to try.

They tried. They asked a lot of questions and while they failed I'm certain
the actually did try. While we've had some problems with serverworks
(notably no ECC docs which for some enterprise customers is a showstopper)
in general they are very co-operative nowdays, although they do like NDA's
and the like first.

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

Filesystem Corruption (ext2) on Tyan S2462, 2xAMD1900MP, 2.4.17SMP (RH7.2)

Post by Ken Brownfiel » Thu, 21 Mar 2002 10:50:09


[...]
| Next, that config option is a distro addition not mine, but it has creeped
| in so it is here.

Ya, Martin has cleaned this up for 2.4 I believe, and I'll do the grunt
work on patches for you and/or Martin to clean up 2.5.  The option is
fine, just that the specific IDE drivers aren't handling the logic
properly and ide-pci does it already.

Thanks!
--
Ken.

| Regards,
|
| Andre Hedrick
| LAD Storage Consulting Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

Filesystem Corruption (ext2) on Tyan S2462, 2xAMD1900MP, 2.4.17SMP (RH7.2)

Post by Alan Co » Thu, 21 Mar 2002 10:50:11


Quote:> Ya, Martin has cleaned this up for 2.4 I believe, and I'll do the grunt

2.4 IDE cleanups are Andre mostly. Martin has been beating the *out of
2.5
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://www.veryComputer.com/
Please read the FAQ at  http://www.veryComputer.com/
 
 
 

Filesystem Corruption (ext2) on Tyan S2462, 2xAMD1900MP, 2.4.17SMP (RH7.2)

Post by Andre Hedric » Thu, 21 Mar 2002 11:50:05



> > I am in their lab trying to reproduce the error and I have found some docs
> > which could help address the error of the 4byte FIFO issue in the engine.
> > It looks fixable on paper.

> Andre - if you want the info I have from the previous stuff I was involved
> in I can strip out customer company info and send it on.

> > As for the AMD driver, who knows which version is in that kernel.

> 2.4.18 has a very old one
> 2.4.18-ac has the Andre/AMD updated one, but not the further updates.
>            (eg it turns off SWDMA on more chipsets than it needs to)

Why, SWDMA is obsoleted and there should not be any modern drives
reporting the support.

Cheers,

Andre Hedrick
LAD Storage Consulting Group

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

Filesystem Corruption (ext2) on Tyan S2462, 2xAMD1900MP, 2.4.17SMP (RH7.2)

Post by Martin Knoblauc » Thu, 21 Mar 2002 22:40:17



> > I am in their lab trying to reproduce the error and I have found some docs
> > which could help address the error of the 4byte FIFO issue in the engine.
> > It looks fixable on paper.

> Andre - if you want the info I have from the previous stuff I was involved
> in I can strip out customer company info and send it on.

> > As for the AMD driver, who knows which version is in that kernel.

> 2.4.18 has a very old one
> 2.4.18-ac has the Andre/AMD updated one, but not the further updates.
>                 (eg it turns off SWDMA on more chipsets than it needs to)

 it is actually possible that the AMD driver is not enabled on the
kernel from our integrator. Could this give problems when someone
enables DMA on the IDE devices?

 I am still wondering why we did not see it on the eight other boxes
with the same setup. Maybe just luck?

Thanks
Martin
--
------------------------------------------------------------------

TeraPort GmbH            |    Phone:  +49-89-510857-309
C+ITS                    |    Fax:    +49-89-510857-111
http://www.teraport.de   |    Mobile: +49-170-4904759
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

Filesystem Corruption (ext2) on Tyan S2462, 2xAMD1900MP, 2.4.17SMP (RH7.2)

Post by Alan Co » Thu, 21 Mar 2002 22:40:18


Quote:> > 2.4.18-ac has the Andre/AMD updated one, but not the further updates.
> >               (eg it turns off SWDMA on more chipsets than it needs to)

> Why, SWDMA is obsoleted and there should not be any modern drives
> reporting the support.

Yes but the rev C4 (?) check is wrong for later chipsets obsolete or
otherwise -wrong but harmless

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

1. Filesystem Corruption (ext2) on Tyan S2462, 2xAMD1900MP, 2.4.17SMP

[snip]
[snip]

Looking at it on a byte-by-byte level, it looks like (at least) these
types of bit flips are happening:

     --1-----
     --0-----
     ------0-
MSB->76543210<-LSB

That is, it looks like sometimes bit 5 is being flipped on or off, or bit
1 is being flipped off. (There could be others that I just haven't seen in
those logs yet.) I'm suspecting bad hardware (in case that wasn't
obvious), but I don't know exactly what component is defective. (By the
way, the BIOS has ECC error correction enabled, right??)

Also, do the weird capitalization changes in the logs happen on screen
too, or only in the logfile?


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2. 16 group limit per user?

3. 2.4.17/2.4.9 panic w/ S2462 tyan thunder board

4. XF86Config LapTop 640x480x16?

5. 2.4.17smp kernel oops in nfs client

6. Changing a user password from a program

7. 2.4.17 filesystem corruption

8. Detecting a floppy disk in Sparc Station

9. SMP kernel spontaneously reboots Dual AMD (Tyan S2462)

10. assertion failure : ext3 & lvm , 2.4.17 smp & 2.4.18-ac1 smp

11. some 2.4.17 vs. 2.4.17-rmap8 vs. lowmem analysis

12. 2.4.17rc2aa2 oops in page_alloc.c