2.4.20: / corruption on md-raid on large IDE discs

2.4.20: / corruption on md-raid on large IDE discs

Post by Hom » Thu, 10 Apr 2003 11:30:16



Hi,
   I'm having some really hairy problems on a pair of servers both
running 2.4.20.  In both cases I'm getting really bad corruption of
/ (hundreds of files going to lost+found, files suddenly become
inaccessible).

If it was only one box I'd blame it on RAM or the like; but its two of
them, and I'm not sure what is happening.  Both use straight, unpatched
2.4.20 kernels. (except a loaded driver on one box - see below).

  Lets call the boxes 'c' and 'n'.

box 'n':
P3-733MHz with 512MB of RAM and 2 x Western digital 200GB IDE drives
connected to
    2 channels of a Intel 82801BA IDE U100 motherboard controller

A pair of MD RAID1 sets gives us a root partition and a 180GB working
directory.

   It runs a SuSE 8.1 installation.

It runs as a general server (NFS, Samba, DNS, YP, squid).

Here is hda:

/dev/hda:

  Model=WDC WD2000JB-00DUA1, FwRev=02.13B02, SerialNo=WD-WMACK1575630
  Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
  RawCHS=16383/16/63, TrkSize=57600, SectSize=600, ECCbytes=74
  BuffType=DualPortCache, BuffSize=8192kB, MaxMultSect=16, MultSect=16
  CurCHS=16383/16/63, CurSects=-66060037, LBA=yes, LBAsects=268435455
  IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
  PIO modes: pio0 pio1 pio2 pio3 pio4
  DMA modes: mdma0 mdma1 mdma2 udma0 udma1 udma2
  AdvancedPM=no
  Drive Supports : Reserved : ATA-1 ATA-2 ATA-3 ATA-4 ATA-5 ATA-6

---------------------------------------------------------------------
box 'c':

P3-866MHz with 768MB of RAM and 5 x Western digital 200GB IDE drives
connected to:
    1 channel of a PIIX4 82371AB on board IDE interface
    2 channels of a Promise Ultra133 TX2
    2 channels of an HPT302 controller

   (The mix of controllers was prompted by reports on lkml of people
having problems with multiple promise controllers).

   A pair of MD RAID5 sets gives us a root partition and a big 800GB
main working directory.

   The HPT302 controller driver is built from source and loaded in an
initrd.

   It runs a SuSE 7.2 installation.

Box 'c' previously had a set of smaller 80GB drives in (all on its
PIIX4) - but with non-RAID root and had been happy for a long time.

It runs as a backup server, rsyncing large amounts of data to it
overnight and then streaming to tape.

Here is hdc:
/dev/hdc:

  Model=WDC WD2000JB-00DUA0, FwRev=65.13G65, SerialNo=WD-WMACK1345385
  Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
  RawCHS=16383/16/63, TrkSize=57600, SectSize=600, ECCbytes=74
  BuffType=DualPortCache, BuffSize=8192kB, MaxMultSect=16, MultSect=16
  CurCHS=4047/16/255, CurSects=16511760, LBA=yes, LBAsects=268435455
  IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
  PIO modes: pio0 pio1 pio2 pio3 pio4
  DMA modes: mdma0 mdma1 mdma2 udma0 udma1 *udma2 udma3 udma4 udma5
  Drive Supports : ATA-1 ATA-2 ATA-3 ATA-4 ATA-5 ATA-6

------------

Both boxes originally had ext3 root filesystems.  Having suspected
ext3 I replaced the ext3 root on box 'c' with a reiser partition which
still has corruption issues.

The large partition on 'c' is reiser, the large partition on 'n' is
ext3.

So where is the problem?
   1) Could it be the MD code - but I've had no problems with it before.
   2) Some problems with the onboard IDE controllers (both Intel
derivatives).
   3) Are there any known problems with these drives?
   4) LBA48 issues?
   5) Any known 2.4.20 issues?

We're running 2.4.20 because as far as I can tell it is the first one
that is happy with LBA48.

Any help greatfully received.

Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

1. fs corruption with 2.4.20 IDE+md+LVM

I observed filesystem corruption on my home workstation recently. I was
running kernel 2.4.20 (built myself with gcc 2.95.4), and ext3 with the
default journaling mode (ordered?).

I was downloading files, and noticed that they weren't being saved. I
immediately did a 'df -h', and it reported my home partition as having 7.3T
used, -64Z free.

I (foolishly) immediately did a 'du -sch ~/*' to see what might be taking up
all the space. after realizing what was going on (du reported filesystem
permission errors on files it shouldn't have), I shut down all programs, and
dropped to runlevel 1.

I unmounted my LVM'ed partitions (/var /usr /home), and tried to fsck
/dev/sys/home (the /home partition). it couldn't find a good superblock; and
fell back to using another backup superblock. fsck reported that the journal
was corrupt, and discarded it. many of the low-numbered inodes had wrong
refcounts, or wrong modes.

eventually it fixed the filesystem; but everything ended up in many files &
directories under lost+found. (had to pull the home dirs from one or more
dirs each, under lost+found).

after fixing the filesystem, I gratuitously fsck -f'ed all my other
partitions; they came up clean.

fortunately, looks like the only stuff I really lost were some chunks of my
XFree86 source tree, and some linux kernel sources. easily replaceable
stuff.

here's my system architecture:
2x Western Digital 80GB Special Edition IDE drives (hde, hdf)
- / is an ext3 RAID1 /dev/md0 made of hde1 and hdf1
- /dev/md1 is LVM-formatted RAID1, made of hde2 and hdf2. this partition
contains /var, /usr, and /home.

/home is the only place that I saw this corruption.

I have since reverted back to kernel 2.4.18.

I'm thinking that my reaction *should* have been to power-cycle the box
immediately upon notice of the problem, to prevent further fs corruption,
and bring it back up in single-user read-only mode. shutting down programs
nicely would have written more stuff to disk, worsening the corruption.

I will also point out that kernel 2.4.20-ac1 and 2.4.21-pre6 will not boot
on my machine; they kernel panic when detecting my IDE devices. I have not
tried 2.4.20-ac2 nor 2.4.21-pre2 yet. 2.4.20 and 2.4.18 boot quite happily
tho. I suppose I ought to try the latest versions and set up a serial
console to capture the oops, before reporting a bug on this.

Carl Soderstrom.
--
Systems Administrator
Real-Time Enterprises
www.real-time.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2. biz.sco ==>> comp.unix.sco

3. 2.4.20 Promise IDE RAID Locks up (gcc 3.2.1!)

4. filter in printcap for ps printer

5. 2.4.20-ac1 not seeing IDE disk on PIIX host adapter

6. Free Beer for All Linuxers!

7. IBM x440 problems on 2.4.20 to 2.4.20-rc1-ac3

8. Hp / PCL Printer . Filter?

9. 2.4.20 + XFS patches + rmap15a + Ingo's 2.4.20-rc3 O(1) sched

10. NFS/UDP/IP performance - 2.4.19 v/s 2.4.20, 2.4.20-pre3

11. IDE kernel parameter (was: 2.4.20 SMP, a PDC20269, and a huge Maxtor disk)

12. SCSI under 2.4.20-8 but not 2.4.20-18.9 (RH9)

13. PIIX4 IDE w/ MWDMA disk still broken in 2.4.20-rc1(bk)