Summary:
I've recently added a (brand new) Maxtor ATA/100 card (Promise
Ultra100) and a (brand new) 100GB Western Digital ATA/100 drive to my
system. I ran e2fsck to find bad blocks - none.
I immediately began to see "status errors" on ide2, so I did a little
detective work, below, to try and get my head around the issue.
On swapping out the 100GB drive for a (less than 1-year old) Quantum
30GB ATA/66 drive, fearing that my new WD drive was DOA, I managed to
get different errors -- see below).
I have a SCSI drive and controller in there that have worked just fine
for about a year now.
--------------------------------------------------------------
Exhibit 1: (/proc/interrupts)
CPU0 CPU1
0: 181978 203371 IO-APIC-edge timer
1: 1102 1338 IO-APIC-edge keyboard
2: 0 0 XT-PIC cascade
3: 3 3 IO-APIC-edge serial
5: 25 23 IO-APIC-level AM53C974
8: 0 1 IO-APIC-edge rtc
10: 6970093 6969779 IO-APIC-level ide2, sym53c8xx
11: 239 207 IO-APIC-level eth0
12: 1140 1034 IO-APIC-edge PS/2 Mouse
14: 2115 2210 IO-APIC-edge ide0
15: 9890 10936 IO-APIC-edge ide1
NMI: 0 0
LOC: 385273 385272
ERR: 0
MIS: 30
-------------------------------------------------------------
Exhibit 2: (from /proc/pci)
PCI devices found:
Bus 0, device 7, function 1:
IDE interface: Intel Corp. 82371AB PIIX4 IDE (rev 1).
Master Capable. Latency=64.
I/O at 0xffa0 [0xffaf].
Bus 0, device 13, function 0:
SCSI storage controller: LSI Logic / Symbios Logic (formerly NCR)
53c895 (rev 1).
IRQ 10.
Master Capable. Latency=64. Min Gnt=30.Max Lat=64.
I/O at 0xe800 [0xe8ff].
Non-prefetchable 32 bit memory at 0xfebfff00 [0xfebfffff].
Non-prefetchable 32 bit memory at 0xfebfe000 [0xfebfefff].
Bus 0, device 18, function 0:
Unknown mass storage controller: Promise Technology, Inc. 20267
(rev 2).
IRQ 10.
Master Capable. Latency=64.
I/O at 0xeff0 [0xeff7].
I/O at 0xefe4 [0xefe7].
I/O at 0xefa8 [0xefaf].
I/O at 0xefe0 [0xefe3].
I/O at 0xef00 [0xef3f].
Non-prefetchable 32 bit memory at 0xfebc0000 [0xfebdffff].
---------------------------------------------------------------
Exhibit 3: (from /var/log/messages)
[This is what happens when I try to copy ~6GB of data from /dev/sda1
to the WD drive (as /dev/hde1)]
command: find . | cpio -mpu /newdrive
Mar 23 15:57:02 gromit kernel: hde: status error: status=0x58 {
DriveReady SeekComplete DataRequest }
Mar 23 15:57:02 gromit kernel: hde: drive not ready for command
Mar 23 15:57:02 gromit kernel: hde: status timeout: status=0xd0 { Busy
Mar 23 15:57:02 gromit kernel: hde: drive not ready for commandQuote:}
Mar 23 15:57:02 gromit kernel: ide2: reset: success
Mar 23 15:59:12 gromit kernel: hde: status error: status=0x58 {
DriveReady SeekComplete DataRequest }
Mar 23 15:59:12 gromit kernel: hde: drive not ready for command
Mar 23 15:59:12 gromit kernel: hde: status timeout: status=0xd0 { Busy
Mar 23 15:59:12 gromit kernel: hde: drive not ready for commandQuote:}
Mar 23 15:59:12 gromit kernel: ide2: reset: success
Mar 23 16:07:44 gromit kernel: hde: status error: status=0x58 {
DriveReady SeekComplete DataRequest }
Mar 23 16:07:44 gromit kernel: hde: drive not ready for command
Mar 23 16:07:44 gromit kernel: hde: status timeout: status=0xd0 { Busy
Mar 23 16:07:44 gromit kernel: hde: drive not ready for commandQuote:}
Mar 23 16:07:44 gromit kernel: ide2: reset: success
I've repeated this test about six times, and each time there have been
exactly three resets and they come at about the same time during the
copy (about a minute in).
Other than these errors, the data _appears_ to copy over intact.
-------------------------------------------------------------------
Exhibit 4: (from /var/log/messages)
[This is what happens when I try to copy ~6GB of data from /dev/sda1
to the Quantum drive (as /dev/hde1)]
command: find . | cpio -mpu /newdrive
Mar 23 16:37:09 gromit kernel: hde: irq timeout: status=0xd0 { Busy }
Mar 23 16:37:09 gromit kernel: ide2: reset: success
Mar 23 16:47:42 gromit kernel: hde: irq timeout: status=0xd0 { Busy }
Mar 23 16:47:42 gromit kernel: ide2: reset: success
Mar 23 16:57:37 gromit kernel: hde: irq timeout: status=0xd0 { Busy }
Mar 23 16:57:37 gromit kernel: ide2: reset: success
I've repeated this test about six times, and each time there have been
exactly three resets and they come at about the same time during the
copy (about a minute in).
Other than these errors, the data _appears_ to copy over intact.
------------------------------------------------------------------
Exhibit 5: (testing copy IDE1 -> IDE2)
When I try to copy ~4GB of data from /dev/hdc1 to the WD drive (as
/dev/hde1), I get no errors in the log, and the data appears to copy
over intact.
------------------------------------------------------------------
Exhibit 6: (testing copy IDE1 -> IDE2)
When I try to copy ~4GB of data from /dev/hdc1 to the Quantum drive
(as /dev/hde1), I get no errors in the log, and the data appears to
copy over intact.
------------------------------------------------------------------
Observations:
1. The new Ultra100 takes the same IRQ (10) as the SCSI host adapter.
2. The errors only occur when copying a large amount of data from the
SCSI drive to the ATA/100 drive, and not when copying from an ATA/33
drive to the ATA/100 drive.
3. The errors all seem to be very predictable and repeatable.
-------------------------------------------------------------------
Conclusions:
1. Both the new ATA/100 WD and the ATA/66 Quantum are fine (likely?)
OR both of them are identically broken (unlikely?).
2. The Ultra100 is broken (possible) OR it doesn't play nice when on
the same interrupt as another controller (possible). (Can the IRQ be
manually assigned easily?)
3. The linux ide driver is buggy (likely?)
Are there any hard drive experts out there who can help me out here?
Since the error is so repeatable, I'm happy to run any more detailed
tests for the kernel/driver guys if it helps fix this...
Thanks in advance,
WMB