Weird corruption problems

Weird corruption problems

Post by Ross Vandegrif » Sat, 08 Jul 2000 04:00:00



Hi all,

        I've had a server running on a dedicated T1 for a long time and
have had relatively few problems with it.  It cooks along providing me
nice bandwidth and responsive service.  However, it has recently become
plagued with a bizarre corruption problem.

        I first noticed it while uploading some scripts from other Debian
boxen.  The script would run fine on the local machines, and when FTPed to
any other Debian box on the network.  However, when FTPed to the remote
one, the resulting file would be chock full o' errors.  Weird stuff -
sometimes a whole line would be left out of the middle, sometimes a few
characters, and sometimes control characters would show up.  I just redid
the upload and it was fine again.

        I noticed it a second time when I recieved a "CRC error" report
about my release 3 SlackReiser boot disk from an extremely helpful
gentleman trying my software.  He determined that the image on the remote
machine was corrupt, and I have verified this fact.  Hmm, now I realize
something is fishy.

        Today I got a call from a business associate who uses this remote
server for email, saying he couldn't relay mail from his new domain, could
I add his new domain to our realy-domains file.  I said sure, and opened
it up.  Much to my surprise, one of the characters in the file had been
replaced with a control character.  I fixed the error, but now I'm really
scared about the rest of the data.  What can I do to guarentee that it's
all there and correct?  What on earth could be causing such a bizarre
problem?  Situation 1 and 2 point to communication problems, but number 3
involves the relay-domains file - it hasn't been sent over FTP ever -
seems to say that it's a filesystem/hardware problem.  Where should I
start looking?

Thanks,
        Ross Vandegrift
        Seitz Technical Products Inc

 
 
 

Weird corruption problems

Post by Chad+n.. » Sat, 08 Jul 2000 04:00:00


I'd be looking at the disk subsystem.  What controllers are you using?  Sounds
to me like they maybe going bad.  Anything in the log files?  I've heard of
cases where extremely busy scsi buses can corrupt data, might that be the
case here?  Give us some more details about the hardware involved.

Regards,
Chad


>Date: Fri, 7 Jul 2000 16:49:38 -0400


>Newsgroups: comp.os.linux.networking, comp.os.linux.hardware
>Subject: Weird corruption problems

>Hi all,

>    I've had a server running on a dedicated T1 for a long time and
>have had relatively few problems with it.  It cooks along providing me
>nice bandwidth and responsive service.  However, it has recently become
>plagued with a bizarre corruption problem.

>    I first noticed it while uploading some scripts from other Debian
>boxen.  The script would run fine on the local machines, and when FTPed to
>any other Debian box on the network.  However, when FTPed to the remote
>one, the resulting file would be chock full o' errors.  Weird stuff -
>sometimes a whole line would be left out of the middle, sometimes a few
>characters, and sometimes control characters would show up.  I just redid
>the upload and it was fine again.

>    I noticed it a second time when I recieved a "CRC error" report
>about my release 3 SlackReiser boot disk from an extremely helpful
>gentleman trying my software.  He determined that the image on the remote
>machine was corrupt, and I have verified this fact.  Hmm, now I realize
>something is fishy.

>    Today I got a call from a business associate who uses this remote
>server for email, saying he couldn't relay mail from his new domain, could
>I add his new domain to our realy-domains file.  I said sure, and opened
>it up.  Much to my surprise, one of the characters in the file had been
>replaced with a control character.  I fixed the error, but now I'm really
>scared about the rest of the data.  What can I do to guarentee that it's
>all there and correct?  What on earth could be causing such a bizarre
>problem?  Situation 1 and 2 point to communication problems, but number 3
>involves the relay-domains file - it hasn't been sent over FTP ever -
>seems to say that it's a filesystem/hardware problem.  Where should I
>start looking?

>Thanks,
>    Ross Vandegrift
>    Seitz Technical Products Inc

--
                                                 _\|/_
                                                 (o o)
----------------------------------------------oOO-(_)-OOo------    

Packet filtering for Linux
http://www.packetfilter.dynip.com/
Now hosting IPChains mailing list v2

"...Unix, MS-DOS, and Windows NT (also known as the Good,
the Bad, and the Ugly)."  (By Matt Welsh)

---------------------------------------------------------------

 
 
 

Weird corruption problems

Post by Ross Vandegrif » Sat, 08 Jul 2000 04:00:00


Quote:> I'd be looking at the disk subsystem.  What controllers are you using?  Sounds
> to me like they maybe going bad.  Anything in the log files?  I've heard of
> cases where extremely busy scsi buses can corrupt data, might that be the
> case here?  Give us some more details about the hardware involved.

Hmm, that's something along the lines of what I was thinking too.  We use
all IDE controllers on the server - This one has four hard drives,
hd[abcd].  They're all Western Digital 13G UDMA 33 drives.  The
motherboard is ASUS, IDE controllers are Intel PIIX4.  hda4 is the root
partition, hd[bc]1 are RAID1'ed together, hdd2 is a seperate disk for
shell users' home directories.  I can't imagine it has to do with the
harddisks failing, since IDE drives do sector relocation automatically.
Is there a chance it has to do with noise on the bus?  Also, the location
isn't known for having a particulary good electricity supply (it was as
low as 90VAC once).  I've seen this corruption happen on both /dev/hda4
and /dev/md0, so I doubt it's a bug in the RAID code.  (which, btw is
kernel 2.2.11 with raid0145 patch)

The server is a Cyrix 6x86MX 150, 64M of RAM, 128M of swap (swap is on
hdd1).  Network card is a 3Com 3C509B, video is a Bob's Generic Brand ISA
VGA card... It's probably a trident 1 megger or something old like that.

Thanks,
Ross


> >Date: Fri, 7 Jul 2000 16:49:38 -0400


> >Newsgroups: comp.os.linux.networking, comp.os.linux.hardware
> >Subject: Weird corruption problems

> >Hi all,

> >       I've had a server running on a dedicated T1 for a long time and
> >have had relatively few problems with it.  It cooks along providing me
> >nice bandwidth and responsive service.  However, it has recently become
> >plagued with a bizarre corruption problem.

> >       I first noticed it while uploading some scripts from other Debian
> >boxen.  The script would run fine on the local machines, and when FTPed to
> >any other Debian box on the network.  However, when FTPed to the remote
> >one, the resulting file would be chock full o' errors.  Weird stuff -
> >sometimes a whole line would be left out of the middle, sometimes a few
> >characters, and sometimes control characters would show up.  I just redid
> >the upload and it was fine again.

> >       I noticed it a second time when I recieved a "CRC error" report
> >about my release 3 SlackReiser boot disk from an extremely helpful
> >gentleman trying my software.  He determined that the image on the remote
> >machine was corrupt, and I have verified this fact.  Hmm, now I realize
> >something is fishy.

> >       Today I got a call from a business associate who uses this remote
> >server for email, saying he couldn't relay mail from his new domain, could
> >I add his new domain to our realy-domains file.  I said sure, and opened
> >it up.  Much to my surprise, one of the characters in the file had been
> >replaced with a control character.  I fixed the error, but now I'm really
> >scared about the rest of the data.  What can I do to guarentee that it's
> >all there and correct?  What on earth could be causing such a bizarre
> >problem?  Situation 1 and 2 point to communication problems, but number 3
> >involves the relay-domains file - it hasn't been sent over FTP ever -
> >seems to say that it's a filesystem/hardware problem.  Where should I
> >start looking?

> >Thanks,
> >       Ross Vandegrift
> >       Seitz Technical Products Inc

> --
>                                                  _\|/_
>                                                  (o o)
> ----------------------------------------------oOO-(_)-OOo------    

> Packet filtering for Linux
> http://www.packetfilter.dynip.com/
> Now hosting IPChains mailing list v2

> "...Unix, MS-DOS, and Windows NT (also known as the Good,
> the Bad, and the Ugly)."  (By Matt Welsh)

> ---------------------------------------------------------------

 
 
 

Weird corruption problems

Post by Chad+n.. » Sat, 08 Jul 2000 04:00:00



>Date: Fri, 7 Jul 2000 18:04:05 -0400



>Newsgroups: comp.os.linux.networking, comp.os.linux.hardware
>Subject: Re: Weird corruption problems

>> I'd be looking at the disk subsystem.  What controllers are you using?  Sounds
>> to me like they maybe going bad.  Anything in the log files?  I've heard of
>> cases where extremely busy scsi buses can corrupt data, might that be the
>> case here?  Give us some more details about the hardware involved.

>Hmm, that's something along the lines of what I was thinking too.  We use
>all IDE controllers on the server - This one has four hard drives,
>hd[abcd].  They're all Western Digital 13G UDMA 33 drives.  The
>motherboard is ASUS, IDE controllers are Intel PIIX4.  hda4 is the root
>partition, hd[bc]1 are RAID1'ed together, hdd2 is a seperate disk for
>shell users' home directories.  I can't imagine it has to do with the
>harddisks failing, since IDE drives do sector relocation automatically.

I tend to agree with you that the drives are not at fault, but WD does have a
tool, that can run in non destructive mode, that you can use to test each
drive.  Depends how much time you have, and how desperate you are.

Quote:>Is there a chance it has to do with noise on the bus?  Also, the location
>isn't known for having a particulary good electricity supply (it was as
>low as 90VAC once).  I've seen this corruption happen on both /dev/hda4
>and /dev/md0, so I doubt it's a bug in the RAID code.  (which, btw is
>kernel 2.2.11 with raid0145 patch)

I'm shooting at the power or lack thereof.  Get a UPS NOW!!  That will at
least prevent future power problems from affecting you.  Since you have a T1,
I'm assuming you can afford to purchase a UPS, I think you can get them for
less than $100 now.  

As to what part(s) the lack of power may have damaged, it will be a process of
elimination to figure it out.  I had a power supply go whacy because of power
problems, it would cause my workstation to reboot at intermittent
intervals.  That was extremely annoying. ;-)  Coincidently a Cyrix system.

You've seen data corruption on both / and the raid.  What level of raid?  I'm
assuming one.  You'll have to weigh the time vs. money factors but I think I'd
start with a new motherboard and perform the non destructive disk tests.  If
you still see problems then examine disk cables, power supply.  You might want
to replace the power supply along with the motherboard.  If the power supply
is in fact bad, it might damage the new motherboard.  

You could also examine the bug reports (if they exist) for the motherboard and
bios, you might find something, probably not though.

Summary  
        o - UPS
        o - disk diag tests
        o - new power supply & new mother board

Quote:>The server is a Cyrix 6x86MX 150, 64M of RAM, 128M of swap (swap is on
>hdd1).  Network card is a 3Com 3C509B, video is a Bob's Generic Brand ISA
>VGA card... It's probably a trident 1 megger or something old like that.

It must be Bob's Generic ISA VGA. ;)

If I can help further let me know, I'd like to know the root problem, if you
determine it.

Regards,
Chad

- Show quoted text -

>Thanks,
>Ross


>> >Date: Fri, 7 Jul 2000 16:49:38 -0400


>> >Newsgroups: comp.os.linux.networking, comp.os.linux.hardware
>> >Subject: Weird corruption problems

>> >Hi all,

>> >   I've had a server running on a dedicated T1 for a long time and
>> >have had relatively few problems with it.  It cooks along providing me
>> >nice bandwidth and responsive service.  However, it has recently become
>> >plagued with a bizarre corruption problem.

>> >   I first noticed it while uploading some scripts from other Debian
>> >boxen.  The script would run fine on the local machines, and when FTPed to
>> >any other Debian box on the network.  However, when FTPed to the remote
>> >one, the resulting file would be chock full o' errors.  Weird stuff -
>> >sometimes a whole line would be left out of the middle, sometimes a few
>> >characters, and sometimes control characters would show up.  I just redid
>> >the upload and it was fine again.

>> >   I noticed it a second time when I recieved a "CRC error" report
>> >about my release 3 SlackReiser boot disk from an extremely helpful
>> >gentleman trying my software.  He determined that the image on the remote
>> >machine was corrupt, and I have verified this fact.  Hmm, now I realize
>> >something is fishy.

>> >   Today I got a call from a business associate who uses this remote
>> >server for email, saying he couldn't relay mail from his new domain, could
>> >I add his new domain to our realy-domains file.  I said sure, and opened
>> >it up.  Much to my surprise, one of the characters in the file had been
>> >replaced with a control character.  I fixed the error, but now I'm really
>> >scared about the rest of the data.  What can I do to guarentee that it's
>> >all there and correct?  What on earth could be causing such a bizarre
>> >problem?  Situation 1 and 2 point to communication problems, but number 3
>> >involves the relay-domains file - it hasn't been sent over FTP ever -
>> >seems to say that it's a filesystem/hardware problem.  Where should I
>> >start looking?

>> >Thanks,
>> >   Ross Vandegrift
>> >   Seitz Technical Products Inc

>> --
>>                                                  _\|/_
>>                                                  (o o)
>> ----------------------------------------------oOO-(_)-OOo------    

>> Packet filtering for Linux
>> http://www.packetfilter.dynip.com/
>> Now hosting IPChains mailing list v2

>> "...Unix, MS-DOS, and Windows NT (also known as the Good,
>> the Bad, and the Ugly)."  (By Matt Welsh)

>> ---------------------------------------------------------------

--
                                                 _\|/_
                                                 (o o)
----------------------------------------------oOO-(_)-OOo------    

Packet filtering for Linux
http://www.packetfilter.dynip.com/
Now hosting IPChains mailing list v2

"...Unix, MS-DOS, and Windows NT (also known as the Good,
the Bad, and the Ugly)."  (By Matt Welsh)

---------------------------------------------------------------

 
 
 

Weird corruption problems

Post by Ross Vandegrif » Tue, 11 Jul 2000 04:00:00


Quote:> I'm shooting at the power or lack thereof.  Get a UPS NOW!!  That will at
> least prevent future power problems from affecting you.  Since you have a T1,
> I'm assuming you can afford to purchase a UPS, I think you can get them for
> less than $100 now.  

The machine is on a very nice UPS.  It provides conditioned power, has
batteries less than a year old, and has been tested to be in tip-top
shape.  Maybe it's not the power?  Do you still think it might be the
power supply gone bad?

Quote:> You've seen data corruption on both / and the raid.  What level of raid?  I'm
> assuming one.  You'll have to weigh the time vs. money factors but I think I'd
> start with a new motherboard and perform the non destructive disk tests.  If
> you still see problems then examine disk cables, power supply.  You might want
> to replace the power supply along with the motherboard.  If the power supply
> is in fact bad, it might damage the new motherboard.  

We've had a lot of problems with WD disks in this machine before, but it
turned out that a batch of 13G's that we ordered had been damaged in
shipping.  Once we got new disks into the machine, they've been working
wonderfully.  Being paranoid about them going bad too, we've tested those
disks fairly seriously, and I'm willing to confidently say they would pass
the WD scan.  The motherboard and power supply could still be issues, but
the boss isn't going to like that one... it's a co-located machine at a
local ISP.

In addition, he is convinced that the ext2 filesystem is at
fault (he hasn't looked at it or anything, he just knows ::-).  Is there
any logic that something weird could be happeneing on the filesystem layer
that is causing this??

Thanks,
        Ross Vandegrift

 
 
 

1. Weird Font Corruption Problem

Hello, i have an intersting problem which I haven't been able to figure
out, and I was wondering if any of you have heard of this one before.

The players:
Tyan dual P133
Diamond Stealth S3 Chipset
XFree 86 3.12 (Old slackware 3 dist)
Netscape 2 and 3
Shareware XV

So here is waht happens:
X starts out fine no problems.

I start Netscape 2 OR 3. After about two are three pages, some of the
fonts go bad and all I see is random garbage for each charater. If select
some of the text with the mouse, then the patterns change, but the
characters are still messed up (which suggests to me that it's not a set
of courrupted fonts)

Then I run shareware XV v3.1. All the fonts in the control window are no
screwed up too. So I quit XV.

I quit Netscape.

I restart XV and all the fonts are fine.

I restart Netscape, play there for awhile, and then the XV fonts get
corrupted again.

I have tried a fresh copy of netscape, but that doesn't seem to be the
problem. I had the same results with a fresh copy of Netscape. Sometime
the problem will spread to xconsole. However, the one time it did spread
to xcon I just restarted  fvwm and things were reconfined to XV and
Netscape.

I can't figure out this one. I'm wondering if I'll have to upgrade to the
new version of Xfree (which has been the plan).

If anyone has any ideas please let me know. I would almost guess that a
font cache somewhere was going bad, but I dunno.

Thanks in advance.

2. cannot print using slackware

3. Weird file corruption (448 bytes): FS? RAM?

4. Priorities on SWAP drives?

5. Weird FS corruptions

6. man pages

7. Weird Ping, weird FTP, weird Telnet... HELP!!!

8. Massive problems after compiling X 4.0.2

9. Weird, weird, weird issue ....

10. DVD writing corruption (DMA problem? 2GB limit problem?)

11. Weird, really weird???

12. Problem changing file permissions - a WEIRD problem!

13. Weird memory problem (was: Problem with SCSI in 1.0)