My first root filesystem corruption. : -(

My first root filesystem corruption. : -(

Post by Randy Carpent » Wed, 28 Oct 1992 23:12:58




>Last week, we experienced an extended power failure that crashed our
>HP 9000/817 running HPUX 8.02.  When power returned and the system rebooted,
>fsck detected corruption in the root file system.  Needless to say, I was not
>amused.  This was my first corruption after about 1 year as a sysadmin.  After
>devouring several manuals and my class notes from my HP sysadmin class, I was
>able to run fsck manually, answer it's questions intelligently (I hope), and
>check the lost+found directories for important missing files.

>My gripe is this: the brittleness of the Unix filesystem and the complexity of
>the recovery process is unacceptable for a mission-critical production
>environment.  Come on!  Corruption preventing reboot after a power failure???

Sounds like you don't have your HP configured correctly.  Your /etc/fstab
should have an option (at least our SGI systems do) to automatically fsck(1)
the filesystems upon reboot.  It's easy to slam Unix if you configure it
incorrectly.

Quote:>Sheesh.  And the recovery process asks some rather scary questions pertaining
>to data integrity.  There's no way a typical computer-room operator is going
>to be able to do this.  Highly skilled Technical Support personnel (me) must
>be in control.  And what about those lost+found directories?  You have to have
>extensive knowledge about what's going on in your system to re-unite the
>I-node named orphaned files with their original locations.  In a large
>production system with gigabytes of files, this could be impossible!

>I think I'm going to recommend to Management that they invest in some UPS
>systems that will keep our Unix systems alive long enough during power
>failures so they can be shutdown in a clean manner...

In spite of the configuration error, this is a good idea anyway.

>This whole corruption business is very unpleasant.  HP's proprietary MPE
>OS has a very robust Transaction Manager (XM) to insure data integrity.  It sure
>would be nice to have an XM-level of integrity under HPUX.  Can anybody at
>HP comment if HPUX will have better filesystem integrity in the future?

>IMHO, HPUX is never going to make it in the Real World of mission critical
>applications without some serious enhancement...
>--

>Coast Community College District   1370 Adams Avenue
>District Information Services      Costa Mesa, CA, USA  92626
>Technical Support                  (714) 432-5064
>"You can tune a file system, but you can't tune a fish." - tunefs(1M)

--
===========================================================================

Georgia State University   (404) 651-2648      ksh: matter: cannot create  
Wells Computer Center                          $
 
 
 

My first root filesystem corruption. : -(

Post by Mark Bix » Thu, 29 Oct 1992 01:44:32



>Sounds like you don't have your HP configured correctly.  Your /etc/fstab
>should have an option (at least our SGI systems do) to automatically fsck(1)
>the filesystems upon reboot.  It's easy to slam Unix if you configure it
>incorrectly.

The system is configured properly -- fsck gets run at every reboot.  The
corruption was detected at reboot time, rather than due to strange software
symptoms.
--

Coast Community College District   1370 Adams Avenue
District Information Services      Costa Mesa, CA, USA  92626
Technical Support                  (714) 432-5064
"You can tune a file system, but you can't tune a fish." - tunefs(1M)

 
 
 

My first root filesystem corruption. : -(

Post by Gary Hest » Thu, 29 Oct 1992 00:53:57



>Last week, we experienced an extended power failure that crashed our
>HP 9000/817 running HPUX 8.02.  When power returned and the system rebooted,
  [ ... ]
>My gripe is this: the brittleness of the Unix filesystem and the complexity of
>the recovery process is unacceptable for a mission-critical production
>environment.  Come on!  Corruption preventing reboot after a power failure???
  [ ... ]
>I think I'm going to recommend to Management that they invest in some UPS
>systems that will keep our Unix systems alive long enough during power
>failures so they can be shutdown in a clean manner...

WHAT???!!! You have a MISSION CRITICAL computer, you're *ing about a
power failure data corruption, and you *don't* have it on a UPS??

*All* of my servers and hosts are protected by UPSs. Not just the critical
ones (although, believe it or not, the Chairman was *ing about spending
the $1000 for it, for a production line generating $500K/shift in revenue),
but even the mundane news/mail systems.

I don't have print servers protected by UPSs, but I don't think they need
it--after all, if the individual workstations go down, I'm not worried
about their printouts.

At the prices of UPSs, compared to the cost of downtime (how many programmers
does it take to cost $1K/hour? Less than 23, not considering disruption-
induced inefficiency), there is no excuse for not having UPS protection on
systems, particularly "mission critical" ones.

Obviously, I consider your complaint to be out of line; Unix machines are
somewhat fragile, there's no secret about that, and it's the responsibility
of the owner to protect it from things like power failures.

Incidentally, there are reasons that data corruption occurs; it could be
almost completely prevented, at the expense of terrible performance. I'll
let someone with OS design experience explain the details.

And if you think crashing a HP9000 is *, you should be around when
a dual 3090 manframe setup crashes. Imagine a dozen people running around
frantically for 8-10 hours....

--

The Chairman of the Board and the CFO speak for SCI. I'm neither.
"...I looked out my window, and saw Kyle Pettys' car upside down, then I
thought 'One of us is in real trouble'." Davey Allison, re: a 150MPH crash

 
 
 

1. My first root filesystem corruption. :-(

UNIX is complex, I think this is good.  You get to think it is bad.

Integrate this with an automatic shutdown utility.  There are several available.

There is code in UnixWorld? in the column by R. Thomas.  This included a cute
power-failure detection device.

I would recommend to you that you buy a packaged solution including UPS,
software, integration, and site survey.

Yes it is.  It is also inevitable.  The complexity helps to identify potential
problems and provide maximum recovery options without too much overhead.
You must know quite a bit about UNIX and your system to utilize this
effectively.

I do believe it is already there.

--

418 Winfield Dr         (317) 284-7131   work
Greenfield, IN 46140

2. Word Perfect 8 and RH 5.1

3. Summary: My first root filesystem corruption. :-(

4. Unix Quiz

5. Bizarre root filesystem corruption?

6. apache with ssl has fatal error

7. root filesystem corruption

8. Oops in 2.4.3: cat /proc/tty/driver/serial

9. How can I separate root filesystem and /usr filesystem

10. One disk, one filesystem, no partitions?

11. root vs.non-root filesystem space

12. filesystem corruption under Xenix

13. 2.5.12 severe ext3 filesystem corruption warning!