Filesystem semantics protecting meta data ... and users data

Filesystem semantics protecting meta data ... and users data

Post by David Holla » Thu, 09 Jun 1994 22:45:11




 > > <lots of flames about asynchronous write semantics of UNIX file systems
 > >  deleted>
 >
 > Get a clue. Whoever writes sensitive data through the normal file
 > system with normal semantics deserves to get corrupted data. If you
 > want to commit data into a UNIX file system, you can do that perfectly
 > well by either using fsync(2) or by opening the file with O_SYNC
 > in the first place.

You missed the point.

If metadata is written ahead of the actual data, a crash at the wrong
time can conceivably cause file X to contain the data from file Y.

If file Y is payroll records and file X is Joe User's email, this is
not very good; you can solve the problem by writing file X
synchronously, but most of the time Joe User isn't going to do that -
especially not if he's Joe Hacker trying to exploit the problem.

This is how it happens, as the original poster presented it:

                file Y is deleted --->
                                        <--- file X is written, using
                                             blocks formerly from file Y
        file Y's inode is written --->
                                        <--- file X's inode is written
                                   <CRASH>
                                        <--- file X's data was never written

Now, after recovery and reboot, file X contains some blocks that used
to be in file Y... which still contain the data from file Y.

Security breach.

I don't know if this is actually possible with current filesystems;
I'd hope not, but...

--
   - David A. Holland          | "The right to be heard does not automatically

 
 
 

Filesystem semantics protecting meta data ... and users data

Post by Burkhard Neidecker-Lu » Fri, 10 Jun 1994 20:26:05



Quote:

>This is how it happens, as the original poster presented it:

NOT

Quote:

>            file Y is deleted --->

so blocks are being released, causing metadata to be updated *on disk*
before any other file gets a chance of reusing that block.

Quote:>                                    <--- file X is written, using
>                                         blocks formerly from file Y

Which are zero-filled at that time.

Quote:>    file Y's inode is written --->

Can't happen now (after all, Y was deleted and hence it's inode
is *gone*). Even if Y was truncated by ftruncate(2), that would have
been noted on disk *before* the blocks could be gotten at from
X.

Quote:>                                    <--- file X's inode is written
>                               <CRASH>
>                                    <--- file X's data was never written

>Now, after recovery and reboot, file X contains some blocks that used
>to be in file Y... which still contain the data from file Y.

>Security breach.

Maybe on UNIX V6 15 years ago.

Quote:>I don't know if this is actually possible with current filesystems;
>I'd hope not, but...

Can't speak for SUN :-), but can't happen on any modern UNIX I know.

                Burkhard Neidecker-Lutz

Distributed Multimedia Group, CEC Karlsruhe
Advanced Technology Group, Digital Equipment Corporation


 
 
 

Filesystem semantics protecting meta data ... and users data

Post by Frank Lofa » Sat, 11 Jun 1994 01:51:18




>>This is how it happens, as the original poster presented it:

>NOT

>>                file Y is deleted --->

>so blocks are being released, causing metadata to be updated *on disk*
>before any other file gets a chance of reusing that block.

>>                                        <--- file X is written, using
>>                                             blocks formerly from file Y

>Which are zero-filled at that time.

>>        file Y's inode is written --->

>Can't happen now (after all, Y was deleted and hence it's inode
>is *gone*). Even if Y was truncated by ftruncate(2), that would have
>been noted on disk *before* the blocks could be gotten at from
>X.

>>                                        <--- file X's inode is written
>>                                   <CRASH>
>>                                        <--- file X's data was never written

>>Now, after recovery and reboot, file X contains some blocks that used
>>to be in file Y... which still contain the data from file Y.

>>Security breach.

>Maybe on UNIX V6 15 years ago.

>>I don't know if this is actually possible with current filesystems;
>>I'd hope not, but...

>Can't speak for SUN :-), but can't happen on any modern UNIX I know.

What about Linux ext2fs?
Can it happen on it?

- Show quoted text -

>            Burkhard Neidecker-Lutz

>Distributed Multimedia Group, CEC Karlsruhe
>Advanced Technology Group, Digital Equipment Corporation


 
 
 

Filesystem semantics protecting meta data ... and users data

Post by Stefan Ess » Sun, 12 Jun 1994 01:29:11



|> NOT
|> >
|> >              file Y is deleted --->
|>
|> so blocks are being released, causing metadata to be updated *on disk*
|> before any other file gets a chance of reusing that block.
|>
|> >                                      <--- file X is written, using
|> >                                           blocks formerly from file Y
|>
|> Which are zero-filled at that time.

They are zero filled on disk ????
In case of a crash it doesn't make much of a difference, whether
they had been zero filled in RAM ...

|> >                                      <--- file X's inode is written
|> >                                 <CRASH>
|> >                                      <--- file X's data was never written
|> >
|> >Now, after recovery and reboot, file X contains some blocks that used
|> >to be in file Y... which still contain the data from file Y.
|> >
|> >Security breach.
|>
|> Maybe on UNIX V6 15 years ago.

Really ?

Is there an implied fsync on the file being closed, before writing
updated inode information to disk ?

(I remember that closing a file under Ultrix 4 could take quite some
time, so maybe this is really done in Ultrix ? It made our main server
often unusable for minutes, since one of the most important programs
wrote some 20 files of 64MB each on exit. We reduced the size of the
buffer cache, to shorten the time to flush the file to disk, since
other disk operations on that drive were blocked, from the moment fclose
was called until it returned ...)

How about indirect inode blocks on large files (say 2GB), you can't keep
all of them in RAM at all times (2GB/8KB * 4Byte = 2Mbyte), can you ?
(We DO write files of that size on our system regularly, so its not
only of academic interest to me).

If you update the meta date of large files before the corresponding
data blocks are guaranteed to be written to disk, the above scenario
doesn't seem impossible to me, even under modern UNIXes.

You'd have to keep buffer cache blocks and the corresponding meta data
linked in some way, to be sure you always write data before meta data.

The main problem with asynch. inode updates was, if a directory had just
been created, the inode already written to disk, but the data block still
contained ordinary file data (or, worse a previously deleted directory),
then fsck often did silly things. Worst of all was the possibility of
an indirect block number being written into an inode (on disk), when this
block (on disk) still contained ordinary file data.

Always writing inode blocks synchronously when creating or removing a
directory or allocating a new indirect block, makes fsck work much more
reliably.

But it doesn't guarantee that data blocks from another data file don't
end up in your data file, since that doesn't confuse fsck, but it may
confuse the previous owner of that data :).

And that's what the initiator of this thread said ...

--

 Mathematisches Institut                Tel:            +49 221 4706010
 Universitaet zu Koeln                  FAX:            +49 221 4705160
 Weyertal 80
 50931 Koeln

 
 
 

Filesystem semantics protecting meta data ... and users data

Post by Totally Lo » Sun, 12 Jun 1994 08:14:28




>>Now, after recovery and reboot, file X contains some blocks that used
>>to be in file Y... which still contain the data from file Y.

>>Security breach.

>Maybe on UNIX V6 15 years ago.

>>I don't know if this is actually possible with current filesystems;
>>I'd hope not, but...

>Can't speak for SUN :-), but can't happen on any modern UNIX I know.

>            Burkhard Neidecker-Lutz

I think the three strikes and Burkhard is out posting made it clear how
the UFS filesystem can leave trash data in a users file at a crash.

On machines that ran unix V6, most were lucky to have enough memory to
have 5-10 512 byte buffers ... systems with more than 25 buffers were
fairly rare. That is barely enough for the filesystem to run without
deadlock ... there was no code room or buffer space to implement a more
robust policy ... Ken did well. I'm amazed that I ran 3-4 users
on my first unix child ... a PDP11/34 with 96KB memory, 2.5mb disk,
two dec tapes, 9trk tape, two plotters, two digitizing tablets,
two terminals, two high performance LUndy graphics subsystems,
and a dream in CalPoly San Luis Obispo.

Up until this point room for kernel code was limited by the 16bit address
space and the fact most unix machines only had 128K to 256K of DRAM
to run 16-64 users. Anything added to kernel reduced the amount of
USER dram and increased the amount of swapping.

By 1980 when the BSD team started their work on the VAX machines with
512Kbytes and larger were starting to be common, and the VAX relieved
the 17 bit address space limit from the kernel (dual 16 bit spaces
on 11/70, 45, and 44's)..

The network code which ran in user space as prcesses (done by
the arpa community at Urbana, UCLA and BBN) was re-written at
UCB to drop into the VAX kernel. and BSD was born soon after
with the native port of V7/V32 to the vax. The unix kernal then
started it's rapid transition to huge.

All systems with filesystems based upon V6/V7/SVR3/BSD have the problem.
This is nearly every system ever shipped.

Fixing the problem means major changes to the filesystem/BIO/driver
relationships, key data structures, and key kernel internal programming
interfaces including driver and FFS/VNODE interfaces.

Twice I have come close to implementing this. First at Fortune Systems
where Don (1st VP of Engr) and I wrote it into the technical part of
the business plan/ product specification in Feb/Mar 1981. In May (??)
Don and I were replaced when we told Homer Dunn (Founder) that unless the
Software team and development computer funding that were prommised
for March be in place in May (??), we would slip first customer ship from
early Dec 81 week by week until available. Steve Puthuff and Rick Kiesig
(which replaced Don and I) abandoned this requirement while slipping the
schedule from Dec 81 to Sept 82 week by week. The software staff budget
for 5-7 seasoned programmers turned into 25 plus kids in or just
out of school - None including Rick has the experience to do what they
blindly took on. Late in the spring of 82 Homer wanted to fire all of them
for slipping HIS schedule ... seeking my advice to replace them I
told him fat chance if he wanted to deliver the product, and that he
just spent our companies reputation to build/train what would become
one of the better unix teams in the Valley. Rick choose to meet a
minimal filesystem harding requirement by using part of the BSD code.

The second time was receintly at SCO where I attempted several times
to work out a contract to do some significant re-architecting of
the UNIX kernel and filesystem for performance and scaling reasons.
After a couple false starts, ownership of the kernel technologies
was won by the London team and hopes vanished for doing anything
interesting in Santa Cruz. So what twice could have been a major
UNIX event, is still a dream and code fragments in my lab.

The the various LFS style filesystems have the promise of reliability
but the implementation tradeoffs are performance cripping for most
desk top systems smaller than the SPrite sized machines the work
was done for. Locality of data is severly compromised. At last
winters usenix wip session I gave a wake call talk on part of these issues.

Every production machine I see is killed by UNIX filesystem I/O
running at 10-20% of what it should be ... by filesystems designers
that insist on using a horse and buggy as the prototype for a space
ship. The receint software bloat caused by X/Motif applications
continues the pressure on the I/O subsystem, combined with
increadibly faster processor technology the pressure to
replace or rearchitect UNIX will continue into the 90's.

As with my comments about Raw I/O in comp.os.rsearch the critical
problem is people attempting to continue to use outdated decisions
without re-evaluation of the assumptionas and tradeoffs involved.
The current UNIX filesystem architecture is critically flawed
on all major fronts - performance, reliability and security - and
lacks key features of the main frame market it replaces.
OS work today is done mostly by follow the herd, critical thinking
is a lost art.

Either Novell and the key players need to get the clue, or UNIX
will be replaced in the passing of time (the 90's).

John

 
 
 

Filesystem semantics protecting meta data ... and users data

Post by David Holla » Sat, 11 Jun 1994 19:25:22



 > >                                      <--- file X is written, using
 > >                                           blocks formerly from file Y
 >
 > Which are zero-filled at that time.

Are they? Is this *guaranteed*? Is there anything that insures that
these zeros are *written out* before the blocks can be reused? Suppose
the crash occurs after the old file's cleared inode has been written,
but before the zeroed blocks have been written out? Then a second
crash, at the wrong time, could have this same effect. Although I
suppose in this case fsck could take care of the problem.

Not having source handy, I don't know the answer to these questions.

Nonetheless, the behavior you describe is still what I'd consider
broken: after the crash, the new file is found to exist and have the
correct length - but contain nothing. That is, the system fails
silently.

Btw...

 > >      file Y's inode is written --->
 >
 > Can't happen now (after all, Y was deleted and hence it's inode
 > is *gone*).

...the inode has to be cleared or otherwise marked unused. Otherwise,
at fsck time, file Y will rise from the grave...

--
   - David A. Holland          | "The right to be heard does not automatically

 
 
 

Filesystem semantics protecting meta data ... and users data

Post by Totally Lo » Sun, 12 Jun 1994 23:43:14





>have the expertise and the ability.

>So why don't *you* write an efficient, secure filesystem for Linux
>(or one of the free versions of BSD, if that suits you better,
>though they may be a bit more entrenched in tradition than the
>Linux developers)?  You have the source code, so you can make any
>changes you need to in the interfaces, drivers, etc. to make it
>happen.  You might even be able to work with the people who wrote
>the device drivers.

I am self employeed, and run a small consulting company with
a house and kids to support (IE current fixed expense >5K/mo).
My short term goal is to save enough to return to grad school soon.

This is not the only critical flaw of UFS or other V6/V7/SvRx derived
filesystems. To do the job right is a ground up redesign of a
filesystem, BIO, Kernel Interfaces, drivers, and supporting utilities.
It would take myself and two/three junior helpers probably a year to
complete to production release status.

I am not ready to starve for that long and sacrifice my house and
kids (and education) to put a work this size into the public domain.
I'd put the core work into a raid controller and attempt to make
a product out of it first.  Any other path requires a system
vendor willing to make major changes to the base OS and porting
all drivers and other filesystems (at SCO this is a BIG DEAL)
AND making life tough for a bunch of third parties that will have
to do the same. When I was at SCO they had just introduced SCO/UNIX
and had forced much of the XENIX customer base to make the big
change with them. Every five years is  about as often as a
major vendor can afford this size wrinkle to customers.

After SCO I discussed this with both NCR and Compaq
to see if major functionality improvements would be
an incentive to customize the SCO product. SCO is
a STANDARD in their market ... and not to be fiddled
with. Given the NIH factor and no existing relationship
I didn't see any practical way to get DEC, Sun, or USL
to fund the project. I don't think there are any other
UNIX vendors left besides these that could afford the
pay for it.

The only other long shot is to do the work in FreeBSD
or the lite when release and make a product out of it.  I
would have to think long an hard about that, and have been.

Quote:>Am I right that a priority-based buffer cache would be sufficient
>to get the characteristics you need for the reliability and security
>requirements?

as I discussed elsewhere a simple priority approach doesn't
quite work ...

John

 
 
 

Filesystem semantics protecting meta data ... and users data

Post by Burkhard Neidecker-Lu » Mon, 13 Jun 1994 03:06:11



>I think the three strikes and Burkhard is out posting made it clear how
>the UFS filesystem can leave trash data in a users file at a crash.

Yes...

Quote:>All systems with filesystems based upon V6/V7/SVR3/BSD have the problem.
>This is nearly every system ever shipped.

So count out AIX and DEC OSF/1.

Quote:>Fixing the problem means major changes to the filesystem/BIO/driver
>relationships, key data structures, and key kernel internal programming
>interfaces including driver and FFS/VNODE interfaces.

Yes.

Quote:>Twice I have come close to implementing this.
> So what twice could have been a major
>UNIX event, is still a dream and code fragments in my lab.

A reality in DEC OSF/1.

Quote:>The the various LFS style filesystems have the promise of reliability
>but the implementation tradeoffs are performance cripping for most
>desk top systems smaller than the SPrite sized machines the work
>was done for. Locality of data is severly compromised. At last
>winters usenix wip session I gave a wake call talk on part of these issues.

You don't need to do a log-structured file system to get the
reliability associated with keeping a log. Advfs (and as far as
I know, JFS in AIX) are not log-structured.

Quote:>Every production machine I see is killed by UNIX filesystem I/O
>running at 10-20% of what it should be ... by filesystems designers
>that insist on using a horse and buggy as the prototype for a space
>ship. The receint software bloat caused by X/Motif applications
>continues the pressure on the I/O subsystem, combined with
>increadibly faster processor technology the pressure to
>replace or rearchitect UNIX will continue into the 90's.

>As with my comments about Raw I/O in comp.os.rsearch the critical
>problem is people attempting to continue to use outdated decisions
>without re-evaluation of the assumptionas and tradeoffs involved.
>The current UNIX filesystem architecture is critically flawed
>on all major fronts - performance, reliability and security - and
>lacks key features of the main frame market it replaces.
>OS work today is done mostly by follow the herd, critical thinking
>is a lost art.

Maybe in the UNIX Sys V crowd, not here.

                Burkhard Neidecker-Lutz

Distributed Multimedia Group, CEC Karlsruhe
Advanced Technology Group, Digital Equipment Corporation

 
 
 

Filesystem semantics protecting meta data ... and users data

Post by John F. Haugh » Tue, 14 Jun 1994 07:53:07




>>Now, after recovery and reboot, file X contains some blocks that used
>>to be in file Y... which still contain the data from file Y.

>>Security breach.

>Maybe on UNIX V6 15 years ago.

Two nits -- V6 was more than 15 years ago, and the blocks were zero
filled on allocation the same way as they are now.  The function in
traditional UNIX systems which allocates blocks is alloc().  Anyone
with a copy of Lions' book can verify my statement.
--
John F. Haugh II  [ NRA-ILA ] [ Kill Barney ] !'s: ...!cs.utexas.edu!rpp386!jfh

 There are three documents that run my life: The King James Bible, the United
 States Constitution, and the UNIX System V Release 4 Programmer's Reference.
 
 
 

Filesystem semantics protecting meta data ... and users data

Post by John F. Haugh » Tue, 14 Jun 1994 07:59:07



Quote:>Are they? Is this *guaranteed*? Is there anything that insures that
>these zeros are *written out* before the blocks can be reused? Suppose
>the crash occurs after the old file's cleared inode has been written,
>but before the zeroed blocks have been written out? Then a second
>crash, at the wrong time, could have this same effect. Although I
>suppose in this case fsck could take care of the problem.

The 6th Edition alloc() function wouldn't return the block number
until the block had been zero'd.  That is how strong the guarantee is.
--
John F. Haugh II  [ NRA-ILA ] [ Kill Barney ] !'s: ...!cs.utexas.edu!rpp386!jfh

 There are three documents that run my life: The King James Bible, the United
 States Constitution, and the UNIX System V Release 4 Programmer's Reference.
 
 
 

Filesystem semantics protecting meta data ... and users data

Post by -candee-+Kovach K. » Tue, 14 Jun 1994 21:08:56


Quote:>The current UNIX filesystem architecture is critically flawed
>on all major fronts - performance, reliability and security - and
>lacks key features of the main frame market it replaces.
>OS work today is done mostly by follow the herd, critical thinking
>is a lost art.

>Either Novell and the key players need to get the clue, or UNIX
>will be replaced in the passing of time (the 90's).

Can we get a posting by one or more of the authors of prior postings
of references to work that solves some or all of the problems of
performance, reliability and security?

I would like to get a clue, but have been at least partially blinded
by the fact that UNIX file systems and UFS are generally what is taught
and held up as examples of how to do file systems. So in particular
I would like to see how other OS's solve the problem.

Kurt Kovach                     "My opinions are my own."
Novell, Summit

 
 
 

Filesystem semantics protecting meta data ... and users data

Post by David Holla » Tue, 14 Jun 1994 23:45:52


 > The 6th Edition alloc() function wouldn't return the block number
 > until the block had been zero'd.  That is how strong the guarantee is.

That would do it.  

...as long as it's done synchronously and the cache doesn't trap it.

Btw, your news header is missing the domain name.

--
   - David A. Holland          | "The right to be heard does not automatically

 
 
 

Filesystem semantics protecting meta data ... and users data

Post by Ken Pizzi » Wed, 15 Jun 1994 05:02:34





>>Are they? Is this *guaranteed*? Is there anything that insures that
>>these zeros are *written out* before the blocks can be reused? Suppose
>>the crash occurs after the old file's cleared inode has been written,
>>but before the zeroed blocks have been written out? Then a second
>>crash, at the wrong time, could have this same effect. Although I
>>suppose in this case fsck could take care of the problem.

>The 6th Edition alloc() function wouldn't return the block number
>until the block had been zero'd.  That is how strong the guarantee is.

To *disk*?

                --Ken Pizzini

 
 
 

Filesystem semantics protecting meta data ... and users data

Post by Dan Swartzendrub » Wed, 15 Jun 1994 20:33:00






>>The 6th Edition alloc() function wouldn't return the block number
>>until the block had been zero'd.  That is how strong the guarantee is.

>To *disk*?

No, why does that matter?  When you alloc a block from the free list,
before using it for any file, you zero it.  

--

#include <std_disclaimer.h>

Dan S.

 
 
 

Filesystem semantics protecting meta data ... and users data

Post by David Holla » Wed, 15 Jun 1994 21:12:55



 > >>The 6th Edition alloc() function wouldn't return the block number
 > >>until the block had been zero'd.  That is how strong the guarantee is.
 > >
 > >To *disk*?
 >
 > No, why does that matter?  When you alloc a block from the free list,
 > before using it for any file, you zero it.  

Haven't you been following the thread? If you don't flush those zeros
*to the disk* before using the block, a badly-timed system crash can
cause the UNZEROED blcosk to appear in the new file - containing who
knows what kind of private data.

--
   - David A. Holland          | "The right to be heard does not automatically