fs/locks.c rewrite for better locking

fs/locks.c rewrite for better locking

Post by Dave Hanse » Wed, 22 May 2002 08:50:08



I'm cc'ing LKML to see if anyone else has some input on this.

LKML folks,
   Matt and I are discussing moving fs/locks.c away from the BKL and
to a better locking scheme.

Here's an overview of what is happening right now:

Current scheme, locks are always:
1.* off a per-inode list
AND
(
2. On the file_lock_list
OR
3. On the blocked_list
)

All operations on all of those lists are guarded using the BKL.

With the current layout of data structures, finely grained locking
will be a pain.  There will continue to be a need for these locks:
1. 1 for each of the 2 global lists
2. a lock for each inode's lock list
3. a lock (or refcount) for every single lock, because it could be
referenced from either the inode's list OR the global lists.
OR, a single global lock for everything (which I don't like)

   That's why I want to change the global lock list storage to use the
already existing per-inode lists.  I want a file_lock_list to be a
list of inodes that have locks outstanding on them.  This global list
of inodes will need race protection as will the per-inode flock list.
  The print_lock_info thing (whatever it's called) will look something
like this:

read_lock( global_flock_list )
for each inode with locks {
        read_lock( inode->flock_list_lock )
        for each flock on inode {
                print stuff for this lock
        }
        read_unlock( inode->flock_list_lock )

Quote:}

read_unlock( global_flock_list )

Quote:>>inode_list fl_inode_list* head;
>>Is the use of blocked_list just an optimization?  Could
>>posix_locks_deadlock theoretically go through all of the locks (from
>>the global list) and ignore the ones that aren't posix and aren't
>>blocking?

> Theoretically, yes, because that's what the code used to do ;-)
> That was really ugly code, though.  Not to mention that it's an O(n^2)
> algorithm, which you really don't want to do while holding a global lock.
> It's still O(n^2), but at least I reduced the size of N to the number
> of tasks currently waiting for a lock.  (Actually, it used to be O(n^3)
> because it was written badly.  But we'll skip over that.)

Since this will only require a read lock, it should be significantly
less costly than when the BKL was held.  Also, we're dealing with
quite small numbers here, or we wouldn't be using linked lists.  If
optimization is needed, perhaps the global-inodes-with-locks-on-them
list can be supplemented with a
global-inodes-with-locks-blocking-on-them list too.  However, I don't
like the looks of it and I think that to do it cleanly, it will be
necessary to add _another_ per-inode list for the blocked locks.  This
approach is starting to sound way too complex.

Quote:>>2.4.17-something, one of our patch sets that we test.  There are some
>>BKL changes that mimic the 2.5 behavior.

> Ah, that might help... I was pounding on some SCSI discs trying to saturate
> them and got hung up on the io_request_lock.

You might want to check out our lse04 patch.  It has all of the stuff
that we can't get into 2.4, including (I think) iorequest lock removal
for the popular scsi drivers.  It is quite stable and tested
thoroughly.  Look at the very bottom of our sourceforge files page:
http://www.veryComputer.com/

--
Dave Hansen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://www.veryComputer.com/
Please read the FAQ at  http://www.veryComputer.com/

 
 
 

fs/locks.c rewrite for better locking

Post by Matthew Wilco » Wed, 22 May 2002 09:40:06



> With the current layout of data structures, finely grained locking
> will be a pain.  There will continue to be a need for these locks:
> 1. 1 for each of the 2 global lists
> 2. a lock for each inode's lock list
> 3. a lock (or refcount) for every single lock, because it could be
> referenced from either the inode's list OR the global lists.

I don't believe #3 is necessary.  At any given point in a lock's
existence, it's:

(a) blocking on an active lock (attached to its blocking lock through
    fl_block, attached to the list of blocked locks through fl_link,
    and pointing at its blocking lock with fl_next)
(b) an active lock (attached to any locks blocking on it through
    fl_block, attached to the list of global locks through fl_link and
    pointing at the next active lock with fl_next).

So I can't see the need for a reference count.  We always know where
it's referenced from and what its lifecycle is.  I'm not surprised
you're confused though -- locks.c is still a mess.  I think I've got
some improvements, but I'm waiting for OSDL to give me access to a quad
CPU machine to test on.

Quote:> OR, a single global lock for everything (which I don't like)

I prefer a single global lock.  Yes, it will be held for longer than
necessary if you had a per-inode lock, but it doesn't bloat struct inode
any more than it currently is.  Also, there's never a lock inversion
issue, which I believe the algorithm you describe below has.

Insert lock is grab inode-file-lock for write, grab global-file-lock for write,
insert lock, release both.
Display locks is grab global-file-lock for read, grab inode-file-lock for
read, display information, release both.

so insert grabs inode lock, display grabs global lock, insert spins
waiting for exclusive access, display spins waiting for the exclusive
access to be released.  Boom.

Quote:>    That's why I want to change the global lock list storage to use the
> already existing per-inode lists.  I want a file_lock_list to be a
> list of inodes that have locks outstanding on them.  This global list
> of inodes will need race protection as will the per-inode flock list.
>   The print_lock_info thing (whatever it's called) will look something
> like this:

> read_lock( global_flock_list )
> for each inode with locks {
>    read_lock( inode->flock_list_lock )
>    for each flock on inode {
>            print stuff for this lock
>    }
>    read_unlock( inode->flock_list_lock )
> }
> read_unlock( global_flock_list )

> >>inode_list fl_inode_list* head;
> >>Is the use of blocked_list just an optimization?  Could
> >>posix_locks_deadlock theoretically go through all of the locks (from
> >>the global list) and ignore the ones that aren't posix and aren't
> >>blocking?

> > Theoretically, yes, because that's what the code used to do ;-)
> > That was really ugly code, though.  Not to mention that it's an O(n^2)
> > algorithm, which you really don't want to do while holding a global lock.
> > It's still O(n^2), but at least I reduced the size of N to the number
> > of tasks currently waiting for a lock.  (Actually, it used to be O(n^3)
> > because it was written badly.  But we'll skip over that.)

> Since this will only require a read lock, it should be significantly
> less costly than when the BKL was held.  Also, we're dealing with
> quite small numbers here, or we wouldn't be using linked lists.  If
> optimization is needed, perhaps the global-inodes-with-locks-on-them
> list can be supplemented with a
> global-inodes-with-locks-blocking-on-them list too.  However, I don't
> like the looks of it and I think that to do it cleanly, it will be
> necessary to add _another_ per-inode list for the blocked locks.  This
> approach is starting to sound way too complex.

The cost will still be high.  The N in the approach you want to take
is roughly the number of locks existing in the system (each lock has
to be checked to see whether any lock is currently blocking on it).
The N in the current approach is the number of locks which are currently
blocking in the system.  Since a task can only be blocking on one lock,
this is the number of tasks in the system currently blocking on a lock.
I think you underestimate how big a difference this really is.

Typically, there are maybe 8 processes blocked on locks (those apache
children using it as a mutex ;-).  O(N^2) for this case is O(64).
If we're talking about a system which is big enough that this kind of
lock scalability makes a difference, there must be thousands of locks
being taken and released every second.  O(N^2) would be O(1000000),
with a global lock held -- 15000 times longer!

--
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

1. fs/locks.c: Fix posix locking for threaded tasks

Saurabh Desai believes that locks created by threads should not conflict
with each other.  I'm inclined to agree; I don't know why the test for
->fl_pid was added, but the comment suggests that whoever added it wasn't
sure either.

Frankly, I have no clue about the intended semantics for threads, and
SUS v3 does not offer any enlightenment.  But it seems reasonable that
processes which share a files_struct should share locks.  After all,
if one process closes the fd, they'll remove locks belonging to the
other process.

Here's a patch generated against 2.4; it also applies to 2.5.
Please apply.

===== fs/locks.c 1.9 vs edited =====
--- 1.9/fs/locks.c      Mon Jun  3 18:49:43 2002

 }

 /*
- * Check whether two locks have the same owner
- * N.B. Do we need the test on PID as well as owner?
- * (Clone tasks should be considered as one "owner".)
+ * Locks are deemed to have the same owner if the tasks share files_struct.
  */
 static inline int
 locks_same_owner(struct file_lock *fl1, struct file_lock *fl2)
 {
-       return (fl1->fl_owner == fl2->fl_owner) &&
-              (fl1->fl_pid   == fl2->fl_pid);
+       return (fl1->fl_owner == fl2->fl_owner);
 }

 /* Remove waiter from blocker's block list.

--
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2. Where's the CERT warning for SATAN on comp.security.announce?

3. netscape rpm from compaq + my box = lock lock lock

4. Quake "Segmentation fault" !!!

5. Modem TX and RX status via Num Lock-Caps Lock-Scroll Lock

6. Screen shots.

7. Request: removal of fs/fs.h/super_block.u to enable partition locking

8. libc5 broken by upgrade from Red Hat 5.0 to 5.1

9. fs/locks.c and fs/nfs/file.c cleanup

10. unable to lock mailbox: no locks available

11. More mail file locking questions (lockf, NFS, /var/spool/mail/*.lock)

12. Locking mechanisms / fcntl.c locks.c

13. XScreenSaver lock -> lock console