MM patches against 2.5.31

MM patches against 2.5.31

Post by Andrew Morto » Fri, 23 Aug 2002 11:30:06



I've uploaded a rollup of pending fixes and feature work
against 2.5.31 to

http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/2.5.31-mm1/

The rolled up patch there is suitable for ongoing testing and
development.  The individual patches are in the broken-out/
directory and should all be documented.

broken-out/linus.patch
  Incremental BK patch from Linus' tree

broken-out/page_reserved.patch
  Test PageReserved in pagevec_release()

broken-out/scsi_hack.patch
  Fix block-highmem for scsi

broken-out/page_cache_release_lru_fix.patch
  Fix a race between __page_cache_release() and shrink_cache().

broken-out/page_cache_release_fix.patch
  Fix __page_cache_release() bugs

broken-out/mvm.patch
  Fix vmalloc bugs

broken-out/pte-chain-fix.patch
  Fix a VM lockup on uniprocessors

broken-out/func-fix.patch
  gcc-2.91.66 does not support __func__

broken-out/ext3-htree.patch
  Indexed directories for ext3

broken-out/rmap-mapping-BUG.patch
  Fix a BUG_ON(page->mapping == NULL) in try_to_unmap()

broken-out/misc.patch
  misc fixlets

broken-out/tlb-speedup.patch
  Reduce typical global TLB invalidation frequency by 35%

broken-out/buffer-slab-align.patch
  Don't align the buffer_head slab on hardware cacheline boundaries

broken-out/zone-rename.patch
  Rename zone_struct->zone, zonelist_struct->zonelist.  Remove zone_t,
  zonelist_t.

broken-out/per-zone-lru.patch
  Per-zone page LRUs

broken-out/per-zone-lock.patch
  Per-zone LRU list locking

broken-out/l1-max-size.patch
  Infrastructure for determining the maximum L1 cache size which the kernel
  may have to support.

broken-out/zone-lock-alignment.patch
  Pad struct zone to ensure that the lru and buddy locks are in separate
  cachelines.

broken-out/put_page_cleanup.patch
  Clean up put_page() and page_cache_release().

broken-out/anon-batch-free.patch
  Batched freeing and de-LRUing of anonymous pages

broken-out/writeback-sync.patch
  Writeback fixes and tuneups

broken-out/ext3-inode-allocation.patch
  Fix an ext3 deadlock

broken-out/ext3-o_direct.patch
  O_DIRECT support for ext3.

broken-out/jfs-bio.patch
  Convert JFS to use direct-to-BIO I/O

broken-out/discontig-paddr_to_pfn.patch
  Convert page pointers into pfns for i386 NUMA

broken-out/discontig-setup_arch.patch
  Rework setup_arch() for i386 NUMA

broken-out/discontig-mem_init.patch
  Restructure mem_init for i386 NUMA

broken-out/discontig-i386-numa.patch
  discontigmem support for i386 NUMA

broken-out/cleanup-mem_map-1.patch
  Clean up lots of open-coded uese of mem_map[].  For ia32 NUMA

broken-out/zone-pages-reporting.patch
  Fix the boot-time reporting of each zone's available pages

broken-out/enospc-recovery-fix.patch
  Fix the __block_write_full_page() error path.

broken-out/fix-faults.patch
  Back out the initial work for atomic copy_*_user()

broken-out/bkl-consolidate.patch
  Consolidation per-arch lock_kenrel() implementations.

broken-out/might_sleep.patch
  Infrastructure to detect sleep-inside-spinlock bugs

broken-out/spin-lock-check.patch
  spinlock/rwlock checking infrastructure

broken-out/atomic-copy_user.patch
  Support for atomic copy_*_user()

broken-out/kmap_atomic_reads.patch
  Use kmap_atomic() for generic_file_read()

broken-out/kmap_atomic_writes.patch
  Use kmap_atomic() for generic_file_write()

broken-out/config-PAGE_OFFSET.patch
  Configurable kenrel/user memory split

broken-out/throttling-fix.patch
  Fix throttling of heavy write()rs.

broken-out/dirty-state-accounting.patch
  Make the global dirty memory accounting more accurate

broken-out/rd-cleanup.patch
  Cleanup and fix the ramdisk driver (doesn't work right yet)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

MM patches against 2.5.31

Post by Christian Ehrhard » Fri, 23 Aug 2002 20:40:06



> I've uploaded a rollup of pending fixes and feature work
> against 2.5.31 to

> http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/2.5.31-mm1/

> The rolled up patch there is suitable for ongoing testing and
> development.  The individual patches are in the broken-out/
> directory and should all be documented.

Sorry, but we still have the page release race in multiple places.
Look at the following (page starts with page_count == 1):

Processor 1                          Processor 2
refill_inactive: lines 378-395
   as page count == 1 we'll
   continue with line 401

                                     __pagevec_release: line 138
                                       calls release_pages
                                     release_pages: line 100-111
                                       put_page_test_zero brings the
                                       page count to 0 and we'll continue
                                       at line 114. Note that this may
                                       happen while another processor holds
                                       the lru lock, i.e. there is no
                                       point in checking for page count == 0
                                       with the lru lock held because
                                       the lru lock doesn't protect against
                                       decrements of page count after
                                       the check.
  line 401: page_cache_get
  resurrects the page, page
  count is now 1.
  lines 402-448.
  line 448 calls __pagevec_release

__pagevec_release: line 138
  calls release_pages
release_pages: lines 100-111
  put_page_test_zero brings the
  page count back to 0 (!!!)
  i.e. we continue at line 114:

  lines 114-123.
  The page count == 0 check in line
  123 is successful and the page
  is returned to the buddy allocator

                                       lines 114-123.
                                       The page count == 0 check in line
                                       123 is successful, i.e. the page
                                       is returned to the buddy allocator
                                       a second time. ===> BOOM

Neither the lru lock nor any of the page count == 0 checks can
prevent this from happening.

    regards   Christian

--
THAT'S ALL FOLKS!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

MM patches against 2.5.31

Post by Steven Col » Sat, 24 Aug 2002 01:10:06



> I've uploaded a rollup of pending fixes and feature work
> against 2.5.31 to

> http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/2.5.31-mm1/

> The rolled up patch there is suitable for ongoing testing and
> development.  The individual patches are in the broken-out/
> directory and should all be documented.


time since 2.5.31-vanilla there were _no_ BUG()s reported at all.


kjournald starting.  Commit interval 5 seconds
EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,3), internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald: page allocation failure. order:0, mode:0x0

The kjournald failure message came out with dbench 48 running on an ext3
partition.  The test continued with only this one instance of this
message.

Steven

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

MM patches against 2.5.31

Post by Martin J. Blig » Sat, 24 Aug 2002 01:20:04


Quote:> kjournald: page allocation failure. order:0, mode:0x0

I've seen this before, but am curious how we ever passed
a gfpmask (aka mode) of 0 to __alloc_pages? Can't see anywhere
that does this?

Thanks,

M.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

MM patches against 2.5.31

Post by Steven Col » Sat, 24 Aug 2002 05:00:05



> > kjournald: page allocation failure. order:0, mode:0x0

> I've seen this before, but am curious how we ever passed
> a gfpmask (aka mode) of 0 to __alloc_pages? Can't see anywhere
> that does this?

> Thanks,

> M.

I ran dbench 1..128 on 2.5.31-mm1 several more times with nothing
unusual happening, and then got this from pdflush with dbench 96.

pdflush: page allocation failure. order:0, mode:0x0

FWIW, this 2.5.31-mm1 kernel is SMP, HIGHMEM4G, no PREEMPT.

Steven

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

MM patches against 2.5.31

Post by Andrew Morto » Tue, 27 Aug 2002 10:50:04




> > I've uploaded a rollup of pending fixes and feature work
> > against 2.5.31 to

> > http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/2.5.31-mm1/

> > The rolled up patch there is suitable for ongoing testing and
> > development.  The individual patches are in the broken-out/
> > directory and should all be documented.

> Sorry, but we still have the page release race in multiple places.
> Look at the following (page starts with page_count == 1):

So we do.  It's a hugely improbable race, so there's no huge rush
to fix it.  Looks like the same race is present in -ac kernels,
actually, if add_to_swap() fails.  Also perhaps 2.4 is exposed if
swap_writepage() is synchronous, and page reclaim races with
zap_page_range.  ho-hum.

What I'm inclined to do there is to change __page_cache_release()
to not attempt to free the page at all.  Just let it sit on the
LRU until page reclaim encounters it.  With the anon-free-via-pagevec
patch, very, very, very few pages actually get their final release in
__page_cache_release() - zero on uniprocessor, I expect.

And change pagevec_release() to take the LRU lock before dropping
the refcount on the pages.

That means there will need to be two flavours of pagevec_release():
one which expects the pages to become freeable (and takes the LRU
lock in anticipation of this).  And one which doesn't expect the
pages to become freeable.  The latter will leave the occasional
zero-count page on the LRU, as above.

Sound sane?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

MM patches against 2.5.31

Post by Andrew Morto » Tue, 27 Aug 2002 11:10:05



> > kjournald: page allocation failure. order:0, mode:0x0

> I've seen this before, but am curious how we ever passed
> a gfpmask (aka mode) of 0 to __alloc_pages? Can't see anywhere
> that does this?

Could be anywhere, really.  A network interrupt doing GFP_ATOMIC
while kjournald is executing.  A radix-tree node allocation
on the add-to-swap path perhaps.  (The swapout failure messages
aren't supposed to come out, but mempool_alloc() stomps on the
caller's setting of PF_NOWARN.)

Or:

mnm:/usr/src/25> grep -r GFP_ATOMIC drivers/scsi/*.c | wc -l
     89
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

MM patches against 2.5.31

Post by Martin J. Blig » Tue, 27 Aug 2002 11:20:03


Quote:>> > kjournald: page allocation failure. order:0, mode:0x0

>> I've seen this before, but am curious how we ever passed
>> a gfpmask (aka mode) of 0 to __alloc_pages? Can't see anywhere
>> that does this?

> Could be anywhere, really.  A network interrupt doing GFP_ATOMIC
> while kjournald is executing.  A radix-tree node allocation
> on the add-to-swap path perhaps.  (The swapout failure messages
> aren't supposed to come out, but mempool_alloc() stomps on the
> caller's setting of PF_NOWARN.)

> Or:

> mnm:/usr/src/25> grep -r GFP_ATOMIC drivers/scsi/*.c | wc -l
>      89

No, GFP_ATOMIC is not 0:

#define __GFP_HIGH  0x20    /* Should access emergency pools? */
#define GFP_ATOMIC  (__GFP_HIGH)

Looking at all the options:

#define __GFP_WAIT  0x10    /* Can wait and reschedule? */
#define __GFP_HIGH  0x20    /* Should access emergency pools? */
#define __GFP_IO    0x40    /* Can start low memory physical IO? */
#define __GFP_HIGHIO    0x80    /* Can start high mem physical IO? */
#define __GFP_FS    0x100   /* Can call down to low-level FS? */

What worries me is that 0 seems to mean "you can't do anything
to try and free it, but you can't access the emergency pools either".
Seems doomed to failure to me. And the standard sets we have are

#define GFP_NOHIGHIO    (             __GFP_WAIT | __GFP_IO)
#define GFP_NOIO    (             __GFP_WAIT)
#define GFP_NOFS    (             __GFP_WAIT | __GFP_IO | __GFP_HIGHIO)
#define GFP_ATOMIC  (__GFP_HIGH)
#define GFP_USER    (             __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS)
#define GFP_HIGHUSER    (             __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS | __GFP_HIGHMEM)
#define GFP_KERNEL  (             __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS)
#define GFP_NFS     (             __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS)
#define GFP_KSWAPD  (             __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS)

So I think someone's screwed something up, and this is accidental.
Or I'm just totally misunderstanding this, which is perfectly
possible.

M.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

MM patches against 2.5.31

Post by Andrew Morto » Tue, 27 Aug 2002 11:30:04



> >> > kjournald: page allocation failure. order:0, mode:0x0

> >> I've seen this before, but am curious how we ever passed
> >> a gfpmask (aka mode) of 0 to __alloc_pages? Can't see anywhere
> >> that does this?

> > Could be anywhere, really.  A network interrupt doing GFP_ATOMIC
> > while kjournald is executing.  A radix-tree node allocation
> > on the add-to-swap path perhaps.  (The swapout failure messages
> > aren't supposed to come out, but mempool_alloc() stomps on the
> > caller's setting of PF_NOWARN.)

> > Or:

> > mnm:/usr/src/25> grep -r GFP_ATOMIC drivers/scsi/*.c | wc -l
> >      89

> No, GFP_ATOMIC is not 0:

It's mempool_alloc(GFP_NOIO) or such.  mempool_alloc() strips
__GFP_WAIT|__GFP_IO on the first attempt.

It also disables the printk, so maybe I just dunno ;)  show_stack()
would tell.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

MM patches against 2.5.31

Post by Steven Col » Tue, 27 Aug 2002 12:10:04




> > >> > kjournald: page allocation failure. order:0, mode:0x0

> > >> I've seen this before, but am curious how we ever passed
> > >> a gfpmask (aka mode) of 0 to __alloc_pages? Can't see anywhere
> > >> that does this?

> > > Could be anywhere, really.  A network interrupt doing GFP_ATOMIC
> > > while kjournald is executing.  A radix-tree node allocation
> > > on the add-to-swap path perhaps.  (The swapout failure messages
> > > aren't supposed to come out, but mempool_alloc() stomps on the
> > > caller's setting of PF_NOWARN.)

> > > Or:

> > > mnm:/usr/src/25> grep -r GFP_ATOMIC drivers/scsi/*.c | wc -l
> > >      89

> > No, GFP_ATOMIC is not 0:

> It's mempool_alloc(GFP_NOIO) or such.  mempool_alloc() strips
> __GFP_WAIT|__GFP_IO on the first attempt.

> It also disables the printk, so maybe I just dunno ;)  show_stack()
> would tell.

The "kjournald: page allocation failure. order:0, mode:0x0" message and
"pdflush: page allocation failure. order:0, mode:0x0" occurred only once
each on my dual p3 scsi ext3 test box running 2.5.31-mm1.  So, I added
something like this:
--- page_alloc.c.orig   Thu Aug 22 17:27:32 2002

                        printk("%s: page allocation failure."
                                " order:%d, mode:0x%x\n",
                                current->comm, order, gfp_mask);
+                       if (gfp_mask == 0)
+                               BUG();
                }
                return NULL;
        }

and continued testing on Friday with no repeats of the "page allocation failure"
messages.  I obtained a second dual p3 ext3 test box (ide this time) and left both
boxes running 2.5.31-mm1 and the dbench 1..128 stress test scripted to rerun many
times over the weekend.  Due to a couple of firewalls, I can't look at those boxes
from here, but I'll let you know what happened in about 10 to 11 hours.

Cheers,
Steven

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

MM patches against 2.5.31

Post by Christian Ehrhard » Tue, 27 Aug 2002 18:20:06


On Sun, Aug 25, 2002 at 06:52:55PM -0700, Andrew Morton wrote:
> Christian Ehrhardt wrote:

> > On Wed, Aug 21, 2002 at 07:29:04PM -0700, Andrew Morton wrote:

> > > I've uploaded a rollup of pending fixes and feature work
> > > against 2.5.31 to

> > > http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/2.5.31-mm1/

> > > The rolled up patch there is suitable for ongoing testing and
> > > development.  The individual patches are in the broken-out/
> > > directory and should all be documented.

> > Sorry, but we still have the page release race in multiple places.
> > Look at the following (page starts with page_count == 1):

> So we do.  It's a hugely improbable race, so there's no huge rush
> to fix it.

Actually the race seems to happen in real life (it does explain the
the pte.chain != NULL BUG) and it is not that improbable with preempt.

> Looks like the same race is present in -ac kernels,
> actually, if add_to_swap() fails.  Also perhaps 2.4 is exposed if
> swap_writepage() is synchronous, and page reclaim races with
> zap_page_range.  ho-hum.

I didn't check each kernel, but I agree that most of the recent kernels
have the potential for this race. Your tree just happend to be the
one where I found it first.

> What I'm inclined to do there is to change __page_cache_release()
> to not attempt to free the page at all.  Just let it sit on the
> LRU until page reclaim encounters it.  With the anon-free-via-pagevec
> patch, very, very, very few pages actually get their final release in
> __page_cache_release() - zero on uniprocessor, I expect.

It's not just a Problem with __page_cache_release, but yes it seems
to be a SMP only race.

> And change pagevec_release() to take the LRU lock before dropping
> the refcount on the pages.

> That means there will need to be two flavours of pagevec_release():
> one which expects the pages to become freeable (and takes the LRU
> lock in anticipation of this).  And one which doesn't expect the
> pages to become freeable.  The latter will leave the occasional
> zero-count page on the LRU, as above.

> Sound sane?

So the rules would be:
* if you bring the page count to zero call __free_pages_ok unless the
  page is on the lru.
* if someone (reclaim page) walking the lru finds a page with page count zero
  remove it from the lru and call __free_pages_ok.

This requires that ANYTHING that ends up calling put_page_testzero
must happen under the lru lock. This doesn't sound like a good idea
but it seems to be possible to do it race free.

I'd actually go for the following (patch in it compiles state but
otherwise untested below to illustrate what I'm talking about):

The basic idea is to move the logic into page_cache_get. The rules
would be:
1. if you bring the page count to zero (using put_page_testzero) it
   is your job to call __free_pages_ok eventually. Before doing so
   make sure that the page is no longer on the lru.
2. You may only call page_cache_get to duplicate an existing reference,
   i.e. page_cache_get could be made to BUG_ON(page_count(page) == 0).
3. If you got a pointer to a page without holding a reference (this
   is only allowd to happen if we found the pointer on an lru list)
   call page_cache_get_lru UNDER the lru lock and just leave the page
   alone if that would resurrect the page. page_cache_get_lru would
   basically look like this (implementation details below):

   int page_cache_get_lru (struct page * page) {
        if (!atomic_inc_and_test_for_one (&page->count))
                return 1;
        atomic_dec (&page->count);
        return 0;
   }

Proof of correctness:
A page is called dead if its page count reached zero before (no matter
what the page count currently is). Once a page is dead there can be
at most two pointers to the page: One held by the lru and the other
one held by the thread freeing the page. Any thread accessing the page
via the lru list will first call page_cache_get_lru under the lru lock,
the thread freeing the page will not read the page count anymore.
As page_cache_get_lru will not resurrect the page there will never
be a page count != 0 visible outside the lru lock on a dead page.
This meas that each thread trying to access the dead page via the lru
list will detect that the page is dead and leave it alone. It follows
that each page is freed at most once.
Suppose a page could be leaked under these rules. This would require
someone calling __put_page (or atomic_dec (&page->count)) to bring the
page count to zero on a not already dead page. However, the only place
where this happens is in page_cache_get_lru and it only happens if the
page is already dead.

Now let's look at the ugly part: implementation.
The basic problem is that we don't habe an atomic_inc_and_test_for_one
function and it is unlikely that we'll get one on all architectures. The
solution (and this is the ugly part) is a change in the semantics of
page->count. The value -1 now means no reference, 0 means one reference etc.

This way we can define
put_page_testzero    as   atomic_add_negative (-1, &page->count);   and
get_page_testzero    as   atomic_inc_and_test (&page->count);

Here's the promised (untested) patch against bk7 to illustrate what
I'm talking about:

diff -ur linux-2.5.31-bk7/include/linux/mm.h linux-2.5.31-cae/include/linux/mm.h
--- linux-2.5.31-bk7/include/linux/mm.h Sun Aug 25 18:30:38 2002
+++ linux-2.5.31-cae/include/linux/mm.h Sun Aug 25 21:40:57 2002
@@ -184,6 +184,11 @@
 /*
  * Methods to modify the page usage count.
  *
+ * NOTE: Real page counts start at -1 for no reference. This is a hack
+ * to be able to implement get_page_testzero with the existing portable
+ * atomic functions. The value exposed via set_page_count and page_count
+ * is (1+page->count).
+ *
  * What counts for a page usage:
  * - cache mapping   (page->mapping)
  * - private data    (page->private)
@@ -192,12 +197,25 @@
  *
  * Also, many kernel routines increase the page count before a critical
  * routine so they can be sure the page doesn't go away from under them.
+ *
+ * A special Problem is the lru lists. Presence on one of these lists
+ * does not increase the page count. The FIRST thread that brings the
+ * page count back to zero is responsible to remove the page from the
+ * lru list and actually free it (__free_pages_ok). This means that we
+ * can only get a reference to a page that is on a lru list, if this
+ * page is not already dead, i.e. about to be removed from the lru list.
+ * To do this we call get_page_testzero which will increment the page
+ * count and return true if we just resurrected the page i.e. the real
+ * page->count is now zero indicating one user. In this case we drop
+ * the reference again using __put_page. Both calls must happen under
+ * the lru lock.
  */
 #define get_page(p)            atomic_inc(&(p)->count)
 #define __put_page(p)          atomic_dec(&(p)->count)
-#define put_page_testzero(p)   atomic_dec_and_test(&(p)->count)
-#define page_count(p)          atomic_read(&(p)->count)
-#define set_page_count(p,v)    atomic_set(&(p)->count, v)
+#define put_page_testzero(p)   atomic_add_negative(-1, &(p)->count)
+#define page_count(p)          (1+atomic_read(&(p)->count))
+#define set_page_count(p,v)    atomic_set(&(p)->count, v-1)
+#define get_page_testzero(p)   atomic_inc_and_test(&(p)->count)
 extern void FASTCALL(__page_cache_release(struct page *));
 #define put_page(p)                                                    \
        do {                                                            \
diff -ur linux-2.5.31-bk7/include/linux/pagemap.h linux-2.5.31-cae/include/linux/pagemap.h
--- linux-2.5.31-bk7/include/linux/pagemap.h    Sun Aug 25 18:30:38 2002
+++ linux-2.5.31-cae/include/linux/pagemap.h    Sun Aug 25 21:56:30 2002
@@ -22,7 +22,43 @@
 #define PAGE_CACHE_MASK                PAGE_MASK
 #define PAGE_CACHE_ALIGN(addr) (((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK)

-#define page_cache_get(x)      get_page(x)
+/*
+ * Get a reference to the page. This function must not be called on
+ * a dead page, i.e. a page that has page count zero. If the page is
+ * still on a lru_list use page_cache_get_lru instead.
+ */
+static inline void page_cache_get (struct page * page)
+{
+       BUG_ON(page_count(page) == 0);
+       get_page(page);
+}
+
+/*
+ * Try to get a reference to page that we found on an lru list.
+ * The lru lists may contain pages with page count zero. We must
+ * not take a reference to such a page because it is already about
+ * to be freed (once it is of the lru lists). If we'd take a reference
+ * the page would eventually be freed twice.
+ *
+ * The return value is true if we sucessfully incremented the page count.
+ *
+ * This function must be called with the lru lock held.
+ */
+static inline int page_cache_get_lru (struct page * page)
+{
+       /*
+        * Yes there is a window where the page count is not zero
+        * even though the page is dead. This is one of the reasons
+        * why the caller must hold the lru lock. Due to the lru_lock
+        * only the thread that is about to free the page can have
+        * a reference to this page. This thread will not test the
+        * page count anymore.
+        */
+       if (!get_page_testzero (page))
+               return 1;
+       __put_page (page);
+       return 0;
+}

 static inline void page_cache_release(struct page *page)
 {
diff -ur linux-2.5.31-bk7/mm/swap.c linux-2.5.31-cae/mm/swap.c
--- linux-2.5.31-bk7/mm/swap.c  Sun Aug 25 18:30:38 2002
+++ linux-2.5.31-cae/mm/swap.c  Sun Aug 25 11:28:55 2002
@@ -77,7 +77,6 @@
 void __page_cache_release(struct page *page)
 {
        unsigned long flags;
-       BUG_ON(page_count(page) != 0);

        spin_lock_irqsave(&_pagemap_lru_lock, flags);
        if (TestClearPageLRU(page)) {
@@ -86,11 +85,8 @@
                else
                        del_page_from_inactive_list(page);
        }
-       if (page_count(page) != 0)
-               page = NULL;
        spin_unlock_irqrestore(&_pagemap_lru_lock, flags);
-       if (page)
-               __free_pages_ok(page, 0);
+       __free_pages_ok(page, 0);
 }

 /*
@@ -131,8 +127,7 @@
                        else
                                del_page_from_inactive_list(page);
                }
-               if (page_count(page) == 0)
-                       pagevec_add(&pages_to_free, page);
+               pagevec_add(&pages_to_free, page);
        }
        if (lock_held)
...

read more »

 
 
 

MM patches against 2.5.31

Post by Daniel Phillip » Tue, 27 Aug 2002 23:30:07



Quote:> + * A special Problem is the lru lists. Presence on one of these lists
> + * does not increase the page count.

Please remind me... why should it not?

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

MM patches against 2.5.31

Post by Christian Ehrhard » Wed, 28 Aug 2002 00:40:05




> > + * A special Problem is the lru lists. Presence on one of these lists
> > + * does not increase the page count.

> Please remind me... why should it not?

Pages that are only on the lru but not reference by anyone are of no
use and we want to free them immediatly. If we leave them on the lru
list with a page count of 1, someone else will have to walk the lru
list and remove pages that are only on the lru. One could argue that
try_to_free_pages could do this but try_to_free_pages will process the
pages in lru order and push out other pages first.
The next suggestion that comes to mind is: Let's have some magic in
page_cache_release that will remove the page from the lru list if
it is actually dead. However, this raises the question: How do we detect
that a page is now dead? The answer is something along the lines of

        if (put_page_testzero ()) {
                __free_pages_ok (page);
                return
        }
        spin_lock_irq(pagemap_lru_lock);
        if (PageLRU(page) && (page_count(page) == 1)) {
                lru_cache_del (page);
                spin_unlock_irq(pagemap_lru_lock);
                page_cache_release (page);
                return;
        }
        spin_unlock_irq(pagemap_lru_lock);
        return;

The sad truth is, that this solution has all the same races that
we have now, plus it makes the fast path (decreasing page count
to something not zero) slower. One problem in the above would be
that the last reference might as well not be due the the lru
cache, i.e at the time we call PageLRU(page) the page might
have been freed by another processor.

I know the idea is appealing (see one of my earlier Mails on the
subject ;-) ) but it doesn't solve the Problem.

      regards   Christian Ehrhardt

--
THAT'S ALL FOLKS!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

MM patches against 2.5.31

Post by Daniel Phillip » Wed, 28 Aug 2002 03:20:16





> > > + * A special Problem is the lru lists. Presence on one of these lists
> > > + * does not increase the page count.

> > Please remind me... why should it not?

> Pages that are only on the lru but not reference by anyone are of no
> use and we want to free them immediatly. If we leave them on the lru
> list with a page count of 1, someone else will have to walk the lru
> list and remove pages that are only on the lru.

I don't understand this argument.  Suppose lru list membership is worth a
page count of one.  Then anyone who finds a page by way of the lru list can
safely put_page_testzero and remove the page from the lru list.  Anyone who
finds a page by way of a page table can likewise put_page_testzero and clear
the pte, or remove the mapping and pass the page to Andrew's pagevec
machinery, which will eventually do the put_page_testzero.  Anyone who
removes a page from a radix tree will also do a put_page_testzero.  Exactly
one of those paths will result in the page count reaching zero, which tells
us nobody else holds a reference and it's time for __free_pages_ok.  The page
is thus freed immediately as soon as there are no more references to it, and
does not hang around on the lru list.

Nobody has to lock against the page count.  Each put_page_testzero caller
only locks the data structure from which it's removing the reference.

This seems so simple, what is the flaw?

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/