scalable kmap (was Re: vm lock contention reduction) (fwd)

scalable kmap (was Re: vm lock contention reduction) (fwd)

Post by Andrew Morto » Fri, 12 Jul 2002 12:10:05




> ...
> Andrew and Martin,

>         I ran this updated patch on 2.5.25 with dbench on
> the 8-way with 4 Gb of memory compared to clean 2.5.25.
> I saw a significant improvement in throughput about 15%
> (averaged over 5 runs each).

Thanks, Hanna.

The kernel compile test isn't a particularly heavy user of
copy_to_user(), whereas with RAM-only dbench, copy_*_user()
is almost the only thing it does.  So that makes sense.

Tried dbench on the 2.5G 4xPIII Xeon: no improvement at all.
This thing seems to have quite poor memory bandwidth - maybe
250 megabyte/sec downhill with the wind at its tail.

Quote:>         Included is the pretty picture (akpm-2525.png) the
> data that picture came from (akpm-2525.data) and the raw
> results of the runs with min/max and timing results
> (2525akpmkmaphi and 2525clnhi).
>         I believe the drop at around 64 clients is caused by
> memory swapping leading to increased disk accesses since the
> time increased by 200% in direct correlation with the decreased
> throughput.

Yes.  The test went to disk.   There are two reasons why
it will do this:

1: Some dirty data was in memory for more than 30-35 seconds or

2: More than 40% of memory is dirty.

In your case, the 64-client run was taking 32 seconds.  After that
the disks lit up.  Once that happens, dbench isn't a very good
benchmark.  It's an excellent benchmark when it's RAM-only
though.  Very repeatable and hits lots of code paths which matter.

You can run more clients before the disk I/O cuts in by
increasing /proc/sys/vm/dirty_expire_centisecs and
/proc/sys/vm/dirty_*_ratio.

The patch you tested only uses the atomic kmap across generic_file_read.
It is reasonable to hope that another 15% or morecan be gained by holding
an atomic kmap across writes as well.  On your machine ;)

Here's what oprofile says about `dbench 40' with that patch:

c0140f1c 402      0.609543    __block_commit_write    
c013dfd4 413      0.626222    vfs_write              
c01402cc 431      0.653515    __find_get_block        
c013a895 472      0.715683    .text.lock.highmem      
c017fe30 494      0.749041    ext2_get_block          
c012cef0 564      0.85518     unlock_page            
c013ee80 564      0.85518     fget                    
c01079f4 571      0.865794    apic_timer_interrupt    
c01e8ecc 594      0.900669    radix_tree_lookup      
c013da90 597      0.905218    generic_file_llseek    
c01514b4 607      0.92038     __d_lookup              
c0106ff8 687      1.04168     system_call            
c013a02c 874      1.32523     kunmap_high            
c0148388 922      1.39801     link_path_walk          
c0140b00 1097     1.66336     __block_prepare_write  
c01346d0 1138     1.72552     rmqueue                
c01127ac 1243     1.88473     smp_apic_timer_interrupt
c0139eb8 1514     2.29564     kmap_high              
c0105368 6188     9.38272     poll_idle              
c012d8a8 9564     14.5017     file_read_actor        
c012ea70 21326    32.3361     generic_file_write      

Not taking a kmap in generic_file_write is a biggish patch - it
means changing the prepare_write/commit_write API and visiting
all filesystems.  The API change would be: core kernel no longer
holds a kmap across prepare/commit. If the filesystem wants one
for its own purposes then it gets to do it for itself, possibly in
its prepare_write().

I think I'd prefer to get some additional testing and understanding
before undertaking that work.  It arguably makes sense as a small
cleanup/speedup anyway, but that's not a burning issue.

hmm.  I'll do just ext2, and we can take another look then.

-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

scalable kmap (was Re: vm lock contention reduction) (fwd)

Post by Andrew Morto » Fri, 12 Jul 2002 14:20:07


Andrew Morton wrote:

> Not taking a kmap in generic_file_write is a biggish patch

OK, so I'm full of it.  It's actually quite simple and clean.

This patch is incremental to the one which you just tested.  It
takes an atomic kmap across generic_file_write().

There's a copy of these patches at
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.25/

The patch also teaches ext2 and ext3 about the new API. If you're
using any other filesystems they will probably oops.

ext2 is still using regular kmap() for directory operations.

And this time, it actually buys 3% improvement on my lumbering hulk,
so your machine will hopefully see nice improvements.

Profile looks better too.  kmap_high and the IPI have vanished.

c013c164 304      0.50923     __set_page_dirty_nobuffers
c0141008 355      0.59466     __block_commit_write    
c013e004 382      0.639887    vfs_write              
c01402fc 410      0.68679     __find_get_block        
c017fd70 471      0.788971    ext2_get_block          
c01e8eec 530      0.887802    radix_tree_lookup      
c012cef0 542      0.907903    unlock_page            
c013eeb0 562      0.941405    fget                    
c013dac0 585      0.979932    generic_file_llseek    
c01514e4 600      1.00506     __d_lookup              
c0106ff8 635      1.06369     system_call            
c01483b8 872      1.46069     link_path_walk          
c0134700 1014     1.69855     rmqueue                
c0140b30 1191     1.99504     __block_prepare_write  
c0105368 4029     6.74897     poll_idle              
c012d8a8 9520     15.9469     file_read_actor        
c012ea70 23301    39.0315     generic_file_write      

 fs/buffer.c     |   57 +++++++++++++++++++++++++++++++-------------------------
 fs/ext2/inode.c |   16 +++++++++++++--
 fs/ext3/inode.c |   30 +++++------------------------
 mm/filemap.c    |    6 ++---
 4 files changed, 55 insertions(+), 54 deletions(-)

--- 2.5.25/mm/filemap.c~kmap_atomic_writes      Wed Jul 10 21:18:19 2002
+++ 2.5.25-akpm/mm/filemap.c    Wed Jul 10 21:24:41 2002
@@ -2228,6 +2228,7 @@ generic_file_write(struct file *file, co
                unsigned long offset;
                long page_fault;
                char *kaddr;
+               struct copy_user_state copy_user_state;

                offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
                index = pos >> PAGE_CACHE_SHIFT;
@@ -2252,22 +2253,22 @@ generic_file_write(struct file *file, co
                        break;
                }

-               kaddr = kmap(page);
                status = a_ops->prepare_write(file, page, offset, offset+bytes);
                if (unlikely(status)) {
                        /*
                         * prepare_write() may have instantiated a few blocks
                         * outside i_size.  Trim these off again.
                         */
-                       kunmap(page);
                        unlock_page(page);
                        page_cache_release(page);
                        if (pos + bytes > inode->i_size)
                                vmtruncate(inode, inode->i_size);
                        break;
                }
+               kaddr = kmap_copy_user(&copy_user_state, page, KM_FILEMAP, 0);
                page_fault = __copy_from_user(kaddr + offset, buf, bytes);
                flush_dcache_page(page);
+               kunmap_copy_user(&copy_user_state);
                status = a_ops->commit_write(file, page, offset, offset+bytes);
                if (unlikely(page_fault)) {
                        status = -EFAULT;
@@ -2282,7 +2283,6 @@ generic_file_write(struct file *file, co
                                buf += status;
                        }
                }
-               kunmap(page);
                if (!PageReferenced(page))
                        SetPageReferenced(page);
                unlock_page(page);
--- 2.5.25/fs/buffer.c~kmap_atomic_writes       Wed Jul 10 21:25:11 2002
+++ 2.5.25-akpm/fs/buffer.c     Wed Jul 10 21:33:06 2002
@@ -1804,7 +1804,6 @@ static int __block_prepare_write(struct
        int err = 0;
        unsigned blocksize, bbits;
        struct buffer_head *bh, *head, *wait[2], **wait_bh=wait;
-       char *kaddr = kmap(page);

        BUG_ON(!PageLocked(page));
        BUG_ON(from > PAGE_CACHE_SIZE);
@@ -1845,13 +1844,19 @@ static int __block_prepare_write(struct
                                        set_buffer_uptodate(bh);
                                        continue;
                                }
-                               if (block_end > to)
-                                       memset(kaddr+to, 0, block_end-to);
-                               if (block_start < from)
-                                       memset(kaddr+block_start,
-                                               0, from-block_start);
-                               if (block_end > to || block_start < from)
+                               if (block_end > to || block_start < from) {
+                                       void *kaddr;
+
+                                       kaddr = kmap_atomic(page, KM_USER0);
+                                       if (block_end > to)
+                                               memset(kaddr+to, 0,
+                                                       block_end-to);
+                                       if (block_start < from)
+                                               memset(kaddr+block_start,
+                                                       0, from-block_start);
                                        flush_dcache_page(page);
+                                       kunmap_atomic(kaddr, KM_USER0);
+                               }
                                continue;
                        }
                }
@@ -1890,10 +1895,14 @@ out:
                if (block_start >= to)
                        break;
                if (buffer_new(bh)) {
+                       void *kaddr;
+
                        clear_buffer_new(bh);
                        if (buffer_uptodate(bh))
                                buffer_error();
+                       kaddr = kmap_atomic(page, KM_USER0);
                        memset(kaddr+block_start, 0, bh->b_size);
+                       kunmap_atomic(kaddr, KM_USER0);
                        set_buffer_uptodate(bh);
                        mark_buffer_dirty(bh);
                }
@@ -1979,9 +1988,10 @@ int block_read_full_page(struct page *pa
                                        SetPageError(page);
                        }
                        if (!buffer_mapped(bh)) {
-                               memset(kmap(page) + i*blocksize, 0, blocksize);
+                               void *kaddr = kmap_atomic(page, KM_USER0);
+                               memset(kaddr + i * blocksize, 0, blocksize);
                                flush_dcache_page(page);
-                               kunmap(page);
+                               kunmap_atomic(kaddr, KM_USER0);
                                set_buffer_uptodate(bh);
                                continue;
                        }
@@ -2089,7 +2099,7 @@ int cont_prepare_write(struct page *page
        long status;
        unsigned zerofrom;
        unsigned blocksize = 1 << inode->i_blkbits;
-       char *kaddr;
+       void *kaddr;

        while(page->index > (pgpos = *bytes>>PAGE_CACHE_SHIFT)) {
                status = -ENOMEM;
@@ -2111,12 +2121,12 @@ int cont_prepare_write(struct page *page
                                                PAGE_CACHE_SIZE, get_block);
                if (status)
                        goto out_unmap;
-               kaddr = page_address(new_page);
+               kaddr = kmap_atomic(new_page, KM_USER0);
                memset(kaddr+zerofrom, 0, PAGE_CACHE_SIZE-zerofrom);
                flush_dcache_page(new_page);
+               kunmap_atomic(kaddr, KM_USER0);
                __block_commit_write(inode, new_page,
                                zerofrom, PAGE_CACHE_SIZE);
-               kunmap(new_page);
                unlock_page(new_page);
                page_cache_release(new_page);
        }
@@ -2141,21 +2151,20 @@ int cont_prepare_write(struct page *page
        status = __block_prepare_write(inode, page, zerofrom, to, get_block);
        if (status)
                goto out1;
-       kaddr = page_address(page);
        if (zerofrom < offset) {
+               kaddr = kmap_atomic(page, KM_USER0);
                memset(kaddr+zerofrom, 0, offset-zerofrom);
                flush_dcache_page(page);
+               kunmap_atomic(kaddr, KM_USER0);
                __block_commit_write(inode, page, zerofrom, offset);
        }
        return 0;
 out1:
        ClearPageUptodate(page);
-       kunmap(page);
        return status;

 out_unmap:
        ClearPageUptodate(new_page);
-       kunmap(new_page);
        unlock_page(new_page);
        page_cache_release(new_page);
 out:
@@ -2167,10 +2176,8 @@ int block_prepare_write(struct page *pag
 {
        struct inode *inode = page->mapping->host;
        int err = __block_prepare_write(inode, page, from, to, get_block);
-       if (err) {
+       if (err)
                ClearPageUptodate(page);
-               kunmap(page);
-       }
        return err;
 }

@@ -2178,7 +2185,6 @@ int block_commit_write(struct page *page
 {
        struct inode *inode = page->mapping->host;
        __block_commit_write(inode,page,from,to);
-       kunmap(page);
        return 0;
 }

@@ -2188,7 +2194,6 @@ int generic_commit_write(struct file *fi
        struct inode *inode = page->mapping->host;
        loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
        __block_commit_write(inode,page,from,to);
-       kunmap(page);
        if (pos > inode->i_size) {
                inode->i_size = pos;
                mark_inode_dirty(inode);
@@ -2205,6 +2210,7 @@ int block_truncate_page(struct address_s
        struct inode *inode = mapping->host;
        struct page *page;
        struct buffer_head *bh;
+       void *kaddr;
        int err;

        blocksize = 1 << inode->i_blkbits;
@@ -2257,9 +2263,10 @@ int block_truncate_page(struct address_s
                        goto unlock;
        }

-       memset(kmap(page) + offset, 0, length);
+       kaddr = kmap_atomic(page, KM_USER0);
+       memset(kaddr + offset, 0, length);
        flush_dcache_page(page);
-       kunmap(page);
+       kunmap_atomic(kaddr, KM_USER0);

        mark_buffer_dirty(bh);
        err = 0;
@@ -2279,7 +2286,7 @@ int block_write_full_page(struct page *p
        struct inode * const inode = page->mapping->host;
        const unsigned long end_index = inode->i_size >> PAGE_CACHE_SHIFT;
        unsigned offset;
-       char *kaddr;
+       void *kaddr;

        /* Is the page fully inside i_size? */
        if (page->index < end_index)
@@ -2293,10 +2300,10 @@ int block_write_full_page(struct page *p
        }

        /* The page straddles i_size */
-       kaddr = kmap(page);
+       kaddr = kmap_atomic(page, KM_USER0);
        memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
        flush_dcache_page(page);
-       kunmap(page);
+       kunmap_atomic(kaddr, KM_USER0);
        return __block_write_full_page(inode, page, get_block);
 }

--- 2.5.25/fs/ext3/inode.c~kmap_atomic_writes   Wed Jul 10 21:34:16 2002
+++ 2.5.25-akpm/fs/ext3/inode.c Wed Jul 10 21:35:08 2002
@@ -1048,16 +1048,6 @@ static int ext3_prepare_write(struct fil
        if (ext3_should_journal_data(inode)) {
                ret = walk_page_buffers(handle, page_buffers(page),
                                from, to, NULL, do_journal_get_write_access);
-               if (ret) {
-                       /*
-                        * We're going to fail this prepare_write(),
-                        * so commit_write() will not be called.
-                        * We need to undo block_prepare_write()'s kmap().
-                        * AKPM: Do we need to clear PageUptodate?  I don't
-                        * think so.
-                        */
-                       kunmap(page);
-               }
        }
 prepare_write_failed:
        if (ret)
@@ -1117,7 +1107,6 @@ static int ext3_commit_write(struct file
                        from, to, &partial, commit_write_fn);
                if (!partial)
                        SetPageUptodate(page);
-               kunmap(page);
                if (pos > inode->i_size)
                        inode->i_size = pos;
                EXT3_I(inode)->i_state |= EXT3_STATE_JDATA;
@@ -1128,17 +1117,8 @@ static int ext3_commit_write(struct file
                }
                /* Be careful here if generic_commit_write becomes a
                 * required invocation after block_prepare_write. */
-               if (ret == 0) {
+               if (ret == 0)
                        ret = generic_commit_write(file, page, from, to);
-               } else {
-                       /*
-                        * block_prepare_write() was called, but we're not
-                        * going to call generic_commit_write().  So we
-                        * need to perform generic_commit_write()'s kunmap
-                        * by hand.
-                        */
-                       kunmap(page);
-               }
        }
        if (inode->i_size >
...

read more »

 
 
 

1. Enhanced profiling support (was Re: vm lock contention reduction)

[Excuse the quoting, I was out of the loop on this ...]

It makes me very unhappy too. There are a number of horrible things
there, mostly for the sake of convenience and performance.
sys_call_table is just the most obviously foul thing.  I'm glad to hear
there is interest in getting some kernel support for such things to be
done tastefully.

How do you see such dentry names being exported to user-space for the
profiling daemon to access ? The current oprofile scheme is, um, less
than ideal ...

I'm interested in doing so but I'd like to hear some more on how people
perceive this working. It essentially means a fork for a lot of the
kernel-side code, so it'd mean a lot more work for us (at least until I
can drop the 2.2/2.4 versions).

regards
john

--
"If a thing is not diminished by being shared, it is not rightly owned if
 it is only owned & not shared."
        - St. Augustine
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2. Need help setting sound card on Solaris 7 x86.

3. Fwd: struct inode size reduction.

4. Which Shell is being run

5. Bandwidth reduction protocol -- >10:1 reduction -- seeking feedback.

6. Weird ftp problem

7. Lock contention problems

8. anyone using MarsNWE here?

9. reduce lock contention in do_pagecache_readahead

10. kernel lock contention and scalability

11. reduce lock contention in try_to_free_buffers()

12. Lock contention data for 2.5.8-pre1

13. found small type in Documentation/sysctl/vm.txt (fwd)