fix obj vma sorting

fix obj vma sorting

Post by Hugh Dickin » Wed, 09 Apr 2003 20:20:09



Fix several points in objrmap's vma sorting:

1. It was adding all vmas, even private ones, to i_mmap_shared.
2. It was not quite sorting: list_add_tail is needed in all cases.
3. If vm_pgoff is changed on a file vma (as in vma_merge and split_vma)
   we must unlink vma from list and relink while holding i_shared_sem:
   move_vma_start to do this (holds page_table_lock too, as vma_merge
   did and split_vma did not: I think nothing needs that, rip it out
   if you like, but my guess was that you'd prefer the extra safety).

Sorry, no, this doesn't magically make it all a hundred times faster.

--- 2.5.67-mm1/mm/mmap.c        Tue Apr  8 14:02:06 2003

                else
                        vmhead = &mapping->i_mmap;

-               list_for_each(vmlist, &mapping->i_mmap_shared) {
+               list_for_each(vmlist, vmhead) {
                        struct vm_area_struct *vmtemp;
                        vmtemp = list_entry(vmlist, struct vm_area_struct, shared);
                        if (vmtemp->vm_pgoff >= vma->vm_pgoff)
                                break;
                }
-               if (vmlist == vmhead)
-                       list_add_tail(&vma->shared, vmlist);
-               else
-                       list_add(&vma->shared, vmlist);
+               list_add_tail(&vma->shared, vmlist);
        }
 }

        validate_mm(mm);
 }

+static void move_vma_start(struct vm_area_struct *vma, unsigned long addr)
+{
+       spinlock_t *lock = &vma->vm_mm->page_table_lock;
+       struct inode *inode = NULL;
+      
+       if (vma->vm_file) {
+               inode = vma->vm_file->f_dentry->d_inode;
+               down(&inode->i_mapping->i_shared_sem);
+       }
+       spin_lock(lock);
+       if (inode)
+               __remove_shared_vm_struct(vma, inode);
+       /* If no vm_file, perhaps we should always keep vm_pgoff at 0?? */
+       vma->vm_pgoff += (long)(addr - vma->vm_start) >> PAGE_SHIFT;
+       vma->vm_start = addr;
+       if (inode) {
+               __vma_link_file(vma);
+               up(&inode->i_mapping->i_shared_sem);
+       }
+       spin_unlock(lock);
+}
+
 /*
  * Return true if we can merge this (vm_flags,file,vm_pgoff,size)

                        unsigned long end, unsigned long vm_flags,
                        struct file *file, unsigned long pgoff)
 {
-       spinlock_t * lock = &mm->page_table_lock;
-
        if (!prev) {
                prev = rb_entry(rb_parent, struct vm_area_struct, vm_rb);

        if (prev->vm_end == addr &&
                        can_vma_merge_after(prev, vm_flags, file, pgoff)) {
                struct vm_area_struct *next;
+               spinlock_t *lock = &mm->page_table_lock;
                struct inode *inode = file ? file->f_dentry->d_inode : NULL;
                int need_up = 0;

                                pgoff, (end - addr) >> PAGE_SHIFT))
                        return 0;
                if (end == prev->vm_start) {
-                       spin_lock(lock);
-                       prev->vm_start = addr;
-                       prev->vm_pgoff -= (end - addr) >> PAGE_SHIFT;
-                       spin_unlock(lock);
+                       move_vma_start(prev, addr);
                        return 1;
                }

        if (new_below) {
                new->vm_end = addr;
-               vma->vm_start = addr;
-               vma->vm_pgoff += ((addr - new->vm_start) >> PAGE_SHIFT);
+               move_vma_start(vma, addr);
        } else {
                vma->vm_end = addr;
                new->vm_start = addr;

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

fix obj vma sorting

Post by Martin J. Blig » Thu, 10 Apr 2003 19:10:13


Hmmm. Something somewhere went wrong. Some semaphore blew up
somewhere ... I'm not convinced that this is your patch
causing the problem, I just thought that since vma_link seems
to have gone up rather in the profile. I'm playing with getting
some better data on what actually happened, but in case someone
is feeling psychic.

The main thing I changed here (66-mjb2 -> 67-mjb0.2) was to pick up
Andrew's rmap speedups, and drop the objrmap code I had for the stuff
he had. *However*, what he had worked fine. I also picked up your
sorting patch here Hugh ... this bit worries me:

+static void move_vma_start(struct vm_area_struct *vma, unsigned long addr)
+{
+       spinlock_t *lock = &vma->vm_mm->page_table_lock;
+       struct inode *inode = NULL;
+      
+       if (vma->vm_file) {
+               inode = vma->vm_file->f_dentry->d_inode;
+               down(&inode->i_mapping->i_shared_sem);
+       }
+       spin_lock(lock);
+       if (inode)
+               __remove_shared_vm_struct(vma, inode);
+       /* If no vm_file, perhaps we should always keep vm_pgoff at 0?? */
+       vma->vm_pgoff += (long)(addr - vma->vm_start) >> PAGE_SHIFT;
+       vma->vm_start = addr;
+       if (inode) {
+               __vma_link_file(vma);
+               up(&inode->i_mapping->i_shared_sem);
+       }
+       spin_unlock(lock);
+}

M.

DISCLAIMER: SPEC(tm) and the benchmark name SDET(tm) are registered
trademarks of the Standard Performance Evaluation Corporation. This
benchmarking was performed for research purposes only, and the run results
are non-compliant and not-comparable with any published results.

Results are shown as percentages of the first set displayed

SDET 128  (see disclaimer)
                           Throughput    Std. Dev
                   2.5.66       100.0%         0.2%
                   2.5.67        97.7%         5.1%
               2.5.66-mm2       176.1%         0.6%
               2.5.67-mm1       176.7%         0.2%
              2.5.66-mjb2       181.8%         0.0%
            2.5.67-mjb0.2       141.1%         0.1%

diffprofile {2.5.66-mjb2,2.5.67-mjb0.2}/sdetbench/128/profile
(these are at 100 Hz).

     12913    38.8% default_idle
     12472    20.2% total
      3085   912.7% __down
      1026   385.7% schedule
       946   666.2% __wake_up
       904     0.0% __d_lookup
       626     0.0% move_vma_start
       452  6457.1% __vma_link
       159    40.9% remove_shared_vm_struct
        84    36.4% do_no_page
        69    22.5% copy_mm
        65   125.0% vma_link
        37   528.6% default_wake_function
        31     9.0% do_wp_page
        29   290.0% rb_insert_color
        19    95.0% try_to_wake_up
        18   450.0% __vma_link_rb
        17     6.0% clear_page_tables
        15    20.3% handle_mm_fault
        14   140.0% find_vma_prepare
        14   700.0% __rb_rotate_left
        13    46.4% exit_mmap
        11   110.0% kunmap_atomic
        10    24.4% do_mmap_pgoff
...
      -102   -58.6% __read_lock_failed
      -124   -43.5% path_release
      -126   -17.7% __copy_to_user_ll
      -168   -20.1% release_pages
      -189   -21.1% page_add_rmap
      -223   -33.2% path_lookup
      -241   -15.3% zap_pte_range
      -247   -18.3% page_remove_rmap
      -310   -46.5% follow_mount
      -405   -70.7% .text.lock.dcache
      -425   -76.6% .text.lock.namei
      -551   -49.5% atomic_dec_and_lock
      -628   -71.2% .text.lock.dec_and_lock
     -1148   -98.5% d_lookup

diffprofile {2.5.67-mm1,2.5.67-mjb0.2}/sdetbench/128/profile

    110028    31.3% default_idle
     92085    14.2% total
     31265  1054.5% __down
     10473   428.0% schedule
      9351   611.6% __wake_up
      6260     0.0% move_vma_start
      4200  1076.9% __vma_link
      1328    32.0% remove_shared_vm_struct
       831  1695.9% find_trylock_page
       567    17.8% copy_mm
       428    57.7% vma_link
       380   633.3% default_wake_function
       294   306.2% rb_insert_color
       182    87.5% try_to_wake_up
       177   411.6% __vma_link_rb
       158    62.7% exit_mmap
       150     0.0% rcu_do_batch
       135   540.0% __rb_rotate_left
...
      -196   -31.3% block_invalidatepage
      -202   -39.5% ext2_new_block
      -204   -54.5% .text.lock.inode
      -208   -33.1% task_mem
      -213    -6.6% clear_page_tables
      -213   -55.6% d_lookup
      -218   -44.7% select_parent
      -228   -21.2% kmap_high
      -235   -46.5% read_block_bitmap
      -241   -47.2% d_path
      -244   -57.5% complete
      -261  -100.0% group_release_blocks
      -263   -31.6% proc_root_link
      -264   -17.9% number
      -295   -38.6% strnlen_user
      -296   -26.8% task_vsize
      -320   -49.2% generic_file_aio_write_nolock
      -331   -75.1% call_rcu
      -334   -23.3% __fput
      -336   -56.4% may_open
      -339   -35.3% dput
      -343   -51.0% __find_get_block_slow
      -348   -40.6% d_instantiate
      -354   -65.1% __alloc_pages
      -371   -41.6% prune_dcache
      -377  -100.0% group_reserve_blocks
      -380   -22.6% release_task
      -398   -55.4% generic_fillattr
      -398   -31.4% exit_notify
      -420   -51.9% unmap_vmas
      -424   -35.5% file_kill
      -427   -72.7% read_inode_bitmap
      -435   -41.2% proc_check_root
      -459   -16.3% free_pages_and_swap_cache
      -480   -14.3% do_anonymous_page
      -517   -43.9% ext2_new_inode
      -519   -72.2% ext2_get_group_desc
      -527   -35.9% fd_install
      -537   -28.8% d_alloc
      -559   -30.4% __find_get_block
      -574   -49.7% __mark_inode_dirty
      -575   -38.0% .text.lock.highmem
      -580   -44.6% .text.lock.attr
      -598   -24.6% file_move
      -603   -23.4% copy_process
      -628   -27.4% filemap_nopage
      -633   -42.1% __set_page_dirty_buffers
      -634   -24.0% proc_pid_stat
      -636   -42.2% .text.lock.base
      -705   -28.5% link_path_walk
      -716   -60.9% flush_signal_handlers
      -758   -11.5% __copy_to_user_ll
      -780   -52.0% .text.lock.file_table
      -781   -36.3% free_hot_cold_page
      -834   -91.2% update_atime
      -906   -42.2% buffered_rmqueue
      -916   -56.0% __read_lock_failed
      -920   -34.3% kmem_cache_free
      -993   -38.1% path_release
     -1002   -71.0% __brelse
     -1106   -13.6% page_add_rmap
     -1256   -39.7% pte_alloc_one
     -1303   -29.3% do_no_page
     -1365   -83.5% grab_block
     -1522   -80.4% current_kernel_time
     -1819   -21.4% release_pages
     -1902   -14.2% copy_page_range
     -2149   -32.4% path_lookup
     -2150   -62.3% .text.lock.namei
     -2464   -21.4% __d_lookup
     -2486   -26.1% find_get_page
     -2499   -41.2% follow_mount
     -3174   -22.3% page_remove_rmap
     -3217   -65.7% .text.lock.dcache
     -4119   -42.3% atomic_dec_and_lock
     -4359   -24.6% zap_pte_range
     -4551  -100.0% .text.lock.filemap
     -4665   -64.7% .text.lock.dec_and_lock

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

fix obj vma sorting

Post by Hugh Dickin » Thu, 10 Apr 2003 20:30:24



> Hmmm. Something somewhere went wrong. Some semaphore blew up
> somewhere ... I'm not convinced that this is your patch
> causing the problem, I just thought that since vma_link seems
> to have gone up rather in the profile. I'm playing with getting
> some better data on what actually happened, but in case someone
> is feeling psychic.

> The main thing I changed here (66-mjb2 -> 67-mjb0.2) was to pick up
> Andrew's rmap speedups, and drop the objrmap code I had for the stuff

I haven't examined it, but I'm guessing 66-mjb2 did not have Dave's
vma sorting in at all?  Its linear search would certainly raise the
time spent in __vma_link (notable in your diffprofile), which would
increase the pressure on i_shared_sem.

(Whether it's a worthwhile optimization remains to be seen: like
rmap generally, it speeds up page_referenced and try_to_unmap at
the expense of the fast path.  One improvement would be for fork
to just slot dst vma in next to src vma instead of linear search.)

I don't think my fix to the sort order could have slowed it down
further (though once there are stray entries out of order, it may
be hard to predict how things will work out).  But without it
page_referenced and try_to_unmap sometimes couldn't quite find
all the mappings they were looking for.

Quote:> he had. *However*, what he had worked fine. I also picked up your
> sorting patch here Hugh ... this bit worries me:

> +static void move_vma_start(struct vm_area_struct *vma, unsigned long addr)

It does use i_shared_sem where it wasn't used before, yes, but it's
only called by one case of vma_merge and one case of split_vma:
unless your tests are doing a lot of vma splitting (e.g. mprotecting
ranges which break up vmas), I wouldn't expect it to figure highly.
I can see it's there in the plus part of your diffprofile, but I'm
too inexperienced at reading these things, without the original
profiles, to tell whether it's being used a surprising amount.

When you say "*However*, what he had worked fine", are you saying
you profiled before adding in my patch on top?  The diffprofile of
the before and after my patch should in that case illuminate.

Hugh

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

fix obj vma sorting

Post by Hugh Dickin » Thu, 10 Apr 2003 21:20:20


On Wed, 9 Apr 2003, Martin J. Bligh wrote:
> >> Hmmm. Something somewhere went wrong. Some semaphore blew up
> >> somewhere ... I'm not convinced that this is your patch
> >> causing the problem, I just thought that since vma_link seems
> >> to have gone up rather in the profile. I'm playing with getting
> >> some better data on what actually happened, but in case someone
> >> is feeling psychic.

> >> The main thing I changed here (66-mjb2 -> 67-mjb0.2) was to pick up
> >> Andrew's rmap speedups, and drop the objrmap code I had for the stuff

> > I haven't examined it, but I'm guessing 66-mjb2 did not have Dave's
> > vma sorting in at all?  Its linear search would certainly raise the
> > time spent in __vma_link (notable in your diffprofile), which would
> > increase the pressure on i_shared_sem.

> No it didn't ... but I think 67-mm1 did.

> > (Whether it's a worthwhile optimization remains to be seen: like
> > rmap generally, it speeds up page_referenced and try_to_unmap at
> > the expense of the fast path.  One improvement would be for fork
> > to just slot dst vma in next to src vma instead of linear search.)

Ignore that last parenthetical sentence: I just took a look at copy_mm,
noticing it up in your diffprofile, and it does already slot new vma
in next to old vma without linear search.

> > I don't think my fix to the sort order could have slowed it down
> > further (though once there are stray entries out of order, it may
> > be hard to predict how things will work out).  But without it
> > page_referenced and try_to_unmap sometimes couldn't quite find
> > all the mappings they were looking for.

> It is that fix ... I just backed that one patch off and recompared:

Thanks.  Yes, seems conclusive, but I'm puzzled.
I hope a fresh pair of eyes can work it out for us.

- Show quoted text -

> DISCLAIMER: SPEC(tm) and the benchmark name SDET(tm) are registered
> trademarks of the Standard Performance Evaluation Corporation. This
> benchmarking was performed for research purposes only, and the run results
> are non-compliant and not-comparable with any published results.

> Results are shown as percentages of the first set displayed

> SDET 32  (see disclaimer)
>                            Throughput    Std. Dev
>                    2.5.67       100.0%         0.3%
>             2.5.67-mjb0.2       151.7%         0.5%
>      2.5.67-mjb0.2-nosort       207.1%         0.0%

> SDET 64  (see disclaimer)
>                            Throughput    Std. Dev
>                    2.5.67       100.0%         0.4%
>             2.5.67-mjb0.2       147.0%         0.5%
>      2.5.67-mjb0.2-nosort       201.5%         0.2%

> SDET 128  (see disclaimer)
>                            Throughput    Std. Dev
>                    2.5.67       100.0%         5.1%
>             2.5.67-mjb0.2       144.5%         0.1%
>      2.5.67-mjb0.2-nosort       188.6%         0.3%

> I think it's that sem, which seems to be heavily contented.
> Quite possibly for glibc's address_space or something.
> (even though it says "-nosort", it's just your sort fix I
> backed out ... otherwise it's what was in -mm).

Certainly your idea of glibc's address_space is plausible: I can
well imagine (sorry, can't try right now) that it patches the mmap
of some jump tables, doing mprotect and split and merge.  But
split_vma and vma_merge didn't show all that high before.  Of
course, the inline __vma_link_file in move_vma_start will push
it quite high, but I still don't see why __down soars that high.

- Show quoted text -

> >> he had. *However*, what he had worked fine. I also picked up your
> >> sorting patch here Hugh ... this bit worries me:

> >> +static void move_vma_start(struct vm_area_struct *vma, unsigned long addr)

> > It does use i_shared_sem where it wasn't used before, yes, but it's
> > only called by one case of vma_merge and one case of split_vma:
> > unless your tests are doing a lot of vma splitting (e.g. mprotecting
> > ranges which break up vmas), I wouldn't expect it to figure highly.
> > I can see it's there in the plus part of your diffprofile, but I'm
> > too inexperienced at reading these things, without the original
> > profiles, to tell whether it's being used a surprising amount.

> Here's the diffprofile for just your patch ... where it's positive,
> that's the increase in the number of ticks by applying your patch.
> Where it's negative, that's the decrease. The %age is the change from
> the first to the second profile:

> larry:/var/bench/results# diffprofile 2.5.67-mjb0.2{-nosort,}/sdetbench/64/profile
>       7148    24.9% total
>       6482    37.7% default_idle
>       1466   842.5% __down
>        442   566.7% __wake_up
>        435   378.3% schedule
>        251     0.0% move_vma_start
>        149   876.5% __vma_link
>         72    40.2% remove_shared_vm_struct
>         46    35.1% copy_mm
>         20    60.6% vma_link
>         12   300.0% default_wake_function
>         11   137.5% rb_insert_color
> ...
>        -20   -37.0% number
>        -20   -12.6% do_anonymous_page
>        -21   -36.8% fd_install
>        -23   -27.7% __find_get_block
>        -24   -55.8% flush_signal_handlers
>        -27   -45.0% __set_page_dirty_buffers
>        -28   -26.7% kmem_cache_free
>        -28    -7.5% find_get_page
>        -29   -34.1% buffered_rmqueue
>        -32   -34.8% path_release
>        -33   -32.0% file_move
>        -35   -60.3% __read_lock_failed
>        -35   -43.8% .text.lock.highmem
>        -37   -59.7% .text.lock.namei
>        -37   -29.1% pte_alloc_one
>        -40   -10.3% page_add_rmap
>        -41   -41.4% free_hot_cold_page
>        -44   -60.3% .text.lock.file_table
>        -54   -18.4% __copy_to_user_ll
>        -58   -43.0% follow_mount
>        -62   -29.0% path_lookup
>        -85   -20.9% __d_lookup
>        -86   -20.4% release_pages
>        -99   -68.8% .text.lock.dcache
>       -100   -15.4% page_remove_rmap
>       -106   -36.6% atomic_dec_and_lock
>       -126   -16.8% zap_pte_range
>       -141   -66.8% .text.lock.dec_and_lock

> Note the massive increase in down() (and some of the vma ops).
> The things that are cheaper are probably just because of less
> contention, I guess.

> > When you say "*However*, what he had worked fine", are you saying
> > you profiled before adding in my patch on top?  The diffprofile of
> > the before and after my patch should in that case illuminate.

> Well, I hadn't ... but I should have done, and I have now ;-)

> I'll attach the two raw profiles for you as well. profile.with
> is with your patch, profile.without is without ... I was looking
> at SDET 64, since it showed the most dramatic difference.

Thanks for all the info, I'm sorry, I must rush away now.
I'll try another think later, but hope someone can do better.

Hugh

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

fix obj vma sorting

Post by William Lee Irwin II » Thu, 10 Apr 2003 22:20:18



> Thanks.  Yes, seems conclusive, but I'm puzzled.
> I hope a fresh pair of eyes can work it out for us.

They're pounding ->i_shared_sem, which you already knew.

Here's what I see as far as number of processes mapping what. It seems
to indicate large scale sharing occurs for a number of objects, which
could very well lead to mutual interference for several objects.

It seems to indicate more than glibc is involved, and that there's some
shm involved with large vma count files on "normal" systems as well.

-- wli

how many processes were mapping a given file
        (i.e. remove dups in /proc/$PID/maps)
---------------------------------------------
/lib/libc-2.2.5.so                                          151
/lib/ld-2.2.5.so                                            151
/lib/libnsl-2.2.5.so                                        110
/lib/libnss_compat-2.2.5.so                                 107
/lib/libdl-2.2.5.so                                         85
/lib/libm-2.2.5.so                                          70
/lib/libncurses.so.5.2                                      65
/usr/X11R6/lib/libX11.so.6.2                                44
/usr/X11R6/lib/libSM.so.6.0                                 43
/usr/X11R6/lib/libICE.so.6.3                                43
/lib/libcap.so.1.10                                         39
/usr/lib/zsh/4.0.4/zsh/zle.so                               35
/usr/lib/zsh/4.0.4/zsh/rlimits.so                           35
/usr/lib/zsh/4.0.4/zsh/complete.so                          35
/usr/lib/zsh/4.0.4/zsh/compctl.so                           35
/usr/X11R6/lib/libXpm.so.4.11                               35
/bin/zsh4                                                   35
/lib/libnss_files-2.2.5.so                                  32
/usr/lib/libz.so.1.1.4                                      31
/usr/X11R6/lib/libXext.so.6.4                               24
/lib/libcrypt-2.2.5.so                                      22
/lib/libresolv-2.2.5.so                                     21
/lib/libnss_dns-2.2.5.so                                    21

How many vma's total mapped a given file:
-----------------------------------------
/lib/libc-2.2.5.so            302
/lib/ld-2.2.5.so              302
/lib/libnsl-2.2.5.so          220
/SYSV00000000                 220
/lib/libnss_compat-2.2.5.so   214
/lib/libdl-2.2.5.so           170
/lib/libm-2.2.5.so            140
/lib/libncurses.so.5.2        130
/usr/X11R6/lib/libX11.so.6.2  88
/usr/X11R6/lib/libSM.so.6.0   86
/usr/X11R6/lib/libICE.so.6.3  86
/lib/libcap.so.1.10           78
/usr/lib/zsh/4.0.4/zsh/zle.so 70
/usr/X11R6/lib/libXpm.so.4.11 70
/bin/zsh4                     70
/lib/libnss_files-2.2.5.so    64
/usr/lib/libz.so.1.1.4        62
/usr/X11R6/lib/libXext.so.6.4 48
/lib/libcrypt-2.2.5.so        44
/lib/libresolv-2.2.5.so       42
/lib/libnss_dns-2.2.5.so      42
/usr/X11R6/lib/libXt.so.6.0   40
/usr/X11R6/bin/wterm          40
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

fix obj vma sorting

Post by Hugh Dickin » Fri, 11 Apr 2003 16:00:18




> > Here's the diffprofile for just your patch ... where it's positive,
> > that's the increase in the number of ticks by applying your patch.
> > Where it's negative, that's the decrease. The %age is the change from
> > the first to the second profile:

> > larry:/var/bench/results# diffprofile 2.5.67-mjb0.2{-nosort,}/sdetbench/64/profile
> >       7148    24.9% total
> >       6482    37.7% default_idle
> >       1466   842.5% __down
> >        442   566.7% __wake_up
> >        435   378.3% schedule
> >        251     0.0% move_vma_start
> >        149   876.5% __vma_link
> >         72    40.2% remove_shared_vm_struct
> >         46    35.1% copy_mm
> >         20    60.6% vma_link

> > Note the massive increase in down() (and some of the vma ops).

> Thanks for all the info, I'm sorry, I must rush away now.
> I'll try another think later, but hope someone can do better.

I've not reproduced this in testing myself (I don't have SDET);
but the conclusion I've come to is that the length of your vma lists
(for one or probably more files) was such that they were already
dangerously extending the hold of i_shared_sem with Dave's linear-
search-to-sort patch, and my additional downs in move_vma_start
then just pushed it over the edge into a thrash of collisions.

Clearly I was wrong to suppose that move_vma_start would scarcely be
called: even in my testing it showed up ~50% higher than __vma_link,
the other user of __vma_link_file.  But we cannot avoid i_shared_sem
there (can probably avoid page_table_lock and I did try doing without
that, just in case my up before spin_unlock had some hideous effect,
but apparently not).

I believe you've done the right thing in 2.5.67-mjb1: chucked out
both my patch and the vma list sorting: it's just too expensive on
the fast path, and you've shown that vividly.

Hugh

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

fix obj vma sorting

Post by Martin J. Blig » Fri, 11 Apr 2003 16:40:05


Quote:>> > Here's the diffprofile for just your patch ... where it's positive,
>> > that's the increase in the number of ticks by applying your patch.
>> > Where it's negative, that's the decrease. The %age is the change from
>> > the first to the second profile:

>> > larry:/var/bench/results# diffprofile 2.5.67-mjb0.2{-nosort,}/sdetbench/64/profile
>> >       7148    24.9% total
>> >       6482    37.7% default_idle
>> >       1466   842.5% __down
>> >        442   566.7% __wake_up
>> >        435   378.3% schedule
>> >        251     0.0% move_vma_start
>> >        149   876.5% __vma_link
>> >         72    40.2% remove_shared_vm_struct
>> >         46    35.1% copy_mm
>> >         20    60.6% vma_link

>> > Note the massive increase in down() (and some of the vma ops).

>> Thanks for all the info, I'm sorry, I must rush away now.
>> I'll try another think later, but hope someone can do better.

> I've not reproduced this in testing myself (I don't have SDET);
> but the conclusion I've come to is that the length of your vma lists
> (for one or probably more files) was such that they were already
> dangerously extending the hold of i_shared_sem with Dave's linear-
> search-to-sort patch, and my additional downs in move_vma_start
> then just pushed it over the edge into a thrash of collisions.

> Clearly I was wrong to suppose that move_vma_start would scarcely be
> called: even in my testing it showed up ~50% higher than __vma_link,
> the other user of __vma_link_file.  But we cannot avoid i_shared_sem
> there (can probably avoid page_table_lock and I did try doing without
> that, just in case my up before spin_unlock had some hideous effect,
> but apparently not).

Yeah, sorry ... I guess someone should have published the phone conversation
we had yesterday ... </me pokes Dave in the eye>

We came to the conclusion that should be adding the semaphore to the current
code even, as list_add_tail isn't atomic to a doubly linked list (unless
maybe you can do some fancy-pants compare and exchange thing after setting
up the prev pointer of the new element already). Which is probably going
to suck performance-wise, but I'd prefer correctness. From there we can
make a better judgment, but it sounds like it's going to content horribly
on those busy semaphores.

cat /proc/*/maps | nawk '{print $6}' | sort | uniq -c
reveals that we have 600 or so mappings to libc and ld splattered around,
which seems fairly low load ... SDET is doing bunches of shell scripts,
which probably generates the high operations on top of that.

I think the "list of lists" thing will help this, but unless we do
something like RCU here, I don't see how we can do much to this data
structure without death-by-semaphore contention.

Quote:> I believe you've done the right thing in 2.5.67-mjb1: chucked out
> both my patch and the vma list sorting: it's just too expensive on
> the fast path, and you've shown that vividly.

Yeah, I was being grumpy and threw it all out ;-) Needs more
thought before we decide what to do with this stuff.

M.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

fix obj vma sorting

Post by Hugh Dickin » Fri, 11 Apr 2003 16:40:19



> Yeah, sorry ... I guess someone should have published the phone conversation
> we had yesterday ... </me pokes Dave in the eye>

No problem: I left you all*.

Quote:> We came to the conclusion that should be adding the semaphore to the current
> code even, as list_add_tail isn't atomic to a doubly linked list

Sure you can't list_add_tail without the semaphore: where is it missed?

Hugh

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://www.veryComputer.com/
Please read the FAQ at  http://www.veryComputer.com/

 
 
 

fix obj vma sorting

Post by Dave McCracke » Fri, 11 Apr 2003 17:00:08


--On Thursday, April 10, 2003 07:29:03 -0700 "Martin J. Bligh"


> Yeah, sorry ... I guess someone should have published the phone
> conversation we had yesterday ... </me pokes Dave in the eye>

> We came to the conclusion that should be adding the semaphore to the
> current  code even, as list_add_tail isn't atomic to a doubly linked list
> (unless maybe you can do some fancy-pants compare and exchange thing
> after setting up the prev pointer of the new element already). Which is
> probably going to suck performance-wise, but I'd prefer correctness. From
> there we can make a better judgment, but it sounds like it's going to
> content horribly on those busy semaphores.

I didn't publish the conversation because I realized that the semaphore is
taken outside the function, so it is held.  It's what I called you back to
tell you.

I'm guessing the contention we're seeing with Hugh's fix is because of the
way ld.so works.  It maps the entire library, then does an mprotect to
change the idata section from shared to private.  It does this for every
mapped library after every exec.

Dave

======================================================================
Dave McCracken          IBM Linux Base Kernel Team      1-512-838-3059

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

fix obj vma sorting

Post by Martin J. Blig » Fri, 11 Apr 2003 17:00:26


Quote:>> Yeah, sorry ... I guess someone should have published the phone
>> conversation we had yesterday ... </me pokes Dave in the eye>

>> We came to the conclusion that should be adding the semaphore to the
>> current  code even, as list_add_tail isn't atomic to a doubly linked list
>> (unless maybe you can do some fancy-pants compare and exchange thing
>> after setting up the prev pointer of the new element already). Which is
>> probably going to suck performance-wise, but I'd prefer correctness. From
>> there we can make a better judgment, but it sounds like it's going to
>> content horribly on those busy semaphores.

> I didn't publish the conversation because I realized that the semaphore is
> taken outside the function, so it is held.  It's what I called you back to
> tell you.

Oh yeah. I guess I should poke myself in the eye instead ;-)
So it's OK the way it is.

Quote:> I'm guessing the contention we're seeing with Hugh's fix is because of the
> way ld.so works.  It maps the entire library, then does an mprotect to
> change the idata section from shared to private.  It does this for every
> mapped library after every exec.

Eeek. There's no way we can set this up to do it as two separate VMAs
initially, is there?

M.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

fix obj vma sorting

Post by Hugh Dickin » Fri, 11 Apr 2003 17:30:07



> Eeek. There's no way we can set this up to do it as two separate VMAs
> initially, is there?

What if we could?  It's already shown the VMA sorting is (liable to be)
too slow.  Changing that most common case won't change the fact.

Hugh

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

fix obj vma sorting

Post by Martin J. Blig » Fri, 11 Apr 2003 17:30:14


Quote:>> Eeek. There's no way we can set this up to do it as two separate VMAs
>> initially, is there?

> What if we could?  It's already shown the VMA sorting is (liable to be)
> too slow.  Changing that most common case won't change the fact.

Well, it'd thrash it substantially less, I guess. However, you're probably
right ... need a design change instead of tweaking. Doubling the number
of tasks would probably just take us back to where we were before ... need
something more radical.

M.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

1. vma rbtree and vma merging

I rewrote the vma lookup engine with a rbtree, it drops many browse of
the tree in many places this way and it drops the complexity of
rebalancing the tree as well. There is still some room for improvements
(get_unmapped_area walks the tree one more time in the non MAP_FIXED
case and we could checkpoint there but it would pollute some other code
and I thought it was a lower prio).

It also hopefully does the right merging for the anon mappings for mmap,
mremap and mprotect in all the cases (I got some mremap bit from Ben's
patch).  It's still very early work so don't apply it in any production
box but it works for me so far (running kde pre-2.9 over it, konqueror,
mutt, emacs, and some test programs and found no problems yet). I also
included it under the name of 70_mmap-rb-3 in the 2.4.9aa1 patchkit but
please make sure to back it out from the patchkit before doing any
production work. It needs to be tested better and it also needs to be
benchmarked carefully (should be a speed improvement).

against 2.4.9 vanilla:

        ftp://ftp.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4....

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2. 2.4 x86 as PPP host

3. Fix for vma merging refcounting bug

4. Bad CRC with minicom ... ARG !

5. Q: C-Obj. and F77-Obj.

6. ISO 9660 Volume Name?

7. sort sort: 0653-657 A write error occurred while sorting (4.1.3)

8. XF86Setup can't find my config file for re-configuration??

9. Fixed my Mach64 problem...sort of

10. sort fixed length file

11. i've tried to sort fixed length records but it aint working

12. sorting fixed length records/fields (large files)