[PATCH] Scalable Scheduling

[PATCH] Scalable Scheduling

Post by Linus Torvald » Fri, 10 Aug 2001 01:50:07




> I have been working on scheduler scalability.  Specifically,
> the concern is running Linux on bigger machines (higher CPU
> count, SMP only for now).

Note that there is no way I will ever apply this particular patch for a
very simple reason: #ifdef's in code.

Why do you have things like

        #ifdef CONFIG_SMP
                .. use nr_running() ..
        #else
                .. use nr_running ..
        #endif

and

        #ifdef CONFIG_SMP
               list_add(&p->run_list, &runqueue(task_to_runqueue(p)));
        #else
               list_add(&p->run_list, &runqueue_head);
        #endif

when it just shows that you did NOT properly abstract your thinking to
realize that the non-SMP case should be the same as the SMP case with 1
CPU (+ optimization).

I find code like the above physically disgusting.

What's wrong with using

        nr_running()

unconditionally, and make sure that it degrades gracefully to just the
single-CPU case?

What's wrong whit just using

        runqueue(task_to_runqueue(p))

and having the UP case realize that the "runqueue()" macro is a fixed
entry?

Same thing applies to that runqueue_lock stuff. That is some of the
ugliest code I've seen in a long time. Please use inline functions, sane
defines that work both ways, and take advantage of the fact that gcc will
optimize constant loops and numbers (it's ok to reference arrays in UP
with "array[smp_processor_id()]", and it's ok to have loops that look like
"for (i = 0; i < NR_CPUS; i++)" that will do the right thing on UP _and_
SMP.

And make your #ifdef's be _outside_ the code.

I hate code that has #ifdef's. It's a magjor design mistake, and shows
that the person who coded it didn't think of it as _one_ problem, but as
two.

So please spend some time cleaning it up, I can't look at it like this.

                Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

[PATCH] Scalable Scheduling

Post by Mike Kravet » Fri, 10 Aug 2001 02:10:11


Thanks for the input.  I'll try to get the code into some form
that you can stomach.  Hopefully, after that we can discuss the
merits of the approach/design.

-
Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

[PATCH] Scalable Scheduling

Post by Hubertus Frank » Fri, 10 Aug 2001 02:40:07


Linus, great input on the FLAME side, criticism accepted :-)

More importantly, we wanted to get some input (particular from you)
on whether our approach is actually an acceptable one, not
withstanding the #ifdef's :-),
These are easy to fix, but we wanted to follow up
on this topic after OLS ASAP, before the thoughts on this got lost due to
time.

We will clean this code up ASAP and resubmit.

Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)


(w) 914-945-2003    (fax) 914-945-4425   TL: 862-2003




Subject:  Re: [RFC][PATCH] Scalable Scheduling


> I have been working on scheduler scalability.  Specifically,
> the concern is running Linux on bigger machines (higher CPU
> count, SMP only for now).

Note that there is no way I will ever apply this particular patch for a
very simple reason: #ifdef's in code.

Why do you have things like

     #ifdef CONFIG_SMP
          .. use nr_running() ..
     #else
          .. use nr_running ..
     #endif

and

     #ifdef CONFIG_SMP
            list_add(&p->run_list, &runqueue(task_to_runqueue(p)));
     #else
            list_add(&p->run_list, &runqueue_head);
     #endif

when it just shows that you did NOT properly abstract your thinking to
realize that the non-SMP case should be the same as the SMP case with 1
CPU (+ optimization).

I find code like the above physically disgusting.

What's wrong with using

     nr_running()

unconditionally, and make sure that it degrades gracefully to just the
single-CPU case?

What's wrong whit just using

     runqueue(task_to_runqueue(p))

and having the UP case realize that the "runqueue()" macro is a fixed
entry?

Same thing applies to that runqueue_lock stuff. That is some of the
ugliest code I've seen in a long time. Please use inline functions, sane
defines that work both ways, and take advantage of the fact that gcc will
optimize constant loops and numbers (it's ok to reference arrays in UP
with "array[smp_processor_id()]", and it's ok to have loops that look like
"for (i = 0; i < NR_CPUS; i++)" that will do the right thing on UP _and_
SMP.

And make your #ifdef's be _outside_ the code.

I hate code that has #ifdef's. It's a magjor design mistake, and shows
that the person who coded it didn't think of it as _one_ problem, but as
two.

So please spend some time cleaning it up, I can't look at it like this.

          Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

[PATCH] Scalable Scheduling

Post by Linus Torvald » Fri, 10 Aug 2001 02:50:08



> Linus, great input on the FLAME side, criticism accepted :-)

> More importantly, we wanted to get some input (particular from you)
> on whether our approach is actually an acceptable one, not
> withstanding the #ifdef's :-),

I think what the code itself tried to do looked reasonable, but it was so
distracting to read the patch that I can't make any really intelligent
comments about it.

The only thing that looked really ugly was that real-time runqueue thing.
Does it _really_ have to be done that way?

                Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

[PATCH] Scalable Scheduling

Post by Linus Torvald » Fri, 10 Aug 2001 03:10:10



> The only thing that looked really ugly was that real-time runqueue
> thing. Does it _really_ have to be done that way?

Oh, and as I didn't actually run it, I have no idea about what performance
is really like. I assume you've done lmbench runs across wide variety (ie
UP to SMP) of machines with and without this?

"Scalability" is useless if the baseline you scale from is bad. In the
end, the only thing that matters is "performance", not "scalability".
Which is why sometimes O(n) is better than O(logn).

                Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

[PATCH] Scalable Scheduling

Post by David S. Mille » Fri, 10 Aug 2001 04:00:10


   Date:        Wed, 8 Aug 2001 11:18:44 -0700

   Someobdy really ought to take the time to make a cache miss counter program
   that works like /bin/time.  So I could do

           $ cachemiss lat_ctx 2
           10123 instruction, 22345 data, 50432 TLB flushes

   Has anyone done that?  If so, then what would be cool is if each of these
   wonderful new features that people propose come with cachemiss results for
   the related part of LMbench or some other benchmark.

On some platforms, such an app can basically be written already.

The kernel support is there for sparc64 wrt. the cache
miss stuff.  It uses the cpu performance counter stuff.
Have a look at linux/include/asm-sparc64/perfctr.h to see
what I mean.

The performance counters are fancy enough that you can
ask them to do stuff like:

1) tell me D-cache misses in user and/or kernel mode
2) tell me D-cache misses that hit the E-cache
   in user and/or kernel mode
3) tell me I-cache misses, but only those which actually
   ended up stalling the pipeline
4) tell me E-cache misses, where the chip was not able
   to get granted to memory bus immediately
5) Same as #4, but how many total bus cycles were spent
   waiting for bus grant for the E-cache miss

Later,
David S. Miller

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

[PATCH] Scalable Scheduling

Post by Hubertus Frank » Fri, 10 Aug 2001 04:10:10


There is a project that some of our people here are involved with called
SIGMA.
They are looking at standardizing interfaces for this across all platforms.
I'll dig out an http and forward to the list.

Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
, OS-PIC (Chair)

(w) 914-945-2003    (fax) 914-945-4425   TL: 862-2003






Subject:  Re: [RFC][PATCH] Scalable Scheduling

   Date:  Wed, 8 Aug 2001 11:18:44 -0700

   Someobdy really ought to take the time to make a cache miss counter
program
   that works like /bin/time.  So I could do

        $ cachemiss lat_ctx 2
        10123 instruction, 22345 data, 50432 TLB flushes

   Has anyone done that?  If so, then what would be cool is if each of
these
   wonderful new features that people propose come with cachemiss results
for
   the related part of LMbench or some other benchmark.

On some platforms, such an app can basically be written already.

The kernel support is there for sparc64 wrt. the cache
miss stuff.  It uses the cpu performance counter stuff.
Have a look at linux/include/asm-sparc64/perfctr.h to see
what I mean.

The performance counters are fancy enough that you can
ask them to do stuff like:

1) tell me D-cache misses in user and/or kernel mode
2) tell me D-cache misses that hit the E-cache
   in user and/or kernel mode
3) tell me I-cache misses, but only those which actually
   ended up stalling the pipeline
4) tell me E-cache misses, where the chip was not able
   to get granted to memory bus immediately
5) Same as #4, but how many total bus cycles were spent
   waiting for bus grant for the E-cache miss

Later,
David S. Miller

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

[PATCH] Scalable Scheduling

Post by Linus Torvald » Fri, 10 Aug 2001 04:20:07



> write:

>    inline int nr_running(void)
>    {
>    #ifdef CONFIG_SMP
>            int i = 0, tot=nt_running(REALTIME_RQ);
>            while (i < smp_num_cpus) {
>                    tot += nt_running(cpu_logical_map(i++));
>            }
>            return(tot);
>    #else
>            return nr_running;
>    #endif
>    }

Even more preferably, just have (in a header file)

        #ifdef CONFIG_SMP

        inline int nr_running(void)
        {
                ...
        }

        .. other SMP cases ..

        #else

        #define nr_running() (__nr_running)

        .. other UP cases ..

        #endif

if you just cannot make an efficient function that just works for both.

No, we don't adhere to this everywhere. But we should (and largely _do_)
try to.

Having the #ifdef's outside the code tends to have two advantages:

 - it makes the code much more readable, and doesn't split things up.

 - you have to choose your abstraction interfaces more carefully, which in
   turn tends to make for better code.

Abstraction is nice - _especially_ when you have a compiler that sees
through the abstraction and can generate code as if it wasn't there.

                Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

[PATCH] Scalable Scheduling

Post by Hubertus Frank » Fri, 10 Aug 2001 04:20:08


I thought, Linus's suggestion is pretty straight forward and clear, like
you do it.
We fix it and resubmit ASAP.

Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
, OS-PIC (Chair)

(w) 914-945-2003    (fax) 914-945-4425   TL: 862-2003


03:06:01 PM





Subject:  Re: [RFC][PATCH] Scalable Scheduling


Quote:> Yes we have, we'll provide those numbers with the updated patch.
> One challenge will be maintaining the same level of performance
> for UP as in the current code.  The current code has #ifdefs to
> separate some of the UP/SMP code paths and we will try to eliminate
> these.

Does it help if I clarify what Linus was suggesting?  Instead of:

         #ifdef CONFIG_SMP
                 .. use nr_running() ..
         #else
                 .. use nr_running ..
         #endif

write:

     inline int nr_running(void)
     {
     #ifdef CONFIG_SMP
          int i = 0, tot=nt_running(REALTIME_RQ);
          while (i < smp_num_cpus) {
               tot += nt_running(cpu_logical_map(i++));
          }
          return(tot);
     #else
          return nr_running;
     #endif
     }

Then see if you can make the #ifdef's go away from that too.  (If that's
too hard, well, at least the #ifdef's are now reduced.)

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

[PATCH] Scalable Scheduling

Post by Victor Yodaike » Fri, 10 Aug 2001 04:40:10



> One challenge will be maintaining the same level of performance
> for UP as in the current code.  The current code has #ifdefs to

How does the "current code" compare to the current Linux UP code?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

[PATCH] Scalable Scheduling

Post by Victor Yodaike » Fri, 10 Aug 2001 05:00:12



> We did not modify the UP code at all. There will be NO effects (positive
> nor negative) what so ever.

Cool. So the obvious next question is
        How does it compare on a dual to the current Linux scheduler?

Obviously:
        Performance sucks on two processors but scales well

would not be a good thing.

> Hubertus Franke
> Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
> , OS-PIC (Chair)

> (w) 914-945-2003    (fax) 914-945-4425   TL: 862-2003





> Subject:  Re: [RFC][PATCH] Scalable Scheduling


> > One challenge will be maintaining the same level of performance
> > for UP as in the current code.  The current code has #ifdefs to

> How does the "current code" compare to the current Linux UP code?

> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
 
 
 

[PATCH] Scalable Scheduling

Post by Hubertus Frank » Sun, 12 Aug 2001 09:10:06


Chris, I looked into the availability of this stuff. The people
here at Watson will put some of their low level stuff together and
hopefully we get some insights in a few weeks.
I keep you posted

Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
, OS-PIC (Chair)

(w) 914-945-2003    (fax) 914-945-4425   TL: 862-2003






Subject:  Re: [RFC][PATCH] Scalable Scheduling

    1) tell me D-cache misses in user and/or kernel mode
    2) tell me D-cache misses that hit the E-cache
       in user and/or kernel mode
    3) tell me I-cache misses, but only those which actually
       ended up stalling the pipeline
    4) tell me E-cache misses, where the chip was not able
       to get granted to memory bus immediately
    5) Same as #4, but how many total bus cycles were spent
       waiting for bus grant for the E-cache miss

ia32 for PPro and above can do all of that too pretty much (perhaps
not exactly the same metric, but hopefully equally useful).  The only
thing is, you can read them all at once, only a small number of them,
and they are for all kernel/userland states, so you would need to
save/read them on context switches.

  --cw

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

[PATCH] Scalable Scheduling

Post by Erik Corr » Thu, 16 Aug 2001 05:00:06



> Someobdy really ought to take the time to make a cache miss counter program
> that works like /bin/time.  So I could do
>    $ cachemiss lat_ctx 2
>    10123 instruction, 22345 data, 50432 TLB flushes

Take a look at http://www.csd.uu.se/~mikpe/linux/perfctr/.

It's a patch to make the performance counters per-program, and
make them easy to control.

There's an example program in there called perfex which does what
you want, though the user interface isn't as simple as the above.
You can do

perfex -e 0x00430046 lat_ctx 2

The last two digits of the -e value are the counter to be printed,
which in this case (Athlon) is the data-TLB misses.  That stuff
is documented in
http://www.amd.com/products/cpg/athlon/techdocs/pdf/22007.pdf
page 164/180

It would be nice if the patch found it's way into the kernel.

If you have APIC support there is also infrastructure for profiling
based on event-sampling instead of time sampling (sample every 100
cache misses instead of every 100us).  (Sadly my old 0.25um Athlon
has no APIC support).

--

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

1. [patch] ultra-scalable O(1) SMP and UP scheduler

hm, Andrea said that the only serious issue was in the sysvinit code,
which should be fixed in any recent distro. Andrea?

there must be some other bug as well, the child-runs-first scheduling can
cause lockups, but it shouldnt cause oopes.

        Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2. Animation Awards

3. Multiprocessor System Suggestion

4. [PATCH] Scalable CPU bitmasks

5. C unix help

6. [patch] Scalable statistics counters with /proc reporting

7. Various devices busy--Help?

8. [RFC] [PATCH] Scalable Statistics Counters

9. [PATCH][RFC] Proposal For A More Scalable Scheduler ...

10. [PATCH] Scalable Statistics Counters

11. [patch] batch/idle priority scheduling, SCHED_BATCH

12. Batch scheduling and fixes added to performance patches (-ck)