Performance improvements binding threads to CPUs or thread priorities

Performance improvements binding threads to CPUs or thread priorities

Post by mara » Fri, 07 Mar 2003 05:40:14



I'm using a SunOS 5.8 Generic_108528-12 sun4u sparc SUNW,Ultra-60 -
which has 2 CPUs.  In my case, there is a c++ application that
consists of several worker threads that are responsible for heavy
computations, and several other threads for receiving data via udp and
tcp, and another thread for sending results via tcp.

1. Can someone point out some advantages of binding a thread/lwp to a
cpu? I'm looking for some docs, or concrete cases of when this would
be beneficial. I tried using processor_bind() in a sample test, but
could not see performance gains.

2. I tried to increase the priority of the computational threads, and
likewise did not see any performance improvement. Can anyone suggest
ways I can leverage priority scheduling for the above application ?

MR

 
 
 

Performance improvements binding threads to CPUs or thread priorities

Post by Steve Wa » Fri, 07 Mar 2003 15:16:08




>I'm using a SunOS 5.8 Generic_108528-12 sun4u sparc SUNW,Ultra-60 -
>which has 2 CPUs.  In my case, there is a c++ application that
>consists of several worker threads that are responsible for heavy
>computations, and several other threads for receiving data via udp and
>tcp, and another thread for sending results via tcp.

>1. Can someone point out some advantages of binding a thread/lwp to a
>cpu? I'm looking for some docs, or concrete cases of when this would
>be beneficial. I tried using processor_bind() in a sample test, but
>could not see performance gains.

>2. I tried to increase the priority of the computational threads, and
>likewise did not see any performance improvement. Can anyone suggest
>ways I can leverage priority scheduling for the above application ?

Why do you think either of these would improve performance?  What is the
specific problem you're really trying to solve?

If you've got n compute-bound threads, where n is greater than the number
of CPUs, there's probably not much you can do to increase performance
any further except add CPU.
--
Steve Watt KD6GGD  PP-ASEL-IA          ICBM: 121W 56' 57.8" / 37N 20' 14.9"

   Free time?  There's no such thing.  It just comes in varying prices...

 
 
 

Performance improvements binding threads to CPUs or thread priorities

Post by mara » Sat, 08 Mar 2003 02:11:35


So are you saying that neither (1) binding a thread to a cpu or (2)
bumping up computational thread priority will help in terms of
performance?  The specific issue is that with the current system
limitations, I'm trying to improve the computational performance -
increase the # of computations that can be performed per time interval
(ie, second) - this is assuming a fast input message rate and fast
output message rate consumption where the bottleneck is the time it
takes to perform the computation.

Simplified Scenario To Solve:
1. computation request received via socket
2. request added to worker thread pool queue
3. worker thread picks up msg and performs calc

So, without hardware modifications - ie, increasing # of CPUs, what
can be done to improve performance of step (3) ?

Thanks

MR




> >I'm using a SunOS 5.8 Generic_108528-12 sun4u sparc SUNW,Ultra-60 -
> >which has 2 CPUs.  In my case, there is a c++ application that
> >consists of several worker threads that are responsible for heavy
> >computations, and several other threads for receiving data via udp and
> >tcp, and another thread for sending results via tcp.

> >1. Can someone point out some advantages of binding a thread/lwp to a
> >cpu? I'm looking for some docs, or concrete cases of when this would
> >be beneficial. I tried using processor_bind() in a sample test, but
> >could not see performance gains.

> >2. I tried to increase the priority of the computational threads, and
> >likewise did not see any performance improvement. Can anyone suggest
> >ways I can leverage priority scheduling for the above application ?

> Why do you think either of these would improve performance?  What is the
> specific problem you're really trying to solve?

> If you've got n compute-bound threads, where n is greater than the number
> of CPUs, there's probably not much you can do to increase performance
> any further except add CPU.

 
 
 

Performance improvements binding threads to CPUs or thread priorities

Post by Eric Sosma » Sat, 08 Mar 2003 03:02:03



> So are you saying that neither (1) binding a thread to a cpu or (2)
> bumping up computational thread priority will help in terms of
> performance?  The specific issue is that with the current system
> limitations, I'm trying to improve the computational performance -
> increase the # of computations that can be performed per time interval
> (ie, second) - this is assuming a fast input message rate and fast
> output message rate consumption where the bottleneck is the time it
> takes to perform the computation.

> Simplified Scenario To Solve:
> 1. computation request received via socket
> 2. request added to worker thread pool queue
> 3. worker thread picks up msg and performs calc

> So, without hardware modifications - ie, increasing # of CPUs, what
> can be done to improve performance of step (3) ?

    All the things that can be done to improve the performance
of any calculation: better algorithms, better implementations,
and so forth.

    One possibility to consider (depending on your data rates
and so on) might be to eliminate the overhead of handing off
the requests from a listener thread to a worker thread.  Wny
not just have the thread that receives a request go off and
service it directly, instead of dumping it in a queue and
going to the bother of waking up some other thread to pull
it out and deal with it?

--

 
 
 

Performance improvements binding threads to CPUs or thread priorities

Post by David Schwart » Sat, 08 Mar 2003 04:06:33



> So are you saying that neither (1) binding a thread to a cpu or (2)
> bumping up computational thread priority will help in terms of
> performance?  The specific issue is that with the current system
> limitations, I'm trying to improve the computational performance -
> increase the # of computations that can be performed per time interval
> (ie, second) - this is assuming a fast input message rate and fast
> output message rate consumption where the bottleneck is the time it
> takes to perform the computation.

> Simplified Scenario To Solve:
> 1. computation request received via socket
> 2. request added to worker thread pool queue
> 3. worker thread picks up msg and performs calc

> So, without hardware modifications - ie, increasing # of CPUs, what
> can be done to improve performance of step (3) ?

        You have X CPUs, each of which can perform Y computations per second.
It takes Z computations to process one request. If your program was
absolutely perfect, the number of requests per second you could handle
would be X times Y divided by Z.

        If there are reasons your code is falling way below that ideal limit,
you may be able to fix them. But you will never get above that limit.

        Measure how long it takes one CPU to do 1,000 requests. Multiply by the
number of CPU, and divide that into 1,000. If you're getting close to
handling that many requests per second, then the only way to get faster
is to process the request using fewer computations. That is, all that
would be left would be an algorithmic optimization.

        Priorities and binding don't make the CPUs go any faster. In fact, they
tend to slow things down a small amount.

        DS

 
 
 

Performance improvements binding threads to CPUs or thread priorities

Post by Gavin Maltb » Sun, 09 Mar 2003 01:36:58


Hi


> So are you saying that neither (1) binding a thread to a cpu or (2)
> bumping up computational thread priority will help in terms of
> performance?  The specific issue is that with the current system
> limitations, I'm trying to improve the computational performance -
> increase the # of computations that can be performed per time interval
> (ie, second) - this is assuming a fast input message rate and fast
> output message rate consumption where the bottleneck is the time it
> takes to perform the computation.

> Simplified Scenario To Solve:
> 1. computation request received via socket
> 2. request added to worker thread pool queue
> 3. worker thread picks up msg and performs calc

> So, without hardware modifications - ie, increasing # of CPUs, what
> can be done to improve performance of step (3) ?

[cut]

With the limited CPU resources you have you'll likely find that
the kernel's own scheduling makes best overall use of the
resources.  Be sure to use the new libthread in Solaris 8
and later - besides typically being faster it always binds
threads to LWPs and can lead to fewer surprises.  The kernel
will try to allow for CPU (really cache) affinity and run
a thread on the CPU on which it last run etc.  Rather than
performance tuning at the relatively coarse level of cpu binding
you'd probably do better to streamline the computation code
in terms of code optimization, cache friendliness etc.  As somebody
else has said, you *may* find that a producer/consumer model
has a significant overhead in terms of synchronisation where
you could just have a thread run with the request it accepted.

Gavin

 
 
 

Performance improvements binding threads to CPUs or thread priorities

Post by mara » Sun, 09 Mar 2003 02:07:52




> > So are you saying that neither (1) binding a thread to a cpu or (2)
> > bumping up computational thread priority will help in terms of
> > performance?  The specific issue is that with the current system
> > limitations, I'm trying to improve the computational performance -
> > increase the # of computations that can be performed per time interval
> > (ie, second) - this is assuming a fast input message rate and fast
> > output message rate consumption where the bottleneck is the time it
> > takes to perform the computation.

> > Simplified Scenario To Solve:
> > 1. computation request received via socket
> > 2. request added to worker thread pool queue
> > 3. worker thread picks up msg and performs calc

> > So, without hardware modifications - ie, increasing # of CPUs, what
> > can be done to improve performance of step (3) ?

>     All the things that can be done to improve the performance
> of any calculation: better algorithms, better implementations,
> and so forth.

>     One possibility to consider (depending on your data rates
> and so on) might be to eliminate the overhead of handing off
> the requests from a listener thread to a worker thread.  Wny
> not just have the thread that receives a request go off and
> service it directly, instead of dumping it in a queue and
> going to the bother of waking up some other thread to pull
> it out and deal with it?

The reason for that is because it's also a concurrent server and needs
to continue receiving requests. If it's iterative and services
immediately then other requests may and will be lost. It's a typical
server thread pool model.
 
 
 

Performance improvements binding threads to CPUs or thread priorities

Post by mara » Sun, 09 Mar 2003 02:12:30




> > So are you saying that neither (1) binding a thread to a cpu or (2)
> > bumping up computational thread priority will help in terms of
> > performance?  The specific issue is that with the current system
> > limitations, I'm trying to improve the computational performance -
> > increase the # of computations that can be performed per time interval
> > (ie, second) - this is assuming a fast input message rate and fast
> > output message rate consumption where the bottleneck is the time it
> > takes to perform the computation.

> > Simplified Scenario To Solve:
> > 1. computation request received via socket
> > 2. request added to worker thread pool queue
> > 3. worker thread picks up msg and performs calc

> > So, without hardware modifications - ie, increasing # of CPUs, what
> > can be done to improve performance of step (3) ?

>    You have X CPUs, each of which can perform Y computations per second.
> It takes Z computations to process one request. If your program was
> absolutely perfect, the number of requests per second you could handle
> would be X times Y divided by Z.

>    If there are reasons your code is falling way below that ideal limit,
> you may be able to fix them. But you will never get above that limit.

>    Measure how long it takes one CPU to do 1,000 requests. Multiply by the
> number of CPU, and divide that into 1,000. If you're getting close to
> handling that many requests per second, then the only way to get faster
> is to process the request using fewer computations. That is, all that
> would be left would be an algorithmic optimization.

>    Priorities and binding don't make the CPUs go any faster. In fact, they
> tend to slow things down a small amount.

>    DS

That's an interesting way to look at an ideal upper limit on calcs per
second. Although, the CPU relationship to processing time should not
be linearly proportional, but close - running the same test with 1000
calcs on 1 CPU took 1.63 seconds, while running the same test on 2
CPUs took .93 seconds - 1.75x (not 2x).

Is it a misconception that raising computational thread priorities
will dedicate more cpu time, resulting in better performance? If so,
then which
case(s) could benefit from increasing thread priorities or binding
lwp(s) to CPUs? Isn't there an advantage in using System Scope
Scheduling ?

 
 
 

Performance improvements binding threads to CPUs or thread priorities

Post by Eric Sosma » Sun, 09 Mar 2003 03:17:14




> >     One possibility to consider (depending on your data rates
> > and so on) might be to eliminate the overhead of handing off
> > the requests from a listener thread to a worker thread.  Wny
> > not just have the thread that receives a request go off and
> > service it directly, instead of dumping it in a queue and
> > going to the bother of waking up some other thread to pull
> > it out and deal with it?

> The reason for that is because it's also a concurrent server and needs
> to continue receiving requests. If it's iterative and services
> immediately then other requests may and will be lost. It's a typical
> server thread pool model.

    How will incoming requests be "lost?"  You wrote earlier
that these requests arrive on a socket, which means they'll
be perfectly happy to sit around in socket buffers until you're
ready to read them.  From all you've said thus far, I don't see
an advantage in reading them into user-space buffers merely to
ignore them until a worker thread is available; you might just
as well ignore them in kernel-space buffers ...?

--

 
 
 

Performance improvements binding threads to CPUs or thread priorities

Post by David Schwart » Sun, 09 Mar 2003 04:18:50



> That's an interesting way to look at an ideal upper limit on calcs per
> second. Although, the CPU relationship to processing time should not
> be linearly proportional, but close - running the same test with 1000
> calcs on 1 CPU took 1.63 seconds, while running the same test on 2
> CPUs took .93 seconds - 1.75x (not 2x).

        It's theoretically possible that you might be able to push that closer
to 2x. 2x is the upper limit without algorithmic optimization of the
calculation itself. Two CPUs will never run more than twice as fast as
one. (Well, that's not totally true, but ....)

Quote:> Is it a misconception that raising computational thread priorities
> will dedicate more cpu time, resulting in better performance?

        Yep. Thread priorities allow you to adjust where the computation time
goes, but you don't get more of it.

Quote:> If so,
> then which
> case(s) could benefit from increasing thread priorities or binding
> lwp(s) to CPUs? Isn't there an advantage in using System Scope
> Scheduling ?

        You should create as many computation threads as you have CPUs (or just
a few more) and give them system scope scheduling. This will probably
get you as close to the ideal as you can get.

        Measure your performance against the ideal. If you're falling well
below it, something's probably wrong. Otherwise, look at optimizing the
CPU-intensive step.

        DS

 
 
 

Performance improvements binding threads to CPUs or thread priorities

Post by Car » Wed, 19 Mar 2003 14:57:33


Marat:

The overall flow of your program seems efficient enough. The following
are the steps we had used to speed up our application

1. Tried to used minimum floating point calculations.
2. Memory Pooling.
3. Caching.

You can also profile[gprof] your program to determine hot-spots and
fix them.

As far as i understand it thread->cpu binding is only used for
management purposes so that a single program does not hog up all the
cpus in say a 12cpu machine.  Also setting processor binding
programtically is only allowed with root priviledges.

Increasing process priority will only be effective if there are
several other computationally intensive processes running on a system
else its no good.

Hope these points help

Ritesh Noronha
PCS Pvt Ltd
Embedded Tech Divison [Storage]

 
 
 

1. Quick Q on kernel threads and RT thread priorities

Hello.  I'm using kernel 2.4.18 in a semi-RT application.  I'm using the
SCHED_RR scheduler with a number of processes, priorities running from 5
up through 90.  Default processes, of course, run under SCHED_OTHER at
priority 0.  There are a number of kernel threads running, quick example
from busybox's ps (modified to show scheduling prios), see bottom of
email.   I have tested and the priorities do work correctly -- a prio 50
process can completely starve a prio 49 process, as it should be.  My
question is:  Do I need to worry about kernel processes, such as
[keventd], [eth0], etc, running at 0 priority?  Should I run them at
99?  I have experienced no problems seeing ethernet traffic with process
53 in the list below at prio 0 and a CPU-starving test process running
at prio 50.   I am worried that the kernel may lock up if it's processes
get starved.  I am also worried that some of these processes may depend
on the fact that everyone can get equal scheduling and setting them to
99 will starve my application.  Any advice?

Thanks in advance,
Jim Duchek
Caseta Technologies, inc.

sample PS output:

 PID  Uid     Pri VmSize Stat Command
    1 root       0   1724 S    init
    2 root       0        S    [keventd]
    3 root       0        S    [ksoftirqd_CPU0]
    4 root       0        S    [kswapd]
    5 root       0        S    [bdflush]
    6 root       0        S    [kupdated]
    7 root       0        S    [mtdblockd]
   20 root       0   1700 S    syslogd -m 0
   22 root       0   1708 S    klogd
   53 root       0        S    [eth0]
   63 root      50   1676 S    /usr/sbin/inetd /etc/inetd.conf
   64 root       0   1724 S    init
   65 root       0   1728 S    init
   66 root      50   2112 S    telnetd
   67 root      50   1784 S    -sh
  142 root      50   2112 R    telnetd
  143 root      50   1784 S    -sh
  871 root      50   1844 R    ps

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2. iconizing vs task bar

3. Threads, threads, threads

4. Solaris 2.2 config for bash 1.12?

5. Threads in linux versus threads in NT and threads in Solaris.

6. Can't access RH7.1 shares w/ WIN2K (Samba)

7. Q: Suggestions for 2D/3D graphics card?

8. threads packages: kernel threads vs. user threads

9. POSIX threads, thread-specific data: what about the "main" thread?

10. Binding a kernel thread to a particular CPU

11. How to bind thread to particular CPU

12. Vast improvement in thread scheduling 2.0.29->2.0.32