NT thread vs Linux thread...

NT thread vs Linux thread...

Post by tawei » Fri, 18 Jul 1997 04:00:00



Hi,

  I am posting this more as a question to NT developers than Linux
developers. However, the numbers from my tests may be some interest
for the Linux developer. Since there is no NT system development
group, I posted this to advocacy and hope this will not turn into a
flame war.

  I am writting a server which spawn off large number thread to hand
event. Thus, the thread creating performance is very important. I have
run some testing on two identical PCs. One PC runs NT 4.0 SP3 and the
other runs Linux RedHat 4.2. The testing result shows Linux thread are
much more efficient than NT thread.

Linux uses LinuxThreads 0.6. A implementation of Pthread based on
Linux clone system call.

NT uses Win32 threads functions: CreateThread, CreateMutex,
WaitForSingleObjectEx, and ReleaseMutex.

Here is the number:

1. Spawn 1 million thread. The thread function is practically a null
   function containing only a return statement. I don't have exact
   timing but here is the test result.

   When Linux finished, NT only spawns 450k+ threads.

2. 1 million mutex creating. Well... This is not exact meaningful but
   the number is interesting.

   When Linux finished, NT only creates 250k+ mutex.

3. 1 million mutex lock/unlock test. This tests how fast mutex
   operation is.

   When Linux finished, NT finishs only 400k lock/unlock operation.

Any suggestion to improve perforamnce is appreciated. Thanks in
advance.

--
Ta-Wei "David" Li

 
 
 

NT thread vs Linux thread...

Post by David LeBla » Sun, 20 Jul 1997 04:00:00



>  I am writting a server which spawn off large number thread to hand
>event.

This is completely ridiculous.  Creating 1 million threads, events,
whatever is quite silly.  What is much more important is what you are
trying to do.  You aren't going to run 1M threads anyway.  You are
going to _do_ something.  Your bottlenecks are really much more likely
to be in what you are doing than in the system calls.

Give us some idea of what you are trying to accomplish, and maybe
someone can give you some decent advice.

David LeBlanc           |Why would you want to have your desktop user,

                        |minicomputer-class computing environment?
                        |Scott McNealy

 
 
 

NT thread vs Linux thread...

Post by Jonathan A. Maxwel » Sun, 20 Jul 1997 04:00:00


]
] >  I am writting a server which spawn off large number thread
] >to hand event.
[ and NT is too slow ]

In NT the 'process' confuses resource sharing with task grouping,
resulting in a complicated model.  Many times, the more complex,
the harder to implement efficiently -- this could be why in your
tests, Linux was much faster creating threads.

] This is completely ridiculous.  Creating 1 million threads,
] events, whatever is quite silly.  What is much more important
  ^^^^^^

Most of the time, sure.  But have you ever tried to implement a
parallel Priority Queue?  One of my experiments is driven by
events, and the faster events are processed the better (the
actual operations can be made parallel) .. a million, I wish!
What's the point in implementing a Fibonacci heap if the gains
are just going to be erased by a slow OS?

] is what you are trying to do.  You aren't going to run 1M
] threads anyway.  You are going to _do_ something.  Your
] bottlenecks are really much more likely to be in what you are
] doing than in the system calls.
]
] Give us some idea of what you are trying to accomplish, and
] maybe someone can give you some decent advice.

No doubt he can do what he wants using only 10 threads, for some
number of 10.  But in the future, where massively parallel
systems will be increasingly used, it may be that the
event.process() will be done in parallel and take no time
compared to the priorityQueue.Insert().

] David LeBlanc           |Why would you want to have your desktop user,

        --JAM

 
 
 

NT thread vs Linux thread...

Post by John Wiltshi » Sun, 20 Jul 1997 04:00:00



comp.os.ms-windows.nt.advocacy:

Quote:>Hi,

>  I am posting this more as a question to NT developers than Linux
>developers. However, the numbers from my tests may be some interest
>for the Linux developer. Since there is no NT system development
>group, I posted this to advocacy and hope this will not turn into a
>flame war.

>  I am writting a server which spawn off large number thread to hand
>event. Thus, the thread creating performance is very important. I have
>run some testing on two identical PCs. One PC runs NT 4.0 SP3 and the
>other runs Linux RedHat 4.2. The testing result shows Linux thread are
>much more efficient than NT thread.

>Linux uses LinuxThreads 0.6. A implementation of Pthread based on
>Linux clone system call.

>NT uses Win32 threads functions: CreateThread, CreateMutex,
>WaitForSingleObjectEx, and ReleaseMutex.

>Here is the number:

>1. Spawn 1 million thread. The thread function is practically a null
>   function containing only a return statement. I don't have exact
>   timing but here is the test result.

>   When Linux finished, NT only spawns 450k+ threads.

>2. 1 million mutex creating. Well... This is not exact meaningful but
>   the number is interesting.

>   When Linux finished, NT only creates 250k+ mutex.

>3. 1 million mutex lock/unlock test. This tests how fast mutex
>   operation is.

>   When Linux finished, NT finishs only 400k lock/unlock operation.

>Any suggestion to improve perforamnce is appreciated. Thanks in
>advance.

Use thread pooling.  Don't create and destroy threads - create a set
of worker threads that sit idle and farm off work to them when it
comes in.  When they finish working let them return to the idle state
but don't terminate them.  You then adjust the number of worker
threads to suit your application.

Also, CreateMutex is slow.  Unless you have a reason for creating a
mutex that is valid across process boundaries use Critical Sections
and not Mutexes.  On a single CPU machine it basically maps to a
boolean flag instead of a kernel object.

You could also try investigating Fibers if you want real speed but a
bit of programming effort (they are nonpreemptive).

This technique should give you speed improvements independant of OS
(Unix has been doing it for years with processes).

John Wiltshire

------------------------------------------------------
John Wiltshire              |  (w) +61 7 38342783

------------------------------------------------------
Fear: when you see B8 00 4C CD 21 and you know what it means.

 
 
 

NT thread vs Linux thread...

Post by David LeBla » Sun, 20 Jul 1997 04:00:00




>] This is completely ridiculous.  Creating 1 million threads,
>] events, whatever is quite silly.  What is much more important
>  ^^^^^^
>Most of the time, sure.  But have you ever tried to implement a
>parallel Priority Queue?  

I'm not sure of your exact definition of this - I think so, actually.

Quote:>One of my experiments is driven by
>events, and the faster events are processed the better (the
>actual operations can be made parallel) .. a million, I wish!
>What's the point in implementing a Fibonacci heap if the gains
>are just going to be erased by a slow OS?

Well, yes - but the point here is that what you are actually doing
with all those threads and events is going to be what dominates your
results.  Those calls are always going to be very rapid.  For example,
you might get better results opening 500 sockets than 5000.

Quote:>] is what you are trying to do.  You aren't going to run 1M
>] threads anyway.  You are going to _do_ something.  Your
>] bottlenecks are really much more likely to be in what you are
>] doing than in the system calls.
>] Give us some idea of what you are trying to accomplish, and
>] maybe someone can give you some decent advice.
>No doubt he can do what he wants using only 10 threads, for some
>number of 10.  

Perhaps so - but he didn't say what he was doing.  He just made some
silly experiment that is very likely going to be a poor indicator of
actual results since he isn't really modeling his process.

Quote:>But in the future, where massively parallel
>systems will be increasingly used, it may be that the
>event.process() will be done in parallel and take no time
>compared to the priorityQueue.Insert().

Ah - but you may well find that which OS can create threads or events
faster may not predict the actual results of a real-world application.
You may also find that the areas of code which most need optimizing
for each OS may differ.

David LeBlanc           |Why would you want to have your desktop user,

                        |minicomputer-class computing environment?
                        |Scott McNealy

 
 
 

NT thread vs Linux thread...

Post by Jim Fros » Sun, 20 Jul 1997 04:00:00



> I am writting a server which spawn off large number thread to hand
> event. Thus, the thread creating performance is very important. I have

> run some testing on two identical PCs. One PC runs NT 4.0 SP3 and the
> other runs Linux RedHat 4.2. The testing result shows Linux thread are

> much more efficient than NT thread.

That's true, however the appropriate approach to this problem is to
create a pool of threads and reuse them; this eliminates the creation
overhead.

For in-process synchronization you should use critical sections in NT,
rather than mutexes.  They have substantially better performance.  NT
mutexes are very expensive and should only be used when cross-process
synchronization is desired.

Hope this helps,

jim frost

 
 
 

NT thread vs Linux thread...

Post by tawei » Sun, 20 Jul 1997 04:00:00


Quote:> >  I am writting a server which spawn off large number thread to hand
> >event.

> This is completely ridiculous.  Creating 1 million threads, events,
> whatever is quite silly.  

I did not state clearly when I posted the original message. My program
is using large number of threads but nowhere close to 1
millions. However, since I have implemented this small thread package
that wraps the native thread implementation of the OS, I thought it
might be interesting to do some benchmark on how fast the OS can
create thread, mutex and how fast the mutex locking/unlocking is.

Quote:> Give us some idea of what you are trying to accomplish, and maybe
> someone can give you some decent advice.

I have implemented a thread pool on top of my thread package that
pre-spawn and reuse thread. Not all threads are created/destroyed on
the fly.

However, the NT mutex locking/unlocking is still more expensive than
Linux's. I am wonder if there is a NT native API for threading.

--
Ta-Wei "David" Li

 
 
 

NT thread vs Linux thread...

Post by Ingo Molna » Mon, 21 Jul 1997 04:00:00



: >  I am writting a server which spawn off large number thread to hand
: >event.

: This is completely ridiculous.  Creating 1 million threads, events,
: whatever is quite silly.  What is much more important is what you are
: trying to do.  You aren't going to run 1M threads anyway.  You are
: going to _do_ something.  Your bottlenecks are really much more likely
: to be in what you are doing than in the system calls.

: Give us some idea of what you are trying to accomplish, and maybe
: someone can give you some decent advice.

come on ... all the system services he benchmarked are vital parts
of _any_ OS. Or are you trying to say that OS speed doesnt matter?

-- mingo

 
 
 

NT thread vs Linux thread...

Post by Ingo Molna » Mon, 21 Jul 1997 04:00:00



: > I am writting a server which spawn off large number thread to hand
: > event. Thus, the thread creating performance is very important. I have
: >
: > run some testing on two identical PCs. One PC runs NT 4.0 SP3 and the
: > other runs Linux RedHat 4.2. The testing result shows Linux thread are
: >
: > much more efficient than NT thread.

: That's true, however the appropriate approach to this problem is to
: create a pool of threads and reuse them; this eliminates the creation
: overhead.

that might be a good workaround, but this shows the real problem:
thread creation under NT is slow. (not only thread creation, but all
the other system services that were benchmarked too). There is alot
of useless bloat and legacy stuff (ACL lists for _everything_, too
wide APIs, linktime HAL), which slow things down.

Pre-creating threads complicates the application and introduces new bug
sources. [this isnt only true for NT, but also for older unices
without threads, they have to do preforking ... eg. Apache under
Linux does preforking too ... an ugly workaround due to a speed
problem].

You can always find a workaround if a particular system service is
slow. In the worst case you can reimplement the whole OS, to make
it fast enough ;) Technically doable, but this isnt the point i guess
...

and dont forget the other system services he mentioned, how do
you work around the slowness there? Thread creation is just the
simplest thing actually ...

-- mingo

ps. Linux takes a slightly different approach: if there is a
'workaround' that is faster, that workaround becomes the
implementation as fast as possible ;)

 
 
 

NT thread vs Linux thread...

Post by David LeBla » Mon, 21 Jul 1997 04:00:00




>: This is completely ridiculous.  Creating 1 million threads, events,
>: whatever is quite silly.  What is much more important is what you are
>: trying to do.  
>come on ... all the system services he benchmarked are vital parts
>of _any_ OS. Or are you trying to say that OS speed doesnt matter?

I'm saying that if you propose to measure something, measure something
that corresponds to what you are actually _doing_.  For example, I can
measure the fuel efficiency of cars at 20MPH in 5th gear and come up
with interesting results which won't apply to anything in the real
world.

OS speed creating threads very well may not matter at all. For
example, my app spawns off 128 threads on initialization and then
doesn't create any more.  It is a one-time hit that might cost me 0.5
seconds from an overall run that could last from 30 minutes to a day.
If I can reduce that to 0.125 seconds, it doesn't look like a big win
to me.

The point is to put together your app, then profile it.  Find which
calls are most expensive, then optimize.  You may well have to
optimize different areas to get the best performance on different OS's
- you could find that OS X does some function very easily, but that OS
Y bogs down.  Different OS's have different design decisions, and you
want to play to their strengths and minimize their weaknesses.  You
may also find that something odd happened - for instance, it is
concievable that the compiler optimized the Linux call that calls
clone() and does nothing right out of the code, wheras NT was still
going through all the thread setup.  As soon as you make that thread
actually _do_ something, the picture could change drastically.

David LeBlanc           |Why would you want to have your desktop user,

                        |minicomputer-class computing environment?
                        |Scott McNealy

 
 
 

NT thread vs Linux thread...

Post by Timothy Watso » Mon, 21 Jul 1997 04:00:00



> concievable that the compiler optimized the Linux call that calls
> clone() and does nothing right out of the code, wheras NT was still

That would really speed it up!! :)

--
________________________________________________________________________
T    i    m    o    t    h    y              W    a    t    s    o    n

  __/| Something there is that doesn't love a wall, that wants it down

 
 
 

NT thread vs Linux thread...

Post by David LeBla » Mon, 21 Jul 1997 04:00:00




>> concievable that the compiler optimized the Linux call that calls
>> clone() and does nothing right out of the code, wheras NT was still
>That would really speed it up!! :)

It crossed my mind that it could be doing exactly that - it is all
copy-on-write, and if there isn't anything being done, no copy, no
nothing.

David LeBlanc           |Why would you want to have your desktop user,

                        |minicomputer-class computing environment?
                        |Scott McNealy

 
 
 

NT thread vs Linux thread...

Post by Ingo Molna » Tue, 22 Jul 1997 04:00:00





: >> concievable that the compiler optimized the Linux call that calls
: >> clone() and does nothing right out of the code, wheras NT was still
:  
: >That would really speed it up!! :)

: It crossed my mind that it could be doing exactly that - it is all
: copy-on-write, and if there isn't anything being done, no copy, no
: nothing.

nah, dont be silly, clone() has serious side-effects, GCC doesnt
optimize it away ...

yes, thread creation under Linux is very fast.

-- mingo

 
 
 

NT thread vs Linux thread...

Post by tawei » Tue, 22 Jul 1997 04:00:00



> OS speed creating threads very well may not matter at all.

I agree that benchmarking thread creating time isn't very meaningful
in real-life application. The thread pool can easily over come the
slowness of the thread creation.

However, the benchmark of the mutex locking/unlocking has some real
meaning. NT's mutex locking/unlocking is 4 times slower than
Linux. For the matters, I actually profile 4 OSs with the same testing
program and NT comes up to be the slowest. I think this is a benchmark
that is relavent to the real-life applications.

I am still curious about why NT's thread creating is so much slower
than other OSs. Again, here is the thread benchmark on 4 different OS.

  NeXTSTEP 3.3 with Mach Cthread on Pentium 166:
  time: 124344.00 ms for 1000000 threads (    0.12 ms per thread)

  Linux wiht LinuxThread 0.6 on Pentium Pro 150:
  time: 379684.00 ms for 1000000 threads (    0.38 ms per thread)

  Solaris with UI thread on Sparc Ultra 1 with 150Mhz Ultra Sparc:
  time: 423948.00 ms for 1000000 threads (    0.42 ms per thread)

  NT with Win32 thread on Pentium Pro 150:
  time: 750234.00 ms for 1000000 threads (    0.75 ms per thread)

--
Ta-Wei "David" Li

 
 
 

NT thread vs Linux thread...

Post by David LeBla » Wed, 23 Jul 1997 04:00:00



>I have implemented a thread pool on top of my thread package that
>pre-spawn and reuse thread. Not all threads are created/destroyed on
>the fly.

That's a good decision, regardless of OS.

Quote:>However, the NT mutex locking/unlocking is still more expensive than
>Linux's. I am wonder if there is a NT native API for threading.

Unless you need to lock and unlock across processes (which it sounds
like you don't), use critical sections instead.  Much lower overhead.

Look up CreateThread().  If all of this was done from some library,
you could be taking severe hits due to a bad library.

David LeBlanc           |Why would you want to have your desktop user,

                        |minicomputer-class computing environment?
                        |Scott McNealy