exit_free(), 2.5.31-A0

exit_free(), 2.5.31-A0

Post by Ingo Molna » Thu, 15 Aug 2002 00:30:07



the attached patch implements a new syscall, exit_free():

        exit_free(error_code, addr, val);

this syscall is used as a performance optimization in glibc's threading
library.

upon exiting of child threads there is an ugly race condition that must be
solved: the freeing of the child stack. Old pthreads used to set up a
*oline, jump to it, munmap() the child stack and call sys_exit(). It
does not have to be detailed how much this hurts performance: the
munmap(), besides being slow for such a lightweight thing as thread-exit,
also flushes the TLB, possibly across CPUs. Also, locality of reference of
thread stacks is lost as well.

the new thread library solves this performance by introducing a 'thread
stack cache' - a simple list of thread stacks. (with some more details to
handle different stack sizes.) The problem with the stack cache is that
there's a race condition: who releases it? We must not free the thread
stack up until the very last instruction the current thread executes,
because a signal handler might arrive and might use an alrady freed stack.  
Other threads might pick this stack and overwrite it ... Disabling signals
upon thread-exit adds a syscall overhead, but it still doesnt solve the
fundamental problem of 'who frees the stack'. A global semaphore for a
*oline stack would have to be introduced to be able to free the stack
safely - a messy, slow and unscalable solution. Or queueing the stack to a
helper thread is equally messy.

again the kernel can give the threading library a helping hand to solve
this catch-22 problem, surprisingly easily in this case as well.  
exit_free() simply writes a user-provided word back to userspace. At that
point the user stack is not used anymore (nor will it ever be - sys_exit()
cannot fail), so freeing it is appropriate. The actual way glibc utilizes
exit_free() is that there's a "is this stack free" flag in the thread
stack control structure, and exit_free() sets this to '1'. So upon
thread-exit the thread frees the stack and puts it into the stack cache -
but other threads will skip over it because the 'free' flag is still 0.

with this syscall it was possible to implement single-syscall thread exit
in glibc. (well, actually not yet, the next patch i send will enable this
fully by solving the "who does the waitpid()?" problem.)

        Ingo

--- linux/arch/i386/kernel/entry.S.orig Tue Aug 13 17:13:30 2002

        .long sys_set_thread_area
        .long sys_get_thread_area
        .long sys_clone_startup /* 245 */
+       .long sys_exit_free

        .rept NR_syscalls-(.-sys_call_table)/4
                .long sys_ni_syscall
--- linux/include/asm-i386/unistd.h.orig        Tue Aug 13 17:13:00 2002

 #define __NR_set_thread_area   243
 #define __NR_get_thread_area   244
 #define __NR_clone_startup     245
+#define __NR_exit_free         246

 /* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */

--- linux/kernel/exit.c.orig    Tue Aug 13 17:12:06 2002

        do_exit((error_code&0xff)<<8);
 }

+asmlinkage long sys_exit_free(int error_code, unsigned long *addr, unsigned long val)
+{
+       put_user(val, addr);
+       do_exit((error_code&0xff)<<8);
+}
+
 asmlinkage long sys_wait4(pid_t pid,unsigned int * stat_addr, int options, struct rusage * ru)
 {
        int flag, retval;

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://www.veryComputer.com/
Please read the FAQ at  http://www.veryComputer.com/

 
 
 

exit_free(), 2.5.31-A0

Post by Linus Torvald » Thu, 15 Aug 2002 00:40:06



> the attached patch implements a new syscall, exit_free():

>    exit_free(error_code, addr, val);

> this syscall is used as a performance optimization in glibc's threading
> library.

This looks like a total glibc braindamage hack.

It may be small, but it's crap, unless you can explain to me why glibc
cannot just cannot just catch the death signal in the master thread and be
done with it (and do all maintenance in the master).

Too ugly to live.

                Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

exit_free(), 2.5.31-A0

Post by Ingo Molna » Thu, 15 Aug 2002 03:00:11



> It may be small, but it's crap, unless you can explain to me why glibc
> cannot just cannot just catch the death signal in the master thread and
> be done with it (and do all maintenance in the master).

we dont really want any signal overhead, and we also dont want any extra
context-switching to the 'master thread'. And there's no master thread
anymore either.

the pthreads API provides sensible ways to just get rid of a helper thread
without *any* handshaking or notification done after exit with any of the
other threads - the thread has finished its work and is gone forever.

the fundamental problem is getting rid of the stack atomically, it's a
catch-22. A thread can be interrupted by a signal on the last instruction
it executes, it can be ptrace debugged, etc. And something must notify
about completion once the stack is 100% unused.

(i'll add any other, userspace-only solution to the code if there's any
that has equivalent performance - i couldnt find any other solution so
far.)

        Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

exit_free(), 2.5.31-A0

Post by Linus Torvald » Thu, 15 Aug 2002 03:00:19



> we dont really want any signal overhead, and we also dont want any extra
> context-switching to the 'master thread'. And there's no master thread
> anymore either.

That still doesn't make it any les crap: because any thread that exits
without calling the "magic exit-flag interface" will then silently be
lost, with no information left around anywhere.

The whole interface is bogus.

If you want to do this, you can do it at _clone_ time, by extending on the
notion of "when I die, tell the parent using signal X" and making that
notion be a more generic "when I die, do X", where "X" migh include
updating some parent tables instead of sending a signal.

But the magic "exit_write()" has to die.

                        Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

exit_free(), 2.5.31-A0

Post by Ingo Molna » Thu, 15 Aug 2002 03:10:11



> > we dont really want any signal overhead, and we also dont want any extra
> > context-switching to the 'master thread'. And there's no master thread
> > anymore either.

> That still doesn't make it any les crap: because any thread that exits
> without calling the "magic exit-flag interface" will then silently be
> lost, with no information left around anywhere.

that should be a pretty rare occurance: with the upcoming signals patch
any segmentation fault zaps all threads and does a proper (and
deadlock-free) multithreaded coredump. Sysadmin doing a kill(1) will get
all threads killed as well. The only possible way for an uncontrolled exit
is for the thread to call sys_exit() explicitly (which is not possible
without the glibc cleanup handlers being called), or for someone to send a
SIGKILL via sys_tkill().

but even in this rare and malicious case, whatever resources a thread has,
they are lost if there's an uncontrolled exit anyway. There's tons of
other stuff that glibc might have to clean up on exit - mutexes,
malloc()s, etc. Thread exit needs to be cooperative, no matter what. The
stack cache does not change this situation the least.

        Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

exit_free(), 2.5.31-A0

Post by Ingo Molna » Thu, 15 Aug 2002 03:20:04



> If you want to do this, you can do it at _clone_ time, by extending on
> the notion of "when I die, tell the parent using signal X" and making
> that notion be a more generic "when I die, do X", where "X" migh include
> updating some parent tables instead of sending a signal.

> But the magic "exit_write()" has to die.

think about it - we have the *very same* problem in kernel-space, and we
had it for years. People wanted to get rid of parent notification in
helper processes for ages. A thread cannot free its own stack. We now can
do it only with very special care and atomicity. The same thing cannot be
done by user-space, because it has no 'atomic change and sys_exit()'
operation at its hands. This capability is that the syscall provides -
perhaps it should be called 'exit_atomic()' instead?

(we got rid of all signal passing in the main fabric of pthreads - and
that's done rightfully so. Futexes are used for message passing and
eventing.)

        Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

exit_free(), 2.5.31-A0

Post by Ingo Molna » Thu, 15 Aug 2002 03:20:06


I was actually surprised to see how much effort it takes on the glibc side
to solve this (admittedly conceptually hard) problem without any kernel
help - it's ugly and slow, and still not completely tight. By providing
this 'exit and free stack' capability we can help tremendously.

the thing that makes it special and hard is the completely shared VM.  
There's just no way for a thread to 'get rid of itself' atomically and
also exit in the same round, without extensive locking and signal passing.
Linux actually has a very very fast clone()+exit() codepath, lets make it
possible for usespace to use it.

in essence the 'exit and send notification signal' thing now became a
simple word written into userspace. Should this be a more formal thing -
userspace mailboxes for the kernel to put events into? I think that might
be a bit overboard though.

        Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

exit_free(), 2.5.31-A0

Post by Linus Torvald » Thu, 15 Aug 2002 03:20:06



> > That still doesn't make it any les crap: because any thread that exits
> > without calling the "magic exit-flag interface" will then silently be
> > lost, with no information left around anywhere.

> that should be a pretty rare occurance: with the upcoming signals patch
> any segmentation fault zaps all threads and does a proper (and
> deadlock-free) multithreaded coredump.

That still doesn't change the fact that the interface is broken
_by_design_.

If the parent wants to get notified on child death, it should damn well
get notified on child death. Not "in case the child exists politely".

We don't depend on processes calling "exit()" to clean up all the stuff
they left behind. The VM gets cleaned up even for bad processes.

                Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

exit_free(), 2.5.31-A0

Post by Linus Torvald » Thu, 15 Aug 2002 03:30:09



> think about it - we have the *very same* problem in kernel-space, and we
> had it for years. People wanted to get rid of parent notification in
> helper processes for ages.

So add the capability to mark the child for proper exit semantics.

That's what I said: if you wan tto do this, you need to mark it at
_create_ time. Exactly so that the proper exit semantics can be done 100%
reliably, instead of just "sometimes". THAT is why _any_ interface that
depends on exit_xxxx() must die - because it is inherently broken for
accidental deaths, and does not leave the parent any way to recover
sanely.

                Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

exit_free(), 2.5.31-A0

Post by Ingo Molna » Thu, 15 Aug 2002 03:30:12



> If the parent wants to get notified on child death, it should damn well
> get notified on child death. Not "in case the child exists politely".

yes i agree. If the parent wants that, then it does not specify the
CLONE_DETACHED flag when creating the child thread. It is the parent that
specifies this flag and it has the freedom to decide whether it wants
signal based notification or not.

if CLONE_DETACHED is not specified upon creation then *no matter what the
child thread does* - both sys_exit() and sys_exit_free() notify the
parent. It's not a matter of politeness.

Quote:> We don't depend on processes calling "exit()" to clean up all the stuff
> they left behind. The VM gets cleaned up even for bad processes.

We'd be more than happy to do this cleanup in userspace, but how do you
free a stack which might as well be used by a de* or a signal handler
right before executing the final "int $0x80" instruction?

should every signal handler start with code that tries to figure out
whether the stack is still valid (by calling gettid() and comparing it
with the TID written into a special offset on the stack)? Should the
exiting thread mask all signals before freeing the stack and calling
sys_exit()?

        Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://www.veryComputer.com/
Please read the FAQ at  http://www.veryComputer.com/

 
 
 

exit_free(), 2.5.31-A0

Post by Linus Torvald » Thu, 15 Aug 2002 03:30:15



> I was actually surprised to see how much effort it takes on the glibc side
> to solve this (admittedly conceptually hard) problem without any kernel
> help - it's ugly and slow, and still not completely tight. By providing
> this 'exit and free stack' capability we can help tremendously.

Ingo, you're barking up the wrong tree.

I'm not against fixing it, but I'm very much against fixing it wrong.

I even told you how you can fix it right. You're arguing against the wrong
thing here.

                Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

exit_free(), 2.5.31-A0

Post by Linus Torvald » Thu, 15 Aug 2002 04:00:07



> So add the capability to mark the child for proper exit semantics.

Actually, this has nothing at all to do with exit().

The same thing comes up when you want to do an execve() (yes, I know
pthreads doesn't support a thread starting another process, but the fact
that pthreads is broken is no excuse for broken interfaces).

If the parent needs to be notified that the stack slot is no longer in
use, it needs to happen for execve() too, not just exit().

In fact, I'd say that this thing is tied in to "mm_release()", not
"exit()".

The fact that the child doesn't want to send a signal to the parent on
exit is a totally different matter, and should already be supported by
just giving a zero signal number.

                        Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

exit_free(), 2.5.31-A0

Post by Ingo Molna » Thu, 15 Aug 2002 04:20:09



> The fact that the child doesn't want to send a signal to the parent on
> exit is a totally different matter, and should already be supported by
> just giving a zero signal number.

exit signal 0 is already being used and relied on by kmod - i originally
implemented it that way. In that case the child thread becomes a zombie
until the parent exits, and then it gets reparented to init. I did not
want to break any existing semantics (no matter how broken they appeared
to me) thus i introduced CLONE_DETACHED. But thinking about it, 'a zombie
staying around indefinitely' is not a semantics that it worth carrying too
far? But in case, if signal 0 is the preferred interface then i'm all for
it - this is not really a clone() property but an exit-signalling
property.

        Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

exit_free(), 2.5.31-A0

Post by Ingo Molna » Thu, 15 Aug 2002 04:40:06



> The same thing comes up when you want to do an execve() (yes, I know
> pthreads doesn't support a thread starting another process, but the fact
> that pthreads is broken is no excuse for broken interfaces).

fixing this too would be very nice indeed.

Quote:> If the parent needs to be notified that the stack slot is no longer in
> use, it needs to happen for execve() too, not just exit().

> In fact, I'd say that this thing is tied in to "mm_release()", not
> "exit()".

yes.

from the practical POV right now we have a dualness of APIs. exit() is a
way to exit a current thread and destroy it. execve() is a way to exit()  
the current thread and bootstrapping a completely new thread from scratch
- while saving over some well-specified state into the new thread.

so for any threading library to handle execve() correctly, there needs to
be some way to specify an mm_release event. It's in essence the same
'exit' conceptual thing we do in both the sys_exit() and sys_execve()
case, but it's accessible via two external interfaces.

one solution would be a new syscall to set 'VM exit notification' address
and value in the released VM. But since it would always come in pair with
sys_exit() or sys_execve(), it would be nicer to have a composite syscalls
as well - ie. exit_release_user_mm() and execve_release_user_mm(). I know
this is a pain to look at, but i dont have any better ideas right now. The
composite syscalls also have the advantage that no additional per-thread
field has to be used, since user-space can be notified right at the
beginning.

hm, maybe there's an idea: perhaps the most elegant way would be to handle
this at clone() time: if the CLONE_NOTIFY_MM_RELEASE flag is specified
then the top of the user stack address is taken as the notification
address. (or a new parameter can be used.) And the notification can as
well be an implicit 'set to 0' rule. [so it's basically a VM lock extended
to userspace.] The user-space stack's address is known at clone() time
already, nothing wants to change that address until exit() time.

mm_release() then sees this address set in current->, and notifies the
userspace VM of the release. No need for new syscalls, and *all* 'exit'
variants in the future will automatically have this capability, without
having to create clumsy composite syscalls.

Quote:> The fact that the child doesn't want to send a signal to the parent on
> exit is a totally different matter, and should already be supported by
> just giving a zero signal number.

yes.

        Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

exit_free(), 2.5.31-A0

Post by Linus Torvald » Thu, 15 Aug 2002 04:50:04



> exit signal 0 is already being used and relied on by kmod - i originally
> implemented it that way. In that case the child thread becomes a zombie
> until the parent exits, and then it gets reparented to init. I did not
> want to break any existing semantics (no matter how broken they appeared
> to me) thus i introduced CLONE_DETACHED. But thinking about it, 'a zombie
> staying around indefinitely' is not a semantics that it worth carrying too
> far?

I think it makes more sense to say that since there was no notification of
the parent, we should just reparent at that point.

 But in case, if signal 0 is the preferred interface then i'm all for

Quote:> it - this is not really a clone() property but an exit-signalling
> property.

Right. I think that it makes more sense to do it that way. Clearly the
parent doesn't care about the exit if the signal is zero.

                Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/