Use __set_current_state() instead of current-> state = (take 1)

Use __set_current_state() instead of current-> state = (take 1)

Post by Perez-Gonzalez, Inak » Fri, 20 Dec 2002 02:50:12



Quote:> > In fs/*.c, many functions manually set the task state directly
> > accessing current->state, or with a macro, kind of
> > inconsistently. This patch changes all of them to use
> > [__]set_current_state().

> Some of these should probably be set_current_state().  I
> realize the current code is equivalent to __set_current_state()
> but it might as well be done right.

Agreed; however, I also don't want to introduce unnecessary
bloat, so I need to understand first what cases need it - it
is kind of hard for me. Care to let me know some gotchas?

Quote:> > -     current->state = TASK_INTERRUPTIBLE;
> > +     __set_current_state (TASK_INTERRUPTIBLE);
> >       add_wait_queue(fl_wait, &wait);
> >       if (timeout == 0)

> At least this guy should be set_current_state(), on quick glance.

Is that because it is called lockless? ... grunt, in some areas
is kind of very obscure to guess if it is or not.

Inaky Perez-Gonzalez -- Not speaking for Intel - opinions are my own [or my
fault]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

Use __set_current_state() instead of current-> state = (take 1)

Post by Robert Lov » Fri, 20 Dec 2002 03:10:07



> > Some of these should probably be set_current_state().  I
> > realize the current code is equivalent to __set_current_state()
> > but it might as well be done right.

> Agreed; however, I also don't want to introduce unnecessary
> bloat, so I need to understand first what cases need it - it
> is kind of hard for me. Care to let me know some gotchas?

set_current_state() includes a write barrier to ensure the setting of
the state is flushed before any further instructions.  This is to
provide a memory barrier for weak-ordering processors that can and will
rearrange the writes.

Not all processors like those made by your employer are strongly-ordered
:)

Anyhow, the problem is when the setting of the state is set to
TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE.  In those cases, it may be
possible for the state to actually be flushed to memory AFTER the wake
up event has occurred.

So you have code like this:

        __set_current_state(TASK_INTERRUPTIBLE);
        add_waitqueue(...);

but the processor ends up storing the current->state write after the
task is added to the waitqueue.  In between those events, the wake up
event occurs and the task is woken up.  Then the write to current->state
is flushed.  So you end up with:

        add_waitqueue(...);
        interrupt or whatever occurs and wakes up the wait queue
        task is now woken up and running
        current->state = TASK_INTERRUPTIBLE /* BOOM! */

the task is marked sleeping now, but its wait event has already occurred
-- its in for a long sleep...

So to ensure the write is flushed then and there, and that the processor
(or compile, but we do not really worry about it because the call to
add_waitqueue will act as a compiler barrier, for example) does not move
the write to after the potential wake up, we use the write barrier.

In short, set_current_state() uses a memory barrier to ensure the state
setting occurs before any potential wake up activity, eliminating a
potential race and process deadlock.

It sounds complicated but its pretty simple... my explanation was
probably way too long - better than any I have seen here before on why
we have these things, at least.  Hope it helps.

If in doubt, just make all of them set_current_state().  That is the
standard API, and its at least safe.

        Robert Love

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

Use __set_current_state() instead of current-> state = (take 1)

Post by Perez-Gonzalez, Inak » Fri, 20 Dec 2002 04:00:14


Quote:> > Agreed; however, I also don't want to introduce unnecessary
> > bloat, so I need to understand first what cases need it - it
> > is kind of hard for me. Care to let me know some gotchas?

> set_current_state() includes a write barrier to ensure the setting of
> the state is flushed before any further instructions.  This is to
> provide a memory barrier for weak-ordering processors that
> can and will rearrange the writes.

It is what I was expecting, given xchg() being in the equation;
then it is reduced to a problem of guessing what can be the
delay for the flushing of the write ... beautiful ...

So, in that scenario, it means that:

- any setting before a return should be barriered unless we
  return to a place[s] known to be harmless

- any setting to TASK_RUNNING should be kind of safe

- exec.c:de_thread(),

        while (atomic_read(&oldsig->count) > count) {
                oldsig->group_exit_task = current;
-               current->state = TASK_UNINTERRUPTIBLE;
+               __set_current_state(TASK_UNINTERRUPTIBLE);
                spin_unlock_irq(&oldsig->siglock);

  Should be safe, as spin_unlock_irq() will do memory clobber
  on sti() [undependant from UP/SMP].

- namei.c:do_follow_link()

        if (current->total_link_count >= 40)
                goto loop;
        if (need_resched()) {
-               current->state = TASK_RUNNING;
+               __set_current_state(TASK_RUNNING);
                schedule();
        }
        err = security_inode_follow_link(dentry, nd);

  There is a function for it, cond_resched().

So, sending an updated patch right now

Quote:> Not all processors like those made by your employer are
> strongly-ordered :)

You'd be surprised how little about those*details I do know :]

Inaky Perez-Gonzalez -- Not speaking for Intel - opinions are my own [or my
fault]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://www.veryComputer.com/
Please read the FAQ at  http://www.veryComputer.com/

 
 
 

Use __set_current_state() instead of current-> state = (take 1)

Post by Robert Lov » Fri, 20 Dec 2002 04:10:08



> - any setting before a return should be barriered unless we
>   return to a place[s] known to be harmless

Not sure.

Quote:> - any setting to TASK_RUNNING should be kind of safe

Yes, I agree.  It may race, but with what?

Quote:> - exec.c:de_thread(),

>    while (atomic_read(&oldsig->count) > count) {
>            oldsig->group_exit_task = current;
> -          current->state = TASK_UNINTERRUPTIBLE;
> +          __set_current_state(TASK_UNINTERRUPTIBLE);
>            spin_unlock_irq(&oldsig->siglock);

>   Should be safe, as spin_unlock_irq() will do memory clobber
>   on sti() [undependant from UP/SMP].

The memory clobber only acts as a compiler barrier and insures the
compiler does not reorder the statements from the order in the C code.

What we need is a memory barrier to ensure the processor does not
reorder statements.  In other words, the processor can completely
rearrange loads and stores as they are issued to it, as long as it does
not break obvious data dependencies.  On a weakly ordered processor,
sans memory barrier, there is no telling when and where a store will
actually reach memory.  This is regardless of the order of the C code or
anything else.

That said, I do not know if the above example is a problem or not.  On a
very quick glance, the only issue I saw is the one I pointed out
earlier, and you fixed it.

        Robert Love

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

Use __set_current_state() instead of current-> state = (take 1)

Post by Perez-Gonzalez, Inak » Fri, 20 Dec 2002 04:50:05


Quote:> > - any setting before a return should be barriered unless we
> >   return to a place[s] known to be harmless

> Not sure.

Well, I think it makes kind of sense. If we know we are
returning to some place where nothing bad could happen
with reordering ... well, so be it, don't use __set_...()

Quote:> > - any setting to TASK_RUNNING should be kind of safe

> Yes, I agree.  It may race, but with what?

Hmmm ... I guess it could race, but it would not really
be that important [unless maybe certain constructs], as
the only setting it will go to would be [UN]INTERRUPTIBLE,
where it would be waken up afterwards, as it'd be out of the
rq, STOPPED, same thing, out of the rq, ZOMBIE or DEAD [gone].

I guess in this case, the only one I can come up with,
there would be an inconsistency between the actual state
of the task in task->state and the real one [or its position
in whatever rq or wq].

Could that have ugly consequences? I dunno ...

Quote:> > -             current->state = TASK_UNINTERRUPTIBLE;
> > +             __set_current_state(TASK_UNINTERRUPTIBLE);
> >               spin_unlock_irq(&oldsig->siglock);

> >   Should be safe, as spin_unlock_irq() will do memory clobber
> >   on sti() [undependant from UP/SMP].

> The memory clobber only acts as a compiler barrier and insures the
> compiler does not reorder the statements from the order in the C code.

while (1)
  printk ("Inaky, remember, clobber is a hint to gcc\n");

And that would now really work when CONFIG_X86_OOSTORE=1 is required
[after all, it is a write, so it'd need the equivalent of a wmb() or
xchg()].
Okay, changing that one too, just in case.

Inaky Perez-Gonzalez -- Not speaking for Intel - opinions are my own [or my
fault]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

Use __set_current_state() instead of current-> state = (take 1)

Post by Robert Lov » Fri, 20 Dec 2002 05:30:06



> Well, I think it makes kind of sense. If we know we are
> returning to some place where nothing bad could happen
> with reordering ... well, so be it, don't use __set_...()

Oh, I see.  If it returns to somewhere that immediately e.g. puts it on
a wait queue.  In that case, yep: need the barrier version.

Quote:> And that would now really work when CONFIG_X86_OOSTORE=1 is required
> [after all, it is a write, so it'd need the equivalent of a wmb() or
> xchg()].

Is this a hint that your employer may have an x86 chip in the future
with weak ordering? :)

Quote:> Okay, changing that one too, just in case.

Good - better safe than sorry.

        Robert Love

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

Use __set_current_state() instead of current-> state = (take 1)

Post by Perez-Gonzalez, Inak » Fri, 20 Dec 2002 21:10:11


Quote:> > And that would now really work when CONFIG_X86_OOSTORE=1 is required
> > [after all, it is a write, so it'd need the equivalent of a wmb() or
> > xchg()].

> Is this a hint that your employer may have an x86 chip in the future
> with weak ordering? :)

Hmmm ... taking into account that there are some many thousands of
employees in my company and that I know less than one hundred ...
and that they are all software ... well, I don't think I am into
the rumour mill :]

Inaky Perez-Gonzalez -- Not speaking for Intel - opinions are my own [or my
fault]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

Use __set_current_state() instead of current-> state = (take 1)

Post by Alan Co » Sat, 21 Dec 2002 04:40:05



> > > And that would now really work when CONFIG_X86_OOSTORE=1 is required
> > > [after all, it is a write, so it'd need the equivalent of a wmb() or
> > > xchg()].

> > Is this a hint that your employer may have an x86 chip in the future
> > with weak ordering? :)

> Hmmm ... taking into account that there are some many thousands of
> employees in my company and that I know less than one hundred ...
> and that they are all software ... well, I don't think I am into
> the rumour mill :]

Also OOSTORE is there because other vendors already make weak store
order capable x86 processors. One example of this is the Winchip - where
turning off strict store ordering is worth 30% performance.

In addition you have to treat store ordering/locking carefully due to
the pentium pro store fencing errata. (Thats why our < PII kernel
generates lock movb to unlock when in theory the lock isnt needed).

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

Use __set_current_state() instead of current-> state = (take 1)

Post by Oliver Xymoro » Sat, 21 Dec 2002 21:40:13



> In addition you have to treat store ordering/locking carefully due to
> the pentium pro store fencing errata. (Thats why our < PII kernel
> generates lock movb to unlock when in theory the lock isnt needed).

Except the SMP causality test app I wrote got success reports from every
PPro stepping level.

--
 "Love the dolphins," she advised him. "Write by W.A.S.T.E.."
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/