async poll for 2.5

async poll for 2.5

Post by Davide Libenz » Wed, 16 Oct 2002 21:20:05





> > Something like this might work :

> > int sys_epoll_create(int maxfds);
> > void sys_epoll_close(int epd);
> > int sys_epoll_wait(int epd, struct pollfd **pevts, int timeout);

> > where sys_epoll_wait() return the number of events available, 0 for
> > timeout, -1 for error.

> There's no reason to make epoll_wait a new syscall -- poll events can
> easily be returned via the aio_complete mechanism (with the existing
> aio_poll experiment as a possible means for doing so).

Ben, one of the reasons of the /dev/epoll speed is how it returns events
and how it collapses them. A memory mapped array is divided by two and
while the user consumes events in one set, the kernel fill the other one.
The next wait() will switch the pointers. There is no copy from kernel to
user space. Doing :

int sys_epoll_wait(int epd, struct pollfd **pevts, int timeout);

the only data the kernel has to copy to userspace is the 4(8) bytes for
the "pevts" pointer.

- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

async poll for 2.5

Post by Shailabh Naga » Wed, 16 Oct 2002 21:20:08




>>Something like this might work :

>>int sys_epoll_create(int maxfds);
>>void sys_epoll_close(int epd);
>>int sys_epoll_wait(int epd, struct pollfd **pevts, int timeout);

>>where sys_epoll_wait() return the number of events available, 0 for
>>timeout, -1 for error.

> There's no reason to make epoll_wait a new syscall -- poll events can
> easily be returned via the aio_complete mechanism (with the existing
> aio_poll experiment as a possible means for doing so).

So a user would setup an ioctx  and use io_getevents to retrieve events on
an interest set of fds created and manipulated through the new system calls ?

-- Shailabh

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

async poll for 2.5

Post by Benjamin LaHais » Wed, 16 Oct 2002 21:20:13



> Ben, one of the reasons of the /dev/epoll speed is how it returns events
> and how it collapses them. A memory mapped array is divided by two and
> while the user consumes events in one set, the kernel fill the other one.
> The next wait() will switch the pointers. There is no copy from kernel to
> user space. Doing :

> int sys_epoll_wait(int epd, struct pollfd **pevts, int timeout);

> the only data the kernel has to copy to userspace is the 4(8) bytes for
> the "pevts" pointer.

Erm, the aio interface has support for the event ringbuffer being accessed
by userspace (it lives in user memory and the kernel acts as a writer, with
userspace as a reader), that's one of its advantages -- completion events
are directly accessible from userspace after being written to by an
interrupt.  Ideally this is to be wrapped in a vsyscall, but we don't have
support for that yet on x86, although much of the code written for x86-64
should be reusable.

                -ben
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

async poll for 2.5

Post by Davide Libenz » Wed, 16 Oct 2002 21:30:16




> > Ben, one of the reasons of the /dev/epoll speed is how it returns events
> > and how it collapses them. A memory mapped array is divided by two and
> > while the user consumes events in one set, the kernel fill the other one.
> > The next wait() will switch the pointers. There is no copy from kernel to
> > user space. Doing :

> > int sys_epoll_wait(int epd, struct pollfd **pevts, int timeout);

> > the only data the kernel has to copy to userspace is the 4(8) bytes for
> > the "pevts" pointer.

> Erm, the aio interface has support for the event ringbuffer being accessed
> by userspace (it lives in user memory and the kernel acts as a writer, with
> userspace as a reader), that's one of its advantages -- completion events
> are directly accessible from userspace after being written to by an
> interrupt.  Ideally this is to be wrapped in a vsyscall, but we don't have
> support for that yet on x86, although much of the code written for x86-64
> should be reusable.

In general I would like to have a "common" interface to retrieve IO
events, but IMHO the two solutions should be benchmarked before adopting
the one or the other.

- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

async poll for 2.5

Post by Dan Kege » Wed, 16 Oct 2002 21:30:22





> > > Ben, one of the reasons of the /dev/epoll speed is how it returns events
> > > and how it collapses them. A memory mapped array is divided by two and
> > > while the user consumes events in one set, the kernel fill the other one.
> > > The next wait() will switch the pointers. There is no copy from kernel to
> > > user space. Doing :

> > > int sys_epoll_wait(int epd, struct pollfd **pevts, int timeout);

> > > the only data the kernel has to copy to userspace is the 4(8) bytes for
> > > the "pevts" pointer.

> > Erm, the aio interface has support for the event ringbuffer being accessed
> > by userspace (it lives in user memory and the kernel acts as a writer, with
> > userspace as a reader), that's one of its advantages -- completion events
> > are directly accessible from userspace after being written to by an
> > interrupt.  Ideally this is to be wrapped in a vsyscall, but we don't have
> > support for that yet on x86, although much of the code written for x86-64
> > should be reusable.

> In general I would like to have a "common" interface to retrieve IO
> events, but IMHO the two solutions should be benchmarked before adopting
> the one or the other.

Seems like /dev/epoll uses a double-buffering scheme rather than
a ring buffer, and this is not just a trivial difference; it's
related to how redundant events are collapsed, right?
- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
 
 
 

async poll for 2.5

Post by Davide Libenz » Wed, 16 Oct 2002 21:50:12






> > > > Ben, one of the reasons of the /dev/epoll speed is how it returns events
> > > > and how it collapses them. A memory mapped array is divided by two and
> > > > while the user consumes events in one set, the kernel fill the other one.
> > > > The next wait() will switch the pointers. There is no copy from kernel to
> > > > user space. Doing :

> > > > int sys_epoll_wait(int epd, struct pollfd **pevts, int timeout);

> > > > the only data the kernel has to copy to userspace is the 4(8) bytes for
> > > > the "pevts" pointer.

> > > Erm, the aio interface has support for the event ringbuffer being accessed
> > > by userspace (it lives in user memory and the kernel acts as a writer, with
> > > userspace as a reader), that's one of its advantages -- completion events
> > > are directly accessible from userspace after being written to by an
> > > interrupt.  Ideally this is to be wrapped in a vsyscall, but we don't have
> > > support for that yet on x86, although much of the code written for x86-64
> > > should be reusable.

> > In general I would like to have a "common" interface to retrieve IO
> > events, but IMHO the two solutions should be benchmarked before adopting
> > the one or the other.

> Seems like /dev/epoll uses a double-buffering scheme rather than
> a ring buffer, and this is not just a trivial difference; it's
> related to how redundant events are collapsed, right?

It's just a matter of implementation. With a double buffer you have
clearly two distinct working zones, one is the user zone and the other is
the kernel zone. With a ring buffer you have to mark what is the area that
is currently returned as event set to the user and avoid the kernel to
overflow on such area. The double buffer is probably faster and easy to
implement ( for event collapsing ).

- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

async poll for 2.5

Post by Benjamin LaHais » Wed, 16 Oct 2002 22:50:07




> >Erm, the aio interface has support for the event ringbuffer being accessed
> >by userspace

> Making the event ringbuffer visible to userspace conflicts with being
> able to support event priorities.  To support event priorities, the
> ringbuffer would need to be replaced with some other data structure.

No it does not.  Event priorities are easily accomplished via separate
event queues for events of different priorities.  Most hardware implements
event priorities in this fashion.

                -ben
--
"Do you seek knowledge in time travel?"
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

async poll for 2.5

Post by Dan Kege » Wed, 16 Oct 2002 23:00:20




> >If you look at how /dev/epoll does it, the collapsing of readiness
> >events is very elegant: a given fd is only allowed to report a change
> >in its state once per run through the event loop.

> And the way /dev/epoll does it has a key flaw: it only works with single
> threaded callers.  If you have multiple threads simultaneously trying to
> get events, then race conditions abound.

Delaying the "get next batch of readiness events" call as long as
possible
increases the amount of event collapsing possible, which is important
because
the network stack seems to generate lots of spurious events.  Thus I
suspect
you don't want multiple threads all calling the "get next batch of
events"
entry point frequently.
The most effective way to use something like /dev/epoll in a
multithreaded
program might be to have one thread call "get next batch of events",
then divvy up the events across multiple threads.  
Thus I disagree that the way /dev/epoll does it is flawed.

Quote:> I certainly hope /dev/epoll itself doesn't get accepted into the kernel,
> the interface is error prone.  Registering interest in a condition when
> the condition is already true should immediately generate an event, the
> epoll interface did not do that last time I saw it discussed.  This
> deficiency in the interface requires callers to include more complex
> workaround code and is likely to result in subtle, hard to diagnose bugs.

With queued readiness notification schemes like SIGIO and /dev/epoll,
it's safest to allow readiness notifications from the kernel
to be wrong sometimes; this happens at least in the case of accept
readiness,
and possibly other places.  Once you allow that, it's easy to handle the
condition you're worried about by generating a spurious readiness
indication when registering a fd.  That's what I do in my wrapper
library.  

Also, because /dev/epoll and friends are single-shot notifications of
*changes* in readiness, there is little reason to register interest in
this or that event, and change that interest over time; instead,
apps should simply register interest in any event they might ever
be interested in.  The number of extra events they then have to ignore
is very
small, since if you take no action on a 'read ready' event, no more
of those events will occur.

So I pretty much disagree all around :-) but I do understand where
you're
coming from.  I used to feel similarly until I figured out the
'right' way to use one-shot readiness notification systems
(sometime last week :-)

- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

async poll for 2.5

Post by Davide Libenz » Wed, 16 Oct 2002 23:20:16




> >If you look at how /dev/epoll does it, the collapsing of readiness
> >events is very elegant: a given fd is only allowed to report a change
> >in its state once per run through the event loop.

> And the way /dev/epoll does it has a key flaw: it only works with single
> threaded callers.  If you have multiple threads simultaneously trying to
> get events, then race conditions abound.

> >The ioctl that swaps
> >event buffers acts as a barrier between the two possible reports.

> Which assumes there are only single threaded callers.  To work correctly
> with multithreaded callers, there needs to be a more explicit mechanism
> for a caller to indicate it has completed handling an event and wants to
> rearm its interest.

> There are also additional interactions with cancellation.  How does the
> cancellation interface report and handle the case where an associated
> event is being delivered or handled by another thread?  What happens
> when that thread then tries to rearm the canceled interest?

Why would you need to use threads with a multiplex-like interface like
/dev/epoll ? The reason of these ( poll()/select()//dev/epoll//dev/poll )
interfaces is to be able to handle more file descriptors inside a _single_
task.

Quote:> I certainly hope /dev/epoll itself doesn't get accepted into the kernel,
> the interface is error prone.  Registering interest in a condition when
> the condition is already true should immediately generate an event, the
> epoll interface did not do that last time I saw it discussed.  This
> deficiency in the interface requires callers to include more complex
> workaround code and is likely to result in subtle, hard to diagnose bugs.

It works exactly like rt-signals and all you have to do is to change your
code from :

int myread(...) {

        if (wait(POLLIN))
                read();

Quote:}

to :

int myread(...) {

        while (read() == EGAIN)
                wait(POLLIN);

Quote:}

- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

async poll for 2.5

Post by Davide Libenz » Thu, 17 Oct 2002 00:30:15




> >Why would you need to use threads with a multiplex-like interface like
> >/dev/epoll ?

> Because in some applications processing an event can cause the thread to
> block, potentially for a long time.  Multiple threads are needed to
> isolate that block to the context associated with the event.

I don't want this to become the latest pro/against threads but if your
processing thread block for a long time you should consider handling the
blocking condition asynchronously. If your procesing thread blocks, your
application model should very likely be redesigned, or you just go with
threads ( and you do not need any multiplex interface ).

Quote:> >       while (read() == EGAIN)
> >               wait(POLLIN);

> Assuming registration of interest is inside wait(), this has a race.  If
> the file becomes readable between the time that read() returns and the
> time that wait() can register interest, the connection will hang.

Your assumption is wrong, the registration is done as soon as the fd
"born" ( socket() or accept() for example ) and is typically removed when
it dies.

- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/