xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl]

xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl]

Post by Andrea Arcangel » Sat, 22 Feb 2003 00:00:27



On Thu, Feb 20, 2003 at 10:35:43AM -0800, Andrew Morton wrote:
> Marc-Christian Petersen <m....@wolk-project.de> wrote:

> > On Wednesday 19 February 2003 18:49, Andrea Arcangeli wrote:

> > Hi Andrea,

> > > Marcelo please include this:
> > >  http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/...
> > >1pre4aa3/10_inode-highmem-2
> > great. Thanks. Now let's hope Marcelo use this :)

> > > other fixes should be included too but they don't apply cleanly yet
> > > unfortunately, I (or somebody else) should rediff them against mainline.
> > Can you tell me what in special you mean? I'd do this.

> Andrea's VM patches, against 2.4.21-pre4 are at

>    http://www.zip.com.au/~akpm/linux/patches/2.4/2.4.21-pre4/

> The applying order is in the series file.

Cool!

> These have been rediffed, and apply cleanly.  They have not been
> tested much though.

If they didn't reject in non obvious way they should work fine too ;)
If Marcelo merges them I'll verify everything when I update to his tree
like I do regularly with everything else that rejects.

btw, I finished today fixing a deadlock condition in the xdr layer
triggered by nfs on highmem machines, here's the fix against 2.4.21pre4,
please apply it now to pre4 or will have to live in my tree with the
other hundred of patches like it happened to some of the patches we're
discussing in this thread.

Explanation is very simple: you _can't_ kmap two times in the context of
a single task (especially if more than one task can run the same code at
the same time). I don't yet have the confirmation that this fixes the
deadlock though (it takes days to reproduce so it will take weeks to
confirm), but I can't see anything else wrong at the moment, and this
remains a genuine highmem deadlock that has to be fixed.  The fix is
optimal, no change unless you run out of kmaps and in turn you can
deadlock, i.e. all the light workloads won't be affected at all.

Note, this was developed on top of 2.4.21pre4aa3, so I had to rework it
to make it apply cleanly to mainline, the version I tested and included
in -aa is different, so this one is untested but if it compiles it will
work like a charm ;).

2.5.62 has the very same deadlock condition in xdr triggered by nfs too.
Andrew, if you're forward porting it yourself like with the filebacked
vma merging feature just let me know so we make sure not to duplicate
effort.

diff -urNp nfs-ref/include/asm-i386/highmem.h nfs/include/asm-i386/highmem.h
--- nfs-ref/include/asm-i386/highmem.h  2003-02-14 07:01:58.000000000 +0100
+++ nfs/include/asm-i386/highmem.h      2003-02-20 21:42:17.000000000 +0100
@@ -56,16 +56,19 @@ extern void kmap_init(void) __init;
 #define PKMAP_NR(virt)  ((virt-PKMAP_BASE) >> PAGE_SHIFT)
 #define PKMAP_ADDR(nr)  (PKMAP_BASE + ((nr) << PAGE_SHIFT))

-extern void * FASTCALL(kmap_high(struct page *page));
+extern void * FASTCALL(kmap_high(struct page *page, int nonblocking));
 extern void FASTCALL(kunmap_high(struct page *page));

-static inline void *kmap(struct page *page)
+#define kmap(page) __kmap(page, 0)
+#define kmap_nonblock(page) __kmap(page, 1)
+
+static inline void *__kmap(struct page *page, int nonblocking)
 {
        if (in_interrupt())
                out_of_line_bug();
        if (page < highmem_start_page)
                return page_address(page);
-       return kmap_high(page);
+       return kmap_high(page, nonblocking);
 }

 static inline void kunmap(struct page *page)
diff -urNp nfs-ref/include/linux/sunrpc/xdr.h nfs/include/linux/sunrpc/xdr.h
--- nfs-ref/include/linux/sunrpc/xdr.h  2003-02-19 01:12:41.000000000 +0100
+++ nfs/include/linux/sunrpc/xdr.h      2003-02-20 21:39:51.000000000 +0100
@@ -137,7 +137,7 @@ void xdr_zero_iovec(struct iovec *, int,
  * XDR buffer helper functions
  */
 extern int xdr_kmap(struct iovec *, struct xdr_buf *, unsigned int);
-extern void xdr_kunmap(struct xdr_buf *, unsigned int);
+extern void xdr_kunmap(struct xdr_buf *, unsigned int, int);
 extern void xdr_shift_buf(struct xdr_buf *, size_t);

 /*
diff -urNp nfs-ref/mm/highmem.c nfs/mm/highmem.c
--- nfs-ref/mm/highmem.c        2002-11-29 02:23:18.000000000 +0100
+++ nfs/mm/highmem.c    2003-02-20 21:45:27.000000000 +0100
@@ -77,7 +77,7 @@ static void flush_all_zero_pkmaps(void)
        flush_tlb_all();
 }

-static inline unsigned long map_new_virtual(struct page *page)
+static inline unsigned long map_new_virtual(struct page *page, int nonblocking)
 {
        unsigned long vaddr;
        int count;
@@ -96,6 +96,9 @@ start:
                if (--count)
                        continue;

+               if (nonblocking)
+                       return 0;
+
                /*
                 * Sleep for somebody else to unmap their entries
                 */
@@ -126,7 +129,7 @@ start:
        return vaddr;
 }

-void *kmap_high(struct page *page)
+void *kmap_high(struct page *page, int nonblocking)
 {
        unsigned long vaddr;

@@ -138,11 +141,15 @@ void *kmap_high(struct page *page)
         */
        spin_lock(&kmap_lock);
        vaddr = (unsigned long) page->virtual;
-       if (!vaddr)
-               vaddr = map_new_virtual(page);
+       if (!vaddr) {
+               vaddr = map_new_virtual(page, nonblocking);
+               if (!vaddr)
+                       goto out;
+       }
        pkmap_count[PKMAP_NR(vaddr)]++;
        if (pkmap_count[PKMAP_NR(vaddr)] < 2)
                BUG();
+ out:
        spin_unlock(&kmap_lock);
        return (void*) vaddr;
 }
diff -urNp nfs-ref/net/sunrpc/xdr.c nfs/net/sunrpc/xdr.c
--- nfs-ref/net/sunrpc/xdr.c    2002-11-29 02:23:23.000000000 +0100
+++ nfs/net/sunrpc/xdr.c        2003-02-20 21:39:51.000000000 +0100
@@ -180,7 +180,7 @@ int xdr_kmap(struct iovec *iov_base, str
 {
        struct iovec    *iov = iov_base;
        struct page     **ppage = xdr->pages;
-       unsigned int    len, pglen = xdr->page_len;
+       unsigned int    len, pglen = xdr->page_len, first_kmap;

        len = xdr->head[0].iov_len;
        if (base < len) {
@@ -203,9 +203,17 @@ int xdr_kmap(struct iovec *iov_base, str
                ppage += base >> PAGE_CACHE_SHIFT;
                base &= ~PAGE_CACHE_MASK;
        }
+       first_kmap = 1;
        do {
                len = PAGE_CACHE_SIZE;
-               iov->iov_base = kmap(*ppage);
+               if (first_kmap) {
+                       first_kmap = 0;
+                       iov->iov_base = kmap(*ppage);
+               } else {
+                       iov->iov_base = kmap_nonblock(*ppage);
+                       if (!iov->iov_base)
+                               goto out;
+               }
                if (base) {
                        iov->iov_base += base;
                        len -= base;
@@ -223,20 +231,23 @@ map_tail:
                iov->iov_base = (char *)xdr->tail[0].iov_base + base;
                iov++;
        }
+ out:
        return (iov - iov_base);
 }

-void xdr_kunmap(struct xdr_buf *xdr, unsigned int base)
+void xdr_kunmap(struct xdr_buf *xdr, unsigned int base, int niov)
 {
        struct page     **ppage = xdr->pages;
        unsigned int    pglen = xdr->page_len;

        if (!pglen)
                return;
-       if (base > xdr->head[0].iov_len)
+       if (base >= xdr->head[0].iov_len)
                base -= xdr->head[0].iov_len;
-       else
+       else {
+               niov--;
                base = 0;
+       }

        if (base >= pglen)
                return;
@@ -250,7 +261,11 @@ void xdr_kunmap(struct xdr_buf *xdr, uns
                 * we bump pglen here, and just subtract PAGE_CACHE_SIZE... */
                pglen += base & ~PAGE_CACHE_MASK;
        }
-       for (;;) {
+       /*
+        * In case we could only do a partial xdr_kmap, all remaining iovecs
+        * refer to pages. Otherwise we detect the end through pglen.
+        */
+       for (; niov; niov--) {
                flush_dcache_page(*ppage);
                kunmap(*ppage);
                if (pglen <= PAGE_CACHE_SIZE)
@@ -322,9 +337,22 @@ void
 xdr_shift_buf(struct xdr_buf *xdr, size_t len)
 {
        struct iovec iov[MAX_IOVEC];
-       unsigned int nr;
+       unsigned int nr, len_part, n, skip;
+
+       skip = 0;
+       do {
+
+               nr = xdr_kmap(iov, xdr, skip);
+
+               len_part = 0;
+               for (n = 0; n < nr; n++)
+                       len_part += iov[n].iov_len;
+
+               xdr_shift_iovec(iov, nr, len_part);
+
+               xdr_kunmap(xdr, skip, nr);

-       nr = xdr_kmap(iov, xdr, 0);
-       xdr_shift_iovec(iov, nr, len);
-       xdr_kunmap(xdr, 0);
+               skip += len_part;
+               len -= len_part;
+       } while (len);
 }
diff -urNp nfs-ref/net/sunrpc/xprt.c nfs/net/sunrpc/xprt.c
--- nfs-ref/net/sunrpc/xprt.c   2003-01-29 06:14:32.000000000 +0100
+++ nfs/net/sunrpc/xprt.c       2003-02-20 21:39:51.000000000 +0100
@@ -226,23 +226,34 @@ xprt_sendmsg(struct rpc_xprt *xprt, stru
        /* Dont repeat bytes */
        skip = req->rq_bytes_sent;
        slen = xdr->len - skip;
-       niov = xdr_kmap(niv, xdr, skip);
+       oldfs = get_fs(); set_fs(get_ds());
+       do {
+               unsigned int slen_part, n;

-       msg.msg_flags   = MSG_DONTWAIT|MSG_NOSIGNAL;
-       msg.msg_iov     = niv;
-       msg.msg_iovlen  = niov;
-       msg.msg_name    = (struct sockaddr *) &xprt->addr;
-       msg.msg_namelen = sizeof(xprt->addr);
-       msg.msg_control = NULL;
-       msg.msg_controllen = 0;
+               niov = xdr_kmap(niv, xdr, skip);

-       oldfs = get_fs(); set_fs(get_ds());
-       clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
-       result = sock_sendmsg(sock, &msg, slen);
+               msg.msg_flags   = MSG_DONTWAIT|MSG_NOSIGNAL;
+               msg.msg_iov     = niv;
+               msg.msg_iovlen  = niov;
+               msg.msg_name    = (struct sockaddr *) &xprt->addr;
+               msg.msg_namelen = sizeof(xprt->addr);
+               msg.msg_control = NULL;
+               msg.msg_controllen = 0;
+
+               slen_part = 0;
+               for (n = 0; n < niov; n++)
+                       slen_part += niv[n].iov_len;
+
+               clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
+               result = sock_sendmsg(sock, &msg, slen_part);
+
+               xdr_kunmap(xdr, skip, niov);
+
+               skip += slen_part;
+               slen -= slen_part;
+       } while (result >= 0 && slen);
        set_fs(oldfs);

-       xdr_kunmap(xdr, skip);
-
        dprintk("RPC:      xprt_sendmsg(%d) = %d\n", slen, result);

        if (result >= 0)

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl]

Post by Jeff Garzi » Sat, 22 Feb 2003 01:10:07




>      > 2.5.62 has the very same deadlock condition in xdr triggered by
>      >        nfs too.
>      > Andrew, if you're forward porting it yourself like with the
>      > filebacked vma merging feature just let me know so we make sure
>      > not to duplicate effort.

> For 2.5.x we should rather fix MSG_MORE so that it actually works
> instead of messing with hacks to kmap().

> For 2.4.x, Hirokazu Takahashi had a patch which allowed for a safe
> kmap of > 1 page in one call. Appended here as an attachment FYI
> (Marcelo do *not* apply!).

One should also consider kmap_atomic...  (bcrl suggest)

        Jeff

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl]

Post by Trond Myklebus » Sat, 22 Feb 2003 01:20:09


     > One should also consider kmap_atomic...  (bcrl suggest)

The problem is that sendmsg() can sleep. kmap_atomic() isn't really
appropriate here.

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl]

Post by Andreas Dilge » Sat, 22 Feb 2003 01:20:12



Quote:> Explanation is very simple: you _can't_ kmap two times in the context of
> a single task (especially if more than one task can run the same code at
> the same time). I don't yet have the confirmation that this fixes the
> deadlock though (it takes days to reproduce so it will take weeks to
> confirm), but I can't see anything else wrong at the moment, and this
> remains a genuine highmem deadlock that has to be fixed.  The fix is
> optimal, no change unless you run out of kmaps and in turn you can
> deadlock, i.e. all the light workloads won't be affected at all.

We had a similar problem in Lustre, where we have to kmap multiple pages
at once and hold them over a network RPC (which is doing zero-copy DMA
into multiple pages at once), and there is possibly a very heavy load
of kmaps because the client and the server can be on the same system.

What we did was set up a "kmap reservation", which used an atomic_dec()
+ wait_event() to reschedule the task until it could get enough kmaps
to satisfy the request without deadlocking (i.e. exceeding the kmap cap
which we conservitavely set at 3/4 of all kmap space).

A single "server" task could exceed the kmap cap by enough to satisfy the
maximum possible request size, so that a single system with both clients
and servers can always make forward progress even in the face of clients
trying to kmap more than the total amount of kmap space.

This works for us because we are the only consumer of huge amounts of kmaps
on our systems, but it would be nice to have a generic interface to do that
so that multiple apps don't deadlock against each other (e.g. NFS + Lustre).

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl]

Post by Andrea Arcangel » Sat, 22 Feb 2003 11:50:08




>      > 2.5.62 has the very same deadlock condition in xdr triggered by
>      >        nfs too.
>      > Andrew, if you're forward porting it yourself like with the
>      > filebacked vma merging feature just let me know so we make sure
>      > not to duplicate effort.

> For 2.5.x we should rather fix MSG_MORE so that it actually works
> instead of messing with hacks to kmap().

> For 2.4.x, Hirokazu Takahashi had a patch which allowed for a safe
> kmap of > 1 page in one call. Appended here as an attachment FYI
> (Marcelo do *not* apply!).

you can't do it this way, the number of kmap available can be just 1,
and you can ask for 10000 in a row this way. Furthmore you want to be
able to use all the kmaps available, think if you have 11 kmaps, and 10
are constantly in use. I much prefer my approch that is the most
finegrined and scalable and it doesn't risk to deadlock in function of
the number of kmaps in the pool and the max reservation you make. I just
considered the approch implemented in the patch you quoted and I
discarded it for the reasons explained above.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl]

Post by Andrea Arcangel » Sat, 22 Feb 2003 11:50:12




>      > One should also consider kmap_atomic...  (bcrl suggest)

> The problem is that sendmsg() can sleep. kmap_atomic() isn't really
> appropriate here.

100% correct.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl]

Post by Andrea Arcangel » Sat, 22 Feb 2003 11:50:12





> >      > 2.5.62 has the very same deadlock condition in xdr triggered by
> >      >        nfs too.
> >      > Andrew, if you're forward porting it yourself like with the
> >      > filebacked vma merging feature just let me know so we make sure
> >      > not to duplicate effort.

> > For 2.5.x we should rather fix MSG_MORE so that it actually works
> > instead of messing with hacks to kmap().

> > For 2.4.x, Hirokazu Takahashi had a patch which allowed for a safe
> > kmap of > 1 page in one call. Appended here as an attachment FYI
> > (Marcelo do *not* apply!).

> One should also consider kmap_atomic...  (bcrl suggest)

impossible, either you submit page structures to the IP layer, or you
*must* have persistence, depending on a sock_sendmsg that can't schedule
would be totally broken (or the preemptive thing is a joke). nfs client
O_DIRET zerocopy would be a nice feature but this is 2.4.

the only option would be the atomic and at the same time persistent
kmaps in the process address space that don't work well with threads...
but again this is 2.4 and we miss it even in 2.5 because of the troubles
they generate.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl]

Post by Andrea Arcangel » Sat, 22 Feb 2003 11:50:17




> > Explanation is very simple: you _can't_ kmap two times in the context of
> > a single task (especially if more than one task can run the same code at
> > the same time). I don't yet have the confirmation that this fixes the
> > deadlock though (it takes days to reproduce so it will take weeks to
> > confirm), but I can't see anything else wrong at the moment, and this
> > remains a genuine highmem deadlock that has to be fixed.  The fix is
> > optimal, no change unless you run out of kmaps and in turn you can
> > deadlock, i.e. all the light workloads won't be affected at all.

> We had a similar problem in Lustre, where we have to kmap multiple pages
> at once and hold them over a network RPC (which is doing zero-copy DMA
> into multiple pages at once), and there is possibly a very heavy load
> of kmaps because the client and the server can be on the same system.

> What we did was set up a "kmap reservation", which used an atomic_dec()
> + wait_event() to reschedule the task until it could get enough kmaps
> to satisfy the request without deadlocking (i.e. exceeding the kmap cap
> which we conservitavely set at 3/4 of all kmap space).

Your approch was fragile (every arch is free to give you just 1 kmap in
the pool and you still must not deadlock) and it's not capable of using
the whole kmap pool at the same time. the only robust and efficient way
to fix it is the kmap_nonblock IMHO

Quote:> A single "server" task could exceed the kmap cap by enough to satisfy the
> maximum possible request size, so that a single system with both clients
> and servers can always make forward progress even in the face of clients
> trying to kmap more than the total amount of kmap space.

> This works for us because we are the only consumer of huge amounts of kmaps
> on our systems, but it would be nice to have a generic interface to do that
> so that multiple apps don't deadlock against each other (e.g. NFS + Lustre).

This isn't the problem, if NFS wouldn't be broken it couldn't deadlock
against Lustre even with your design (assuming you don't fall in the two
problems mentioned above). But still your design is more fragile and
less scalable, especially for a generic implementation where you don't
know how many pages you'll reserve in mean, and you don't know how many
kmaps entries the architecture can provide to you. But of course with
kmap_nonblock you'll have to fallback submitting single pages if it
fails, it's a bit more difficult but it's more robust and optimized IMHO.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl]

Post by Andreas Dilge » Sat, 22 Feb 2003 21:50:08




> > What we did was set up a "kmap reservation", which used an atomic_dec()
> > + wait_event() to reschedule the task until it could get enough kmaps
> > to satisfy the request without deadlocking (i.e. exceeding the kmap cap
> > which we conservitavely set at 3/4 of all kmap space).

> Your approch was fragile (every arch is free to give you just 1 kmap in
> the pool and you still must not deadlock) and it's not capable of using
> the whole kmap pool at the same time. the only robust and efficient way
> to fix it is the kmap_nonblock IMHO

So (says the person who only ever uses i386 and ia64), does an arch exist
which needs highmem/kmap, but only ever gives 1 kmap in the pool?

Quote:> > This works for us because we are the only consumer of huge amounts of kmaps
> > on our systems, but it would be nice to have a generic interface to do that
> > so that multiple apps don't deadlock against each other (e.g. NFS + Lustre).

> This isn't the problem, if NFS wouldn't be broken it couldn't deadlock
> against Lustre even with your design (assuming you don't fall in the two
> problems mentioned above). But still your design is more fragile and
> less scalable, especially for a generic implementation where you don't
> know how many pages you'll reserve in mean, and you don't know how many
> kmaps entries the architecture can provide to you. But of course with
> kmap_nonblock you'll have to fallback submitting single pages if it
> fails, it's a bit more difficult but it's more robust and optimized IMHO.

In our case, Lustre (well Portals really, the underlying network protocol)
always knows in advance the number of pages that it will need to kmap
because the client needs to tell the server in advance how much bulk data
is going to send.  This is required for being able to do RDMA.  It might
be possible to have the server do the transfer in multiple parts if
kmap_nonblock() failed, but that is not how things are currently set up,
which is why we block in advance until we know we can get enough pages.

This is very similar to ext3 journaling, which requests in advance the
maximum number of journal blocks it might need, and blocks until it can
get them all.

The only problem happens when other parts of the kernel start acquiring
multiple kmaps without using the same reservation/accounting system as us.
Each works fine in isolation, but in combination it fails.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl]

Post by Andrea Arcangel » Sat, 22 Feb 2003 21:50:13





> > > What we did was set up a "kmap reservation", which used an atomic_dec()
> > > + wait_event() to reschedule the task until it could get enough kmaps
> > > to satisfy the request without deadlocking (i.e. exceeding the kmap cap
> > > which we conservitavely set at 3/4 of all kmap space).

> > Your approch was fragile (every arch is free to give you just 1 kmap in
> > the pool and you still must not deadlock) and it's not capable of using
> > the whole kmap pool at the same time. the only robust and efficient way
> > to fix it is the kmap_nonblock IMHO

> So (says the person who only ever uses i386 and ia64), does an arch exist
> which needs highmem/kmap, but only ever gives 1 kmap in the pool?

> > > This works for us because we are the only consumer of huge amounts of kmaps
> > > on our systems, but it would be nice to have a generic interface to do that
> > > so that multiple apps don't deadlock against each other (e.g. NFS + Lustre).

> > This isn't the problem, if NFS wouldn't be broken it couldn't deadlock
> > against Lustre even with your design (assuming you don't fall in the two
> > problems mentioned above). But still your design is more fragile and
> > less scalable, especially for a generic implementation where you don't
> > know how many pages you'll reserve in mean, and you don't know how many
> > kmaps entries the architecture can provide to you. But of course with
> > kmap_nonblock you'll have to fallback submitting single pages if it
> > fails, it's a bit more difficult but it's more robust and optimized IMHO.

> In our case, Lustre (well Portals really, the underlying network protocol)
> always knows in advance the number of pages that it will need to kmap
> because the client needs to tell the server in advance how much bulk data
> is going to send.  This is required for being able to do RDMA.  It might
> be possible to have the server do the transfer in multiple parts if
> kmap_nonblock() failed, but that is not how things are currently set up,
> which is why we block in advance until we know we can get enough pages.

> This is very similar to ext3 journaling, which requests in advance the
> maximum number of journal blocks it might need, and blocks until it can
> get them all.

> The only problem happens when other parts of the kernel start acquiring
> multiple kmaps without using the same reservation/accounting system as us.
> Each works fine in isolation, but in combination it fails.

no, if the other places are not buggy, it won't fail, regardless if they
use your mechanism or the kmap_nonblock. you don't have to use your
mechanism everywhere to make your mechanism work. For istance you will
be fine with the kmap_nonblock fix in combination with your current
code. Not sure why you think otherwise.

I understand it may be simpler to do the full reservation, in ext3 you
don't even risk anything because you know how large the pool is, but I
think for these cases the kmap_nonblock is superior because you have
obvious depdency on the architecture and you're not able to use at best
all the kmap pool (and here there's not a transaction that has to be
committed all at once so it's doable).  still in practice it will work
fine in combination of the other safe usages (like kmap_nonblock) if you
reserve few enough pages at time.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl]

Post by Andrew Morto » Sat, 22 Feb 2003 23:00:24




>      > 2.5.62 has the very same deadlock condition in xdr triggered by
>      >        nfs too.
>      > Andrew, if you're forward porting it yourself like with the
>      > filebacked vma merging feature just let me know so we make sure
>      > not to duplicate effort.

> For 2.5.x we should rather fix MSG_MORE so that it actually works
> instead of messing with hacks to kmap().

Is the fixing of MSG_MORE likely to actually happen?

Quote:> For 2.4.x, Hirokazu Takahashi had a patch which allowed for a safe
> kmap of > 1 page in one call. Appended here as an attachment FYI
> (Marcelo do *not* apply!).

Andrea's patch is quite simple.  Although I wonder if this, in
xdr_kmap():

+               } else {
+                       iov->iov_base = kmap_nonblock(*ppage);
+                       if (!iov->iov_base)
+                               goto out;
+               }

should be skipping the map_tail thing?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl]

Post by Trond Myklebus » Sat, 22 Feb 2003 23:40:16


    >> For 2.5.x we should rather fix MSG_MORE so that it actually
    >> works instead of messing with hacks to kmap().

     > Is the fixing of MSG_MORE likely to actually happen?

We had better try. The server/knfsd has already converted to sendpage
+ MSG_MORE 8-)

That won't work for 2.4.x though, since that doesn't have support for
sendpage over UDP.

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl]

Post by David S. Mille » Sun, 23 Feb 2003 02:00:14





> >      > One should also consider kmap_atomic...  (bcrl suggest)

> > The problem is that sendmsg() can sleep. kmap_atomic() isn't really
> > appropriate here.

> 100% correct.

It actually depends upon whether you have sk->priority set
to GFP_ATOMIC or GFP_KERNEL.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl]

Post by Andrea Arcangel » Mon, 24 Feb 2003 17:40:10






> > >      > One should also consider kmap_atomic...  (bcrl suggest)

> > > The problem is that sendmsg() can sleep. kmap_atomic() isn't really
> > > appropriate here.

> > 100% correct.

> It actually depends upon whether you have sk->priority set
> to GFP_ATOMIC or GFP_KERNEL.

You must not disable preemption when entering sock_sendmsg no matter
sk->priority. disabling preemption inside sock_sendmsg is way too late
so even if you have such preemption bug in sock_sendmsg, it won't help.
you would need to disable preemption in the caller before doing the
kmap_atomic if something. And again that is a preemption bug.

Not to tell you'd need to allocate a big pool of atomic kmaps to do
that, and this would eat hundred megs of virtual address space since
it's replicated per-cpu. This is has even less sense, those machines
where the highmem deadlock triggers eats normal zone big time.

Really, the claim that it can be solved with atomic kmaps doesn't make
any sense to me, nor the fact the sock_sendmsg will not schedule if
called with GFP_ATOMIC. Of course it must not schedule if it can be
called from an irq with priority=GFP_ATOMIC, but this isn't the case
we're discussing here, an irq implicitly just disabled preemption by
design and calling sock_sendmsg from irq isn't really desiderable (even
if technically possible maybe with priority=GFP_ATOMIC according to you)
because it will take some time.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

1. filesystem access slowing system to a crawl

Hi,

maybe you could help me out with a really weird problem we're having
with a NFS fileserver for a couple of webservers:

- Dual Xeon 2.2 GHz
- 6 GB RAM
- QLogic FCAL Host adapter with about 5.5 TB on a several RAIDs
- Debian "woody" w/Kernel 2.4.19

Running just "find /" (or ls -R or tar on a large directory) locally
slows the box down to absolute unresponsiveness - it takes minutes
to just run ps and kill the find process. During that time, kupdated
and kswapd gobble up all available CPU time.

The system performs great otherwise, so I've ruled out a hardware
problem. It can't be a load problem because during normal operation,
the system is more or less bored out of its mind (70-90% idle time).

I'm really at the end of my wits here :-(

Any help would be greatly appreciated!

TIA,
Thomas

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2. Modeline for DEC RGB VRT19HA monitor?

3. NFS: When one server goes down, entire network slows down to a crawl.

4. How to get more colors out of my Laptop LCD?

5. Streams problem - systems slows to a crawl

6. Looking for tutorial book on Threads and POSIX 1003.4

7. System slows to a crawl?

8. Prompt line question

9. Reading/Playing CDROM slows system to a crawl

10. X slows system to a crawl!!

11. compile fix for fs/aio.c on non-highmem systems

12. Added 4gb hard disk and system slows down to a crawl

13. HELP: Slow NFS client to server causing Very slow NFS install