2 times faster rawio and several fixes (2.4.3aa3)

2 times faster rawio and several fixes (2.4.3aa3)

Post by Andrea Arcangel » Sun, 08 Apr 2001 01:20:07



I merged some of SCT's fixes plus I fixed another couple of bugs and then I
boosted the code to run faster. There's still room for improvements for example
by using a ring of iobuf to walk pagetables and lock down pages for the next
atomic I/O chunk while the I/O of the previous iobuf is in progress (before
waiting synchronously) but with those first basic improvements it just runs
exactly 2 times faster than vanilla 2.4.3 on my hardware.  NOTE: since I made
the atomic I/O 512k to go in sync with the max size of a io-request and to take
advantage of the large I/O requests the MAX_KIO_SECTORS grown so much that it
cannot be loaded on the stack anymore (it was just a bad idea anyways to load
it on the stack anyways) so for things like the buffer array I preallocate an
helper buffer in the kiovec structure for that.

This should very significantly boost Oracle when the working set doesn't fit in
cache because the rawio path should be quite efficient now (comparable to
regular I/O through the cache).

2.4.3aa3 without rawio-1:

alpha:/home/andrea # time ./rawio-bench
Opening /dev/raw1
Allocating 50MB of memory
Reading from /dev/raw1
Writing data to /dev/raw1

real    0m10.323s
user    0m0.002s
sys     0m1.248s
alpha:/home/andrea # time ./rawio-bench
Opening /dev/raw1
Allocating 50MB of memory
Reading from /dev/raw1
Writing data to /dev/raw1

real    0m10.299s
user    0m0.002s
sys     0m1.247s
alpha:/home/andrea # time ./rawio-bench
Opening /dev/raw1
Allocating 50MB of memory
Reading from /dev/raw1
Writing data to /dev/raw1

real    0m10.557s
user    0m0.004s
sys     0m1.267s
alpha:/home/andrea # time ./rawio-bench
Opening /dev/raw1
Allocating 50MB of memory
Reading from /dev/raw1
Writing data to /dev/raw1

real    0m10.310s
user    0m0.003s
sys     0m1.282s
alpha:/home/andrea #

2.4.3aa3 with rawio-1:


Opening /dev/raw1
Allocating 50MB of memory
Reading from /dev/raw1
Writing data to /dev/raw1

real    0m5.208s
user    0m0.001s
sys     0m1.162s

Opening /dev/raw1
Allocating 50MB of memory
Reading from /dev/raw1
Writing data to /dev/raw1

real    0m5.233s
user    0m0.002s
sys     0m1.184s

Opening /dev/raw1
Allocating 50MB of memory
Reading from /dev/raw1
Writing data to /dev/raw1

real    0m5.378s
user    0m0.002s
sys     0m1.213s

Opening /dev/raw1
Allocating 50MB of memory
Reading from /dev/raw1
Writing data to /dev/raw1

real    0m5.258s
user    0m0.001s
sys     0m1.183s

Original patch is here:

        ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2...

however to apply cleanly to lvm you need to first apply the lvm patches into
the 2.4.3aa3 directory to upgrade to 0.9.1 beta6 (btw, I appreciated very much
the sistina folks that gone back to IPO 10 as suggested a few weeks ago,
thanks! :)

I also ported the patch to vanilla 2.4.3 for inclusion (however that version is
untested but the only rejects was in lvm-snap.c and they were obvious enough not
to require testing) but lvm people please look at the other patch that will
just apply cleanly to your CVS tree:

        ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2...

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

2 times faster rawio and several fixes (2.4.3aa3)

Post by Andrea Arcangel » Sun, 08 Apr 2001 01:50:08



> 2.4.3aa3 with rawio-1:


> Opening /dev/raw1
> Allocating 50MB of memory
> Reading from /dev/raw1
> Writing data to /dev/raw1

> real    0m5.208s
> user    0m0.001s
> sys     0m1.162s

> Opening /dev/raw1
> Allocating 50MB of memory
> Reading from /dev/raw1
> Writing data to /dev/raw1

> real    0m5.233s
> user    0m0.002s
> sys     0m1.184s

> Opening /dev/raw1
> Allocating 50MB of memory
> Reading from /dev/raw1
> Writing data to /dev/raw1

> real    0m5.378s
> user    0m0.002s
> sys     0m1.213s

> Opening /dev/raw1
> Allocating 50MB of memory
> Reading from /dev/raw1
> Writing data to /dev/raw1

> real    0m5.258s
> user    0m0.001s
> sys     0m1.183s


with this patch:

--- 2.4.3aa/include/linux/iobuf.h       Fri Apr  6 16:33:12 2001

  * entire iovec.
  */

-#define KIO_MAX_ATOMIC_IO      512 /* in kb */
+#define KIO_MAX_ATOMIC_IO      1024 /* in kb */
 #define KIO_STATIC_PAGES       (KIO_MAX_ATOMIC_IO / (PAGE_SIZE >> 10) + 1)
 #define KIO_MAX_SECTORS                (KIO_MAX_ATOMIC_IO * 2)

applied on top of 2.4.3aa3 I get even better results:

alpha:/home/andrea # time ./rawio-bench
Opening /dev/raw1
Allocating 50MB of memory
Reading from /dev/raw1
Writing data to /dev/raw1

real    0m4.898s
user    0m0.003s
sys     0m1.138s
alpha:/home/andrea # time ./rawio-bench
Opening /dev/raw1
Allocating 50MB of memory
Reading from /dev/raw1
Writing data to /dev/raw1

real    0m4.935s
user    0m0.002s
sys     0m1.159s
alpha:/home/andrea # time ./rawio-bench
Opening /dev/raw1
Allocating 50MB of memory
Reading from /dev/raw1
Writing data to /dev/raw1

real    0m4.925s
user    0m0.003s
sys     0m1.162s
alpha:/home/andrea # time ./rawio-bench
Opening /dev/raw1
Allocating 50MB of memory
Reading from /dev/raw1
Writing data to /dev/raw1

real    0m4.941s
user    0m0.004s
sys     0m1.166s
alpha:/home/andrea #

this is most probably beacuse I'm striping on two scsi disks and this way we can send
512k requests to each disk.

NOTE: also userspace reads and writes have to be >=512kbytes in granularity or
you'll generate small requests because rawio in always synchronous. And using
decent sized write/reads is good idea anyways to reduce the enter/exit kernel
overhead.

However we can probably stay with the 512k atomic I/O otherwise the iobuf
structure will grow again of an order of 2. With 512k of atomic I/O the kiovec
structure is just 8756 in size (infact probably I should allocate some of the
structures dynamically instead of statics inside the kiobuf.. as it is now
with my patch it's not very reliable as it needs an allocation of order 2).

BTW, some more description on the testcase: it first read 50mbytes physically
contigous and then it lseek to zero and writes 50mbytes. Disk throughput in
mean is 100mbyte/5sec = 20mbyte/sec.

It uses anonymous memory as in-core backend. It looks perfect testcase to me
and they're the faster disks I have here around.

Here the proggy:


#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <asm/page.h>

#define MB (1024*1024)
#define BUFSIZE (50*MB)

main()
{
        int fd, size, ret;
        int filemap;
        char * buf, * end, * tmp;

        printf("Opening /dev/raw1\n");
        fd = open("/dev/raw1", O_RDWR);
        if (fd < 0)
                perror("open /dev/raw1"), exit(1);

#if 1
        printf("Allocating %dMB of memory\n", BUFSIZE/MB);
        buf = (char *) malloc(BUFSIZE);
        if (buf < 0)
                perror("malloc"), exit(1);
        end = (char *) ((unsigned long) (buf + BUFSIZE) & PAGE_MASK);
        buf = (char *) ((unsigned long)(buf + ~PAGE_MASK) & PAGE_MASK);
#else
        printf("Mapping %dMB of memory\n", BUFSIZE/MB);
        filemap = open("deleteme", O_RDWR|O_TRUNC|O_CREAT, 0644);
        if (filemap < 0)
                perror("open"), exit(1);
        {
                int i;
                char buf[4096];
                for (i = 0; i < BUFSIZE; i += 4096)
                        write(filemap, &buf, 4096);
        }
        ftruncate(filemap, BUFSIZE);
        buf = mmap(0, BUFSIZE, PROT_READ|PROT_WRITE, MAP_SHARED, filemap, 0);
        if ((long) buf < 0)
                perror("mmap"), exit(1);
        if ((unsigned long) buf & ~PAGE_MASK)
                perror("mmap misaligned"), exit(1);
        end = buf + BUFSIZE;
#endif
        size = end - buf;

        printf("Reading from /dev/raw1\n");
        ret = read(fd, buf, size);
        if (ret < 0)
                perror("read /dev/raw1"), exit(1);
        if (ret != size)
                fprintf(stderr, "read only %d of %d bytes\n", ret, size);
        printf("Writing data to /dev/raw1\n");
        if (lseek(fd, 0, SEEK_SET) < 0)
                perror("lseek"), exit(1);
        ret = write(fd, buf, size);
        if (ret < 0)
                perror("read /dev/raw1"), exit(1);
        if (ret != size)
                fprintf(stderr, "write only %d of %d bytes\n", ret, size);

Quote:}

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

2 times faster rawio and several fixes (2.4.3aa3)

Post by Andi Klee » Sun, 08 Apr 2001 02:10:05



> However we can probably stay with the 512k atomic I/O otherwise the iobuf
> structure will grow again of an order of 2. With 512k of atomic I/O the kiovec
> structure is just 8756 in size (infact probably I should allocate some of the
> structures dynamically instead of statics inside the kiobuf.. as it is now
> with my patch it's not very reliable as it needs an allocation of order 2).

8756bytes wastes most of an order 2 allocation. Wouldn't it make more sense to
round it up to 16k to use the four pages fully ?  (if the increased atomic
size doesn't have other bad effects -- i guess it's no problem anymore to
lock down that much memory?)

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

2 times faster rawio and several fixes (2.4.3aa3)

Post by Andrea Arcangel » Sun, 08 Apr 2001 02:30:09




> > However we can probably stay with the 512k atomic I/O otherwise the iobuf
> > structure will grow again of an order of 2. With 512k of atomic I/O the kiovec
> > structure is just 8756 in size (infact probably I should allocate some of the
> > structures dynamically instead of statics inside the kiobuf.. as it is now
> > with my patch it's not very reliable as it needs an allocation of order 2).

> 8756bytes wastes most of an order 2 allocation. Wouldn't it make more sense to
> round it up to 16k to use the four pages fully ?  (if the increased atomic

I prefer to get rid of the order 2 allocation to avoid having to deal with
fragmentation. The patch introduces arrays takes 1 page each (on x86 and alpha)
if the atomic IO is 512k so I can allocate them with a separate kmalloc.

OTOH on x86-64 we have PAGE_SIZE 4k and 8byte words so maybe I should use
vmalloc instead? Performance of vmalloc is not an issue because those
allocations doesn't happen anymore in any fast path, only worry in using
vmalloc are the additional global 3 tlb entries (but OTOH also with kmalloc
there's the chance the code will use a few more global tlb entries if the
memory returned for all the kiovec structures doesn't all fit in the same
2/4Mbytes naturally aligned area). so probably I will take the vmalloc way
that is more generic and it shouldn't hurt perormance (I will measure that
to be sure though).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
 
 

2 times faster rawio and several fixes (2.4.3aa3)

Post by Andrea Arcangel » Sun, 08 Apr 2001 03:10:06



> 2/4Mbytes naturally aligned area). so probably I will take the vmalloc way

As expected vmalloc additional 2 tlbs aren't visible in the numbers (that
are mostly dominated by I/O anyways), I think it's the best solution to avoid
the order 2 multipage:

alpha:/home/andrea # time ./rawio-bench
Opening /dev/raw1
Allocating 50MB of memory
Reading from /dev/raw1
Writing data to /dev/raw1

real    0m5.241s
user    0m0.002s
sys     0m1.119s
alpha:/home/andrea # time ./rawio-bench
Opening /dev/raw1
Allocating 50MB of memory
Reading from /dev/raw1
Writing data to /dev/raw1

real    0m5.176s
user    0m0.003s
sys     0m1.128s
alpha:/home/andrea # time ./rawio-bench
Opening /dev/raw1
Allocating 50MB of memory
Reading from /dev/raw1
Writing data to /dev/raw1

real    0m5.196s
user    0m0.002s
sys     0m1.132s
alpha:/home/andrea # time ./rawio-bench
Opening /dev/raw1
Allocating 50MB of memory
Reading from /dev/raw1
Writing data to /dev/raw1

real    0m5.477s
user    0m0.004s
sys     0m1.146s
alpha:/home/andrea # time ./rawio-bench
Opening /dev/raw1
Allocating 50MB of memory
Reading from /dev/raw1
Writing data to /dev/raw1

real    0m5.217s
user    0m0.004s
sys     0m1.149s
alpha:/home/andrea #

Tomorrow maybe I will try to speed it up furhter using the desing described in
the first email.

The s/kmem_cache_alloc/vmalloc/ change is here for now and it is rock solid
for me (regression testing is still happy):

        ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2...

I think it's ok for inclusion.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/