gcc 2.8.1 vs gcc 2.95.3 optimization on Sparc V8

gcc 2.8.1 vs gcc 2.95.3 optimization on Sparc V8

Post by Dennis Clark » Sun, 27 May 2001 10:40:02



Not exactly the most exciting topic in the world but here it is.  I was very
happy with gcc version 2.8.1 on my Sparc20 at home.  For reasons unknown and
better left unsaid I thought that I could install 2.95.3 to see what improvement
in optimization I would get in some of my more numerically intensive code.  I
was surprised to see that the same source code produced a different run time on
my Sparc20 when compiled with gcc 2.95.3.  Considerably slower.  I wonder what
causes that?  Well, could be a lot of little things so I wrote a cute little
program that computes pi by using the most inefficient method that I know of.
Essentially (pi^2)/6 is equal to the infinite sum of 1/(n^2) for n>0 .  Very
very slow and thus an estimate for pi accurate to about seven digits after the
decimal may be achieved with n=1073741823.  That's a lot of iterations through
the central loop.  Well, I compiled this program while using gcc 2.95.3 with
various optimization options but never get anything close to the performance
produced by using gcc 2.8.1.  Here is the source :

$ cat pi.c
/* Standard PI calculation using an infinite series - Dennis Clarke */
/********************************************************************/
/* $ uname -a                                                       */
/* SunOS yay 5.7 Generic_106541-15 sun4m sparc SUNW,SPARCstation-20 */
/*                                                                  */
/* $ psrinfo -v                                                     */
/* Status of processor 0 as of: 05/25/01 21:24:41                   */
/*   Processor has been on-line since 05/21/01 05:37:39.            */
/*   The sparc processor operates at 60 MHz,                        */
/*         and has a sparc floating point processor.                */
/* Status of processor 2 as of: 05/25/01 21:24:41                   */
/*   Processor has been on-line since 05/21/01 05:37:43.            */
/*   The sparc processor operates at 60 MHz,                        */
/*         and has a sparc floating point processor.                */
/********************************************************************/

#include <locale.h>
#include <stdio.h>
#include <sys/time.h>
#include <math.h>

int main(int argc, char *argv[]) {

        double pi = (double) 0.0;
        unsigned long i;

        /*****************************************************/
        /** sum the series 1/(x^2)                          **/
        /*****************************************************/
        fprintf ( stdout, "\n\n" );
        for (i = 1; i < 1073741823; i++) {
                pi = pi + (double)1.0/( (double)i * (double)i );
        }

        fprintf(stdout, " pi at n=%9u is %.12g \n", i, sqrt( pi * (double)6.0
));

        exit(1);

Quote:}    

Well, I was going to print the start and stop time in the code but decided to
simply use the Solaris time program instead.  In any case, here are the results
of my test :

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Run 1

Using gcc 2.8.1 thus :

gcc -Wall -v -O3 -c -o pi.o pi.c
gcc -v -o pi pi.o -lm

results in a file 24664 bytes in length and of type :

ELF 32-bit MSB executable SPARC Version 1, dynamically linked, not stripped

The run time is

$ time -p ./pi

 pi at n=1073741823 is 3.14159264498

real 305.58
user 305.49
sys 0.02

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Run 2

Using gcc 2.95.3 20010315 (release) thus :

gcc -Wall -v -O3 -c -msupersparc -mcpu=supersparc -mtune=supersparc -o pi.o pi.c
gcc -v -o pi pi.o -lm

results in a file 7076 bytes in length and of type :

ELF 32-bit MSB executable SPARC Version 1, dynamically linked, not stripped

The run time is

$ time -p ./pi

 pi at n=1073741823 is 3.14159264498

real 377.32
user 377.25
sys 0.02

-=-=-=-=-=-=-=-=-=-

Compile with gcc 2.95.3 20010315 (release) thus :

gcc -Wall -v -O3 -c -mcpu=v8 -mtune=v8 -o pi.o pi.c
gcc -v -o pi pi.o -lm

Run 3

$ file pi
pi:             ELF 32-bit MSB executable SPARC Version 1, dynamically linked, s
tripped
$ time -p ./pi

 pi at n=1073741823 is 3.14159264498

real 377.38
user 377.35
sys 0.02

-=-=-=-=-=-=-=-=-=-=-=-=-

Run 4

$ gcc -Wall -O3 -c -mcpu=v8 -o pi.o pi.c
pi.c: In function `main':
pi.c:26: warning: unsigned int format, long unsigned int arg (arg 3)
$ gcc -o pi pi.o -lm
$ ls -lap pi
-rwxr-xr-x   1 dclarke  staff       7068 May 25 19:09 pi
$ strip pi
$ ls -lap pi
-rwxr-xr-x   1 dclarke  staff       4664 May 25 19:09 pi
$ time -p ./pi

 pi at n=1073741823 is 3.14159264498

real 377.31
user 377.26
sys 0.03

-=-=-=-=-=-=-=-=-=-=-=-=-
Run 5

$ gcc -Wall -O3 -c -o pi.o pi.c
pi.c: In function `main':
pi.c:26: warning: unsigned int format, long unsigned int arg (arg 3)
$ gcc -o pi pi.o -lm
$ ls -lap pi
-rwxr-xr-x   1 dclarke  staff       7068 May 25 19:27 pi
$ strip pi
$ ls -lap pi
-rwxr-xr-x   1 dclarke  staff       4664 May 25 19:27 pi
$ time -p ./pi

 pi at n=1073741823 is 3.14159264498

real 377.40
user 377.38
sys 0.01

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Run 6

$ gcc -Wall -O0 -c -o pi.o pi.c
pi.c: In function `main':
pi.c:26: warning: unsigned int format, long unsigned int arg (arg 3)
$ gcc -o pi pi.o -lm
$ ls -lap pi
-rwxr-xr-x   1 dclarke  staff       7172 May 25 19:50 pi
$ strip pi
$ ls -lap pi
-rwxr-xr-x   1 dclarke  staff       4768 May 25 19:51 pi
$ time -p ./pi

 pi at n=1073741823 is 3.14159264498

real 557.21
user 557.13
sys 0.02

Well, so there it is.  The results seem to show that gcc 2.8.1 will produce a
faster binary with the same source with no special optimizations.  Geez I wonder
why.  Just how different can the machine code be?  Maybe I should disassemble to
two and have a look see.

Dennis

ps: I'll try the same thing on an Ultra2 using the SparcV9 cpu optimization
option but I don't expect much to be different.  Then again ,  who knows.

 
 
 

gcc 2.8.1 vs gcc 2.95.3 optimization on Sparc V8

Post by Drazen Kac » Sun, 27 May 2001 12:35:56



> Well, so there it is.  The results seem to show that gcc 2.8.1 will
> produce a faster binary with the same source with no special
> optimizations.  Geez I wonder why.  Just how different can the machine
> code be?  Maybe I should disassemble to two and have a look see.

You could try adding "-ffast-math -fexpensive-optimizations" flags. I
couldn't see any difference, but it wasn't on the same CPU, so maybe.

Quote:> ps: I'll try the same thing on an Ultra2 using the SparcV9 cpu optimization
> option but I don't expect much to be different.  Then again ,  who knows.

I tried with Sun's 6.1 compiler. "cc -fast -xbuiltin" produces code which
runs roughly twice faster than what was produced by gcc 2.95.3. But when
I added "-xdepend" it was slower. The assembly output shows that -xdepend
somehow killed very clever loop unrolling/restructuring/however-is-that-
trick-called (/me being very impressed). -xautopar had the same effect. Hmmm.

--
 .-.   .-.    Sarcasm is just one more service we offer.
(_  \ /  _)

     |

 
 
 

gcc 2.8.1 vs gcc 2.95.3 optimization on Sparc V8

Post by Dennis Clark » Mon, 28 May 2001 00:57:59



> You could try adding "-ffast-math -fexpensive-optimizations" flags. I
> couldn't see any difference, but it wasn't on the same CPU, so maybe.

Good suggestion.  I pkgrm'd gcc 2.95.3 and reinstalled 2.8.1 for a while.  The
optimizations you mention above result in the following :

$ time -p ./pi

 pi at n=1073741823 is 3.14159264498

real 305.05
user 305.03
sys 0.02

Which is a little quicker than the following :

Run 8 with gcc 2.8.1 with optimizations for Sparc V8, no -ffast-math
-fexpensive-optimizations

$ time -p ./pi

 pi at n=1073741823 is 3.14159264498

real 305.37
user 305.33
sys 0.03  

Quote:

> I tried with Sun's 6.1 compiler. "cc -fast -xbuiltin" produces code which
> runs roughly twice faster than what was produced by gcc 2.95.3.

TWICE !!  Really!  Geez, I think I drop the cash for a copy of that compiler
after all.  I wonder what other wonderful things it can do for numericaly
intense code.  I have the SparcWorks Compiler version 4.1 and Sun says that I
can upgrade for about $2000.00 to the complete Forte 6.1.  It may be worth it in
the long run [ pun intended ] :)

Quote:> But when
> I added "-xdepend" it was slower.

I don't know what that means so I'll just have to buy the compiler.  I know that
it is HOSTID locked so there is no way to simply "get a copy" from someone.  I
have no intentions of changing my hostid so I guess I stuck with buying it.

By the way, here is the compile and run on an Ultra 10 with a 300 MHz cpu.

$ uname -a
SunOS mars 5.8 Generic_108528-05 sun4u sparc SUNW,Ultra-5_10

$ gcc -v -Wall -O3 -c -mcpu=ultrasparc -mtune=ultrasparc -ffast-math -fexpensive
-optimizations -o pi.o pi.c
Reading specs from /usr/local/lib/gcc-lib/sparc-sun-solaris2.8/2.95.3/specs
gcc version 2.95.3 20010315 (release)
 /usr/local/lib/gcc-lib/sparc-sun-solaris2.8/2.95.3/cpp0 -lang-c -v -D__GNUC__=2
-D__GNUC_MINOR__=95 -Dsparc -Dsun -Dunix -D__svr4__ -D__SVR4 -D__sparc__
-D__sun__ -D__unix__ -D__svr4__ -D__SVR4 -D__sparc -D__sun -D__unix
-Asystem(unix) -Asystem(svr4) -D__OPTIMIZE__ -D__FAST_MATH__ -Wall
-D__sparc_v9__ -D__GCC_NEW_VARARGS__ -Acpu(sparc) -Amachine(sparc) pi.c
/var/tmp/ccCFTyKa.i
GNU CPP version 2.95.3 20010315 (release) (sparc)
#include "..." search starts here:
#include <...> search starts here:
 /usr/local/include
 /usr/local/lib/gcc-lib/sparc-sun-solaris2.8/2.95.3/../../../../sparc-sun-solari
s2.8/include
 /usr/local/lib/gcc-lib/sparc-sun-solaris2.8/2.95.3/include
 /usr/include
End of search list.
The following default directories have been omitted from the search path:
 /usr/local/lib/gcc-lib/sparc-sun-solaris2.8/2.95.3/../../../../include/g++-3
End of omitted list.
 /usr/local/lib/gcc-lib/sparc-sun-solaris2.8/2.95.3/cc1 /var/tmp/ccCFTyKa.i -qui
et -dumpbase pi.c -mcpu=ultrasparc -mtune=ultrasparc -O3 -Wall -version -ffast-m
ath -fexpensive-optimizations -o /var/tmp/ccY9LdXh.s
GNU C version 2.95.3 20010315 (release) (sparc-sun-solaris2.8) compiled by GNU C
 version 2.95.3 20010315 (release).
pi.c: In function `main':
pi.c:35: warning: unsigned int format, long unsigned int arg (arg 3)
 /usr/ccs/bin/as -V -Qy -s -xarch=v8plusa -o pi.o /var/tmp/ccY9LdXh.s
/usr/ccs/bin/as: Sun WorkShop 6 99/08/18

$ gcc -v -o pi pi.o -lm
Reading specs from /usr/local/lib/gcc-lib/sparc-sun-solaris2.8/2.95.3/specs
gcc version 2.95.3 20010315 (release)
 /usr/local/lib/gcc-lib/sparc-sun-solaris2.8/2.95.3/collect2 -V -Y P,/usr/ccs/li
b:/usr/lib -Qy -o pi /usr/local/lib/gcc-lib/sparc-sun-solaris2.8/2.95.3/crt1.o /
usr/local/lib/gcc-lib/sparc-sun-solaris2.8/2.95.3/crti.o /usr/ccs/lib/values-Xa.
o /usr/local/lib/gcc-lib/sparc-sun-solaris2.8/2.95.3/crtbegin.o -L/usr/local/lib
/gcc-lib/sparc-sun-solaris2.8/2.95.3 -L/usr/ccs/bin -L/usr/ccs/lib -L/usr/local/
lib pi.o -lm -lgcc -lc -lgcc /usr/local/lib/gcc-lib/sparc-sun-solaris2.8/2.95.3/
crtend.o /usr/local/lib/gcc-lib/sparc-sun-solaris2.8/2.95.3/crtn.o
ld: Software Generation Utilities - Solaris-ELF (4.0)

$ file pi
pi:             ELF 32-bit MSB executable SPARC32PLUS Version 1, V8+ Required, U
ltraSPARC1 Extensions Required, dynamically linked, not stripped

$ ls -lap pi
-rwxr-xr-x   1 dclarke  staff       6220 May 26 11:51 pi
$ strip pi
$ ls -lap pi
-rwxr-xr-x   1 dclarke  staff       3856 May 26 11:52 pi

$ time -p ./pi

 pi at n=1073741823 is 3.14159264498

real 147.97
user 147.71
sys 0.00

Dennis Clarke

ps : Roy Williams at CalTech creates some very fast code for estimating pi to
thousands of digits

 
 
 

gcc 2.8.1 vs gcc 2.95.3 optimization on Sparc V8

Post by Rich Tee » Mon, 28 May 2001 03:37:50



> I don't know what that means so I'll just have to buy the compiler.  I know that
> it is HOSTID locked so there is no way to simply "get a copy" from someone.  I
> have no intentions of changing my hostid so I guess I stuck with buying it.

You can download a free "trial" version from Sun, so you can try it before
handing over large sums of cash.  Also, you can get (at extra cost) a floating
version of the compiler.  This isn't nodelocked, although you can still only use
one copy at a time.

--
Rich Teer

President,
Rite Online Inc.

Voice: +1 (250) 979-1638
URL: http://www.rite-online.net

 
 
 

gcc 2.8.1 vs gcc 2.95.3 optimization on Sparc V8

Post by Oscar del Ri » Mon, 28 May 2001 04:02:21


Quote:> > I tried with Sun's 6.1 compiler. "cc -fast -xbuiltin" produces
code which
> > runs roughly twice faster than what was produced by gcc 2.95.3.

> TWICE !!  Really!  Geez, I think I drop the cash for a copy of that
compiler
> after all.  I wonder what other wonderful things it can do for
numericaly
> intense code.  I have the SparcWorks Compiler version 4.1 and Sun
says that I
> can upgrade for about $2000.00 to the complete Forte 6.1.  It may be
worth it in
> the long run [ pun intended ] :)

I get the following times running your code on a Ultra E450, 400 MHz,
Solaris 7, using Sun Workshop cc 5.0 and gcc 2.8.1:

cc -O  (105.48 s)
cc -fast  (59.52 s)
gcc -O   (116.46 s)
gcc -O3   (105.68 s)
gcc -o pigcc3 -O3 -fexpensive-optimizations -ffast-math  (105.59 s)

That's right, "cc -fast" is almost twice as fast.

 
 
 

gcc 2.8.1 vs gcc 2.95.3 optimization on Sparc V8

Post by Drazen Kac » Mon, 28 May 2001 04:24:58


Dennis Clarke wrote:
> Drazen Kacar wrote:
> > I tried with Sun's 6.1 compiler. "cc -fast -xbuiltin" produces code which
> > runs roughly twice faster than what was produced by gcc 2.95.3.

> TWICE !!  Really!

Almost twice, not to be confused with TWICE. :-)

A minute for Sun cc against minute and fifty-something for gcc 2.95.3.

> > But when I added "-xdepend" it was slower.

> I don't know what that means so I'll just have to buy the compiler.

Additional optimization flag. From the man page:

          (SPARC) Analyzes loops for inter-iteration data dependencies
          and does loop restructuring.

One would assume that it won't harm if it doesn't help, but the problem
with Sun's compilers is that you never know which combination of
optimization flags would produce the best result this week. It's sort of a
black magic. I gave up on that a long time ago.

Below is the assembly produced by the compiler with the command line:

   cc -fast -xarch=v8 -xbuiltin -S pi.c

You can compile it with "gcc pi.s" and run on v8 or better SPARC. The
assembly was produced on US-II with 2MB cache, so scheduling should be
optimized for that, but I think it shouldn't make much of a difference on
something else. The main trick is to run several FP instructions in
parallel.

        .section        ".text",#alloc,#execinstr
        .file   "pi.c"

        .section        ".rodata1",#alloc
        .align  4
!
! CONSTANT POOL
!
.L224:
        .ascii  "\n\n\000"
        .align  4
!
! CONSTANT POOL
!
.L230:
        .ascii  " pi at n=%9u is %.12g \n\000"

        .section        ".text",#alloc,#execinstr
/* 000000          0 */         .align  8
!
! CONSTANT POOL
!
                       ___const_seg_900000104:
/* 000000          0 */         .word   0,0
/* 0x0008            */         .word   1072693248,0
/* 0x0010            */         .word   1127219200,0
/* 0x0018            */         .word   1075314688,0
/* 0x0020          0 */         .type   ___const_seg_900000104,1
/* 0x0020          0 */         .size   ___const_seg_900000104,(.-___const_seg_900000104)
/* 0x0020          0 */         .align  4
! FILE pi.c

!    1                !#include <locale.h>
!    2                !#include <stdio.h>
!    3                !#include <sys/time.h>
!    4                !#include <math.h>
!    6                !int main(int argc, char *argv[]) {

!
! SUBROUTINE main
!
! OFFSET    SOURCE LINE LABEL   INSTRUCTION

                        .global main
                       main:
/* 000000          6 */         save    %sp,-104,%sp

!    8                !        double pi = (double) 0.0;
!    9                !        unsigned long i;
!   11                !        /*****************************************************/
!   12                !        /** sum the series 1/(x^2)                          **/
!   13                !        /*****************************************************/
!   14                !        fprintf ( stdout, "\n\n" );

/* 0x0004         14 */         sethi   %hi(__iob+16),%g2
/* 0x0008          0 */         sethi   %hi(___const_seg_900000104+8),%l0
/* 0x000c         14 */         add     %g2,%lo(__iob+16),%l1
/* 0x0010            */         sethi   %hi(.L224),%g2
/* 0x0014            */         or      %g0,%l1,%o0
/* 0x0018          0 */         add     %l0,%lo(___const_seg_900000104+8),%l2
/* 0x001c         14 */         call    fprintf ! params =  %o0 %o1     ! Result =
/* 0x0020            */         add     %g2,%lo(.L224),%o1

!   15                !        for (i = 1; i < 1073741823; i++) {

/* 0x0024         15 */         ldd     [%l2-8],%f2
/* 0x0028            */         or      %g0,1,%g4
/* 0x002c            */         sethi   %hi(0x3ffffc00),%g2

!   16                !                pi = pi + (double)1.0/( (double)i * (double)i );

/* 0x0030         16 */         st      %g4,[%sp+92]
/* 0x0034         15 */         add     %g2,1023,%g3
/* 0x0038         16 */         or      %g0,2,%g2
/* 0x003c            */         ldd     [%l2+8],%f6
/* 0x0040            */         or      %g0,3,%o0
/* 0x0044         15 */         nop ! volatile
/* 0x0048            */         ldd     [%l0+%lo(___const_seg_900000104+8)],%f4
/* 0x004c            */         sub     %g3,3,%g4
/* 0x0050         16 */         fmovs   %f6,%f0
/* 0x0054            */         ld      [%sp+92],%f1
/* 0x0058            */         fmovs   %f6,%f8
/* 0x005c            */         st      %g2,[%sp+92]
/* 0x0060            */         fsubd   %f0,%f6,%f0
/* 0x0064            */         fmuld   %f0,%f0,%f0
/* 0x0068            */         ld      [%sp+92],%f9
/* 0x006c            */         st      %o0,[%sp+92]
/* 0x0070            */         fdivd   %f4,%f0,%f10
/* 0x0074            */         fsubd   %f8,%f6,%f0
/* 0x0078            */         fmuld   %f0,%f0,%f0
/* 0x007c            */         fdivd   %f4,%f0,%f0
/* 0x0080            */         faddd   %f2,%f10,%f2
                       .L900000114:
/* 0x0084         15 */         nop ! volatile
/* 0x0088            */         nop ! volatile
/* 0x008c            */         nop ! volatile
/* 0x0090            */         nop ! volatile
/* 0x0094            */         nop ! volatile
/* 0x0098            */         nop ! volatile
/* 0x009c            */         nop ! volatile
/* 0x00a0            */         nop ! volatile
/* 0x00a4            */         nop ! volatile
/* 0x00a8            */         nop ! volatile
/* 0x00ac            */         nop ! volatile
/* 0x00b0            */         nop ! volatile
/* 0x00b4            */         nop ! volatile
/* 0x00b8         16 */         ld      [%sp+92],%f9
/* 0x00bc         15 */         nop ! volatile
/* 0x00c0            */         nop ! volatile
/* 0x00c4            */         nop ! volatile
/* 0x00c8            */         nop ! volatile
/* 0x00cc            */         nop ! volatile
/* 0x00d0            */         nop ! volatile
/* 0x00d4            */         nop ! volatile
/* 0x00d8            */         nop ! volatile
/* 0x00dc            */         nop ! volatile
/* 0x00e0            */         nop ! volatile
/* 0x00e4            */         nop ! volatile
/* 0x00e8            */         nop ! volatile
/* 0x00ec            */         nop ! volatile
/* 0x00f0            */         nop ! volatile
/* 0x00f4         16 */         fmovs   %f6,%f8
/* 0x00f8            */         fsubd   %f8,%f6,%f8
/* 0x00fc         15 */         nop ! volatile
/* 0x0100            */         nop ! volatile
/* 0x0104            */         nop ! volatile
/* 0x0108            */         nop ! volatile
/* 0x010c            */         nop ! volatile
/* 0x0110            */         nop ! volatile
/* 0x0114            */         nop ! volatile
/* 0x0118         16 */         fmuld   %f8,%f8,%f8
/* 0x011c         15 */         nop ! volatile
/* 0x0120            */         nop ! volatile
/* 0x0124            */         nop ! volatile
/* 0x0128         16 */         add     %o0,1,%g2
/* 0x012c         15 */         nop ! volatile
/* 0x0130            */         nop ! volatile
/* 0x0134         16 */         st      %g2,[%sp+92]
/* 0x0138         15 */         nop ! volatile
/* 0x013c         16 */         faddd   %f2,%f0,%f0
/* 0x0140            */         fdivd   %f4,%f8,%f2
/* 0x0144         15 */         nop ! volatile
/* 0x0148            */         nop ! volatile
/* 0x014c            */         nop ! volatile
/* 0x0150            */         nop ! volatile
/* 0x0154            */         nop ! volatile
/* 0x0158            */         nop ! volatile
/* 0x015c            */         nop ! volatile
/* 0x0160            */         nop ! volatile
/* 0x0164            */         nop ! volatile
/* 0x0168            */         nop ! volatile
/* 0x016c            */         nop ! volatile
/* 0x0170            */         nop ! volatile
/* 0x0174            */         nop ! volatile
/* 0x0178         16 */         ld      [%sp+92],%f9
/* 0x017c         15 */         nop ! volatile
/* 0x0180            */         nop ! volatile
/* 0x0184            */         nop ! volatile
/* 0x0188            */         nop ! volatile
/* 0x018c            */         nop ! volatile
/* 0x0190            */         nop ! volatile
/* 0x0194            */         nop ! volatile
/* 0x0198            */         nop ! volatile
/* 0x019c            */         nop ! volatile
/* 0x01a0            */         nop ! volatile
/* 0x01a4            */         nop ! volatile
/* 0x01a8            */         nop ! volatile
/* 0x01ac            */         nop ! volatile
/* 0x01b0            */         nop ! volatile
/* 0x01b4         16 */         fmovs   %f6,%f8
/* 0x01b8            */         fsubd   %f8,%f6,%f8
/* 0x01bc         15 */         nop ! volatile
/* 0x01c0            */         nop ! volatile
/* 0x01c4            */         nop ! volatile
/* 0x01c8            */         nop ! volatile
/* 0x01cc            */         nop ! volatile
/* 0x01d0            */         nop ! volatile
/* 0x01d4            */         nop ! volatile
/* 0x01d8         16 */         fmuld   %f8,%f8,%f8
/* 0x01dc         15 */         nop ! volatile
/* 0x01e0            */         nop ! volatile
/* 0x01e4            */         nop ! volatile
/* 0x01e8         16 */         add     %o0,2,%o0
/* 0x01ec         15 */         nop ! volatile
/* 0x01f0            */         nop ! volatile
/* 0x01f4         16 */         st      %o0,[%sp+92]
/* 0x01f8         15 */         nop ! volatile
/* 0x01fc         16 */         cmp     %o0,%g4
/* 0x0200            */         faddd   %f0,%f2,%f2
/* 0x0204            */         bcs     .L900000114
/* 0x0208            */         fdivd   %f4,%f8,%f0
                       .L900000117:
/* 0x020c         16 */         fmovs   %f6,%f8
/* 0x0210            */         ld      [%sp+92],%f9
/* 0x0214            */         add     %o0,1,%g2
/* 0x0218            */         fmovs   %f6,%f10
/* 0x021c            */         st      %g2,[%sp+92]
/* 0x0220            */         add     %o0,2,%g4
/* 0x0224            */         cmp     %g4,%g3
/* 0x0228            */         fsubd   %f8,%f6,%f8
/* 0x022c            */         faddd   %f2,%f0,%f0
/* 0x0230            */         ld      [%sp+92],%f11
/* 0x0234            */         fmuld   %f8,%f8,%f8
/* 0x0238            */         fsubd   %f10,%f6,%f6
/* 0x023c            */         fdivd   %f4,%f8,%f2
/* 0x0240            */         fmuld   %f6,%f6,%f6
/* 0x0244            */         fdivd   %f4,%f6,%f6
/* 0x0248            */         faddd   %f0,%f2,%f0
/* 0x024c            */         bcc     .L77000022
/* 0x0250            */         faddd   %f0,%f6,%f2
/* 0x0254            */         st      %g4,[%sp+92]
                       .L900000118:
/* 0x0258         16 */         add     %g4,1,%g4
/* 0x025c            */         ldd     [%l2+8],%f6
/* 0x0260            */         cmp     %g4,%g3
/* 0x0264            */         ld      [%sp+92],%f1
/* 0x0268            */         fmovs   %f6,%f0
/* 0x026c            */         fsubd   %f0,%f6,%f0
/* 0x0270            */         fmuld   %f0,%f0,%f0
/* 0x0274            */         fdivd   %f4,%f0,%f0
/* 0x0278            */         faddd   %f2,%f0,%f2
/* 0x027c            */         bcs,a   .L900000118
/* 0x0280            */         st      %g4,[%sp+92]

!   17                !        }
!   19                !        fprintf(stdout, " pi at n=%9u is %.12g \n", i, sqrt( pi * (double)6.0
!   20                !));

                       .L77000022:
/* 0x0284         20 */         ldd     [%l2+16],%f4
/* 0x0288            */         sethi   %hi(.L230),%g2
/* 0x028c            */         or      %g0,%l1,%o0
/* 0x0290            */         add     %g2,%lo(.L230),%o1
/* 0x0294            */         or      %g0,%g4,%o2
/* 0x0298            */         fmuld   %f2,%f4,%f4
/* 0x029c            */         fsqrtd  %f4,%f4
/* 0x02a0            */         st      %f4,[%sp+100]
/* 0x02a4            */         st      %f5,[%sp+96]
/* 0x02a8            */         ld      [%sp+100],%o3
/* 0x02ac            */         call    fprintf ! params =  %o0 %o1 %o2 %o3 %o4 ! Result =
/* 0x02b0            */         ld      [%sp+96],%o4

!   22                !        exit(1);

/* 0x02b4         22 */         call    exit    ! params =  %i0 ! Result =      ! (tail call)
/* 0x02b8            */         restore %g0,1,%o0
/* 0x02bc          0 */         .type   main,2
/* 0x02bc          0 */         .size   main,(.-main)
/* 0x02bc          0 */         .global __fsr_init_value
/* 0x02bc            */          __fsr_init_value=1

--
 .-.   .-.    Sarcasm is just one more service we offer.
(_  \ /  _)
     |        d...@arsdigita.com
     |

 
 
 

gcc 2.8.1 vs gcc 2.95.3 optimization on Sparc V8

Post by Paul Flo » Mon, 28 May 2001 04:37:36


[snip]

Quote:>I don't know what that means so I'll just have to buy the compiler.  I know that
>it is HOSTID locked so there is no way to simply "get a copy" from someone.  I
>have no intentions of changing my hostid so I guess I stuck with buying it.

Well, the enterprise version is intended for installation such that you
have a "floating" license server, and any workstation can obtain a
license, up to the number of licenses you have available on the server.

The personal editition is intended to be node locked to the workstation
where you have the compiler installed.

A single enterprise Forte license costs more, but most organizations buy
in bulk and get cheaper prices.

Lastly, if you have a fast/free internet connection, you can download
the "try and buy" version (this is a couple of hungred Mbytes). You get
with it a 1 month temporary license which is not node locked.

A bientot
Paul
--
Paul Floyd                 http://paulf.free.fr (for what it's worth)

If more is better, are double standards better than single ones?

 
 
 

gcc 2.8.1 vs gcc 2.95.3 optimization on Sparc V8

Post by Dennis Clark » Mon, 28 May 2001 09:06:28



> You can download a free "trial" version from Sun, so you can try it before
> handing over large sums of cash.  Also, you can get (at extra cost) a floating
> version of the compiler.  This isn't nodelocked, although you can still only use
> one copy at a time.

Thanks Rich but I already have a quote in hand from Sun for an upgrade from
WorkShop 4.1 to the Forte 6.1 for about $1900.00 or there-abouts.  I was
moderately curious about how different versions of gcc compared to themselves
when running a binary that was stripped down to the bare minimum.  I'd prefer to
do an OpenGL based test or some other test that required moving large arrays of
data about.  Perhaps finding the inverse of a large matrix or a fourier
transform on a whack ( technical term ) of data.  That would involve L2 cache
differences as well as other CPU architecture features so I went with the bare
minimum dog-slow pi calculation for a comparison test.  This is really all
connected to another topic that I made reference to a few weeks back: Comparison
of SunBlade 100 to Sun Ultra 10 in real world terms.  Well, that test has been
completed and you will never guess the results, well, maybe YOU will but most
people look at me like a german shepard that just heard a noise off in the
distance: head turned sideways with a lost look in their eyes wanting to know
more.  The customer that ran the tests would like to recompile a few things for
the SunBlade but I warned them that the cost of doing an application tweak was
far greater than the cost of a set of 280R's.   I'll post the results and wait
for the lawsuit from Sun.

Dennis Clarke

 
 
 

gcc 2.8.1 vs gcc 2.95.3 optimization on Sparc V8

Post by Dennis Clark » Mon, 28 May 2001 09:13:17


Quote:> I get the following times running your code on a Ultra E450, 400 MHz,
> Solaris 7, using Sun Workshop cc 5.0 and gcc 2.8.1:

> cc -O  (105.48 s)
> cc -fast  (59.52 s)
> gcc -O   (116.46 s)
> gcc -O3   (105.68 s)
> gcc -o pigcc3 -O3 -fexpensive-optimizations -ffast-math  (105.59 s)

> That's right, "cc -fast" is almost twice as fast.

Wow!  

    Really.  

        Wow!

   Imagine a factor of 2 reduction in run time based on nothing more than a
compiler choice and a few optimizations.  No code change.  Who would have
guessed.  As per a post that responded to Rich Teer I will be getting the Forte
Compiler but I have yet to choose my workstation for the next few years.  Looks
like it will be a SunBlade 1000 but I'm not sure yet.  I'm very curious about
the benefits of the new UltraSparc III processor when compared to the UltaSparc
II versions.  Very curious and cautious.  

Dennis Clarke

 
 
 

gcc 2.8.1 vs gcc 2.95.3 optimization on Sparc V8

Post by Dennis Clark » Tue, 29 May 2001 07:18:56





> > > I tried with Sun's 6.1 compiler. "cc -fast -xbuiltin" produces code which
> > > runs roughly twice faster than what was produced by gcc 2.95.3.

> > TWICE !!  Really!

> Almost twice, not to be confused with TWICE. :-)

> A minute for Sun cc against minute and fifty-something for gcc 2.95.3.

That feels like twice to me!  Very cool.

Quote:

> > > But when I added "-xdepend" it was slower.

> > I don't know what that means so I'll just have to buy the compiler.

> Additional optimization flag. From the man page:

>           (SPARC) Analyzes loops for inter-iteration data dependencies
>      and does loop restructuring.

> One would assume that it won't harm if it doesn't help, but the problem
> with Sun's compilers is that you never know which combination of
> optimization flags would produce the best result this week. It's sort of a
> black magic. I gave up on that a long time ago.

Hmmm.  Does that mean you have to face east and kill a chicken under a
full moon during compile to get the best results? :)

Quote:

> Below is the assembly produced by the compiler with the command line:

>    cc -fast -xarch=v8 -xbuiltin -S pi.c

> You can compile it with "gcc pi.s" and run on v8 or better SPARC. The
> assembly was produced on US-II with 2MB cache, so scheduling should be
> optimized for that, but I think it shouldn't make much of a difference on
> something else. The main trick is to run several FP instructions in
> parallel.

I'll run it right away and let you know what I get.  There are a lot of
NOPs in there!  For timing and scheduling?  Do we really need to wait so
long to get a floating point operation complete?  Don't we have the option
to have multiple pipelines of execution going simultaneously?  Can I have
all that with fries and super-size them while your at it?  ;)

Well ... here it is and things are not so good on the SS20 :

$ gcc -v -o pi_drazen pi.s -lm
Reading specs from /usr/local/lib/gcc-lib/sparc-sun-solaris2.7/2.8.1/specs
gcc version 2.8.1
 /usr/ccs/bin/as -V -Qy -s -o /var/tmp/ccR1aW4c1.o pi.s
/usr/ccs/bin/as: WorkShop Compilers 5.0 98/12/21
 /usr/ccs/bin/ld -V -Y P,/usr/ccs/lib:/usr/lib -Qy -o pi_drazen
/usr/local/lib/g
cc-lib/sparc-sun-solaris2.7/2.8.1/crt1.o
/usr/local/lib/gcc-lib/sparc-sun-solari
s2.7/2.8.1/crti.o /usr/ccs/lib/values-Xa.o
/usr/local/lib/gcc-lib/sparc-sun-sola
ris2.7/2.8.1/crtbegin.o
-L/usr/local/lib/gcc-lib/sparc-sun-solaris2.7/2.8.1 -L/u
sr/local/sparc-sun-solaris2.7/lib -L/usr/ccs/bin -L/usr/ccs/lib
-L/usr/local/lib
 /var/tmp/ccR1aW4c1.o -lm -lgcc -lc -lgcc
/usr/local/lib/gcc-lib/sparc-sun-solar
is2.7/2.8.1/crtend.o
/usr/local/lib/gcc-lib/sparc-sun-solaris2.7/2.8.1/crtn.o
ld: Software Generation Utilities - Solaris-ELF (4.0)
$ ls -lap pi_drazen
-rwxr-xr-x   1 dclarke  staff      25148 May 27 17:54 pi_drazen
$ file pi_drazen
pi_drazen:      ELF 32-bit MSB executable SPARC Version 1, dynamically
linked, n
ot stripped
$ strip pi_drazen
$ ls -lap pi_drazen
-rwxr-xr-x   1 dclarke  staff       8604 May 27 17:54 pi_drazen
$ time -p ./pi_drazen

 pi at n=1073741823 is 3.14159264498

real 457.47
user 457.43
sys 0.03

Which is a whole lot slower than our previous attempts.  Then again, this
is a SS20 with SM61s and not UltraSparc II.  Lets run your code on two
other systems, an Ultra10 and an Ultra 2.  Both have 300MHz cpus with the
exception that the U2 has a lot more L2 cache.  Shouldn't matter for this
sort of test.

On the Ultra 10 we have the following for your assembly code :

$ time -p ./pi_sparc1

 pi at n=1073741823 is 3.14159264498

real 81.97
user 81.78
sys 0.00

and that is compared with the following binary created from source by gcc
2.95.3 on that same Ultra10

$ time -p ./pi

 pi at n=1073741823 is 3.14159264498

real 152.65
user 152.31
sys 0.01

So there you have it.  Your assembly code is twice as fast ( we keep
saying that ) on this system.  I'll run this on the Ultra2 but I don't
expect any miracles at all.

Dennis Clarke

 
 
 

gcc 2.8.1 vs gcc 2.95.3 optimization on Sparc V8

Post by Roland Main » Tue, 29 May 2001 18:00:09



> > You can download a free "trial" version from Sun, so you can try it before
> > handing over large sums of cash.

> You might also try Forte 6u2 EA if you're adventurous enough.

You don't need to be "adventurous"... it even works for large, complex
projects like Mozilla5 _very_ good (better than Workshop 5/6u1)... :-))

----

Bye,
Roland

--
  __ .  . __


  /O /==\ O\  MPEG specialist, C&&JAVA&&Sun&&Unix programmer
 (;O/ \/ \O;) TEL +49 641 99-41370 FAX +49 641 99-41359

 
 
 

gcc 2.8.1 vs gcc 2.95.3 optimization on Sparc V8

Post by Roland Main » Tue, 29 May 2001 18:02:43


 I tried with Sun's 6.1 compiler. "cc -fast -xbuiltin" produces code
which

Quote:> runs roughly twice faster than what was produced by gcc 2.95.3. But when
> I added "-xdepend" it was slower. The assembly output shows that -xdepend
> somehow killed very clever loop unrolling/restructuring/however-is-that-
> trick-called (/me being very impressed). -xautopar had the same effect. Hmmm.

Uhm... -autopar and co. are for machines with multiple CPUs - and you
need code which can be parallized a lot (for example: many independent
vars)...

----

Bye,
Roland

--
  __ .  . __


  /O /==\ O\  MPEG specialist, C&&JAVA&&Sun&&Unix programmer
 (;O/ \/ \O;) TEL +49 641 99-41370 FAX +49 641 99-41359

 
 
 

gcc 2.8.1 vs gcc 2.95.3 optimization on Sparc V8

Post by Drazen Kac » Tue, 29 May 2001 18:33:06


Dennis Clarke wrote:
> On Sat, 26 May 2001, Drazen Kacar wrote:
> > One would assume that it won't harm if it doesn't help, but the problem
> > with Sun's compilers is that you never know which combination of
> > optimization flags would produce the best result this week. It's sort of a
> > black magic. I gave up on that a long time ago.

> Hmmm.  Does that mean you have to face east and kill a chicken under a
> full moon during compile to get the best results? :)

No, it just means you have to test things regularly, like after applying a
patch. Time invested in that should pay off, if you really need the best
possible results. I don't and it wasn't much fun, so I gave up.

> I'll run it right away and let you know what I get.  There are a lot of
> NOPs in there!  For timing and scheduling?

I think so. I was never interested in FP, so I don't know what various
sparcs have available in this area. But I would assume that nops are there
to keep the pipeline happy. Hopefully somebody else can fill in more
details.

> Do we really need to wait so long to get a floating point operation
> complete?  Don't we have the option to have multiple pipelines of
> execution going simultaneously?

It's not long; all CPUs need some time to perform FP operations. And then
there are loads and stores, because sparcs don't have instructions to load
fp registers from integer registers (I think; haven't checked). But that
should be optimized by the CPU itself, so I don't expect penalty because
of memory write operations.

> real 457.47
> user 457.43
> sys 0.03

> Which is a whole lot slower than our previous attempts.  Then again, this
> is a SS20 with SM61s and not UltraSparc II.  Lets run your code on two

I thought supersparcs could execute multiple fp instructions. Hmm. But
obviously not in the same way as US-II.

> On the Ultra 10 we have the following for your assembly code :

> real 81.97
> user 81.78
> sys 0.00

> and that is compared with the following binary created from source by gcc
> 2.95.3 on that same Ultra10

>  pi at n=1073741823 is 3.14159264498

> real 152.65
> user 152.31
> sys 0.01

> So there you have it.  Your assembly code is twice as fast ( we keep

As far as I understand the copyright law, that assembly code is yours. :-)

> saying that ) on this system.  I'll run this on the Ultra2 but I don't
> expect any miracles at all.

That would be on US-I cpu. I don't know what would happen. Unfortunately I
don't have a lot of different sparcs any more, so I can't test.

But the whole thing shows a general weaknes in RISC theory and practice.
The theory was that you'd have simple CPU and clever compiler, so the
overall result would be better, given some time to improve the compiler
technology. But then, after that time you end up with having different
CPUs that need binaries optimized in a different way. I expect IA-64 world
will suffer from this pretty much.

But let's have another try. This one is produced with:

   cc -fast -xbuiltin -xchip=super -xarch=v8 -xcache=generic -S pi.c

I could have used -xchip=super2, but I don't know which chip you actually
have. -xcache=generic is also not the best, but again, I don't know the
details about the cache on your CPU, so I couldn't do anything about it.

On my US-II this completes in 78 seconds. The one optimized for US-II
needed 61 second. Pretty good CPU performance compatibility, I would say.

        .section        ".text",#alloc,#execinstr
        .file   "pi.c"

        .section        ".rodata1",#alloc
        .align  4
!
! CONSTANT POOL
!
.L224:
        .ascii  "\n\n\000"
        .align  4
!
! CONSTANT POOL
!
.L230:
        .ascii  " pi at n=%9u is %.12g \n\000"

        .section        ".text",#alloc,#execinstr
/* 000000          0 */         .align  8
!
! CONSTANT POOL
!
                       ___const_seg_900000104:
/* 000000          0 */         .word   0,0
/* 0x0008            */         .word   1072693248,0
/* 0x0010            */         .word   1127219200,0
/* 0x0018            */         .word   1075314688,0
/* 0x0020          0 */         .type   ___const_seg_900000104,1
/* 0x0020          0 */         .size   ___const_seg_900000104,(.-___const_seg_900000104)
/* 0x0020          0 */         .align  4
! FILE pi.c

!    1                !#include <locale.h>
!    2                !#include <stdio.h>
!    3                !#include <sys/time.h>
!    4                !#include <math.h>
!    6                !int main(int argc, char *argv[]) {

!
! SUBROUTINE main
!
! OFFSET    SOURCE LINE LABEL   INSTRUCTION

                        .global main
                       main:
/* 000000          6 */         save    %sp,-104,%sp

!    8                !        double pi = (double) 0.0;
!    9                !        unsigned long i;
!   11                !        /*****************************************************/
!   12                !        /** sum the series 1/(x^2)                          **/
!   13                !        /*****************************************************/
!   14                !        fprintf ( stdout, "\n\n" );

/* 0x0004         14 */         sethi   %hi(__iob+16),%l4
/* 0x0008            */         add     %l4,%lo(__iob+16),%l2
/* 0x000c            */         sethi   %hi(.L224),%g2
/* 0x0010            */         or      %g0,%l2,%o0
/* 0x0014          0 */         sethi   %hi(___const_seg_900000104+8),%l0
/* 0x0018          0 */         add     %l0,%lo(___const_seg_900000104+8),%l3
/* 0x001c         14 */         call    fprintf ! params =  %o0 %o1     ! Result =
/* 0x0020            */         add     %g2,%lo(.L224),%o1

!   15                !        for (i = 1; i < 1073741823; i++) {
!   16                !                pi = pi + (double)1.0/( (double)i * (double)i );

/* 0x0024         16 */         ldd     [%l3+8],%f10
/* 0x0028            */         fmovs   %f10,%f12
/* 0x002c         15 */         or      %g0,1,%g4
/* 0x0030         16 */         st      %g4,[%sp+92]
/* 0x0034         15 */         sethi   %hi(0x3ffffc00),%g2
/* 0x0038         16 */         ld      [%sp+92],%f13
/* 0x003c         15 */         add     %g2,1023,%g3
/* 0x0040         16 */         or      %g0,2,%g4
/* 0x0044            */         fsubd   %f12,%f10,%f6
/* 0x0048         15 */         ldd     [%l3-8],%f2
/* 0x004c         16 */         add     %g3,-1,%g2
/* 0x0050         15 */         ldd     [%l0+%lo(___const_seg_900000104+8)],%f4
/* 0x0054         16 */         st      %g4,[%sp+92]
/* 0x0058            */         fmuld   %f6,%f6,%f6
/* 0x005c            */         fdivd   %f4,%f6,%f8
                       .L900000118:
/* 0x0060         16 */         fmovs   %f10,%f12
/* 0x0064            */         ld      [%sp+92],%f13
/* 0x0068            */         add     %g4,1,%g4
/* 0x006c            */         faddd   %f2,%f8,%f2
/* 0x0070            */         cmp     %g4,%g2
/* 0x0074            */         fsubd   %f12,%f10,%f0
/* 0x0078            */         fmuld   %f0,%f0,%f6
/* 0x007c            */         fdivd   %f4,%f6,%f8
/* 0x0080            */         bcs     .L900000118
/* 0x0084            */         st      %g4,[%sp+92]
                       .L900000116:
/* 0x0088         16 */         fmovs   %f10,%f12
/* 0x008c            */         add     %g4,1,%g4
/* 0x0090            */         ld      [%sp+92],%f13
/* 0x0094            */         cmp     %g4,%g3
/* 0x0098            */         faddd   %f2,%f8,%f6
/* 0x009c            */         fsubd   %f12,%f10,%f8
/* 0x00a0            */         fmuld   %f8,%f8,%f8
/* 0x00a4            */         fdivd   %f4,%f8,%f8
/* 0x00a8            */         bcc     .L77000022
/* 0x00ac            */         faddd   %f6,%f8,%f2
/* 0x00b0            */         st      %g4,[%sp+92]
                       .L900000117:
/* 0x00b4         16 */         fmovs   %f10,%f12
/* 0x00b8            */         ld      [%sp+92],%f13
/* 0x00bc            */         add     %g4,1,%g4
/* 0x00c0            */         cmp     %g4,%g3
/* 0x00c4            */         fsubd   %f12,%f10,%f6
/* 0x00c8            */         fmuld   %f6,%f6,%f6
/* 0x00cc            */         fdivd   %f4,%f6,%f6
/* 0x00d0            */         faddd   %f2,%f6,%f2
/* 0x00d4            */         bcs,a   .L900000117
/* 0x00d8            */         st      %g4,[%sp+92]

!   17                !        }
!   19                !        fprintf(stdout, " pi at n=%9u is %.12g \n", i, sqrt( pi * (double)6.0
!   20                !));

                       .L77000022:
/* 0x00dc         20 */         sethi   %hi(.L230),%g2
/* 0x00e0            */         ldd     [%l3+16],%f4
/* 0x00e4            */         or      %g0,%l2,%o0
/* 0x00e8            */         add     %g2,%lo(.L230),%o1
/* 0x00ec            */         fmuld   %f2,%f4,%f4
/* 0x00f0            */         or      %g0,%g4,%o2
/* 0x00f4            */         fsqrtd  %f4,%f4
/* 0x00f8            */         st      %f4,[%sp+96]
/* 0x00fc            */         ld      [%sp+96],%o3
/* 0x0100            */         st      %f5,[%sp+92]
/* 0x0104            */         call    fprintf ! params =  %o0 %o1 %o2 %o3 %o4 ! Result =
/* 0x0108            */         ld      [%sp+92],%o4

!   22                !        exit(1);

/* 0x010c         22 */         call    exit    ! params =  %i0 ! Result =      ! (tail call)
/* 0x0110            */         restore %g0,1,%o0
/* 0x0114          0 */         .type   main,2
/* 0x0114          0 */         .size   main,(.-main)
/* 0x0114          0 */         .global __fsr_init_value
/* 0x0114            */          __fsr_init_value=1

--
 .-.   .-.    Sarcasm is just one more service we offer.
(_  \ /  _)
     |        d...@arsdigita.com
     |

 
 
 

gcc 2.8.1 vs gcc 2.95.3 optimization on Sparc V8

Post by Drazen Kac » Tue, 29 May 2001 18:41:42




>  I tried with Sun's 6.1 compiler. "cc -fast -xbuiltin" produces code
> which
> > runs roughly twice faster than what was produced by gcc 2.95.3. But when
> > I added "-xdepend" it was slower. The assembly output shows that -xdepend
> > somehow killed very clever loop unrolling/restructuring/however-is-that-
> > trick-called (/me being very impressed). -xautopar had the same effect. Hmmm.

> Uhm... -autopar and co. are for machines with multiple CPUs - and you
> need code which can be parallized a lot (for example: many independent
> vars)...

I know. I have two CPUs on that machine, so I built with PARALLEL=2
setting. I didn't expect any performance improvement, because there are no
loops to paralelize in that code, but I was interested to see what would
happen. I'm not sure if performance decrease with -xautopar can be called
a bug or if it's just a case of improper use (which it was), but a huge
performance decrease with -xdepend is a bug, as far as I understand
things.

--
 .-.   .-.    Sarcasm is just one more service we offer.
(_  \ /  _)

     |

 
 
 

gcc 2.8.1 vs gcc 2.95.3 optimization on Sparc V8

Post by Roland Main » Tue, 29 May 2001 19:08:09



> > Well, so there it is.  The results seem to show that gcc 2.8.1 will
> > produce a faster binary with the same source with no special
> > optimizations.  Geez I wonder why.  Just how different can the machine
> > code be?  Maybe I should disassemble to two and have a look see.

> You could try adding "-ffast-math -fexpensive-optimizations" flags. I
> couldn't see any difference, but it wasn't on the same CPU, so maybe.

> > ps: I'll try the same thing on an Ultra2 using the SparcV9 cpu optimization
> > option but I don't expect much to be different.  Then again ,  who knows.

> I tried with Sun's 6.1 compiler. "cc -fast -xbuiltin" produces code which
> runs roughly twice faster than what was produced by gcc 2.95.3. But when
> I added "-xdepend" it was slower. The assembly output shows that -xdepend
> somehow killed very clever loop unrolling/restructuring/however-is-that-
> trick-called (/me being very impressed). -xautopar had the same effect. Hmmm.

Some values from Sun Workshop 6 Update 2 _EarlyAccess_ 2
(http://access1.sun.com/fortedevprod/), Ultra5/333MHz/2MB-2ndLevel cache
running Solaris 7/106541-15:
-- snip --
% cc -fast -xbuiltin pi.c
cc: Warning: -xarch=native has been explicitly specified, or implicitly
specified by a macro option, -xarch=native on this architecture implies
-xarch=v8plusa which generates code that does not run on pre-UltraSPARC
processors
% time ./a.out

 pi at n=1073741823 is 3.14159264498

real    1m12.836s
user    1m11.810s
sys     0m0.010s
% cc -fast pi.c
cc: Warning: -xarch=native has been explicitly specified, or implicitly
specified by a macro option, -xarch=native on this architecture implies
-xarch=v8plusa which generates code that does not run on pre-UltraSPARC
processors
% time ./a.out

 pi at n=1073741823 is 3.14159264498

real    1m12.681s
user    1m11.750s
sys     0m0.000s
% cc -fast -xarch=v9a pi.c
% time ./a.out

 pi at n=1073741823 is 3.14159264498

real    2m35.473s
user    2m33.390s
sys     0m0.010s

# what ??
# ahhh... code uses _long_ - which is 64bit in sparcv9... changing
"long" to "int":
% nedit pi.c&
[1] 3921
% cc -fast -xarch=v9a pi.c
% time ./a.out

 pi at n=1073741823 is 3.14159264498

real    1m12.952s
user    1m11.850s
sys     0m0.010s

% cc -fast -xarch=v9 pi.c
% time ./a.out

 pi at n=1073741823 is 3.14159264498

real    1m12.742s
user    1m11.840s
sys     0m0.010s
-- snip --

But: IMHO this test is too simple (see sparcv9 vs. sparcv9a results) to
see the advantage of complex optimisations... something like a mpeg
encoder would give far better view of compiler optimisation "quality"...

----

Bye,
Roland

--
  __ .  . __


  /O /==\ O\  MPEG specialist, C&&JAVA&&Sun&&Unix programmer
 (;O/ \/ \O;) TEL +49 641 99-41370 FAX +49 641 99-41359

 
 
 

1. gcc 2.95.4 vs gcc 3.3 ?

  I'm trying to compile the 2.4.21-ac3 kernel for some work machines.
One of the users is insisting on gcc 3.3 to compile.  Reading the
web page on www.kernel.org this is recomended against.

  Perchance is this old news, is the 3.3 compiled kernel going to kill
something or anything that should be related to users or any bosses?

Robert

:wq!
---------------------------------------------------------------------------
Robert L. Harris                     | GPG Key ID: E344DA3B

DISCLAIMER:
      These are MY OPINIONS ALONE.  I speak for no-one else.

Diagnosis: witzelsucht          



  application_pgp-signature_part
< 1K Download

2. XDMCP on Solaris 9

3. gcc-2.96 to gcc-2.95

4. PPPoE fails after kernle update

5. GCC-i2.6.3 (gcc with pentium optimizations)

6. HTTP Proxy Server & Anti Virus Solution

7. libstdc++-v3 on Solaris/SPARC with GCC 2.95.3

8. Packard-Bell monitor help needed, please

9. GNU gcc 2.95.1 package for Solaris 7 sparc available !!

10. 2.5.6: JFS vs gcc 2.95.4

11. egcs vs. gcc-2.95

12. gcc 2.95.2 vs. 3.0 (fwd)

13. P4 vs Athlon for Linux latexing, is gcc-2.95 responsible?