gcc/gcc3 generating FAR slower code than egcs

gcc/gcc3 generating FAR slower code than egcs

Post by John Caru » Thu, 28 Mar 2002 08:55:41



In the process of planning a migration from RedHat 6.2 to RedHat 7.2, I
ran into a striking difference in the speed of code compiled with egcs
2.91.66 on RH6.2 and either gcc 2.96 or 3.04 on RH7.2.  I was compiling
a test program that I use to busy out CPUs (source included at the end
of this posting), and saw the following results for the code generated
by the various compilers with and without optimization:

   RPM             gcc version              Unoptimized     Optimized (-O3)
   ===             ===========              ===========     =========
   egcs-1.1.2-30   egcs-2.91.66 19990314      17.97 sec      7.89 sec
   gcc-2.96-98     2.96 20000731              27.31 sec     18.47 sec
   gcc3-3.0.4-1    3.0.4                      28.16 sec     18.37 sec

The time specified is total elapsed time, which for this (completely CPU-
bound) program is equal to the user CPU time.  The egcs compiles were done
on the RH6.2 system, and the gcc compiles were done on the RH7.2 system.
The egcs package was the default compiler on the RH6.2 system, and gcc
2.96 was the default compiler on the RH7.2 system; I installed and tested
gcc 3.0.4 on the RH7.2 system to see if it would perform differently.
The resulting binaries were run on both systems (RH6.2 and RH7.2), and
they produced identical results on each system.

As you can see, gcc 2.96 and 3.0.4 produced roughly equivalent code, and
it was approximately 50% slower (unoptimized) or 130% slower (optimzed)
than the egcs code (!!!).  Even with -O3 optimization, the gcc compilers
couldn't produce code as fast as the *unoptimized* egcs code.

Does anyone know what's going on here?  Could this have anything to do
with RedHat itself?  I find it hard to believe that the newer gcc 2.96/
3.04 packages are really so much worse than the older egcs release,
especially since the egcs modifications (improvements?) were supposedly
folded into the gcc code base.  I'm dismayed, to say the least, at the
prospect of rebuilding RPMs on a RedHat 7.2 system if that's going to
get me code that runs anywhere from 50% to 130% slower.

Here's the test program source code (eatcpu.c):

   #include <stdio.h>
   #include <stdlib.h>

   int main (int argc, char **argv)
   {
           long loops, x, y[20];
           int scale;

           loops = argc > 1 ? atol(argv[1]) : 1000;
           printf("Executing %ld loops...\n", loops);

           for (scale = 0; scale < 5000; scale++) {
                   x = 0;
                   while (x < loops) {
                           y[x % 20] = x++ + 1;
                   }
           }

           exit(0);
   }

The tests were run with "time eatcpu 100000".

- John

 
 
 

gcc/gcc3 generating FAR slower code than egcs

Post by Tom Trome » Thu, 28 Mar 2002 09:58:25


John> As you can see, gcc 2.96 and 3.0.4 produced roughly equivalent
John> code, and it was approximately 50% slower (unoptimized) or 130%
John> slower (optimzed) than the egcs code (!!!).  Even with -O3
John> optimization, the gcc compilers couldn't produce code as fast as
John> the *unoptimized* egcs code.

Report this as a bug on gcc.gnu.org.  Or, better yet, compare the
respective assembly output and include your analysis in the bug
report.  That would help a gcc hacker figure out what went wrong.

John> Does anyone know what's going on here?  Could this have anything
John> to do with RedHat itself?

It has nothing to do with Red Hat.

John> I'm dismayed, to say the least, at the prospect of rebuilding
John> RPMs on a RedHat 7.2 system if that's going to get me code that
John> runs anywhere from 50% to 130% slower.

This benchmark probably doesn't actually give you much useful
information about the performance of actual applications.  Tuning a
compiler for a benchmark often causes it to perform worse on real
programs.

John>                           y[x % 20] = x++ + 1;

I think this is undefined.

Tom

 
 
 

gcc/gcc3 generating FAR slower code than egcs

Post by John Reise » Thu, 28 Mar 2002 09:57:16


Quote:> Here's the test program source code (eatcpu.c):
> ...
>            for (scale = 0; scale < 5000; scale++) {
>                    x = 0;
>                    while (x < loops) {
>                            y[x % 20] = x++ + 1;
>                    }
>            }
> ...

If the purpose is to drain the available CPU cycles, then why
does it matter how "fast" it runs?

Second, just _look_ at the generated code:
        $ gdb eatcpu
        (gdb) x/100i main
        (gdb) x/100i  ## if necessary
Or, "gcc -S <other_flags> eatcpu.c" then view eatcpu.s .

Without even looking, it is an easy guess that one of the compilers
recognized the iteration-to-iteration behavior of "x % 20", and reduced
the strength of '%' on a loop induction variable, becoming "add-
compare-branch-reset" instead of taking a remainder by division.
Sure, that's a regression, but it isn't worth getting upset.
Just rewrite the source.

And of course, aggressive optimization would recognize the "no output"
situation, removing _both_ loops entirely, in favor of 20 simple
assignment statements, or even nothing at all.

--

 
 
 

gcc/gcc3 generating FAR slower code than egcs

Post by Andre Kostu » Thu, 28 Mar 2002 10:52:34




Quote:> John>                   y[x % 20] = x++ + 1;

> I think this is undefined.

What's the undefined part?  x isn't modified twice between sequence
points.....
 
 
 

gcc/gcc3 generating FAR slower code than egcs

Post by John Caru » Thu, 28 Mar 2002 11:28:40



>> Here's the test program source code (eatcpu.c):
>> ...
>>            for (scale = 0; scale < 5000; scale++) {
>>                    x = 0;
>>                    while (x < loops) {
>>                            y[x % 20] = x++ + 1;
>>                    }
>>            }
>> ...

>If the purpose is to drain the available CPU cycles, then why
>does it matter how "fast" it runs?

The entire point of my posting was the *difference* between the run times
for the different compilers.  In general I don't care how fast this code
runs, since I just feed it a larger number to busy out the CPU for as long
as required in any particular instance.

Quote:>Second, just _look_ at the generated code:
>    $ gdb eatcpu
>    (gdb) x/100i main
>    (gdb) x/100i  ## if necessary
>Or, "gcc -S <other_flags> eatcpu.c" then view eatcpu.s .

>Without even looking, it is an easy guess that one of the compilers
>recognized the iteration-to-iteration behavior of "x % 20", and reduced
>the strength of '%' on a loop induction variable, becoming "add-
>compare-branch-reset" instead of taking a remainder by division.

The difference occurs even with optimization turned off, as I said
(unless the optimization you're talking about could occur even with
optimization disabled).  However, it is the case that the gcc 2.96/3.04
code is using idivl where the egcs code uses various shift operations
(e.g. sarl and sall) and simple arithmetic operations.  So perhaps it's
just a difference in the implementation of the modulus operator.

In fact, I just tested a version with the modulus operation replaced
by a simple variable reference and it executes at about the same speed
under egcs and both gcc compilers (and in fact gcc 3.04 beats egcs with
optimization, though gcc 2.96 is still slowest).  So this appears to be
the source of the difference, and it may or may not reflect a difference
in the quality of other code generated by gcc or egcs.  Hopefully there
aren't too many other such differences, and hopefully not many of the
packages I'll be compiling do a lot of modulus operations....

- John

 
 
 

gcc/gcc3 generating FAR slower code than egcs

Post by Tom Trome » Thu, 28 Mar 2002 12:40:20


John> y[x % 20] = x++ + 1;

Tom> I think this is undefined.

Andre> What's the undefined part?

When is `x % 20' evaluated in relationship to `x++'?

Tom

 
 
 

gcc/gcc3 generating FAR slower code than egcs

Post by Erik Max Franci » Thu, 28 Mar 2002 12:47:54



> What's the undefined part?  x isn't modified twice between sequence
> points.....

No, the rule is about both modifying and referencing an object with no
intervening sequence point.

        x = x++;

is undefined behavior.

--

 __ San Jose, CA, US / 37 20 N 121 53 W / ICQ16063900 / &tSftDotIotE
/  \ Laws are silent in time of war.
\__/ Cicero
    Esperanto reference / http://www.alcyone.com/max/lang/esperanto/
 An Esperanto reference for English speakers.

 
 
 

gcc/gcc3 generating FAR slower code than egcs

Post by John Caru » Thu, 28 Mar 2002 14:18:48



>John>                       y[x % 20] = x++ + 1;

>I think this is undefined.

I believe that's true.  I was aware of that when I wrote it (several
years ago), but the purpose was to "hide" the modification of the loop
control variable within an expression, and also to prevent a given C
compiler from applying some optimization to short circuit the internal
loop (that's why I used the modulus).  I wasn't concerned about which
element of y was modified on any given iteration; I just wanted to write
an expression that would hopefully balk default optimizations.

- John

 
 
 

gcc/gcc3 generating FAR slower code than egcs

Post by John Caru » Thu, 28 Mar 2002 14:48:35



>In fact, I just tested a version with the modulus operation replaced
>by a simple variable reference and it executes at about the same speed
>under egcs and both gcc compilers (and in fact gcc 3.04 beats egcs with
>optimization, though gcc 2.96 is still slowest).  So this appears to be
>the source of the difference, and it may or may not reflect a difference
>in the quality of other code generated by gcc or egcs.

A little further testing shows that egcs appears to have an optimization
for modulus operations which use a constant factor.  The execution time
increases as the constant increases, but in my tests it always ran faster
than the gcc-generated code.  If you use a variable instead, though,
egcs will generate a division instruction just as gcc 2.96/3.04 do.

I guess the gcc team decided not to carry over this methodology when
egcs was merged into gcc--possibly because it does require a fair amount
of additional code.

- John

 
 
 

gcc/gcc3 generating FAR slower code than egcs

Post by Floyd Davidso » Thu, 28 Mar 2002 14:24:48




>> What's the undefined part?  x isn't modified twice between sequence
>> points.....

>No, the rule is about both modifying and referencing an object with no
>intervening sequence point.

>    x = x++;

>is undefined behavior.

However, your example is one of x being modified twice between
sequence points.

The previous code,

  y[x % 20] = x++ + 1;

which, for purposes of evaluating undefined behavior, can be
reduced to,

  y[x] = x++;

which does not modify x twice, but is undefined behavior because
x is both modified and used for a purpose other than computing
its new value, without an intervening sequence point.

The text from the C89 Standard is:

   Between the previous and next sequence point an object shall
   have its stored value modified at most once by the evaluation
   of an expression.  Furthermore, the prior value shall be
   accessed only to determine the value to be stored.

Text from the C99 Standard is slightly different:

   Between the previous and next sequence point an object shall
   have its stored value modified at most once by the evaluation
   of an expression.  Furthermore, the prior value shall be
   read only to determine the value to be stored.

--
Floyd L. Davidson         <http://www.ptialaska.net/~floyd>

 
 
 

gcc/gcc3 generating FAR slower code than egcs

Post by Carlos Moren » Fri, 29 Mar 2002 05:05:40





>>John>                   y[x % 20] = x++ + 1;

>>I think this is undefined.

> What's the undefined part?  x isn't modified twice between sequence
> points.....

That's a slightly modified version of *the* classical example
of undefined behaviour:  arr[i] = i++;

When the compiler accesses x to evaluate x % 20, there is
no guarantee whatsoever if the side-effect of the x++ has
already taken place, since there is no sequence point
intervening.

Carlos
--