Optimization hints for Alpha 21164

Optimization hints for Alpha 21164

Post by Serge Joori » Tue, 19 May 1998 04:00:00



Hi all,
I'm looking for optimization hints for the Aplha 21164. Any input will
be great.
please reply by mail.
Thank you.
--
Serge Jooris
Researcher at the Information and Decision Systems Department
Free University of  Brussels.

S.L.N.
av. Franklin Roosvelt 50 (CP 165)
1050 Bruxelles (Brussels)
Belgium

Phone
00 32 67 88 95 51 or 00 32 2 650 22 93
Fax
00 32 67 88 95 52 or 00 32 2 650 22 98
e-mail

 
 
 

Optimization hints for Alpha 21164

Post by Joachim Wesne » Wed, 20 May 1998 04:00:00



> Hi all,
> I'm looking for optimization hints for the Aplha 21164. Any input will
> be great.
> please reply by mail.

I think this would be of big interest to others too.

When asking in a newsgroup, plaese refrain from limiting any answers to
private mail.

Joachim

 
 
 

Optimization hints for Alpha 21164

Post by Olaf Schnapauf » Wed, 20 May 1998 04:00:00


:> I'm looking for optimization hints for the Aplha 21164. Any input will

As far as compiling goes: Get a recent egcs (egcs.cygnus.com), install it.
Compile with something like:

-fomit-frame-pointer -funroll-loops -O5 -finline-functions -ffast-math

and only use -mieee if you really need it (if you get weird Floating point
exceptions, try it).

Olaf

--
-------------------------------------------------------------------------
"The number of Unix installations       Olaf Schnapauff,

- The Unix Programmer's Manual,         http://www.tu-bs.de/~c0033014/

  2nd Edition, June, 1972.              for PGP Public key
Key fingerprint = AD C4 8A F0 45 D0 28 59  77 24 99 53 3B 07 4B EC

 
 
 

Optimization hints for Alpha 21164

Post by Georg Ach » Thu, 21 May 1998 04:00:00


|> :> I'm looking for optimization hints for the Aplha 21164. Any input will
|>
|> As far as compiling goes: Get a recent egcs (egcs.cygnus.com), install it.
|> Compile with something like:
|>
|> -fomit-frame-pointer -funroll-loops -O5 -finline-functions -ffast-math
|>
|> and only use -mieee if you really need it (if you get weird Floating point
|> exceptions, try it).

But egcs still doesn't produce good code...

Some suggestions:

Avoid byte/short accesses, each one translates (without the BWX extension of
21164PC) to at least 4 to 5 instructions. The sieve benchmark shows over 50%
speedup with an int-array instead of char, despite the higher memory consumption.
BTW: A Pentium2 loses more than half of its performance with the int array :-/

Try to prefetch data and deposit it in registers. I've made mesurements with the
inner loop of the mpg123-decoder. It consists mainly of the following loops
(already unrolled in the code):

float *a,*b;
for(n...)
{
        sum=*a++ * *b++;        /* multiply and accumulate MAC*/
        sum+=*a++ * *b++;
        /* 16 times total */

        /* store sum... */

Quote:}

egcs-1.01 does a *y stupid work and translates each line to the following
(approx):

        ldt $f1,($1)
        ldt $f2,($2)
        mult $f1,$f2,$f3
        addt $f4,$f3,$4
        lda $1,8($1)
        lda $2,8($2)

This is very inefficient, since the mult has always to wait for the data (and the
memory latency is very high, compared to the instruction cycles). So the
pipeline is never filled and you can forget your MFLOPs...

To avoid that, you can prefetch the data:

        ldt $f1,($1)
        ldt $f2,($2)
        ldt $f3,8($1)
        ldt $f4,8($2)
        /* etc */
        mult $f1,$f2,$f0
        mult $f3,$f4,$f3
        /* etc */
        addt $f0,$f3,$f0
        /* etc */

I've tried this on the assembler side with 4 MACs (ie. prefetching 8 values)
and interleaving with the ldts (after a mult the source operands are free and can
be loaded with the next data). This decreased the needed processor cycles to 50%!

IMHO can this also be achieved with the right compiler hints (register variables)
and good coding.

Just my 0.02(hey, where's the euro-key?)
--
        Bye

         http://www.veryComputer.com/~acher/
          "Oh no, not again !" The bowl of petunias