Alpha Architecture Quirks

Alpha Architecture Quirks

Post by Keith Scidmo » Tue, 15 Dec 1992 16:29:41



Can any of you DEC people on the net answer the following 21064-AA Alpha
Questions?  You can mail your replies or post them if I have my facts
wrong or you feel the need to continue this thread.

1)  I read that the 21064 divide unit is not pipelined and takes one cycle
    per bit to produce a result.  Isn't this going to kill floating point
    performance?  FP constants can be converted by the compiler but
    what about the rest of the divides?  Doesn't the scoreboard have to stall
    for 64 cycles to get the result of a divide with data depenencies?  I'm
    lead to think that DEC isn't interested in the FP performance of this chip,
    or do I have my facts wrong?

2)  The 21064 has no integer divide.  Again, isn't this going to make math
    performance poor.  Integer constants can be converted to multiplies
    but this is only a partial solution.

3)  The manuals say that the 21064 is superpipelined.  Where?  How can this
    claim be justified in light of the floating point divider not being
    pipelined at all?  It seems to me that all issues are superscalar but
    that there is no superpipelining.

4)  In one place I'm sure I read that all memory accesses are 64-bit yet
    there are provisions for long word writes discussed elsewhere (in
    multiprocesor environments, for example).  Which is it?

Thanks in advance.

Keith R. Scidmore

P.S.  I cancelled two earlier versions of this article that were mangled.
I hope it worked.

 
 
 

Alpha Architecture Quirks

Post by John F Ca » Wed, 16 Dec 1992 09:41:58




Quote:>1)  I read that the 21064 divide unit is not pipelined and takes one cycle
>    per bit to produce a result.  Isn't this going to kill floating point
>    performance? [...] I'm
>    lead to think that DEC isn't interested in the FP performance of this chip,
>    or do I have my facts wrong?

A double precision floating point divide is slow, but look at the benchmark
results: the Alpha AXP 10000 is rated at 200 SPECfp92.  If you buy the
low-end model you get 112 SPECfp92.  Obviously divide performance isn't
critical to this large selection of real-world applications.

In any case, considering the clock speed difference between Alpha AXP and
competitive systems you are better off counting nanoseconds instead of
cycles:

                        DEC Alpha AXP                   IBM RS 6000
                                133 Mhz 150 Mhz                 41 Mhz  62.5 Mhz
                        cycles  ns      ns              cycles  ns      ns
FP mult or add           6       45      40              2       48      32
FP divide               63      473     420             20      480     320

This shows floating point latency of the low/mid range Alpha systems to be
similar to the IBM RS/6000.  But this is the worst-case* performance.  The
Alpha can execute 6 independent floating point operations in parallel, and
most floating point code can use at least part of this parallelism.

*Both Alpha and RS/6000 are slower when handling infinities, NaNs, or
denormals, but for most applications this is not an important consideration.

Quote:>3)  The manuals say that the 21064 is superpipelined.  Where?  How can this
>    claim be justified in light of the floating point divider not being
>    pipelined at all?

I first heard the term "superpipelined" applied to the MIPS R4000, which
does not pipeline integer multiply or divide (I don't know if it pipelines
floating point divide).

        John Carr

        All facts and opinions in this article are the responsibility
        of the author, not DEC.

 
 
 

Alpha Architecture Quirks

Post by Peter May » Wed, 16 Dec 1992 11:12:42



>2)  The 21064 has no integer divide.  Again, isn't this going to make math
>    performance poor.  Integer constants can be converted to multiplies
>    but this is only a partial solution.

From the Alpha Architecture Handbook, p A-12, or the Alpha Architecture
Reference Manual, also p A-12:

Integer division does not exist as a hardware opcode. Division by a
constant can always be done via UMULH of another appropriate constant,
followed by a right shift. General quadword division by true variables
can be done via a subroutine. The subroutine could test for small
divisors (less than about 1000 in absolute value) and for those, do a
table lookup on the exact constant and shift count for an UMULH/shift
sequence. For the remaining cases, a table lookup on about a 1000-entry
table and a multiply can give a linear approximation to 1/divisor that
is accurate to 16 bits. Using this approximation, a multiply and a
back-multiply and a subtract can generate one 16-bit quotient "digit"
plus a 48-bit new partial dividend. Three more such steps can generate
the full quotient. Having prior knowledge of the possible sizes of the
divisor and dividend, normalizing away leading bytes of zeros, and
performing an early-out test can reduce the average number of
multiplies to about 5 (compared to a best case of 1 and a worst case of
9).

Quote:>Thanks in advance.

>Keith R. Scidmore

PJDM
--
Peter Mayne                     | My statements, not Digital's.
Digital Equipment Corporation   |
Canberra, ACT, Australia        | "AXP!": Bill the Cat
 
 
 

Alpha Architecture Quirks

Post by Will Walk » Wed, 23 Dec 1992 04:54:57





> >1)  I read that the 21064 divide unit is not pipelined and takes one cycle
> >    per bit to produce a result.  Isn't this going to kill floating point
> >    performance? [...] I'm
> >    lead to think that DEC isn't interested in the FP performance of this chip,
> >    or do I have my facts wrong?

> A double precision floating point divide is slow, but look at the benchmark
> results: the Alpha AXP 10000 is rated at 200 SPECfp92.  If you buy the
> low-end model you get 112 SPECfp92.  Obviously divide performance isn't
> critical to this large selection of real-world applications.

Quite right, SPECfp92 is not very sensitive to divide or square root
latency.  Which applications are?  Possibly 3D graphics.  Perhaps Mr.
Scidmore's applications are sensitive to divide latency since he
brought it up?  I don't know.

Quote:> In any case, considering the clock speed difference between Alpha AXP and
> competitive systems you are better off counting nanoseconds instead of

  ^^^^^^^^^^^^^^^^^^^

  That's right, you should compare latencies in nanoseconds, not cycles.
  I'll add an HP machine to your table.

Quote:>                    DEC Alpha AXP                   IBM RS 6000
>                    133 Mhz 150 Mhz                 41 Mhz  62.5 Mhz
>                    cycles  ns      ns              cycles  ns      ns
> FP mult or add      6       45      40              2       48      32
> FP divide          63      473     420             20      480     320

                           HP 735
                           100 MHz
                        cycles  ns
  FP mult or add         2       20
  FP divide             15      150

- Will Walker
  my own opinions

 
 
 

Alpha Architecture Quirks

Post by Dileep Bhandark » Wed, 23 Dec 1992 19:03:11



>Quite right, SPECfp92 is not very sensitive to divide or square root
>latency.  Which applications are?  Possibly 3D graphics.  Perhaps Mr.
>Scidmore's applications are sensitive to divide latency since he
>brought it up?  I don't know.

The ora benchmark in cfp92 is very sensitive to the speed of square root.

/d

 
 
 

Alpha Architecture Quirks

Post by Vinod Grov » Thu, 24 Dec 1992 03:12:31



>Quite right, SPECfp92 is not very sensitive to divide or square root
>latency.  Which applications are?  

The benchmark 048.ora in SPECfp92 is very sensitive to the latency of
sqrt.

Vinod Grover

 
 
 

1. IA64 and Alpha architecture comparison

If you haven't heard yet, DEC^H^H^HCompaQ have posted a nice little paper
(http://www.digital.com/hpc/ref/ref_alpha_ia64.pdf) comparing IA64 with
the Alpha architecture. Among other things, it contains very good (IMHO)
explanations of many of the ILP enhancing techniques. Also, for the first
time I've understood the real advantage of simultaneous multithreading,
which the EV8/21464 will implement (as per the CompaQ roadmap). I wonder
why that wasn't made clearer in all those papers, at least one in IEEE
Computer (which supposedly is directed at a general audience), the ex-DEC
group has published!

        Jan

2. Hobbes web site

3. Opinions on adding SIMD-FP to Alpha architecture

4. GPWS Autoland Problem in FS98

5. Alpha motherboard sources and memory architecture questions

6. Future Domain TMC-950

7. Any Papers on ALPHA Architecture?

8. ~~~~RJ45 TP direct connection info

9. Need info on Alpha's cache architecture

10. New book on Alpha architecture, chips, systems, and performance available now

11. RISC "quirks" (esp. HP-PA)

12. The Alpha architecture - brain-dead when it comes to byte operations?

13. Alpha architecture - FP tradeoffs