Hotchips presentation of the 21164

Hotchips presentation of the 21164

Post by Burkhard Neidecker-Lu » Sat, 20 Aug 1994 01:15:55



                Transcript of HOTCHIPS VI presentation of
                      the 21164 microprocessor

Key attributes:

        new design (not like 21064 -> 21064A)
        4-way issue superscalar
        Large on-chip L2 cache
        7-stage integer pipeline
        9-stage floating point pipeline
        low latencies at high clock rate
        high-throughput memory subsystem

Other properties:

        40b physical address  (1 Terabyte)
        43b virtual address   (8 Terabyte)
        128b external cache interface
        L3 cache controller integrated
        Instruction translation buffer 48 entries
        Data translation buffer 64 entries
        16.5 mm x 18.1 mm die size (slightly smaller than original Pentium)
        0.5 micron , 4 layer metal CMOS5 process

Execution pipelines:

        Integer Pipeline 0: arith, logical, ld/st, shift
        Integer Pipeline 1: arith, logical, ld, br/jmp Int mul
        FP Pipeline 0: add, subtract, compare, FP branch
        FP Pipeline 1: multiply
        FP div hangs off FP pipe 0, but runs independently

Latencies:

        Most int ops                    1
        CMOV                            2
        Int mul                         8 - 16
        Float ops                       4
        loads (L1 cache hit)            2
        compare or logical op to
        CMOV or conditional BR          0

Onchip data caches:

        dual-ported L1 data cache (8Kbyte, write through, non-blocking)
        On-Chip L2 cache (96Kbyte,  3-way set assoc., write back, pipelined)  
        Miss Address File (MAF), 6 entry, between L1 and L2
        MAF merges loads to the same cache block
        Up to 21 loads, multiple loads merge regardless of order
        Up to two register file fills per cycle
        Bus Address File (BAF),  2 entry, between L2 and external memory

L3 cache (off-chip)

        Direct-mapped write-back superset of L2 cache
        Up to 2 outstanding reads
        Programmable wave pipelining
        L3 cache is optional

Instruction prefetching

        Aggressive prefetching from L2 cache,
        At least three 32-byte blocks ahead of the current issue point
        Continuous integer instruction issue out of L2 cache (2 per cycle)
        60% of peak issue rate possible out of L2 cache (2.4 per cycle)

Latency and bandwidth of memory operations

                Latency (cycles) Bandwidth (bytes/cycle)

        L1              2               16
        L2              8               16
        L3           >= 12         <= 4

        L1 cache block size 32 bytes
        L2, L3 cache block sizes 64 bytes (with 32-byte block size option)

Cycle count improvements over the 21064/21064A

                                21164           21064/21064A
        shifts/byte ops         1               2
        int mul                 8-16            19-23
        cmp->branch          0               1
        float ops               4               6
        L1 data cache           2               3

                Burkhard Neidecker-Lutz

GLASS Project, CEC Karlsruhe
Advanced Technology Group, Digital Equipment Corporation

 
 
 

Hotchips presentation of the 21164

Post by Christian B Edstr » Sat, 20 Aug 1994 03:14:50


--
What type of memory is L3 cache?  I've never heard of it before.  Is is
just another level of SRAM after the level2?  What is read time for it
(in general)?  Thanx.

                                David Golombek


 
 
 

Hotchips presentation of the 21164

Post by Doug Siebe » Sat, 20 Aug 1994 03:48:39



>            Transcript of HOTCHIPS VI presentation of
>                  the 21164 microprocessor

[...]

Quote:>Onchip data caches:
>    dual-ported L1 data cache (8Kbyte, write through, non-blocking)
>    On-Chip L2 cache (96Kbyte,  3-way set assoc., write back, pipelined)  
>    Miss Address File (MAF), 6 entry, between L1 and L2
>    MAF merges loads to the same cache block
>    Up to 21 loads, multiple loads merge regardless of order
>    Up to two register file fills per cycle
>    Bus Address File (BAF),  2 entry, between L2 and external memory

Could you explain more about the L1 & L2 cache both being onchip in this
design?  Is the very large (for on-chip!) L2 cache implemented in a different
way with fewer and/or denser transistors that slowed it down enough it
couldn't work as an L1 cache?  Or was it simple timing considerations that
dictated this separation?  Basically what I'm asking is why not a 64-104K L1
cache -- was it timing or die size that caused the split into two onchip
caches?

--
Doug Siebert             |  I have a proof that everything I have stated above

 
 
 

Hotchips presentation of the 21164

Post by Dirk Grunwa » Sat, 20 Aug 1994 06:21:06


DS> Could you explain more about the L1 & L2 cache both being onchip
DS> in this design?  Is the very large (for on-chip!) L2 cache
DS> implemented in a different way with fewer and/or denser
DS> transistors that slowed it down enough it couldn't work as an L1
DS> cache?  Or was it simple timing considerations that dictated this
DS> separation?  Basically what I'm asking is why not a 64-104K L1
DS> cache -- was it timing or die size that caused the split into two
DS> onchip caches?
--

There was an interesting paper in the ISCA'94 by Norm Jouppi about
this topic. Basically, you'll get better overall performance if you
use a smaller 8KB L1 and larger (and possibly slower) L2 than if you
had a single large L1.

 
 
 

Hotchips presentation of the 21164

Post by Michael Gordon Weav » Sat, 20 Aug 1994 08:51:12



Quote:

>What type of memory is L3 cache?  I've never heard of it before.  Is is
>just another level of SRAM after the level2?  What is read time for it
>(in general)?  Thanx.

The 21164 has two levels of cache on the chip. So the external cache is
called L3 (i.e. level three cache).

I think that the presenter at Hot Chips said that you could program the
wait states on L3 cache, but I may be misremembering. The paper only says
that L3 cache has a bandwidth of less than or equal to 4 bytes per processor
cycle (300 Mhz clock). The bus connecting L3 to the CPU is 16 bytes wide.

Michael.

 
 
 

Hotchips presentation of the 21164

Post by Gints Kliman » Sat, 20 Aug 1994 10:17:19



Quote:(Doug Siebert) writes:


|>
|> >              Transcript of HOTCHIPS VI presentation of
|> >                    the 21164 microprocessor
|>
|> [...]
|>
|> >Onchip data caches:
|>
|> >      dual-ported L1 data cache (8Kbyte, write through, non-blocking)
|> >      On-Chip L2 cache (96Kbyte,  3-way set assoc., write back,
|> pipelined)  
|> >      Miss Address File (MAF), 6 entry, between L1 and L2
|> >      MAF merges loads to the same cache block
|> >      Up to 21 loads, multiple loads merge regardless of order
|> >      Up to two register file fills per cycle
|> >      Bus Address File (BAF),  2 entry, between L2 and external memory

96 KByte on-chip cache?  96 KBytes  = 786432 bits.  At 6 transistors
/cell, that
approaches 5 million transistors.  At 4 / cell, that's still over 3
million transistors.    

 
 
 

Hotchips presentation of the 21164

Post by Krste Asanov » Sat, 20 Aug 1994 12:38:30


|> 96 KByte on-chip cache?  96 KBytes  = 786432 bits.  At 6 transistors
|> /cell, that
|> approaches 5 million transistors.  At 4 / cell, that's still over 3
|> million transistors.

9.3 million transistors total.

--

International Computer Science Institute,     phone: +1 (510) 642-4274 x143
Suite 600, 1947 Center Street,                  fax: +1 (510) 643-7684
Berkeley, CA 94704-1198, USA                   http://http.icsi.berkeley.edu

 
 
 

Hotchips presentation of the 21164

Post by Krste Asanov » Sat, 20 Aug 1994 12:42:28


|> Could you explain more about the L1 & L2 cache both being onchip in this
|> design?  Is the very large (for on-chip!) L2 cache implemented in a different
|> way with fewer and/or denser transistors that slowed it down enough it
|> couldn't work as an L1 cache?  Or was it simple timing considerations that
|> dictated this separation?  Basically what I'm asking is why not a 64-104K L1
|> cache -- was it timing or die size that caused the split into two onchip
|> caches?

Timing and the fact that the 8KB primary data cache is dual
ported. This is true dual porting with two read ports per bit, not
some form of interleaved cache banks. Only one store per cycle
though. (Is this single ended reads and differential writes?)

Dual porting adds significant area, the L2 cache can use much denser
single ported cells.

--

International Computer Science Institute,     phone: +1 (510) 642-4274 x143
Suite 600, 1947 Center Street,                  fax: +1 (510) 643-7684
Berkeley, CA 94704-1198, USA                   http://http.icsi.berkeley.edu

 
 
 

Hotchips presentation of the 21164

Post by Krste Asanov » Sat, 20 Aug 1994 12:52:36



|>           Transcript of HOTCHIPS VI presentation of
|>                 the 21164 microprocessor

|> Onchip data caches:
|>
|>   dual-ported L1 data cache (8Kbyte, write through, non-blocking)
|>   On-Chip L2 cache (96Kbyte,  3-way set assoc., write back, pipelined)  
|>   Miss Address File (MAF), 6 entry, between L1 and L2
|>   MAF merges loads to the same cache block
|>   Up to 21 loads, multiple loads merge regardless of order
|>   Up to two register file fills per cycle
|>   Bus Address File (BAF),  2 entry, between L2 and external memory

|> Latency and bandwidth of memory operations
|>
|>           Latency (cycles) Bandwidth (bytes/cycle)
|>
|>   L1              2               16
|>         L2                8               16
|>   L3           >= 12         <= 4
|>    
|>   L1 cache block size 32 bytes
|>   L2, L3 cache block sizes 64 bytes (with 32-byte block size option)

What is the pipelining scheme of the L2 cache?

The talk mentioned that instruction fetching could be sustained at 2.4
instructions/cycle from the second level cache. Given the three 32B
block ahead instruction prefetching scheme, I'd assume that this
represents the real peak L2 bandwidth, and that's only 2.4*4=9.6B per
cycle. Assuming 32B lines, transferred in an X-1-1 pattern that would
imply 3.3 cycles per 32B block bandwidth from L2 which isn't an
integral number of cycles.

--

International Computer Science Institute,     phone: +1 (510) 642-4274 x143
Suite 600, 1947 Center Street,                  fax: +1 (510) 643-7684
Berkeley, CA 94704-1198, USA                   http://http.icsi.berkeley.edu

 
 
 

Hotchips presentation of the 21164

Post by Jan Vorbruegg » Sat, 20 Aug 1994 18:00:18



Quote:(Burkhard Neidecker-Lutz) writes:

                   Transcript of HOTCHIPS VI presentation of
                         the 21164 microprocessor

Thanks, Burkhard, for the summary...just a few questions. All properties are
defined in term of cycles. What internal (pipeline)/external speeds will the
chip have? What is the announced availability of chips/systems incorporating
it?

           40b physical address  (1 Terabyte)

Up from 34 bits, correct? Thus, six more pins.

           43b virtual address   (8 Terabyte)

Same as before.

           128b external cache interface

Same as before (?).

           Integer Pipeline 0: arith, logical, ld/st, shift
           Integer Pipeline 1: arith, logical, ld, br/jmp Int mul

Interesting decision to have two load, but only one store unit. Makes sense to
me.

           FP div hangs off FP pipe 0, but runs independently

I.e., consumes an FP pipe 0 issue slot, but is independant afterwards? I would
then expect a bubble to appear when the result is written back to the register
file. Or does this have to be software-controlled as with some instructions on
the MIPS?

   Latencies:
           Int mul                              8 - 16

Does the latency depend on whether high bits are set, and thus the order of
multiplication might be important for performance?

           loads (L1 cache hit)                 2

Ugh!

   Onchip data caches:

           dual-ported L1 data cache (8Kbyte, write through, non-blocking)

Thus, instructions always come from L2 cache, I presume, with the 3 32-byte
prefetcher being the buffer bew* the 2 words/cycle read rate from L2 cache
and the 4 words/cycle issue unit?

           On-Chip L2 cache (96Kbyte,  3-way set assoc., write back, pipelined)  
Wow! At 4 transistors/bit that's 3.15M transistors for the L2 cache alone!

           Miss Address File (MAF), 6 entry, between L1 and L2
           MAF merges loads to the same cache block
           Up to 21 loads, multiple loads merge regardless of order
           Up to two register file fills per cycle
           Bus Address File (BAF),  2 entry, between L2 and external memory

What about writes? Are they pipelined and merged as well? Any info on how this
design and its detailed parameters were chosen (trade-offs involved etc)?

   L3 cache (off-chip)
           Programmable wave pipelining

Err, what's wave pipelining?

   Instruction prefetching
           At least three 32-byte blocks ahead of the current issue point

How are branches handled, especially if the branch address is known early?

        Jan

 
 
 

Hotchips presentation of the 21164

Post by Zalman Ste » Sat, 20 Aug 1994 17:53:13


Jan Vorbrueggen writes

Quote:>       FP div hangs off FP pipe 0, but runs independently

> I.e., consumes an FP pipe 0 issue slot, but is independant afterwards? I  
would
> then expect a bubble to appear when the result is written back to the  
register
> file. Or does this have to be software-controlled as with some  
instructions on
> the MIPS?

The architecture does not allow for software-controlled scheduling. (Are  
there any ISA extensions in the 21164? Somehow I doubt it.) One can save the  
result of the divide in a buffer until another divide is issued. The second  
divide will provide a free writeback slot. The buffer can be bypassed or  
not, in which case accessing the result of the divide would force the  
writeback. What is the floating-point divide latency? If it is anything like  
the 21064 (~30 cycles single precision, ~60 cycles double precision) then an  
extra cycle for writeback won't matter much.

Quote:>    Latencies:
>       Int mul                              8 - 16

> Does the latency depend on whether high bits are set, and thus the order  
of
> multiplication might be important for performance?

If the implementation is like the 21064, the shorter time is for a longword  
multiply (32 bit result) and the longer time is for a quadword multiply (64  
bit result). The architecture has distinct instructions for each.

Quote:>    L3 cache (off-chip)
>       Programmable wave pipelining

> Err, what's wave pipelining?

I think it is a technique in which the propagation delay of a bus is used to  
carry on multiple simultaneous operations. In this case, multiple addresses  
will be present on the cache request bus at once perhaps? Doesn't HP use  
something similar? Is it "programmable" in the digital sense or the analog  
sense?
--

Adobe Systems, 1585 Charleston Rd., POB 7900, Mountain View, CA 94039-7900
It seems like once people grow up, they have no idea what's cool. - Calvin
 
 
 

Hotchips presentation of the 21164

Post by Burkhard Neidecker-Lu » Sat, 20 Aug 1994 17:56:25




>Could you explain more about the L1 & L2 cache both being onchip in this
>design? Basically what I'm asking is why not a 64-104K L1
>cache -- was it timing or die size that caused the split into two onchip
>caches?

I didn't design the thing, but there are a few basic things about caches
(and a lot of suggested reading from much brighter people than myself).

1. size/organization vs. speed:

        In any given technology, a larger cache is slower than a smaller
        one and a cache of higher associativity is slower than one of
        lesser. If you want to see how much, get yourself the WRL Research              Report 93/5 "An Enhanced Access and Cycle Time Model for On-Chip
        Caches" by Steven Wilton and Norman Jouppi. It comes with software
        to predict (within 10% of a real SPICE simulation how fast a given
        onchip cache would be. The software is at

                gatekeeper.dec.com:pub/DEC/cacti.tar.Z

        While the absolute times it reports are going to be wrong unless
        your semiconductor process matches the model, the relative speeds
        are a good approximation.

        In the case of the 21164 caches (8K, direct mapped, 32 byte block
        vs. 96 K, 3-way assoc. 64 byte block), the model predicts a cycle
        time difference of roughly 2:1. So from a timing perspective the
        basic clock would have to be twice as slow to accomodate this cache.

2. Separate caches vs. bandwidth:

        Separate caches have twice the bandwidth of a unified cache, unless             you multiport it. In the 21164 the L1 data cache is already dual-ported
        and the I-cache needs to deliver up to 4 instructions each cycle.
        Either that would have meant triple porting a L1 cache or something
        else which wouldn't be very practical from a cycle time perspective.

3. Separate caches vs. unified caches for wildy different codes:

        In contrast to point 2, unified caches have lower miss rates
        than separate caches of equal total size, because they can dynamically
        adapt to different types of code. As the 21164 attempts to be
        both good at scientific code (tight loops with very small
        I-cache footstamp and terrible data locality) as well as commercial
        code (database code: tens of KByte of straigtline code without a
        loop in sight) a unified cache is a obvious choice. For cycle
        time and bandwidth reasons it can't be the first level cache,
        so this organization is the next best thing.

        For an indepth discussion of the tradeoffs, read the WRL Research
        Report 93/3

         "Tradeoffs in Two-Level On-Chip Caching"

        by the same authors as mentioned above. WRL techreports are
        available from

                FTP: gatekeeper.dec.com:pub/DEC/WRL/research-reports/*

                    Burkhard Neidecker-Lutz

GLASS Project, CEC Karlsruhe
Advanced Technology Group, Digital Equipment Corporation

"August 94: DEC 7000/700, SPECint 193.8, SPECfp 292.6, 275 Mhz 21064A"

 
 
 

Hotchips presentation of the 21164

Post by Burkhard Neidecker-Lu » Sat, 20 Aug 1994 19:27:57



>96 KByte on-chip cache?  96 KBytes  = 786432 bits.  At 6 transistors
>/cell, that approaches 5 million transistors.

Thats 96 Kbyte + 8 Kbyte + 8 Kbyte + the TLB's plus the MAF. The 21164
has slightly more than 9 million transistors. What was the question ?

                Burkhard Neidecker-Lutz

GLASS Project, CEC Karlsruhe
Advanced Technology Group, Digital Equipment Corporation

"VLIW only looks good to people who cannot figure out how to issue
 a billion instructions per second using 1994/95 superscalar technology"

 
 
 

Hotchips presentation of the 21164

Post by Jan Vorbruegg » Sun, 21 Aug 1994 01:19:06



Quote:(Burkhard Neidecker-Lutz) writes:

           In any given technology, a larger cache is slower than a smaller
           one and a cache of higher associativity is slower than one of
           lesser.

Well, Inmos seems to be, then, the lone preacher in the desert with their
decision to go with a fully associative, pseudo-random replacement cache for
the T9000. They're also saying that their solution is more economical on power
than small set-associative caches.

        Jan

 
 
 

Hotchips presentation of the 21164

Post by Michael Bro » Sat, 20 Aug 1994 07:52:44


Michael Brown
Market Development Manager
Supercomputing Systems Division
Silicon Graphics Inc.


telephone: +1(415)390.35.48
telefax: +1(415)390.35.62

 
 
 

1. Alpha 21164


ZKS> Okay, this is not a rhetorical question. This 21164 looks like a
ZKS> hot chip, so what systems will it be in, and where could I get
ZKS> one outside a DEC machine running (ugh!!) VMS?
--

How about in a DEC system running OSF/1 or Windows-NT?

Rumour has it that systems based on the 21164 will be out early next
year. In the mean time, we can just "make do" with the 275Mhz 21064A.

2. vim -g (always exits within 10 minutes.)

3. HELP... ALPHA 21164

4. ICS problem

5. Alpha 21164 TLB design question

6. Cannon Fodder - Hint needed

7. Alpha 21164 and 21164A scheduling differences

8. Samba 2.0.2 on BSDI 3.0

9. If Dixon is manufacturable, is 21164 now cheap?

10. 533/600 MHz Digital Alpha 21164 CPU

11. Alpha 21164 and 21264 specs

12. 21164 architecture

13. Alpha 21164 questions