The Locality Index

The Locality Index

Post by Mark Thors » Mon, 16 Dec 1996 04:00:00



I'm still trying to figure out the implications of the
excellent article "Interconnect scaling -- the real
limiter to high performance ULSI" by Mark Bohr in the
September 1996 issue of _Solid_State_Technology_.

The author is director of process architecture and
integration at Intel, and he worked out a lot of good
numbers for technology trends for semiconductors.

He defines one semiconductor process generation as an
0.7x reduction in die size.  Given that definition,
he says wire length also goes down at 0.7x per generation,
die size goes down at 0.5x per generation (assuming that
the design stays constant -- this is just the feature
size trend squared), and RC delay goes up at 1.3x per
generation.  The calculation of the latter is somewhat
complex, involving things like metal line pitch and aspect
ratio.

The trend of wire delay going up is what troubles me.
The transistors get smaller and denser, the chips get
larger, but the wires get slower.  That is going to impact
computer architecture in subtle ways.  And the trend is
much worse than Bohr describes.

I propose a "locality index", i.e. the pressure on the
computer architect to organize the system as many small
functional blocks rather than fewer large blocks  This
would be the product of:

wire delay (e.g. the delay for a signal to travel 1 mm)
transistors per mm2
die size

Wire delay increases the speed penalty for passing a
signal outside of the neighborhood of the transmitter.
More transistors per unit area increases the functionality
within the neighborhood.  Die size increases the number of
neighborhoods on a chip.

So, Bohr's number for wire delay is 1.3x per generation.
Transistors per mm2 goes up as the inverse square of
feature size, which is about 2x per generation.  Die size
is a bit more difficult to estimate.  Based on a chart
given at Emcon '94 by Integrated Circuit Engineering,
I would say it increases by about 1.5x per generation.
Note that this is a trend entirely driven by progress
in reducing defect density, which is the limiter on how
big a die can be economically manufactured.

Multiplying all these trends together, the locality
index is rising at 3.9x per generation.  That's a lot!
And you can see it just looking at some microprocessor
chips.  Chips up to and including the 386 usually had one
main ALU/shifter/register data path that would be the
largest single structure on the chip.  There would usually
be a large PLA for the microcode, and maybe some other
large structures like the address generation path.  The
486 generation of chips introduced cache, which would
usually be the largest structure on the chip.

What these chips had in common was that a signal could
cross most of the chip in a fraction of a cycle.  When
that condition is present, wire delay is not a very
important driver for the architecture being used.

The first chip I remember that may have been affected
by wire delay was the Alpha which had two levels of
on-chip cache.  However, this may have been due to loading
on the wires crossing the cache array, I'm not sure it was
truly wire delay.  The first-level cache offered single-
cycle access while the second-level cache was two cycles.

But what will life be like when we are looking at chips
with a very high locality index?  You won't see large
functional blocks that take up half the die.  You will see
many small functional blocks.  What will these blocks do?

I'm tempted to say they will be FPGA-like cells with
high functionality.  Perhaps instead of operating on bits,
they will operate on bytes, integers, or floating-point
numbers, with the ability to switch among these data
formats.

But won't VLIW allow control over many functional units?
I don't think so.  Somewhere there's going to be an
instruction dispatcher issuing multiple instructions
per cycle.  Some instructions will be to nearby units,
but some will be to units several cycles distant from
the dispatcher.  Can VLIW do that?  Maybe somebody's
already got a clever solution for that problem.  If so,
please inform me.  Also, not all units will be able to
communicate with each other in a single cycle unless
they're neighbors.

No, what I think I'm describing is actually a form of
cellular automata, albeit with state transition rules of
enormous complexity.  How does this differ from having
an array of von Neumann machines, like many multiprocessor
concepts?  I'm not sure that it does.  The latter is a
subset of the former, but there might be other forms
of the former that would be simpler, such as one or more
finite state machine controllers.

 
 
 

The Locality Index

Post by Phil Koopm » Mon, 16 Dec 1996 04:00:00



>I'm still trying to figure out the implications of the
>excellent article "Interconnect scaling -- the real
>limiter to high performance ULSI" by Mark Bohr in the
>September 1996 issue of _Solid_State_Technology_.
>...

I haven't played the CPU design game in a while, but I think you're
right -- physical locality is going to be an issue.  Locality used to
be a matter of whether two functions were on the same board (vs. the
backplane).  Then, whether they were on the same chip.  It appears
that the pressure is now to get functions on the same region of the
chip.

Quote:>... what I think I'm describing is actually a form of
>cellular automata, albeit with state transition rules of
>enormous complexity.

Well, bit-serial SIMD machines fit what you're saying, and they scale
pretty well.  Now, if only they would run Windows...

-- Phil



 
 
 

The Locality Index

Post by Mark Thors » Mon, 16 Dec 1996 04:00:00



>He defines one semiconductor process generation as an
>0.7x reduction in die size.  Given that definition,

                   ^^^--that should be "feature", not "die"

You can look at something a zillion times, and on the
zillion+1th time see something you never noticed the
previous zillion times.

 
 
 

The Locality Index

Post by Bernd Paysa » Mon, 16 Dec 1996 04:00:00



> The first chip I remember that may have been affected
> by wire delay was the Alpha which had two levels of
> on-chip cache.  However, this may have been due to loading
> on the wires crossing the cache array, I'm not sure it was
> truly wire delay.  The first-level cache offered single-
> cycle access while the second-level cache was two cycles.

I don't think it's wire delay. The first-level cache on the 21164 is
direct mapped, and so it's very simple and fast to access it. No
multiplexer at the output. It's even possible to use the L1-cache result
before checking if it was valid. You can stop the instruction that uses
the wrong value before it writes it's value to a register if it was a
cache miss. I think they did this. On the 21264, there's a larger L1
cache again, with more than one way, and it has a two cycle latency,
mostly because of the compare and select part.

Wire delay depends on wire size. Local interconnections must shrink with
the overall featue size, since you need a certain number of local
interconnections per gate. However, larget interconnections may shrink
slower (or not at all), if you feature new metal layers. They do. Some
of the new metal layers are used for ground and power supply. Some are
used for pad space (move the pads over the active part, not at the
side), but the rest can be used for wider interconnections.

Quote:> But won't VLIW allow control over many functional units?
> I don't think so.  Somewhere there's going to be an
> instruction dispatcher issuing multiple instructions
> per cycle.  Some instructions will be to nearby units,
> but some will be to units several cycles distant from
> the dispatcher.  Can VLIW do that?  Maybe somebody's
> already got a clever solution for that problem.  If so,
> please inform me.  Also, not all units will be able to
> communicate with each other in a single cycle unless
> they're neighbors.

VLIW is fine grain parallelism. With increasing parallelism, larger
grains of parallelism must be used, because there's limited fine grain
parallelism. So VLIW is good to some extend of parallelism. It keeps
decoding simple, so the units are nearby. It makes resource splits
visible, so the registers are near to the units working on them. For
larger parallelism, you must use other ways to parallelize your code.
Communication still is needed, but the latency isn't the same issue,
it's now not measured in fractions of cycle time, but in cycles.

I think the next level will use something we see between computers now:
switched networks, e.g. SCI to couple distant memories. Or perhaps
something in between (sending from register to register).

--
Bernd Paysan
"Late answers are wrong answers!"
http://www.informatik.tu-muenchen.de/~paysan/

 
 
 

The Locality Index

Post by Joe Hinrich » Wed, 18 Dec 1996 04:00:00



[snip]

> He defines one semiconductor process generation as an
> 0.7x reduction in die size.  Given that definition,
> he says wire length also goes down at 0.7x per generation,
> die size goes down at 0.5x per generation (assuming that
> the design stays constant -- this is just the feature
> size trend squared), and RC delay goes up at 1.3x per
> generation.  The calculation of the latter is somewhat
> complex, involving things like metal line pitch and aspect
> ratio.

> The trend of wire delay going up is what troubles me.
> The transistors get smaller and denser, the chips get
> larger, but the wires get slower.  That is going to impact
> computer architecture in subtle ways.  And the trend is
> much worse than Bohr describes.

Pardon an ignorant question: does the 1.3x per generation
RC delay change mean that a signal propagates over the
inverse of 1.3 squared as much area?  This means that the
die size covered by the reduced signal is about .6x, while
the old gate set covers .5x; so as long as the gates don't
change, the RC factor doesn't hurt, except for two things:

        a) Gates tend to quadruple, i.e. die size never
           changes.  This has to be a law with someone's
           name on it already, yes?  Claim below is that
           it goes up at 1.5x per generation.

        b) While die size stays the same, time has to
           shrink.  Say 2x faster?  ((1.5 *2)/.6) = 5x
           net?

        c) Finer wires take less heat - voltage will be
           coming down, too.  MORE than 5x net?

So in other words there is a fences closing in at about
5x per generation?  So whenever this fence comes across
the die edge, the rules change a lot, yes?  Where is the
fence now?

- Show quoted text -

Quote:> I propose a "locality index", i.e. the pressure on the
> computer architect to organize the system as many small
> functional blocks rather than fewer large blocks  This
> would be the product of:

> wire delay (e.g. the delay for a signal to travel 1 mm)
> transistors per mm2
> die size

> Wire delay increases the speed penalty for passing a
> signal outside of the neighborhood of the transmitter.
> More transistors per unit area increases the functionality
> within the neighborhood.  Die size increases the number of
> neighborhoods on a chip.

> So, Bohr's number for wire delay is 1.3x per generation.
> Transistors per mm2 goes up as the inverse square of
> feature size, which is about 2x per generation.  Die size
> is a bit more difficult to estimate.  Based on a chart
> given at Emcon '94 by Integrated Circuit Engineering,
> I would say it increases by about 1.5x per generation.
> Note that this is a trend entirely driven by progress
> in reducing defect density, which is the limiter on how
> big a die can be economically manufactured.

> Multiplying all these trends together, the locality
> index is rising at 3.9x per generation.

I missed your number; what was the math?

- Show quoted text -

Quote:>That's a lot!
> And you can see it just looking at some microprocessor
> chips.  Chips up to and including the 386 usually had one
> main ALU/shifter/register data path that would be the
> largest single structure on the chip.  There would usually
> be a large PLA for the microcode, and maybe some other
> large structures like the address generation path.  The
> 486 generation of chips introduced cache, which would
> usually be the largest structure on the chip.

> What these chips had in common was that a signal could
> cross most of the chip in a fraction of a cycle.  When
> that condition is present, wire delay is not a very
> important driver for the architecture being used.

> The first chip I remember that may have been affected
> by wire delay was the Alpha which had two levels of
> on-chip cache.  However, this may have been due to loading
> on the wires crossing the cache array, I'm not sure it was
> truly wire delay.  The first-level cache offered single-
> cycle access while the second-level cache was two cycles.

> But what will life be like when we are looking at chips
> with a very high locality index?  You won't see large
> functional blocks that take up half the die.  You will see
> many small functional blocks.  What will these blocks do?

> I'm tempted to say they will be FPGA-like cells with
> high functionality.  Perhaps instead of operating on bits,
> they will operate on bytes, integers, or floating-point
> numbers, with the ability to switch among these data
> formats.

Not ready to stipulate the 'many' or the 'small' yet; if the
fence is at the die edge of the P6, then the P7 lump can still
be one-fifth of a P6, hardly math-ckt or FPGA territory.

But eventually we may have something like local autonomous
neurons.  Hmmmmmmm.  About as many cells as, say, the brain
of a house fly, but clocking seven to nine orders of
magnitude faster.  Lllllloook out!

Quote:> But won't VLIW allow control over many functional units?
> I don't think so.  Somewhere there's going to be an
> instruction dispatcher issuing multiple instructions
> per cycle.  Some instructions will be to nearby units,
> but some will be to units several cycles distant from
> the dispatcher.  Can VLIW do that?  Maybe somebody's
> already got a clever solution for that problem.  If so,
> please inform me.  Also, not all units will be able to
> communicate with each other in a single cycle unless
> they're neighbors.

Somebody's going to think of a new form of parallelism.

Quote:> No, what I think I'm describing is actually a form of
> cellular automata, albeit with state transition rules of
> enormous complexity.

Yeah - neurons.

Quote:> How does this differ from having
> an array of von Neumann machines, like many multiprocessor
> concepts?  I'm not sure that it does.  The latter is a
> subset of the former, but there might be other forms
> of the former that would be simpler, such as one or more
> finite state machine controllers.

Not likely to resemble any linear extrapolation from older
classes of actually-built machine.

Good hunting!

JoeH

 
 
 

The Locality Index

Post by Mark Thors » Sat, 21 Dec 1996 04:00:00


In article <32B62190....@churchill.columbiasc.ncr.com>,
Joe Hinrichs  <jhinr...@churchill.columbiasc.ncr.com> wrote:

>Mark Thorson wrote:

>> He defines one semiconductor process generation as an
>> 0.7x reduction in feature size.  Given that definition,
>> he says wire length also goes down at 0.7x per generation,
>> die size goes down at 0.5x per generation (assuming that
>> the design stays constant -- this is just the feature
>> size trend squared), and RC delay goes up at 1.3x per
>> generation.  The calculation of the latter is somewhat
>> complex, involving things like metal line pitch and aspect
>> ratio.

>> The trend of wire delay going up is what troubles me.
>> The transistors get smaller and denser, the chips get
>> larger, but the wires get slower.  That is going to impact
>> computer architecture in subtle ways.  And the trend is
>> much worse than Bohr describes.

>Pardon an ignorant question: does the 1.3x per generation
>RC delay change mean that a signal propagates over the
>inverse of 1.3 squared as much area?  This means that the

I think what you are saying is correct.  In the context
of the original article, RC delay going up by 1.3x means
it will take 1.3x longer to travel a constant distance
(e.g. 1 mm).

>die size covered by the reduced signal is about .6x, while
>the old gate set covers .5x; so as long as the gates don't
>change, the RC factor doesn't hurt, except for two things:

>    a) Gates tend to quadruple, i.e. die size never
>       changes.  This has to be a law with someone's
>       name on it already, yes?  Claim below is that
>       it goes up at 1.5x per generation.

Who says die size never changes?  Defect density per unit
area is going down, so the size of the chips that can be
made with reasonable yields is going up.  There might be
a temporary hang-up waiting for new optical equipment to
take advantage of the new possibilities in die size, but
that equipment will come.

>    b) While die size stays the same, time has to
>       shrink.  Say 2x faster?  ((1.5 *2)/.6) = 5x
>       net?

I'm not sure I follow you.  You seem to be saying:

((die size increase * speed increase)/(area covered by signal)) = 5x

This ignores the increase in transistor count in the area
covered by the signal.  Also, you seem to correlate speed
increase strictly to switch speed, when RC delay is actually
the performance limiter.

>    c) Finer wires take less heat - voltage will be
>       coming down, too.  MORE than 5x net?

Both the original article and my own comments are looking
strictly at signal wires, not power wires.  More about that
below.

>So in other words there is a fences closing in at about
>5x per generation?  So whenever this fence comes across
>the die edge, the rules change a lot, yes?  Where is the
>fence now?

Good question.  The numbers change all the time, and you
have to distinguish between numbers from an ISSCC paper
and what's running in volume at Intel right now.

- Show quoted text -

>> I propose a "locality index", i.e. the pressure on the
>> computer architect to organize the system as many small
>> functional blocks rather than fewer large blocks  This
>> would be the product of:

>> wire delay (e.g. the delay for a signal to travel 1 mm)
>> transistors per mm2
>> die size

>> Wire delay increases the speed penalty for passing a
>> signal outside of the neighborhood of the transmitter.
>> More transistors per unit area increases the functionality
>> within the neighborhood.  Die size increases the number of
>> neighborhoods on a chip.

>> So, Bohr's number for wire delay is 1.3x per generation.
>> Transistors per mm2 goes up as the inverse square of
>> feature size, which is about 2x per generation.  Die size
>> is a bit more difficult to estimate.  Based on a chart
>> given at Emcon '94 by Integrated Circuit Engineering,
>> I would say it increases by about 1.5x per generation.
>> Note that this is a trend entirely driven by progress
>> in reducing defect density, which is the limiter on how
>> big a die can be economically manufactured.

>> Multiplying all these trends together, the locality
>> index is rising at 3.9x per generation.

>I missed your number; what was the math?

wire delay * transistors per unit area * die size

Wire delay because high delay means you have to keep
fast structures in a confined area.  Transistors per unit
area because they give you higher functionality per unit
area.  Die size because that gives you more area to work
with, but doesn't affect either of the first two.  All
of these trends are working together toward large chips
with many small functional blocks.

- Show quoted text -

>>That's a lot!
>> And you can see it just looking at some microprocessor
>> chips.  Chips up to and including the 386 usually had one
>> main ALU/shifter/register data path that would be the
>> largest single structure on the chip.  There would usually
>> be a large PLA for the microcode, and maybe some other
>> large structures like the address generation path.  The
>> 486 generation of chips introduced cache, which would
>> usually be the largest structure on the chip.

>> What these chips had in common was that a signal could
>> cross most of the chip in a fraction of a cycle.  When
>> that condition is present, wire delay is not a very
>> important driver for the architecture being used.

>> But what will life be like when we are looking at chips
>> with a very high locality index?  You won't see large
>> functional blocks that take up half the die.  You will see
>> many small functional blocks.  What will these blocks do?

>> I'm tempted to say they will be FPGA-like cells with
>> high functionality.  Perhaps instead of operating on bits,
>> they will operate on bytes, integers, or floating-point
>> numbers, with the ability to switch among these data
>> formats.

>Not ready to stipulate the 'many' or the 'small' yet; if the
>fence is at the die edge of the P6, then the P7 lump can still
>be one-fifth of a P6, hardly math-ckt or FPGA territory.

That may be true today, but the time is fast approaching when
there will be many little fences on-chip.  The large haciendas
we have today will become many single-family farms.

- Show quoted text -

>But eventually we may have something like local autonomous
>neurons.  Hmmmmmmm.  About as many cells as, say, the brain
>of a house fly, but clocking seven to nine orders of
>magnitude faster.  Lllllloook out!

>> But won't VLIW allow control over many functional units?
>> I don't think so.  Somewhere there's going to be an
>> instruction dispatcher issuing multiple instructions
>> per cycle.  Some instructions will be to nearby units,
>> but some will be to units several cycles distant from
>> the dispatcher.  Can VLIW do that?  Maybe somebody's
>> already got a clever solution for that problem.  If so,
>> please inform me.  Also, not all units will be able to
>> communicate with each other in a single cycle unless
>> they're neighbors.

>Somebody's going to think of a new form of parallelism.

I think so too.  That's my basic question.  What will it be?

>> No, what I think I'm describing is actually a form of
>> cellular automata, albeit with state transition rules of
>> enormous complexity.

>Yeah - neurons.

>> How does this differ from having
>> an array of von Neumann machines, like many multiprocessor
>> concepts?  I'm not sure that it does.  The latter is a
>> subset of the former, but there might be other forms
>> of the former that would be simpler, such as one or more
>> finite state machine controllers.

>Not likely to resemble any linear extrapolation from older
>classes of actually-built machine.

Another article that is relevant to this discussion appeared
in the October 1996 issue of _Nikkei_Microdevices_, pages
92 to 98.  (disclaimer:  I know the author)  This article
describes an emerging packaging technology in which the
package is built up as layers on the face of the wafer,
then the wafer is diced into individual chip-scale packages.

The author is proposing routing critical signal paths on
polyimide flex circuit that is part of the package
construction.  The figures he gives for on-chip RC delay
are R = 12 ohms/cm and C = 2 pf/cm.  On the flex-circuit,
R = 0.5 ohms/cm and C = 0.75 pf/cm.  Therefore, signals
can run much faster on the flex-circuit than on the chip.

He's proposing using wiring on the flex-circuit for both
power and critical signals.  You can't get anything like
the signal density on the chip for the wiring on the flex-
circuit, so this doesn't solve the problem of the rising
locality index.  But it impacts architecture because it
allows some signals to skip over long distances without
the on-chip wire delay penalty, sort of like the "wormholes"
on _Star_Trek_.  How will wormholes affect computer
architecture?  I'd really like to hear some good answers
to that one.

One possibility is that these wires will be used for
an interprocessor bus, if we scale existing multiprocessor
concepts to fit on the large, dense dice of the future.
But that would be a singularly unimaginative use of these
wires.  Maybe they will be used by the instruction dispatcher
of a VLIW machine?  That might be a way to make VLIW
practical for a large die with a high locality index.

 
 
 

The Locality Index

Post by Joe Hinrich » Tue, 24 Dec 1996 04:00:00


Mark Thorson wrote:

>In article <32B62190....@churchill.columbiasc.ncr.com>,
>Joe Hinrichs  <jhinr...@churchill.columbiasc.ncr.com> wrote:
>>Mark Thorson wrote:
[snip]
>>Pardon an ignorant question: does the 1.3x per generation
>>RC delay change mean that a signal propagates over the
>>inverse of 1.3 squared as much area?  This means that the

>I think what you are saying is correct.  In the context
>of the original article, RC delay going up by 1.3x means
>it will take 1.3x longer to travel a constant distance
>(e.g. 1 mm).

Thanks.  But if gates are .5 the size while the RC delay
goes up by 1.3, then the relative RC delay is in fact a
speed increase of 1 / (.5 * 1.3 ) or about 1.5 x faster
given the same number of gates.  The 1.5x faster may not
cut it; and the number of gates has historically gone up
a bunch, right?

>>die size covered by the reduced signal is about .6x, while
>>the old gate set covers .5x; so as long as the gates don't
>>change, the RC factor doesn't hurt, except for two things:

>>       a) Gates tend to quadruple, i.e. die size never
>>          changes.  This has to be a law with someone's
>>          name on it already, yes?  Claim below is that
>>          it goes up at 1.5x per generation.

>Who says die size never changes?  Defect density per unit
>area is going down, so the size of the chips that can be
>made with reasonable yields is going up.  There might be
>a temporary hang-up waiting for new optical equipment to
>take advantage of the new possibilities in die size, but
>that equipment will come.

My bad - should stick to programming :)

>>       b) While die size stays the same, time has to
>>          shrink.  Say 2x faster?  ((1.5 *2)/.6) = 5x
>>          net?

>I'm not sure I follow you.  You seem to be saying:

>((die size increase * speed increase)/(area covered by signal)) = 5x

1.5 was net die area, at 6x gate increase; 2 was desired clock
speed gain; .6 was inverse of (1.3 **2) - which should have been
1.3 * (1.5 ** 1/2) * 2;  re-doing it,

        doubled clock while RC change cuts performance by 1.3
        across 1.5x more area or sqroot(1.5) longer die edge =
        2 * 1.3 * 1.22 =  3.2x net

>This ignores the increase in transistor count in the area
>covered by the signal.  Also, you seem to correlate speed
>increase strictly to switch speed, when RC delay is actually
>the performance limiter.

>>       c) Finer wires take less heat - voltage will be
>>          coming down, too.  MORE than 5x net?

>Both the original article and my own comments are looking
>strictly at signal wires, not power wires.  More about that
>below.

>>So in other words there is a fence closing in at about
>>5x per generation?  So whenever this fence comes across
>>the die edge, the rules change a lot, yes?  Where is the
>>fence now?

>Good question.  The numbers change all the time, and you
>have to distinguish between numbers from an ISSCC paper
>and what's running in volume at Intel right now.

[snip]

>>I missed your number; what was the math?

>wire delay * transistors per unit area * die size

>Wire delay because high delay means you have to keep
>fast structures in a confined area.  Transistors per unit
>area because they give you higher functionality per unit
>area.  Die size because that gives you more area to work
>with, but doesn't affect either of the first two.  All
>of these trends are working together toward large chips
>with many small functional blocks.

[snip]
>>> I'm tempted to say they will be FPGA-like cells with
>>> high functionality.  Perhaps instead of operating on bits,
>>> they will operate on bytes, integers, or floating-point
>>> numbers, with the ability to switch among these data
>>> formats.

>>Not ready to stipulate the 'many' or the 'small' yet; if the
>>fence is at the die edge of the P6, then the P7 lump can still
>>be one-fifth of a P6, hardly math-ckt or FPGA territory.

>That may be true today, but the time is fast approaching when
>there will be many little fences on-chip.  The large haciendas
>we have today will become many single-family farms.

IOW supposing gaet count has to shrink 10 ** 1/2 per generation
(since 3.2 is about that much) then the number of steps between
a P6 and something so much smaller as to require a whole new
paradigm is reasonably small.  BUT - is it one, two, or ten?
Eight steps equates to four orders of magnitude, taking a 10**7
gate P6 to a 10**3 gate whateveritis - which only says that
linear extrapolations tend to contain the seeds of their own
destruction, their own reductio ad absurdum.  The value
equations that specify "6x greater gates overall" and "2x
faster clock overall" should have to bend somewhere.

But if ten generations leads to an absurd Hundred Gate Flower
Pot, the next half-dozen will still be really interesting.
I certainly accept your point about the Family Farm, without
pretending to know to which of the next N generations it will
best apply.

[snip]

- Show quoted text -

>>Somebody's going to think of a new form of parallelism.

>I think so too.  That's my basic question.  What will it be?

>>> No, what I think I'm describing is actually a form of
>>> cellular automata, albeit with state transition rules of
>>> enormous complexity.

>>Yeah - neurons.

>>> How does this differ from having
>>> an array of von Neumann machines, like many multiprocessor
>>> concepts?  I'm not sure that it does.  The latter is a
>>> subset of the former, but there might be other forms
>>> of the former that would be simpler, such as one or more
>>> finite state machine controllers.

>>Not likely to resemble any linear extrapolation from older
>>classes of actually-built machine.

>Another article that is relevant to this discussion appeared
>in the October 1996 issue of _Nikkei_Microdevices_, pages
>92 to 98.  (disclaimer:  I know the author)  This article
>describes an emerging packaging technology in which the
>package is built up as layers on the face of the wafer,
>then the wafer is diced into individual chip-scale packages.

>The author is proposing routing critical signal paths on
>polyimide flex circuit that is part of the package
>construction.  The figures he gives for on-chip RC delay
>are R = 12 ohms/cm and C = 2 pf/cm.  On the flex-circuit,
>R = 0.5 ohms/cm and C = 0.75 pf/cm.  Therefore, signals
>can run much faster on the flex-circuit than on the chip.

Not sure where to find the article; but you seem to be
saying that a flexible element, i.e. something with a
3-dimensional shape, is capable of traversing the die
from contact to contact, with greatly superior RC numbers
thus much enhanced clocking / bussing characteristics.

>He's proposing using wiring on the flex-circuit for both
>power and critical signals.  You can't get anything like
>the signal density on the chip for the wiring on the flex-
>circuit, so this doesn't solve the problem of the rising
>locality index.  But it impacts architecture because it
>allows some signals to skip over long distances without
>the on-chip wire delay penalty, sort of like the "wormholes"
>on _Star_Trek_.  How will wormholes affect computer
>architecture?  I'd really like to hear some good answers
>to that one.

How big do the contacts need to be?  How fine can the
polyimide lines be?  How independent of each other?  Are
they all lithographically placed on something that is one-
for-one with the die and bonds to it at all N bonding sites,
or do you see analogs of little signal cables lying around
all over the die surface, giving the die a Medusa appearance?
Again, please forgive the ignorant questions.  Maybe the
answer is "TBD".

>One possibility is that these wires will be used for
>an interprocessor bus, if we scale existing multiprocessor
>concepts to fit on the large, dense dice of the future.
>But that would be a singularly unimaginative use of these
>wires.  Maybe they will be used by the instruction dispatcher
>of a VLIW machine?  That might be a way to make VLIW
>practical for a large die with a high locality index.

Good thought.  I don't really think we're going to be ready
for the near-chaotic complexities of truly neuron-like
elements at any time in my productive life.  We have always
and will continue for the indefinite foreseeable future to
depend on one-bit-one-path kinds of computing, with simple
languages for simple things like humans to do simple serial
things like write databases in.  Oh, well.  But instead of
VLIW, with the implication that one die will do one or a
small number of serial instruction streams, my money says
that the CISC/RISC, SIW/LIW/VLIW religious wars will not
drive chip evolution; instead there will be increases in
the massiveness of parallelism on the chip.  All those
fences closing in will mean that one die will do more by
being more.  Not sure what the cache implications are, or
how all the necessary cache/RAM/CPU parts will fit inside
the hacienda fence, but that's where I think the action
will be.

JoeH

 
 
 

1. Status of non-locality and Bell's inequality (was Re: Silliness)

Since this subject is of interest to me, I followed up on the pointers
kindly provided by Jim Carr. Since the "silliness" thread started on
comp.arch, I have cross-posted there and set followup to sci.physics.

Layman's summary:

    Not much seems to have happend in the last decade, in terms of
    new definitive experiments; most recent work have been theoretical.
    A class of local hidden variable theories remain viable.


In "Violtion of Bell inqualities by photons more than 10km apart" by
W. Tittle, et el, dated June 12, 1998, available as quant-ph/9806043
from http://xxx.lanl.gov; they describe an experiment (using fibres
laid for telecom purposes) that show en-tangled states can stretch over
large distances.

As for the tests of non-locality, they say this in the introduction:

   "Why should one still bother about quantum nonlocality despite that
   all experiments so far are in agreement with quantum theory [references
   to experiments including Aspect]? The traditional motivations are based
   on foundamental questions on the meaning and compatibility of our basic
   theories, quantum mechanics and relativity: to date, no experiment to
   test Bell's inquality has been loophole free [references to papers that
   discuss the loopholes] and no experiment so far has directly probed the
   tension between quantum non-locality and relativity. ... [talk about
   new motivation of Quantum crypto]"

They also list, in the references, a number of papers that *propose*
experiments to close the loopholes, including Kwiat amongst others. Quick
searchs did not turn up any papers reporting on the results of experiments.

--

Me? Represent other people? Don't make them laugh so hard.

2. Please Help Urgently: VV98 vs. DNS Preffered 3.01

3. allocator and GC locality (was Re: cost of malloc)

4. Spry Dialer & 56k modems !!!!

5. Proxy question for ink spot ce and usb internet connection

6. code reorganization for locality

7. Doom v0.14

8. Unix kernel locality of reference

9. Lost Hard Drive Index?

10. re-indexing networker tape

11. Deleting obsolete indexes from Networker

12. index searching