I'm still trying to figure out the implications of the
excellent article "Interconnect scaling -- the real
limiter to high performance ULSI" by Mark Bohr in the
September 1996 issue of _Solid_State_Technology_.
The author is director of process architecture and
integration at Intel, and he worked out a lot of good
numbers for technology trends for semiconductors.
He defines one semiconductor process generation as an
0.7x reduction in die size. Given that definition,
he says wire length also goes down at 0.7x per generation,
die size goes down at 0.5x per generation (assuming that
the design stays constant -- this is just the feature
size trend squared), and RC delay goes up at 1.3x per
generation. The calculation of the latter is somewhat
complex, involving things like metal line pitch and aspect
The trend of wire delay going up is what troubles me.
The transistors get smaller and denser, the chips get
larger, but the wires get slower. That is going to impact
computer architecture in subtle ways. And the trend is
much worse than Bohr describes.
I propose a "locality index", i.e. the pressure on the
computer architect to organize the system as many small
functional blocks rather than fewer large blocks This
would be the product of:
wire delay (e.g. the delay for a signal to travel 1 mm)
transistors per mm2
Wire delay increases the speed penalty for passing a
signal outside of the neighborhood of the transmitter.
More transistors per unit area increases the functionality
within the neighborhood. Die size increases the number of
neighborhoods on a chip.
So, Bohr's number for wire delay is 1.3x per generation.
Transistors per mm2 goes up as the inverse square of
feature size, which is about 2x per generation. Die size
is a bit more difficult to estimate. Based on a chart
given at Emcon '94 by Integrated Circuit Engineering,
I would say it increases by about 1.5x per generation.
Note that this is a trend entirely driven by progress
in reducing defect density, which is the limiter on how
big a die can be economically manufactured.
Multiplying all these trends together, the locality
index is rising at 3.9x per generation. That's a lot!
And you can see it just looking at some microprocessor
chips. Chips up to and including the 386 usually had one
main ALU/shifter/register data path that would be the
largest single structure on the chip. There would usually
be a large PLA for the microcode, and maybe some other
large structures like the address generation path. The
486 generation of chips introduced cache, which would
usually be the largest structure on the chip.
What these chips had in common was that a signal could
cross most of the chip in a fraction of a cycle. When
that condition is present, wire delay is not a very
important driver for the architecture being used.
The first chip I remember that may have been affected
by wire delay was the Alpha which had two levels of
on-chip cache. However, this may have been due to loading
on the wires crossing the cache array, I'm not sure it was
truly wire delay. The first-level cache offered single-
cycle access while the second-level cache was two cycles.
But what will life be like when we are looking at chips
with a very high locality index? You won't see large
functional blocks that take up half the die. You will see
many small functional blocks. What will these blocks do?
I'm tempted to say they will be FPGA-like cells with
high functionality. Perhaps instead of operating on bits,
they will operate on bytes, integers, or floating-point
numbers, with the ability to switch among these data
But won't VLIW allow control over many functional units?
I don't think so. Somewhere there's going to be an
instruction dispatcher issuing multiple instructions
per cycle. Some instructions will be to nearby units,
but some will be to units several cycles distant from
the dispatcher. Can VLIW do that? Maybe somebody's
already got a clever solution for that problem. If so,
please inform me. Also, not all units will be able to
communicate with each other in a single cycle unless
No, what I think I'm describing is actually a form of
cellular automata, albeit with state transition rules of
enormous complexity. How does this differ from having
an array of von Neumann machines, like many multiprocessor
concepts? I'm not sure that it does. The latter is a
subset of the former, but there might be other forms
of the former that would be simpler, such as one or more
finite state machine controllers.