Does the cache control function work on a 4D/320S and should I
use it when talking to a memory mapped device on the VME?
| Does the cache control function work on a 4D/320S and should I
| use it when talking to a memory mapped device on the VME?
No. I/O addresses are never cached anyway.
The most beautiful things in the world are | Dave Olson
those from which all excess weight has been | Silicon Graphics
I have LOTS of questions here about _data cache_ on SGI machines.
Can someone there please explain to me how a programmer can control the use
of the _CPU data cache_ on a multiple-CPU SGI machine?
I would like to control what data is copied from memory into each of the
individual caches (one cache per CPU), and when the data is copied.
Is this possible?
If it is not possible, is there some way to tell how efficiently the caches
are being utilized during a run?
It's my understanding that multi-CPU SGI machines use shared memory.
Apparently when a program is run on several CPUs in parallel, these machines
generate a certain amount of overhead making sure that when one CPU computes
a new value for some variable, each of the CPU caches (one for each CPU)
reflect the updated value for that variable (if the variable is represented
in any of the CPU caches). I believe this is called prevention of "cache
missing". When more CPUs are crunching in parallel, more cross-checking must
be done between the caches, making more overhead.
Apparently the newer SGI machines (e.g., Power Challenge) crunch numbers much
faster than the older ones. HOWEVER, the newer ones do not check for cache
missing proportionately faster than the older ones. Consequently, while a
program runs much faster on a newer SGI using N-cpus than on an older SGI using
N-cpus, the speed-up one gets from going from N-cpus to N+1 cpus is MUCH LESS
than it was on the older SGIs.
When we run the NPARC code on the older SGIs (e.g., renegade), we cut our CPU
time by almost 50% by going from 1 CPU to 2. But on the Power Challenge, we
only get a 30% speedup. The speedup in adding a 3rd CPU to a Power Challenge
run is dismal. We think the machine is bottlenecked in checking for cache
However, it is also possible that the problem could be drastically reduced if
our arrays were structured differently. But how should they be structured?
When an array element is accessed by a program, are other data moved into
cache too? Which? When are they, and when are they not?
Is there some way to manage the cache to improve the performance of a program
running on a Power Challenge?
Robert Michael Wood | "Do you know what it's like to die
| It's redundant." - Bakersfield P.D.