> > same on all three. Numbers, anyone?
BTW, there are some more recent numbers in the
following online paper (there seems to be no
information on whether it has received further
publication - it seems to have been done as a
"Cache Behaviour of the SPEC95 Benchmark Suite",
Sanjoy Dasgupta and Edouard Servan-Schreiber
The paper looks at a subset of SPEC95, on a SPARC.
The paper suggests that 32 Byte block sizes are optimal
for SPEC95, for small (< 128KB) caches, on the machine
in question - (presumably 32-bit addresses). It appears
to me from this data that a 64-bit address machine would
likely do better with 64 Byte blocks, since the optimum is
leaning that direction already. Larger cache sizes also do
better with larger blocks, so machines with larger unified
L1/L2 caches would likely do better with larger blocks.
In short, it looks like the vendors have probably already
done a pretty good job of optimizing their machines to run SPEC95.
Quote:> My point was that different machines do have different line sizes, and
> the differences are quite large. On the SGI R10000, the secondary line
> size is 128 Bytes. On some IBM Power 2's, the line size is 256 Bytes.
> I'm pretty sure that some other vendors use 32 Byte line sizes.
> Why different vendors use different line sizes is probably related to
> both system issues and to which types of applications they try to
We seem to be in raging agreement up to this point.
Quote:> But, it is irrelevant to the benchmarking issue.
I still like to think that microbenchmarks like lmbench and STREAM,
larger benchmarks like SPEC95, and full-sized application performance,
could be correlated, and even "understood" starting from basic machine
performance. So, I think I disagree with the above statement.
Quote:> is that lmbench measures the latency for fetching a single pointer. On
> such a benchmark a large-line machine will look relatively worse
> compared to the competition than if instead one used a benchmark that
> measured the latency of fetching a cache line.
Certainly true. Of course, in some cases, the machines which have
long main memory latencies are *also* the same machines with poor
Quote:> Now which benchmark is "better". I think both are interesting. Which
> is more relevant to a typical integer application? I don't know.
I don't think there is any doubt that the latencies of the (entire)
memory hierarchy are a major determinant of "integer" performance.
For engineering-and-scientific code, the picture is murkier. Some
codes are pretty much 100% bandwidth determined. Others are not
much different from "integer" performance [assuming you have a modern,
fast, FP implementation]. It is actually the middle ground that is
most "interesting": the codes which can't be trivially transformed
to contiguous memory references, which have independent computed
and so on. This is the area where "concurrency", as distinguished from
the ratio of bandwidth:latency, gets interesting.