benchmarking discussion at Usenix?

benchmarking discussion at Usenix?

Post by Skipper Smit » Fri, 17 Jan 1997 04:00:00



[...]

Quote:> I don't have any hard numbers, but the industry seems to have decided
> that cache sector sizes equal to 4 or 8 data beats (with the bulk
> choosing 4... 8 is usually an available choice or a side effect) is best
> and that, when organizational purposes require it,  multiple sectors per
> block are acceptable but are generally avoided.  Therefore, you need to
> see your bus width as to what the "line" size should be.

Oh, all of that was applying to on-chip caches.  When going to external
caches, one needs to look very carefully at the kinds of programs that
are expected to run before making any real decisions.  Going with a
generic choice could be a real performance killer in many specific
cases.

--
Skipper Smith
Somerset Design Center
All opinions are my own and not those of my employer

 
 
 

benchmarking discussion at Usenix?

Post by Hugh LaMaste » Sat, 18 Jan 1997 04:00:00




> > same on all three.  Numbers, anyone?

BTW, there are some more recent numbers in the
following online paper (there seems to be no
information on whether it has received further
publication - it seems to have been done as a
class project):

  "Cache Behaviour of the SPEC95 Benchmark Suite",
  Sanjoy Dasgupta and Edouard Servan-Schreiber

  http://http.cs.berkeley.edu/~dasgupta/paper/rep/rep.html

The paper looks at a subset of SPEC95, on a SPARC.
The paper suggests that 32 Byte block sizes are optimal
for SPEC95, for small (< 128KB) caches, on the machine
in question - (presumably 32-bit addresses).  It appears
to me from this data that a 64-bit address machine would
likely do better with 64 Byte blocks, since the optimum is
leaning that direction already.  Larger cache sizes also do
better with larger blocks, so machines with larger unified
L1/L2 caches would likely do better with larger blocks.  
In short, it looks like the vendors have probably already
done a pretty good job of optimizing their machines to run SPEC95.  
[Surprise, surprise].

Quote:> My point was that different machines do have different line sizes, and
> the differences are quite large.  On the SGI R10000, the secondary line
> size is 128 Bytes. On some IBM Power 2's, the line size is 256 Bytes.
> I'm pretty sure that some other vendors use 32 Byte line sizes.
> Why different vendors use different line sizes is probably related to
> both system issues and to which types of applications they try to
> optimize.  

We seem to be in raging agreement up to this point.

Quote:>            But, it is irrelevant to the benchmarking issue.

I still like to think that microbenchmarks like lmbench and STREAM,
larger benchmarks like SPEC95, and full-sized application performance,
could be correlated, and even "understood" starting from basic machine
performance.  So, I think I disagree with the above statement.

                                                                The
issue

Quote:> is that lmbench measures the latency for fetching a single pointer.  On
> such a benchmark a large-line machine will look relatively worse
> compared to the competition than if instead one used a benchmark that
> measured the latency of fetching a cache line.

Certainly true.  Of course, in some cases, the machines which have
long main memory latencies are *also* the same machines with poor
bandwidth.

Quote:> Now which benchmark is "better".  I think both are interesting.  Which
> is more relevant to a typical integer application?  I don't know.

I don't think there is any doubt that the latencies of the (entire)
memory hierarchy are a major determinant of "integer" performance.
For engineering-and-scientific code, the picture is murkier.  Some
codes are pretty much 100% bandwidth determined.  Others are not
much different from "integer" performance [assuming you have a modern,
fast, FP implementation].  It is actually the middle ground that is
most "interesting": the codes which can't be trivially transformed
to contiguous memory references, which have independent computed
indices,
and so on.  This is the area where "concurrency", as distinguished from
the ratio of bandwidth:latency, gets interesting.

 
 
 

benchmarking discussion at Usenix?

Post by J Wuns » Sun, 19 Jan 1997 04:00:00



> > > Another part is exception and interrupt handling.

> > Hmm.  This can't be expressed in a benchmark, hardly.  OTOH, if the
> > system keeps reasonable statistics, the effect of interrupt and
> > exception handling should be forseeable.  I estimate this being less
> > than ~ 10 % of the total time on a modern machine.

> I guess it depends on whether or not you have to drive
> dumb ugly serial ports and such like.  You can spend

That's why i wrote ``on a modern machine''. :-)

Quote:> a lot of time handling interrupts with such devices.  
> One paper at Usenix showed Linux on a Pentium dropping
> interrupts/data when driving a 115 Kbps serial port...

Hmm, there are perhaps better estimations for interrupt overhead. :)
Simply run it on a sufficiently slow machine, and see where the
throughput backs off.  My old 386sx/16 notebook (with an older version
of FreeBSD) has only an on-board 16450-style UART.  I can run it up to
38400 bps without flow control, meaning a constant interrupt rate of ~
4 kHz (3.8 kHz from the serial port, 100 Hz from the system timer, 128
Hz from the statclock) doesn't saturate it.  I can increase the baud
rate using hardware flow control up to 115 kbps, and this will yield
an effective throughput of ~ 75 % of the raw bitrate.  This means, the
system levels out at ~ 8 kHz maximal interrupt rate.  (Needless to
say, the clock seriously loses interrupts then.)

As i wrote, that's a rough estimation.  It's always good to keep a
slow machine around for testing. :-))

--
cheers, J"org


Never trust an operating system you don't have sources for. ;-)