benchmarking discussion at Usenix?

benchmarking discussion at Usenix?

Post by Larry McV » Sun, 05 Jan 1997 04:00:00



(I posted this before, I thought, but it seems to have disappeared into a
black hole...)

I'll be at Usenix and I thought I might put together a little BOF to discuss
benchmarking issues, if people are interested.  Topics could include

        . lmbench 2.0 changes
                - finer grain accuracy
                - web benchmarks
                - scaling/load
                - multi pointer memory latency
        . osbench
                - Steve Kleiman suggested that I (or someone) grab a big hunk
                 of OS code and port it to userland and call it osbench.  This
                 is an interesting idea.
        . freespec
                - I'm unhappy about the current spec.  I'd like to build a
                  freeSpec97 that is similar to spec (uses the same tests)
                  but has lmbench style reporting rules (cc -O/f77 -O) and
                  is free of any charges.  Any interest?
        . others?

Let me know if you want me to find some space so we can talk at Usenix.
--
---

 
 
 

benchmarking discussion at Usenix?

Post by J Wuns » Mon, 06 Jan 1997 04:00:00



>    . osbench
>            - Steve Kleiman suggested that I (or someone) grab a big hunk
>             of OS code and port it to userland and call it osbench.  This
>             is an interesting idea.

Only curious: what are the goals for this?

[No, i certainly can't afford to come to Usenix. :)]
--
cheers, J"org


Never trust an operating system you don't have sources for. ;-)

 
 
 

benchmarking discussion at Usenix?

Post by Bernd Paysa » Tue, 07 Jan 1997 04:00:00




> >       . osbench
> >               - Steve Kleiman suggested that I (or someone) grab a big hunk
> >                of OS code and port it to userland and call it osbench.  This
> >                is an interesting idea.

> Only curious: what are the goals for this?

OS code has somewhat different characteristic from user code. If you
look into something like the Linux kernel, you'll see a number of
functions with many tests (e.g. for error condition), often calls to
other functions, most of them small, and few loops.

But I wonder if porting to userland gives a good benchmark. Part of the
OS workload game is the switch between user land and kernel land for
each OS call. Another part is exception and interrupt handling. Porting
parts of an OS to user land gives a benchmark that shows how typical OS
code would perform. In other words: you can vaguely guess how fast Linux
might run on this CPU, and more vaguely guess how other OSes will run
(given that the code will be taken from Linux). Not as accurate as
running real Linux (or NetBSD, to avoid any holy war ;-).

Advantages over the real thing: much more portable, and better to get a
stable basis (think of all the Linux hackers benchmarking the latest
hacker kernel ;-).

Quote:> [No, i certainly can't afford to come to Usenix. :)]

Sad, but me too.

--
Bernd Paysan
"Late answers are wrong answers!"
http://www.informatik.tu-muenchen.de/~paysan/

 
 
 

benchmarking discussion at Usenix?

Post by J Wuns » Thu, 09 Jan 1997 04:00:00



> > >       . osbench
> > Only curious: what are the goals for this?

> OS code has somewhat different characteristic from user code. If you

That's been my basic question, yes.  I wasn't aware that the typical
usage pattern of kernel code differs that much from user code.

Quote:> But I wonder if porting to userland gives a good benchmark. Part of the
> OS workload game is the switch between user land and kernel land for
> each OS call.

Well, i think the effect of this is often overestimated.  Funny
enough, lmbench proved the opposite. :-)  lmbench clearly shows that
Linux has a much faster syscall handling than other systems, but still
(and no, i'm not fighting for 10 or 20 % here), it's not _way_ faster
than other systems under identical conditions.  The conclusion from
this is that syscall overhead is most likely not the typical bottle-
neck.

Quote:> Another part is exception and interrupt handling.

Hmm.  This can't be expressed in a benchmark, hardly.  OTOH, if the
system keeps reasonable statistics, the effect of interrupt and
exception handling should be forseeable.  I estimate this being less
than ~ 10 % of the total time on a modern machine.

Quote:> Advantages over the real thing: much more portable, and better to get a
> stable basis

Yes.

--
cheers, J"org


Never trust an operating system you don't have sources for. ;-)

 
 
 

benchmarking discussion at Usenix?

Post by Hugh LaMaste » Thu, 09 Jan 1997 04:00:00



> I'll be at Usenix and I thought I might put together a little BOF to discuss
> benchmarking issues, if people are interested.  Topics could include

>         . lmbench 2.0 changes
>                 - finer grain accuracy
>                 - web benchmarks
>                 - scaling/load
>                 - multi pointer memory latency

IMHO, lmbench currently attempts to do exactly the right thing wrt
what is commonly understood as memory latency.  I would vehemently
argue against *replacing* it with multipointer memory latency.

"Multipointer memory latency" is another way of referring
to one type of concurrency.

Question: Aside from (pure) latency (currently measured by lmbench)
and bandwidth (measured by lmbench and STREAM), does concurrency matter?

Answer: I think so.  There are a number of cases of interest.
"Bandwidth" is (often) based on stride-1 concurrency.  Also interesting:
Concurrency with stride-N.  Concurrency on gather/scatter.

Putting everything in units of time (seconds x 10 ^^ -N) to:
latency (lmbench):         fetch          random address or datum
stride-1 (1/bandwidth):    fetch&process  time/unit of contiguous data
stride-N (1/bandwidth):    fetch&process  every N-th datum
gather/scatter (1/B-W):    fetch&process  random data
subword (1/bandwidth):     fetch&process  8/16/(32) data bits
                                          within word

For some machines, the bandwidth doesn't vary much, with roughly
constant bandwidth over all possibilities; on some machines, extra
load/store paths allow 2-3X improvements; on some machines, subword
instructions (e.g. VIS and so on) vastly speed up "vector/parallel"
operations within a word.  A major battle of the early 80's was the
CDC * 205 vs. Cray-1/S.  The * 205 had greater stride-1
and gather-scatter bandwidth, the Cray-1/S better latency and stride-N
performance.  Each machine had its applications where it outperformed
the other.  All these types of bandwidths and concurrencies are
worthwhile to examine systematically.  Most "scalar" code, including
compilers, tend to be dominated by latency, while many engineering,
scientific, and graphics/image processing applications tend to be
more bandwidth-intensive.

Quote:>         . osbench
>                 - Steve Kleiman suggested that I (or someone) grab a big hunk
>                  of OS code and port it to userland and call it osbench.  This
>                  is an interesting idea.

An interesting idea.  Various papers over the years (e.g. Alan J. Smith
at U.C. Berkeley?) have noted that when real hardware is instrumented,
operating systems tend to have much higher cache miss rates than
predicted by applications.  By porting an application to user-land,
hopefully at least that much of OS behavior could be captured.  [There
are obviously a lot of difficult-to-simulate OS activities as well...]

Quote:>         . freespec
>                 - I'm unhappy about the current spec.  I'd like to build a
>                   freeSpec97 that is similar to spec (uses the same tests)
>                   but has lmbench style reporting rules (cc -O/f77 -O) and
>                   is free of any charges.  Any interest?

Great idea.

Quote:>         . others?

Hank Dietz (et al.) of Purdue is keeping the "global aggregate function"
flame alive - these are functions which are globally shared by a
N parallel processors/processes, such as barrier synch, and bitwise
functions such as broadcast, AND, OR, NAND, NOR - and pointing out
how important these are (and how useful they are/were on the machines
which implemented them quickly in hardware).  See the following Website
for details:

  http://www.veryComputer.com/~papers/Arch/

A useful benchmark would be to compute these times as a function
of N processes.  A portable reference version using SysV IPC would
set the reference behavior and an upper bound on performance.
[Some hardware implementations provide shockingly good performance
compared to others.]

 
 
 

benchmarking discussion at Usenix?

Post by Patrick Gaine » Fri, 10 Jan 1997 04:00:00




> >         . osbench
> >                 - Steve Kleiman suggested that I (or someone) grab a big hunk
> >                  of OS code and port it to userland and call it osbench.  This
> >                  is an interesting idea.

> An interesting idea.  Various papers over the years (e.g. Alan J. Smith
> at U.C. Berkeley?) have noted that when real hardware is instrumented,
> operating systems tend to have much higher cache miss rates than
> predicted by applications.  By porting an application to user-land,
> hopefully at least that much of OS behavior could be captured.  [There
> are obviously a lot of difficult-to-simulate OS activities as well...]

Now I know this may not be of general applicability but for those who
are
interested, you can profile the kernel today on SGI Challenge machines
running
with an R10000 processor. The on-chip counters may be accessed from user
applications and may be configured to monitor only user events, only
kernel
events, or both. The types of events one can monitor with the R10000 are
pretty
numerous and some really good data may be collected.

Having done some comparisons of different types of workloads, monitoring
all
sorts of things from types of cache misses to CPI values it is very
reasonable
to assume the kernel has higher cache miss rates and lower IPC values
than
user applications (including relational database workloads - which are
pretty
bad in their own right).

Quote:> >         . freespec
> >                 - I'm unhappy about the current spec.  I'd like to build a
> >                   freeSpec97 that is similar to spec (uses the same tests)
> >                   but has lmbench style reporting rules (cc -O/f77 -O) and
> >                   is free of any charges.  Any interest?

> Great idea.

Yep, this is the best suggestion of the lot. It's not been uncommon for
certain
folks to criticize lmbench by saying some of the things it measured
weren't
that relevant at times. Adding a freespec would be a wonderful addition.

Pat

 
 
 

benchmarking discussion at Usenix?

Post by Dror Mayda » Fri, 10 Jan 1997 04:00:00



> Question: Aside from (pure) latency (currently measured by lmbench)
> and bandwidth (measured by lmbench and STREAM), does concurrency matter?

> Answer: I think so.  There are a number of cases of interest.
> "Bandwidth" is (often) based on stride-1 concurrency.  Also interesting:
> Concurrency with stride-N.  Concurrency on gather/scatter.

> Putting everything in units of time (seconds x 10 ^^ -N) to:
> latency (lmbench):         fetch          random address or datum
> stride-1 (1/bandwidth):    fetch&process  time/unit of contiguous data
> stride-N (1/bandwidth):    fetch&process  every N-th datum
> gather/scatter (1/B-W):    fetch&process  random data
> subword (1/bandwidth):     fetch&process  8/16/(32) data bits
>                                           within word

> For some machines, the bandwidth doesn't vary much, with roughly
> constant bandwidth over all possibilities; on some machines, extra
> load/store paths allow 2-3X improvements; on some machines, subword
> instructions (e.g. VIS and so on) vastly speed up "vector/parallel"
> operations within a word.  A major battle of the early 80's was the
> CDC * 205 vs. Cray-1/S.  The * 205 had greater stride-1
> and gather-scatter bandwidth, the Cray-1/S better latency and stride-N
> performance.  Each machine had its applications where it outperformed
> the other.  All these types of bandwidths and concurrencies are
> worthwhile to examine systematically.  Most "scalar" code, including
> compilers, tend to be dominated by latency, while many engineering,
> scientific, and graphics/image processing applications tend to be
> more bandwidth-intensive.

One more interesting category is the latency accessing objects bigger
than 4 bytes.  On many cache based machines accessing everything in a
cache line is just as fast as accessing one element.  I've never seen
measurements, but my guess is that many data elements in compilers are
bigger than 4 bytes; i.e., spatial locality works for compilers.

Dror

 
 
 

benchmarking discussion at Usenix?

Post by Hugh LaMaste » Tue, 14 Jan 1997 04:00:00




> > > Only curious: what are the goals for this?
> > OS code has somewhat different characteristic from user code. If you
> That's been my basic question, yes.  I wasn't aware that the typical
> usage pattern of kernel code differs that much from user code.

Many "user code" benchmarks run with almost zero cache hits if
the cache is large enough.  But, the work I was referring to
has shown substantially higher actual cache miss rates on kernel
code vs user code.  It depends on what you consider "user code",
though - apparently some RDBMS's and some bulky C++ codes are now
as bad or worse than kernel code used to be for the same footprint
comparisons.  If we are looking at the more well-behaved and compact
"user code" benchmarks of yesteryear, then kernel code looks worse.
[Supposedly.  I don't have the citations handy.]

Quote:> > Another part is exception and interrupt handling.
> Hmm.  This can't be expressed in a benchmark, hardly.  OTOH, if the
> system keeps reasonable statistics, the effect of interrupt and
> exception handling should be forseeable.  I estimate this being less
> than ~ 10 % of the total time on a modern machine.

I guess it depends on whether or not you have to drive
dumb ugly serial ports and such like.  You can spend
a lot of time handling interrupts with such devices.  
One paper at Usenix showed Linux on a Pentium dropping
interrupts/data when driving a 115 Kbps serial port...
 
 
 

benchmarking discussion at Usenix?

Post by Hugh LaMaste » Tue, 14 Jan 1997 04:00:00



> One more interesting category is the latency accessing objects bigger
> than 4 bytes.  On many cache based machines accessing everything in a
> cache line is just as fast as accessing one element.  I've never seen
> measurements, but my guess is that many data elements in compilers are
> bigger than 4 bytes; i.e., spatial locality works for compilers.

Well, optimum cache line sizes have been studied extensively.
I'm sure there must be tables in H&P et al. showing hit rate
as a function of line size and total cache size.  For reasonably
large caches, I think the optimum used to be near 16 Bytes for
32-bit byte-addressed machines.  I don't know that I have seen more
recent tables for 64-bit code on, say, Alpha, but my guess is that
32 bytes is probably superior to 16 bytes given the larger address
sizes, not to mention alignment considerations.  Just a guess.
Also, we often (but not always) have two levels of cache now,
and sometimes three, and the optimum isn't necessarily the
same on all three.  Numbers, anyone?
 
 
 

benchmarking discussion at Usenix?

Post by Dror Mayda » Thu, 16 Jan 1997 04:00:00




> > One more interesting category is the latency accessing objects bigger
> > than 4 bytes.  On many cache based machines accessing everything in a
> > cache line is just as fast as accessing one element.  I've never seen
> > measurements, but my guess is that many data elements in compilers are
> > bigger than 4 bytes; i.e., spatial locality works for compilers.

> Well, optimum cache line sizes have been studied extensively.
> I'm sure there must be tables in H&P et al. showing hit rate
> as a function of line size and total cache size.  For reasonably
> large caches, I think the optimum used to be near 16 Bytes for
> 32-bit byte-addressed machines.  I don't know that I have seen more
> recent tables for 64-bit code on, say, Alpha, but my guess is that
> 32 bytes is probably superior to 16 bytes given the larger address
> sizes, not to mention alignment considerations.  Just a guess.
> Also, we often (but not always) have two levels of cache now,
> and sometimes three, and the optimum isn't necessarily the
> same on all three.  Numbers, anyone?

My point was that different machines do have different line sizes, and
the differences are quite large.  On the SGI R10000, the secondary line
size is 128 Bytes. On some IBM Power 2's, the line size is 256 Bytes.
I'm pretty sure that some other vendors use 32 Byte line sizes.
Why different vendors use different line sizes is probably related to
both system issues and to which types of applications they try to
optimize.  But, it is irrelevant to the benchmarking issue.  The issue
is that lmbench measures the latency for fetching a single pointer.  On
such a benchmark a large-line machine will look relatively worse
compared to the competition than if instead one used a benchmark that
measured the latency of fetching a cache line.
Now which benchmark is "better".  I think both are interesting.  Which
is more relevant to a typical integer application?  I don't know.
 
 
 

benchmarking discussion at Usenix?

Post by Matt Dill » Thu, 16 Jan 1997 04:00:00




:>

:>
:>> > > Only curious: what are the goals for this?
:>
:>> > OS code has somewhat different characteristic from user code. If you
:>
:>> That's been my basic question, yes.  I wasn't aware that the typical
:>> usage pattern of kernel code differs that much from user code.
:>
:>Many "user code" benchmarks run with almost zero cache hits if
:>the cache is large enough.  But, the work I was referring to
:>has shown substantially higher actual cache miss rates on kernel
:>code vs user code.  It depends on what you consider "user code",
:>though - apparently some RDBMS's and some bulky C++ codes are now
:>as bad or worse than kernel code used to be for the same footprint
:>comparisons.  If we are looking at the more well-behaved and compact
:>"user code" benchmarks of yesteryear, then kernel code looks worse.
:>[Supposedly.  I don't have the citations handy.]
:>  
:>
:>> > Another part is exception and interrupt handling.
:>
:>> Hmm.  This can't be expressed in a benchmark, hardly.  OTOH, if the
:>> system keeps reasonable statistics, the effect of interrupt and
:>> exception handling should be forseeable.  I estimate this being less
:>> than ~ 10 % of the total time on a modern machine.
:>
:>I guess it depends on whether or not you have to drive
:>dumb ugly serial ports and such like.  You can spend
:>a lot of time handling interrupts with such devices.  
:>One paper at Usenix showed Linux on a Pentium dropping
:>interrupts/data when driving a 115 Kbps serial port...

    The ISA bus cycles used to access the serial ports (in most cases)
    are slow, but not THAT slow.  Any problem linux has with serial
    data overruns is readily attributable to long interrupt disables
    in other parts of the kernel and the fact that the PC-standard
    serial chipset is braindead when it comes to on-chip hardware
    handshaking.  It has nothing to do with the processing
    time required to handle the serial interrupt.

    PC serial ports are the exception rather then the rule.

                                        -Matt

 
 
 

benchmarking discussion at Usenix?

Post by Matt Dill » Thu, 16 Jan 1997 04:00:00




:>
:>> One more interesting category is the latency accessing objects bigger
:>> than 4 bytes.  On many cache based machines accessing everything in a
:>> cache line is just as fast as accessing one element.  I've never seen
:>> measurements, but my guess is that many data elements in compilers are
:>> bigger than 4 bytes; i.e., spatial locality works for compilers.
:>
:>Well, optimum cache line sizes have been studied extensively.
:>I'm sure there must be tables in H&P et al. showing hit rate
:>as a function of line size and total cache size.  For reasonably
:>large caches, I think the optimum used to be near 16 Bytes for
:>32-bit byte-addressed machines.  I don't know that I have seen more
:>recent tables for 64-bit code on, say, Alpha, but my guess is that
:>32 bytes is probably superior to 16 bytes given the larger address
:>sizes, not to mention alignment considerations.  Just a guess.
:>Also, we often (but not always) have two levels of cache now,
:>and sometimes three, and the optimum isn't necessarily the
:>same on all three.  Numbers, anyone?

    The speed at which you can access memory from a program
    is limited by the maximum size of the data object you can
    read or write in a single (memory) instruction cycle, which is
    usually a long or quad word (4 or 8 bytes).  It is unrelated to
    the cache line size for the most part.

    What IS related to the cache line size is the memory-to-cache
    and secondary-to-primary cache bandwidth.  When you run an
    instruction that reads data element N into a register, the
    processor may end up transfering elements N+1, N+2, etc...
    into the primary cache at the same time, but you still have to
    issue instructions to read those elements to actually get a hold
    of them.

    The cache line size is also a topological tradeoff in the design
    of the cache memory.  The larger the line size, the fewer tag
    bits you need AND the higher data bits : tag bits ratio you have.
    It's a two way street, though... if the cache line size is too
    large, you loose efficiency due to data address collisions.

                                                -Matt

 
 
 

benchmarking discussion at Usenix?

Post by Hugh LaMaste » Fri, 17 Jan 1997 04:00:00






> :>

> :>
> :>> > Another part is exception and interrupt handling.
> :>
> :>> Hmm.  This can't be expressed in a benchmark, hardly.  OTOH, if the
> :>> system keeps reasonable statistics, the effect of interrupt and
> :>> exception handling should be forseeable.  I estimate this being less
> :>> than ~ 10 % of the total time on a modern machine.
> :>
> :>I guess it depends on whether or not you have to drive
> :>dumb ugly serial ports and such like.  You can spend
> :>a lot of time handling interrupts with such devices.
> :>One paper at Usenix showed Linux on a Pentium dropping
> :>interrupts/data when driving a 115 Kbps serial port...

>     The ISA bus cycles used to access the serial ports (in most cases)
>     are slow, but not THAT slow.  Any problem linux has with serial
>     data overruns is readily attributable to long interrupt disables
>     in other parts of the kernel

Yes, and ... ?  I'm not attacking linux, BTW.  A particular design
decision was made for reasonable reasons.

                                    and the fact that the PC-standard

Quote:>     serial chipset is braindead when it comes to on-chip hardware
>     handshaking.  

The point of the article/presentation, as well as what to do
about it, centered on the the tradeoffs involved with various
strategies, including trading off time lost due to PIIC accesses
vs losing interrupts.

                   It has nothing to do with the processing

Quote:>     time required to handle the serial interrupt.

"It has nothing to do with the CPU cycles required to handle
the interrupt."  That isn't the same thing as "time".  "Time"
is the time lost to an interrupted process when handling lots
of interrupts.  There is a direct tradeoff in the PC required
between losing some interrupts and time lost, due to the
"braindead" design [your word above].  There is also a proposed
method which should minimize the number of lost interrupts while
still minimizing time impact.  Of course, "real" computers
have hardware that minimizes overhead due to MP synchronization
and due to interrupts.  But, most of us use PCs now, not "real"
computers, so, it is interesting that a software technique exists
which can extend the utility of the PC in demanding situations.

Quote:>     PC serial ports are the exception rather then the rule.

I'm not sure what rule they are an exception to, but anybody
with a PC [almost everybody] who is surprised how "slow" their
PC is when driving a fast modem via a serial port may think such
behavior is the rule rather than the exception.  Of course,
many people avoid using dumb PC serial ports for exactly that
reason.

I agree with somebody above that you can't really benchmark
this in userland, but, I consider "number of interrupts/sec
handled without losing interrupts" to be an interesting number
to know about a (hardware/software) system.

 
 
 

benchmarking discussion at Usenix?

Post by Skipper Smit » Fri, 17 Jan 1997 04:00:00




> > One more interesting category is the latency accessing objects bigger
> > than 4 bytes.  On many cache based machines accessing everything in a
> > cache line is just as fast as accessing one element.  I've never seen
> > measurements, but my guess is that many data elements in compilers are
> > bigger than 4 bytes; i.e., spatial locality works for compilers.

> Well, optimum cache line sizes have been studied extensively.
> I'm sure there must be tables in H&P et al. showing hit rate
> as a function of line size and total cache size.  For reasonably
> large caches, I think the optimum used to be near 16 Bytes for
> 32-bit byte-addressed machines.  I don't know that I have seen more
> recent tables for 64-bit code on, say, Alpha, but my guess is that
> 32 bytes is probably superior to 16 bytes given the larger address
> sizes, not to mention alignment considerations.  Just a guess.
> Also, we often (but not always) have two levels of cache now,
> and sometimes three, and the optimum isn't necessarily the
> same on all three.  Numbers, anyone?

There are at least a couple of other concerns that must be addressed
when looking at optimal cache block lengths.  For example
1) Interrupt latency
2) Data bus width
3) Cache blocking

1) No matter what, your cache sector (a piece of a block which can be
independently valid of the rest of the block) can not be so large that
attempts to load it will cause interrupt latency to be excessive.  This
is at odds with the benefits derived from locality of reference.  Since
data that is near to what you are currently fetching is more likely to
be used in the near future, there is an obvious benefit to fetching as
much data at one time as you can, particularly since bursting makes it
much more efficient to continue a memory access then it is to start a
new one.  

2) Since worst case interrupt latency puts an upper limit on how much
time we are willing to have the bus in use, then the best case amount of
data we can associate with each sector is more dependent on bus width
then whether a processor is 64-bit or 32-bit.  If we presume that 4 or 8
beats (data accesses) is the optimal number (acceptable interrupt
latency + most local data feasible), then a 32-bit data bus will yield
16 or 32 byte sectors, a 64-bit data bus will yield 32 or 64 byte
sectors, and a 128-bit data bus will yield 64 or 128 byte sectors.  If
that is adequate to reach our cache organization goals then fine,
otherwise we might assign more than sector to a block and permit
additional data associated with one tag and allow the additional data to
be brought in on a "time-available" basis.  See the PowerPC 601 chip as
an example (or, for that matter, the MC68030) of a chip that used two
sectors per block to achieve cache organization goals while still
keeping interrupt latency at an acceptable level.

3) Finally, it must be remembered that while locality of reference is
the rule, there are likely to be many exceptions in any given group of
algorithms.  Because of this, caches need to have a way to deal with the
fact that the cache is going to be blocked while it is being accessed.
If one goes off and does an 8 or 16 beat access because you need one
byte in that sector, how long should you be expected to twiddle your
thumbs waiting for access to a different block in the cache (or the bus)
when the stride of your memory accesses takes your next access outside
of that sector?  While this time can be minimized by implementing cache
load buffers, these bring up different challenges with their own impacts
that get harder to avoid at each level.  

I don't have any hard numbers, but the industry seems to have decided
that cache sector sizes equal to 4 or 8 data beats (with the bulk
choosing 4... 8 is usually an available choice or a side effect) is best
and that, when organizational purposes require it,  multiple sectors per
block are acceptable but are generally avoided.  Therefore, you need to
see your bus width as to what the "line" size should be.

--
Skipper Smith
Somerset Design Center
All opinions are my own and not those of my employer

 
 
 

benchmarking discussion at Usenix?

Post by Eugene Mi » Fri, 17 Jan 1997 04:00:00




Quote:>Well, optimum cache line sizes have been studied extensively.

Well, there is the question of continuing validity......

....

Quote:>same on all three.  Numbers, anyone?

Look up the papers of Alan Jay Smith.  He used to read comp.arch
until the S/NR got too bad.  We produced a student between us now at USC.
 
 
 

1. benchmarking discussion at Usenix?

I'll be at Usenix and I thought I might put together a little BOF to discuss
benchmarking issues, if people are interested.  Topics could include

        . lmbench 2.0 changes
                - finer grain accuracy
                - web benchmarks
                - scaling/load
                - multi pointer memory latency
        . osbench
                - Steve Kleiman suggested that I (or someone) grab a big hunk
                 of OS code and port it to userland and call it osbench.  This
                 is an interesting idea.
        . freespec
                - I'm unhappy about the current spec.  I'd like to build a
                  freeSpec97 that is similar to spec (uses the same tests)
                  but has lmbench style reporting rules (cc -O/f77 -O) and
                  is free of any charges.  Any interest?
        . others?

Let me know if you want me to find some space so we can talk at Usenix.
--
---

2. reexecuting a killed program

3. USENIX roomshare (see comp.org.usenix.roomshare)

4. changing IP and gateway using 'ifconfig'

5. Welcome to NetBSD discussion forums, message boards

6. Any X-Clients available , which can run from windows can contact the linux boxes?

7. Message list font in Netscape Mail&Discussion

8. Apache UserDir on Sparc 20 - does not work...

9. securing shell accounts discussion

10. Old Discussions on NewsGroup

11. Discussion: FS tree for large packages

12. Discussion on Syslog Configuration

13. Permission Discussion