Buffer cache lousy throughput?

Buffer cache lousy throughput?

Post by M. K. She » Fri, 06 Jan 1995 11:26:55



Ok, here's what's up:  when I use iozone to test buffer cache throughput
(iozone 10 for me, on a 32 meg system) I get 4 megs/sec plus change.
When I do this I usually have about 26 megs free for buffer cache.

Is this not abysmally slow?  Many multi-gig disks will sustain this
right off the platter.

This is a 486/66, Asus SP3G mboard.  Kernel 1.1.75.  

Is this my hardware, is iozone a lousy test for buffer cache throughput,
or is this a kernel problem?

I seem to recall much better numbers a while back, although my hardware
was different.  I am wondering how much of significance has been done to
the buffer caching since early post 1.0 versions.

My motherboard appears to be set optimally, and by all reports seems to
be a fast motherboard.  Nothing else seems sluggish.

If it matters (it shouldn't, should it?) I am using the onboard SCSI -- a NCR
53c810 chip.  

I know that I'm not somehow missing the buffer cache, because my disk
will not sustain this sort of throughput.

Anybody involved with this part of the kernel have any ideas?  Any
other folks with similar/different hardware have numbers to report?

tia,
Craig.

 
 
 

Buffer cache lousy throughput?

Post by Linus Torval » Fri, 06 Jan 1995 21:11:52




Quote:

>Ok, here's what's up:  when I use iozone to test buffer cache throughput
>(iozone 10 for me, on a 32 meg system) I get 4 megs/sec plus change.
>When I do this I usually have about 26 megs free for buffer cache.

>Is this not abysmally slow?  Many multi-gig disks will sustain this
>right off the platter.

>This is a 486/66, Asus SP3G mboard.  Kernel 1.1.75.  

>Is this my hardware, is iozone a lousy test for buffer cache throughput,
>or is this a kernel problem?

You might time your memcpy(): the memory speeds may actually be
something of a bottle-neck here.  On my Pentium-90, I get only 18MB/s in
memory bandwidth with a normal "memcpy()".  And the buffer cache will be
slower due to having more overhead than just the copy (but you're
probably right about the overhead being a bit too much).

                Linus

 
 
 

Buffer cache lousy throughput?

Post by David Hin » Sat, 07 Jan 1995 07:02:43



: <
: <Ok, here's what's up:  when I use iozone to test buffer cache throughput
: <(iozone 10 for me, on a 32 meg system) I get 4 megs/sec plus change.
: <When I do this I usually have about 26 megs free for buffer cache.
: <
: <Is this not abysmally slow?  Many multi-gig disks will sustain this
: <right off the platter.

: Not that many.. disk drives quote Mb/sec (bits), iozone and hdparm
: (more accurate) quote MB/sec (bytes).  There are very few drive models
: that can sustain more than 40Mb/s or 5MB/sec today, Seagate Barracuda's
: being a notable exception due to their 7200rpm platter speed.

These statements are not inconsistent.  Yes, not that many single drives
can sustain more than 5MB/sec, but the last time I looked, I thought
that a lot of the multi-gig drives I saw were in that ballpark.

Anyway, 4MB/sec through the filesystem is not that terrible, really.
Here are a few numbers to compare:

SGI Indigo2, 150MHz R4400, 48 MB, 1 GB SCSI, IRIX 5.2
  iozone  96  512: writes:  2.9 MB/s, reads:  2.7 MB/s
  iozone   8  512: writes:  9.8 MB/s, reads:  8.4 MB/s
  iozone   8 4096: writes: 22.  MB/s, reads: 30.  MB/s

DEC Alpha 3000AXP, 150 MHz 21064, 64 MB, 1GB SCSI, OSF 3.0
  iozone 128  512: writes:  2.2 MB/s, reads:  3.0 MB/s
  iozone   8  512: writes:  6.5 MB/s, reads:  3.3 MB/s
  iozone   8 4096: writes: 18.  MB/s, reads:  2.5 MB/s

NEC Versa, 25 MHz 486SL, 12 MB, 340 MB IDE, Linux 1.1.76
  iozone  24  512: writes:  0.8 MB/s, reads:  0.6 MB/s
  iozone   4  512: writes:  0.9 MB/s, reads:  1.7 MB/s
  iozone   4 4096: writes:  1.0 MB/s, reads:  3.3 MB/s

If there is anything to be drawn from this, I'd say that Linux is
relatively weak on write caching, and OSF/1 is relatively weak on read
caching.

        -- Dave Hinds

 
 
 

Buffer cache lousy throughput?

Post by M. K. She » Sat, 07 Jan 1995 14:41:33




><
><Ok, here's what's up:  when I use iozone to test buffer cache throughput
><(iozone 10 for me, on a 32 meg system) I get 4 megs/sec plus change.
><When I do this I usually have about 26 megs free for buffer cache.
><
><Is this not abysmally slow?  Many multi-gig disks will sustain this
><right off the platter.

>Not that many.. disk drives quote Mb/sec (bits), iozone and hdparm
>(more accurate) quote MB/sec (bytes).  There are very few drive models

I am aware of this.  I believe you will find that a number of the
higher-density 5400 RPM, modern drives will do 4Meg/s, as well as the
Barracudas you mention.

Please don't assume idiocy.  Some folks do know the difference between
bytes and bits.  

>that can sustain more than 40Mb/s or 5MB/sec today, Seagate Barracuda's
>being a notable exception due to their 7200rpm platter speed.
>--


Very few?  I think you will find that with the proliferation of 4+
Gig, 3.5 drives that this is no longer the case.

The fastest Barracuda, if reports are correct, will sustain 8 megs/sec
or thereabouts.  I believe that any 4 gig 5400 RPM 3.5 drive is likely
to be able to manage 4 megs/sec.  DEC and IBM also now manufacture
7200 RPM drives.  

The number that do it was pretty much unimportant to my point, which
was to make a comparison between RAM and a mechanical medium -- i.e.
there should be no comparison.  

But thank you for your completely tangent nit-picking.

 
 
 

Buffer cache lousy throughput?

Post by Mark Lo » Sat, 07 Jan 1995 04:18:11


<
<Ok, here's what's up:  when I use iozone to test buffer cache throughput
<(iozone 10 for me, on a 32 meg system) I get 4 megs/sec plus change.
<When I do this I usually have about 26 megs free for buffer cache.
<
<Is this not abysmally slow?  Many multi-gig disks will sustain this
<right off the platter.

Not that many.. disk drives quote Mb/sec (bits), iozone and hdparm
(more accurate) quote MB/sec (bytes).  There are very few drive models
that can sustain more than 40Mb/s or 5MB/sec today, Seagate Barracuda's
being a notable exception due to their 7200rpm platter speed.
--

 
 
 

Buffer cache lousy throughput?

Post by Craig A. Johnst » Thu, 12 Jan 1995 09:51:13






>>Ok, here's what's up:  when I use iozone to test buffer cache throughput
>>(iozone 10 for me, on a 32 meg system) I get 4 megs/sec plus change.
>>When I do this I usually have about 26 megs free for buffer cache.
>> [snip]

>You might time your memcpy(): the memory speeds may actually be
>something of a bottle-neck here.  On my Pentium-90, I get only 18MB/s in
>memory bandwidth with a normal "memcpy()".  And the buffer cache will be
>slower due to having more overhead than just the copy (but you're
>probably right about the overhead being a bit too much).

>            Linus

I'm getting 19.6-7 MB/s in memory bandwidth (memcpy() timed with
gettimeofday() ) , so, considering this and some cache benchmark figures
from others, I'm assuming my system is normal.  I guess my memory
of better numbers before was faulty, unless the buffer cache code
has been played with a lot since pre-1.0?

-Craig (who was posting from wife mkshenk's acct)
--
"A psychotic is a guy  |  Craig A. Johnston                     |  MS:

 what's going on."     |  finger for PGP public key             |   Inside"
   -- W.S. Burroughs   |  (C) 1995; all slights deserved.       |

 
 
 

Buffer cache lousy throughput?

Post by system adm » Tue, 17 Jan 1995 23:42:16







: >>
: >>Ok, here's what's up:  when I use iozone to test buffer cache throughput
: >>(iozone 10 for me, on a 32 meg system) I get 4 megs/sec plus change.
: >>When I do this I usually have about 26 megs free for buffer cache.
: >> [snip]
: >
: >You might time your memcpy(): the memory speeds may actually be
: >something of a bottle-neck here.  On my Pentium-90, I get only 18MB/s in
: >memory bandwidth with a normal "memcpy()".  And the buffer cache will be
: >slower due to having more overhead than just the copy (but you're
: >probably right about the overhead being a bit too much).
: >
: >          Linus

: I'm getting 19.6-7 MB/s in memory bandwidth (memcpy() timed with
: gettimeofday() ) , so, considering this and some cache benchmark figures
: from others, I'm assuming my system is normal.  I guess my memory
: of better numbers before was faulty, unless the buffer cache code
: has been played with a lot since pre-1.0?

While we're on this subject, I'm a bit puzzled by some comparitive
benchmarking between my system and a box with similar hardware running
FreeBSD.  Justin Gibbs at Berkeley optimized the Adaptec 7770/7870
SCSI driver, and was seeing iozone throughput jump from ~2.5-3 MB/s
(using 16-meg r/w size) to ~5 MB/s!  With high hopes, I slaved to port
his improvements to the linux 1.1.69 kernel.  Unfortunately, it made
absolutely _no_ measurable difference..  I am still "only" seeing 1.8
MB/s on writes and 1.2 MB/s on reads.

Is there some inherent inefficiency in the scsi code and/or
buffer-caching that might be masking the driver speedup?  Can anyone
shed some light on the particulars of what scsi.c is doing?  The code
is so dense and tersely-commented that I can't really make heads or
tails out of it (not meant as a criticism - I know how it goes..).

I'm anxious for any input from those more knowledgable about the inner
workings of the scsi subsystem and/or kernel.

 
 
 

Buffer cache lousy throughput?

Post by system adm » Wed, 18 Jan 1995 06:20:45







: : >>
: : >>Ok, here's what's up:  when I use iozone to test buffer cache throughput
: : >>(iozone 10 for me, on a 32 meg system) I get 4 megs/sec plus change.
: : >>When I do this I usually have about 26 megs free for buffer cache.
: : >> [snip]
: : >
: : >You might time your memcpy(): the memory speeds may actually be
: : >something of a bottle-neck here.  On my Pentium-90, I get only 18MB/s in
: : >memory bandwidth with a normal "memcpy()".  And the buffer cache will be
: : >slower due to having more overhead than just the copy (but you're
: : >probably right about the overhead being a bit too much).
: : >
: : >                Linus

: : I'm getting 19.6-7 MB/s in memory bandwidth (memcpy() timed with
: : gettimeofday() ) , so, considering this and some cache benchmark figures
: : from others, I'm assuming my system is normal.  I guess my memory
: : of better numbers before was faulty, unless the buffer cache code
: : has been played with a lot since pre-1.0?

: While we're on this subject, I'm a bit puzzled by some comparitive
: benchmarking between my system and a box with similar hardware running
: FreeBSD.  Justin Gibbs at Berkeley optimized the Adaptec 7770/7870
: SCSI driver, and was seeing iozone throughput jump from ~2.5-3 MB/s
: (using 16-meg r/w size) to ~5 MB/s!  With high hopes, I slaved to port
: his improvements to the linux 1.1.69 kernel.  Unfortunately, it made
: absolutely _no_ measurable difference..  I am still "only" seeing 1.8
: MB/s on writes and 1.2 MB/s on reads.

As a followup to my own followup (?), I unearthed a patch to
scsi_ioctl.c by Eric Youngdale that supports benchmarking of raw,
driver-level reads from any scsi block device to the ramdisk area,
thus bypassing the filesystem and buffer cache.

After some minor mods to bring it into compatibility with the current
data structures, I was able to confirm my suspicions.  My primary
drive, a Seagate ST11200, clocks at 2.37 MB/s using srawread.  The
other drive on the system, a Maxtor MXT-540SL, measures 3.6 MB/s. -
_double_ what I have ever seen via the file system and cache.

I agree with Eric's assessment in the file header; this _is_ in
accurate test of the raw driver.  My 2x Texel CD-ROM measures almost
precisely at 330 KB/s!

This tends to suggest that there _is_ a bottleneck in the cache
buffering.  At the time that Eric wrote the benchmark, drives and
controllers in general use were a bit slower than today - thus the
bottleneck was masked.

With Justin measuring 5 MB/s under iozone, I have to wonder just what
the FreeBSD team has done to yield throughput on this level?  I'm not
trying to start a religious war, but it might be educational to
examine the architecture.

 
 
 

Buffer cache lousy throughput?

Post by john dyso » Fri, 20 Jan 1995 01:18:37



>With Justin measuring 5 MB/s under iozone, I have to wonder just what
>the FreeBSD team has done to yield throughput on this level?  I'm not
>trying to start a religious war, but it might be educational to
>examine the architecture.>

David Greenman, I and other members and contributors to FreeBSD spend
DAYS on various performance aspects of the kernel, VM and I/O subsystems.
Actually, the version of FreeBSD that Justin used for benchmarking had some
minor performance quirks.  Many of those are being solved (and sometimes
made worse :-( ) as we speak.  There is ABSOLUTELY no reason that FreeBSD
would give performance less than hardware performance at the 5MByte/second
level (on a reasonably fast machine with good hardware).  The original 4.4Lite
code is good stuff, but needed some work in the filesystem and VM
performance areas.

I have found that system performance is very dependent on the caching,
clustering, and pageout algorithms used.  Under light load conditions, it really
doesn't make much difference.  But under heavy, power user loads, the difference
can be night/day.

For example, the "clock" pageout algorithm that is taught in comp-sci 101 is
faulty.  There are slightly more sophisticated algorithms that can perform much
better under heavy memory loads.  The problem with these more complex
algorithms is that they can impose an unintended policy on memory usage.  It
takes lots of testing to make sure that they really do work better.  Some
development versions of "improved" pageout daemons have been worse
than clock, and others have been much, much better.  Additionally, there
are many potential clashes between process-level VM and a dynamic
buffer cache.  The merged VM/buffer cache scheme of FreeBSD mitigates
these problems so that they don't exist.

PLEASE recognize that FreeBSD is NOT perfect, but is striving to be.  This
note is not meant to be flame-bait, but any technical discussion is invited
either by email or by newsgroup postings!!!!

There will be some documentation and/or book coming out on the FreeBSD
kernel and OS in general sometime (hopefully soon.)  I intend to produce a
VM internals document soon (probably FreeBSD 2.1 timeframe.)

Best wishes
John Dyson (FreeBSD-core)


 
 
 

Buffer cache lousy throughput?

Post by Jeff Kue » Fri, 20 Jan 1995 02:30:22



: >This tends to suggest that there _is_ a bottleneck in the cache
: >buffering.  At the time that Eric wrote the benchmark, drives and
: >controllers in general use were a bit slower than today - thus the
: >bottleneck was masked.

: Actually, some time ago I brought this point forward, claiming that my
: system performed much better under DOS than it did under Linux.  I
: questioned the efficiency of the Adaptec 1542 driver.
: Then, Eric gave me this benchmark and indeed it resulted in the same
: speed as I got when using DOS.  I.e. it gives the maximum attainable
: speed of the disk and/or the controller.

: Puzzled by what could be causing this, Eric set of at an icredible pace
: and wrote all kinds of simulation tools.  He changed a lot in the
: buffering code, and made the requests between buffers and the driver
: more effective.  These improvements became known as the 'clustering
: patches', and went into the kernel early in the 1.1 series.  There was
: a big improvement in I/O performance at that time.

I think this was actually measured as an improvement to the raw device
performance.  I don't know if it was actually benchmarked at a higher
level...

: But indeed, it is still not near the raw device performance.
: [At least you know that a lot of effort already was spent on this problem]

Back in the days of 1.1.54, I ran the Byte Magazine benchmarks on my
Linux system to examine how performance had varied over the series of
1.1.x kernel releases.   (I refuse to defend the Byte Mag Benchmark as
"good"... I chose it because it was there...)  Anyway, I ran the Bench
for the following kernels: (under *very* strictly controlled conditions)

        1.0.4
        1.0.9
        1.1.0
        1.1.1
        1.1.2
                ********write drops to 70% of 1.1.0
        1.1.3
        1.1.4
        1.1.5
        1.1.10
        1.1.15
                ********write drops to 60% of 1.1.0
                ********read drops to 80% of 1.1.0
        1.1.20
        1.1.25
        1.1.30
        1.1.35
        1.1.40
        1.1.45
        1.1.50
        1.1.51
        1.1.52
        1.1.53
        1.1.54

        (hw: dx2/66 16MB + aha1742 eisa/scsi controller)
        (btw: I compiled all of the above kernels with the
        same compiler)

The results of the benchmark, at least for disk performance were
disappointing.  in going (from 1.1.2) to 1.1.3, write performance
dropped to 70% of the speed of 1.1.0.  Between 1.1.15 and 1.1.20 write
performance dropped again to only 60% of the speed of 1.1.0 and read
performance dropped to only 80% of the speed of 1.1.0.  We gained a
little ground between 1.1.20 and 1.1.30, but lost it again between
1.1.30 and 1.1.40.  There was essentially no change between 1.1.40 and
1.1.54.  For those that are interested, I seem to remember (and forgive
me if my memory isn't what it used to be) that the 1.1.3 patches
included much of the clustering code, the switch to bdflush, and
several changes to the buffer cache management (including several new
buffer mgmt functions which I seem to remember as looking cpu intensive).

While the clustering patches clearly improved the performance of the
raw devices, is it possible that something in the clustering patches
and buffer cache changes screwed up the performance at a higher
level?  

Does this seem to support the theory that there's a bottleneck in the
buffer cache (or the filesystem) code?  Maybe it would be worth while
to re-examine this code.

Probably a good place to start would be to profile the kernel and find
out where the time is actually being spent.  I'm willing to re-run some
of the benchmarks for a more detailed analysis.  While I have a lot of
experience in performance analysis and tuning (for supercomputers), I
need a clue on how to use kernel profiling under Linux.  Also, are
there any Linux tools available to examine the data?

--Jeff Kuehn, NCAR/SCD Consulting Group

 
 
 

Buffer cache lousy throughput?

Post by Steven N. Hirs » Fri, 20 Jan 1995 00:09:51



: >This tends to suggest that there _is_ a bottleneck in the cache
: >buffering.  At the time that Eric wrote the benchmark, drives and
: >controllers in general use were a bit slower than today - thus the
: >bottleneck was masked.

: Actually, some time ago I brought this point forward, claiming that my
: system performed much better under DOS than it did under Linux.  I
: questioned the efficiency of the Adaptec 1542 driver.
: Then, Eric gave me this benchmark and indeed it resulted in the same
: speed as I got when using DOS.  I.e. it gives the maximum attainable
: speed of the disk and/or the controller.

: Puzzled by what could be causing this, Eric set of at an icredible pace
: and wrote all kinds of simulation tools.  He changed a lot in the
: buffering code, and made the requests between buffers and the driver
: more effective.  These improvements became known as the 'clustering
: patches', and went into the kernel early in the 1.1 series.  There was
: a big improvement in I/O performance at that time.

: But indeed, it is still not near the raw device performance.
: [At least you know that a lot of effort already was spent on this problem]

Well, "not near" raw device performance is one thing.  Shy by half is
quite another!  Turning on clustering made a small, but measurable
difference and is included in my figures..

As a quick reality check, I'm going to setup a small BSD partition and
see what kind of filesystem benchmarks come forth.

--
____________________________________________________________________________
|Steven N. Hirsch                        "Anything worth doing is worth    |
|University of Vermont                    overdoing.." - Hunter S. Thompson|
|Computer Science / EE                                                     |
----------------------------------------------------------------------------

 
 
 

Buffer cache lousy throughput?

Post by Mark Lo » Fri, 20 Jan 1995 08:51:02


<As a followup to my own followup (?), I unearthed a patch to
<scsi_ioctl.c by Eric Youngdale that supports benchmarking of raw,
<driver-level reads from any scsi block device to the ramdisk area,
<thus bypassing the filesystem and buffer cache.
<
<After some minor mods to bring it into compatibility with the current
<data structures, I was able to confirm my suspicions.  My primary
<drive, a Seagate ST11200, clocks at 2.37 MB/s using srawread.  The
<other drive on the system, a Maxtor MXT-540SL, measures 3.6 MB/s. -
<_double_ what I have ever seen via the file system and cache.

For similar (and easier) accuracy in benchmarking disk drives,
you can use the freely available "hdparm -t", originally invented
for timing IDE read speed.. works for SCSI disks too.

It also measures & reports buffer cache speed, and then subtracts
that overhead from its final estimate of "raw driver" speed.

My IDE drives give around 17.49 MB/sec for the buffer cache,
and around 4MB/sec for the disk driver (AMD DX2/80).
--

 
 
 

Buffer cache lousy throughput?

Post by Steven N. Hirs » Sun, 22 Jan 1995 09:37:42



: >Well, "not near" raw device performance is one thing.  Shy by half is
: >quite another!  Turning on clustering made a small, but measurable
: >difference and is included in my figures..

: >As a quick reality check, I'm going to setup a small BSD partition and
: >see what kind of filesystem benchmarks come forth.

: When you benchmark other systems, be sure to include benchmarks that need
: most of the RAM as cache.  Some systems reserve a small amount of total
: RAM for buffers.  They may get a higher throughput on tests that use a
: very small number of blocks (less cache blocks to check!) but this does not
: necessarily mean the system performs as well in the real world.

Believe me, I'm aware of that!  I have made sure to use extremely
large values for r/w (>20-meg).  With iozone, you can easily see the
throughput "hit the wall" when the cache is finally saturated :-).

I think this discussion is worth pursuing.  I trust that everyone here
is mature enough to keep a technical discussion out of the realm of
religous flaming..

I'll give the hdparm utility a shake.

--
____________________________________________________________________________
|Steven N. Hirsch                        "Anything worth doing is worth    |
|University of Vermont                    overdoing.." - Hunter S. Thompson|
|Computer Science / EE                                                     |
----------------------------------------------------------------------------