Hi!
Quote:>Earlier I had asked how to better buffer the reading of a file.
>Many people replied; there is a function in ANSI C named setvbuf
>which hooks a buffer (which you supply) for use with a FILE * you
>specify.
>H & S also say "you almost never need to modify the default buffering".
>It looks like they are right in my (limited) timing tests.
This is not totally true for some combinations of machine speed/
machine load/ operating system/ file cache/ disk cache/ disk access
time/ etc see(#)
Quote:>So here are numbers:
It's best to analyse the numbers that you get from BSD csh time, and
gain some intuition from what you're getting:
Quote:>1. C program, reading a file and merely counting lines, with default IO
> 5.7u 2.1s 0:12 63% 0+144k 1494+1io 1494pf+0w
^^^^ ^^^^
A lot of I/O & too much paging for my liking here: sounds like an
inefficient C program that reads a load of data into a huge buffer
before processing it (no I haven't read the source yet).
With 128Mb of memory either you didn't test this on an empty machine
or alternatively you have a small working set for the process (ie
limited number of pages allowed resident simultaneously).
Seems too much of a coincidence that the number of reads == the number
of page faults, it looks like every (512 byte) block read it page
faulted to get the memory to store the data read.
Quote:>2. #1, with 10 Meg of buffer (through setvbuf)
> 6.1u 2.9s 0:12 74% 0+9008k 629+0io 632pf+0w
^^^^ ^^^ ^^^
A 10Mb buffer is vastly excessive, you will gain nothing on a demand
paged operating system - don't forget both your setvbuf buffer & fgets
s buffers are in pageable areas of memory, the m*is: If there's a
demand, it's paged!
If you run a disk testing program (there was source for one posted on
the net once called "disktest"(#)) you will find that a combination of
the operating system/ on disk cache/ disk access timing/ different
size reads & writes etc will give you some idea at which values your
program tends to give best performance on your machine.
For example one particular brand of hard disk on a Macintosh IIfx with
System 7 performs with a higher data rate when constantly reading
buffers of around 250Kb than it does with 160kb or 80Kb.
Not only that: taking up 10Mb of store is not useful for the other
people using the same machine, they'll be paging too!
I wrote a program once on the Mac as an MPW Tool & it had a file
buffer set by me. The tool ran quickly, but when I ported the same
code to unix it paged like hell and ran as slow as an old dog.
So I #ifdef'ed the reading functions back to old fgetc to debug it &
work out what was happening. The result: it ran faster without the
extra buffer space. The code has stayed exactly the same since then.
Quote:>3. gawk 'END {print NR}'
> 3.6u 2.3s 0:07 74% 0+496k 561+0io 572pf+0w
>4. echo 'END {print NF}' | a2p > a.perl; chmod +x a.perl; a.perl
> 9.2u 10.3s 02 86% 0+760k 76+1io 122pf+0w
>5. wc -l
> 20.9u 2.0s 0:25 88% 0+280k 589+0io 591pf+0w
>wc is a joke; gawk is great; perl is worse than awk; the default IO
>of the C program is not improved by a 10 Meg buffer.
'scuse me, but whats the important factor here? time taken to do the
job? speed of reading? I ask because you say wc is a joke & gawk is
great: they both have the same i/o & page fault counts neither are an
improvement in terms of machine performance, gawk is only better
because it takes less time.
perl looks good: less i/o, less page faults, but don't be deluded, the
file that you read in may still be partly cached in the unix file
buffering system. Unfortunately it's got the heaviest system time
which is strange!
There are trade offs involved here, you can get them from (#) with
careful programming.
Examining the code: fgets is not a beautiful function anyway, it
probably calls fgetc for each character, stores it & breaks out of the
loop on eof or \n. It's really transferring data from one area of
memory to another (eg the setvbuf buffer to the fgets s buffer) -- I'd
guess this is the source of all the page faults.
Quote:>Compiler/flags for C program didn't matter much. I tried /bin/cc -O4
>and gcc v1.37 -O.
There's hardly anything to optimise in your program. And I would
expect both compilers to generate roughly the same instructions, so
it's the buffering algorithm that is the problem.
Quote:>Info about the data file: it was 180,000 lines, roughly 18 Meg.
With this amount of data wouldn't you be better off with some database
system & structure the data?
>This computer:
>OS/MP 4.1A.1 Export(S5GENERIC/root)#0: Mon Sep 30 16:11:19 1991
>System type is Solbourne Series5e/900 with 128 MB of memory.
>The C program:
>#include <stdio.h>
>int main(argc, argv)
> int argc;
> char **argv;
>{
> int bufsize, lines=0;
> char *buffer, *s;
> FILE *f;
> if (argc != 2) {
> fprintf(stderr, "Usage: %s buffersize\n", argv[0]);
> fprintf(stderr, "buffersize = 0 ==> setvbuf() is NOT called.\n");
> return 1;
> }
> bufsize = atoi(argv[1]);
> buffer = (char *) malloc(bufsize*sizeof(char));
> f = fopen("bigfile", "r");
> if (bufsize != 0) setvbuf(f, buffer, _IOFBF, bufsize);
>#define MAXCHARS 65536
> s = (char *) malloc(MAXCHARS*sizeof(char));
> while (NULL != fgets(s, MAXCHARS, f))
> ++lines;
> printf("%d\n", lines);
> return 0;
>}
>--
Happy hacking,
+--: Andy Edwards :----------*=================*------------------------------+
| level 2 porter and PAP | Barrington | applelink: harlequin |
| server clone originator. | Cambridge | voice: 0223 872522 |
| *Life is strange, yeah | CB2 5RG | +44-223-872-522 |
| compared to what?* | England | fax: 0223 872519 |
+----------------------------*=================*------------------------------+